PostgreSQL: 如何獲取一維數(shù)組的相同元素并根據(jù)相似度排序?
2011-11-02 22:24:09|? 分類: Postgres基礎 |舉報|字號 訂閱
?????
?????
????? 今天開發(fā)有個需求,表中有一個列為一維數(shù)組類型,現(xiàn)在需要找出表中具有相同元素的數(shù)據(jù),描述起來
可能有點費力,下面舉個例子就明白了。
一 需求演示
--1.1測試表
mydb=> \d test_array;
??? Table "mydb.test_array"
Column |?? Type?? | Modifiers
--------+----------+-----------
id???? | integer? |
phone? | bigint[] |
mydb=> select * from test_array;
id |??? phone???
----+-------------
? 1 | {1,2}
? 2 | {1,2,3}
? 3 | {2,3}
? 4 | {1,2,3,4}
? 5 | {1,2,3,4,5}
? 6 | {4,5,6}
?
? 備注: 給出一個 id, 然后找出與這個 id 對應的 phone 數(shù)組含有相同元素的記錄,相同的元素越多,我們
?????????? 就認為這兩個元素越相似,并根據(jù)相似度降序排序。
?
--1.2 找出與 id=1 的 phone 數(shù)組含有相同的元素的記錄?
mydb=> select id,phone from test_array where phone && (select phone from test_array where id=1) and id!=1;
id |??? phone???
----+-------------
? 2 | {1,2,3}
? 3 | {2,3}
? 4 | {1,2,3,4}
? 5 | {1,2,3,4,5}
?????
備注:上面SQL雖然能成功找出具有相同元素的記錄,但是不能根據(jù)相似度排序,今天總結(jié)了下,
?????????? 有以下方法實現(xiàn)上面功能。
方法一:使用intarray模塊比較 int4[] 的數(shù)組類型
--2.1 安裝 intarray 模塊
mydb=# create extension intarray;
CREATE EXTENSION???
?????
備注:intarray模塊里有個 "&" 函數(shù),可以找到數(shù)組元素的相同部分, 具體信息可查閱手冊
????????? http://www.postgresql.org/docs/9.1/static/intarray.html
--2.2 & 操作符使用?????
mydb=> select array[1,2,3] & array[1,2];
?column?
----------
{1,2}
(1 row)
--2.3 不支持 int8 類型的數(shù)組
mydb=> select array[11111111111,2,3] & array[11111111111,2];
ERROR:? operator does not exist: bigint[] & bigint[]
LINE 1: select array[11111111111,2,3] & array[11111111111,2];
????????????????????????????????????? ^
HINT:? No operator matches the given name and argument type(s). You might need to add explicit type casts.
??????? 備注:intarray 模塊雖然能比較并獲得數(shù)組的相同元素,但僅支持 int4 數(shù)組類型。
--2.4 源碼 & 操作符代碼
CREATE OPERATOR & (
??????? LEFTARG = _int4,
??????? RIGHTARG = _int4,
??????? COMMUTATOR = &,
??????? PROCEDURE = _int_inter
);
?? 備注:可以在 $PGHOME/share/extension 目錄下查閱 intarray--1.0.sql 文件。
方法二:創(chuàng)建 intersection 函數(shù),對 int8[] 數(shù)組類型進行比較
--3.1 create function
CREATE OR REPLACE FUNCTION intersection(anyarray, anyarray) RETURNS anyarray as $$
SELECT ARRAY(
??? SELECT $1[i]
??? FROM generate_series( array_lower($1, 1), array_upper($1, 1) ) i
??? WHERE ARRAY[$1[i]] && $2
);
$$ language sql;
?? 備注:這里我們開發(fā)組的一名同事找到的,感謝這位同事。
--3.2 測試
mydb=> select intersection(array[11111111111,2,3],array[11111111111,2,3]);
?? intersection???
-------------------
{11111111111,2,3}
(1 row)
? 備注:這次果然沒報錯了,這種方法雖然功能實現(xiàn)了,但效率如何呢?下面簡單測試下。
四 性能測試
--4.1創(chuàng)建測試表并插入數(shù)據(jù)
mydb=> create table array_test (skyid serial primary key,phone_list int8[]);
NOTICE:? CREATE TABLE will create implicit sequence "array_test_skyid_seq" for serial column "array_test.skyid"
NOTICE:? CREATE TABLE / PRIMARY KEY will create implicit index "array_test_pkey" for table "array_test"
CREATE TABLE
mydb=> insert into array_test(phone_list) select regexp_split_to_array(id1||';'||id2||';'||id3||';'||id4,';')::int8[] from phone ;
INSERT 0 100000
mydb=> select * from array_test limit 10;
skyid |? phone_list??
-------+---------------
???? 1 | {1,2,3,4}
???? 2 | {2,3,4,5}
???? 3 | {3,4,5,6}
???? 4 | {4,5,6,7}
???? 5 | {5,6,7,8}
???? 6 | {6,7,8,9}
???? 7 | {7,8,9,10}
???? 8 | {8,9,10,11}
???? 9 | {9,10,11,12}
??? 10 | {10,11,12,13}
(10 rows)
--4.2 查詢SQL
mydb=> select t2.skyid,t2.phone_list, array_length(intersection(t1.phone_list,t2.phone_list),1)
mydb-> from array_test t1, array_test t2
mydb-> where t1.skyid=1 and t1.skyid!=t2.skyid and t1.phone_list && t2.phone_list
mydb-> ;
skyid | phone_list | array_length
-------+------------+--------------
???? 2 | {2,3,4,5}? |??????????? 3
???? 3 | {3,4,5,6}? |??????????? 2
???? 4 | {4,5,6,7}? |??????????? 1
(3 rows)
--4.3 查看執(zhí)行計劃
mydb=> explain analyze select t2.skyid,t2.phone_list, array_length(intersection(t1.phone_list,t2.phone_list),1)
mydb-> from array_test t1, array_test t2
mydb-> where t1.skyid=8 and t1.skyid!=t2.skyid and t1.phone_list && t2.phone_list
mydb-> order by array_length(intersection(t1.phone_list,t2.phone_list),1)? desc;
???????????????????????????????????????????????????????????????? QUERY PLAN????????????????????????????????????????????????????????
???????
--------------------------------------------------------------------------------------------------------------------------------------------
Sort? (cost=3743.94..3745.19 rows=500 width=110) (actual time=1279.393..1279.423 rows=6 loops=1)
?? Sort Key: (array_length(intersection(t1.phone_list, t2.phone_list), 1))
?? Sort Method: quicksort? Memory: 17kB
?? ->? Nested Loop? (cost=0.00..3721.53 rows=500 width=110) (actual time=0.651..1279.292 rows=6 loops=1)
???????? Join Filter: ((t1.skyid <> t2.skyid) AND (t1.phone_list && t2.phone_list))
???????? ->? Index Scan using array_test_pkey on array_test t1? (cost=0.00..8.28 rows=1 width=57) (actual time=0.236..0.275 rows=1 loops=1)
?????????????? Index Cond: (skyid =
???????? ->? Seq Scan on array_test t2? (cost=0.00..2087.00 rows=100000 width=57) (actual time=0.013..608.045 rows=100000 loops=1)
Total runtime: 1279.619 ms
(9 rows)
--4.4創(chuàng)建 gin 索引
mydb=>? create index concurrently idx_array_test_phone_list on array_test using gin (phone_list);
CREATE INDEX?????
?????
??????
--4.5 再次查看PLAN
mydb=> explain analyze select t2.skyid,t2.phone_list, array_length(intersection(t1.phone_list,t2.phone_list),1)
mydb-> from array_test t1, array_test t2
mydb-> where t1.skyid=7 and t1.skyid!=t2.skyid and t1.phone_list && t2.phone_list
mydb-> order by array_length(intersection(t1.phone_list,t2.phone_list),1)? desc;
?????????????????????????????????????????????????????????????????? QUERY PLAN??????????????????????????????????????????????????????
????????????
-------------------------------------------------------------------------------------------------------------------------------------------------
Sort? (cost=1070.18..1071.43 rows=500 width=110) (actual time=1.185..1.215 rows=6 loops=1)
?? Sort Key: (array_length(intersection(t1.phone_list, t2.phone_list), 1))
?? Sort Method: quicksort? Memory: 17kB
?? ->? Nested Loop? (cost=19.88..1047.77 rows=500 width=110) (actual time=0.854..1.117 rows=6 loops=1)
???????? Join Filter: (t1.skyid <> t2.skyid)
???????? ->? Index Scan using array_test_pkey on array_test t1? (cost=0.00..8.28 rows=1 width=57) (actual time=0.231..0.239 rows=1 loops=1)
?????????????? Index Cond: (skyid = 7)
???????? ->? Bitmap Heap Scan on array_test t2? (cost=19.88..905.74 rows=500 width=57) (actual time=0.226..0.264 rows=7 loops=1)
?????????????? Recheck Cond: (t1.phone_list && phone_list)
?????????????? ->? Bitmap Index Scan on idx_array_test_phone_list? (cost=0.00..19.75 rows=500 width=0) (actual time=0.123..0.123 rows=7 loops=1)
???????????????????? Index Cond: (t1.phone_list && phone_list)
Total runtime: 1.399 ms
(12 rows)
?? 備注:由于測試是在虛擬機上進行,數(shù)據(jù)量并不大,但從上面看出上面的SQL在創(chuàng)建了 gin 類型索引后,
????????????? 執(zhí)行時間在 1.3? 毫秒左右,效率顯著提高。
????
五 參考
http://blog.163.com/digoal@126/blog/static/163877040201192624726272/
http://www.itfingers.com/Question/756871/postgres-function-to-return-the-intersection-of-2-arrays/zh
http://www.postgresql.org/docs/9.1/static/intarray.html
http://www.postgresql.org/docs/9.1/static/indexes-types.html
2011-11-02 22:24:09|? 分類: Postgres基礎 |舉報|字號 訂閱
?????
?????
????? 今天開發(fā)有個需求,表中有一個列為一維數(shù)組類型,現(xiàn)在需要找出表中具有相同元素的數(shù)據(jù),描述起來
可能有點費力,下面舉個例子就明白了。
一 需求演示
--1.1測試表
mydb=> \d test_array;
??? Table "mydb.test_array"
Column |?? Type?? | Modifiers
--------+----------+-----------
id???? | integer? |
phone? | bigint[] |
mydb=> select * from test_array;
id |??? phone???
----+-------------
? 1 | {1,2}
? 2 | {1,2,3}
? 3 | {2,3}
? 4 | {1,2,3,4}
? 5 | {1,2,3,4,5}
? 6 | {4,5,6}
?
? 備注: 給出一個 id, 然后找出與這個 id 對應的 phone 數(shù)組含有相同元素的記錄,相同的元素越多,我們
?????????? 就認為這兩個元素越相似,并根據(jù)相似度降序排序。
?
--1.2 找出與 id=1 的 phone 數(shù)組含有相同的元素的記錄?
mydb=> select id,phone from test_array where phone && (select phone from test_array where id=1) and id!=1;
id |??? phone???
----+-------------
? 2 | {1,2,3}
? 3 | {2,3}
? 4 | {1,2,3,4}
? 5 | {1,2,3,4,5}
?????
備注:上面SQL雖然能成功找出具有相同元素的記錄,但是不能根據(jù)相似度排序,今天總結(jié)了下,
?????????? 有以下方法實現(xiàn)上面功能。
方法一:使用intarray模塊比較 int4[] 的數(shù)組類型
--2.1 安裝 intarray 模塊
mydb=# create extension intarray;
CREATE EXTENSION???
?????
備注:intarray模塊里有個 "&" 函數(shù),可以找到數(shù)組元素的相同部分, 具體信息可查閱手冊
????????? http://www.postgresql.org/docs/9.1/static/intarray.html
--2.2 & 操作符使用?????
mydb=> select array[1,2,3] & array[1,2];
?column?
----------
{1,2}
(1 row)
--2.3 不支持 int8 類型的數(shù)組
mydb=> select array[11111111111,2,3] & array[11111111111,2];
ERROR:? operator does not exist: bigint[] & bigint[]
LINE 1: select array[11111111111,2,3] & array[11111111111,2];
????????????????????????????????????? ^
HINT:? No operator matches the given name and argument type(s). You might need to add explicit type casts.
??????? 備注:intarray 模塊雖然能比較并獲得數(shù)組的相同元素,但僅支持 int4 數(shù)組類型。
--2.4 源碼 & 操作符代碼
CREATE OPERATOR & (
??????? LEFTARG = _int4,
??????? RIGHTARG = _int4,
??????? COMMUTATOR = &,
??????? PROCEDURE = _int_inter
);
?? 備注:可以在 $PGHOME/share/extension 目錄下查閱 intarray--1.0.sql 文件。
方法二:創(chuàng)建 intersection 函數(shù),對 int8[] 數(shù)組類型進行比較
--3.1 create function
CREATE OR REPLACE FUNCTION intersection(anyarray, anyarray) RETURNS anyarray as $$
SELECT ARRAY(
??? SELECT $1[i]
??? FROM generate_series( array_lower($1, 1), array_upper($1, 1) ) i
??? WHERE ARRAY[$1[i]] && $2
);
$$ language sql;
?? 備注:這里我們開發(fā)組的一名同事找到的,感謝這位同事。
--3.2 測試
mydb=> select intersection(array[11111111111,2,3],array[11111111111,2,3]);
?? intersection???
-------------------
{11111111111,2,3}
(1 row)
? 備注:這次果然沒報錯了,這種方法雖然功能實現(xiàn)了,但效率如何呢?下面簡單測試下。
四 性能測試
--4.1創(chuàng)建測試表并插入數(shù)據(jù)
mydb=> create table array_test (skyid serial primary key,phone_list int8[]);
NOTICE:? CREATE TABLE will create implicit sequence "array_test_skyid_seq" for serial column "array_test.skyid"
NOTICE:? CREATE TABLE / PRIMARY KEY will create implicit index "array_test_pkey" for table "array_test"
CREATE TABLE
mydb=> insert into array_test(phone_list) select regexp_split_to_array(id1||';'||id2||';'||id3||';'||id4,';')::int8[] from phone ;
INSERT 0 100000
mydb=> select * from array_test limit 10;
skyid |? phone_list??
-------+---------------
???? 1 | {1,2,3,4}
???? 2 | {2,3,4,5}
???? 3 | {3,4,5,6}
???? 4 | {4,5,6,7}
???? 5 | {5,6,7,8}
???? 6 | {6,7,8,9}
???? 7 | {7,8,9,10}
???? 8 | {8,9,10,11}
???? 9 | {9,10,11,12}
??? 10 | {10,11,12,13}
(10 rows)
--4.2 查詢SQL
mydb=> select t2.skyid,t2.phone_list, array_length(intersection(t1.phone_list,t2.phone_list),1)
mydb-> from array_test t1, array_test t2
mydb-> where t1.skyid=1 and t1.skyid!=t2.skyid and t1.phone_list && t2.phone_list
mydb-> ;
skyid | phone_list | array_length
-------+------------+--------------
???? 2 | {2,3,4,5}? |??????????? 3
???? 3 | {3,4,5,6}? |??????????? 2
???? 4 | {4,5,6,7}? |??????????? 1
(3 rows)
--4.3 查看執(zhí)行計劃
mydb=> explain analyze select t2.skyid,t2.phone_list, array_length(intersection(t1.phone_list,t2.phone_list),1)
mydb-> from array_test t1, array_test t2
mydb-> where t1.skyid=8 and t1.skyid!=t2.skyid and t1.phone_list && t2.phone_list
mydb-> order by array_length(intersection(t1.phone_list,t2.phone_list),1)? desc;
???????????????????????????????????????????????????????????????? QUERY PLAN????????????????????????????????????????????????????????
???????
--------------------------------------------------------------------------------------------------------------------------------------------
Sort? (cost=3743.94..3745.19 rows=500 width=110) (actual time=1279.393..1279.423 rows=6 loops=1)
?? Sort Key: (array_length(intersection(t1.phone_list, t2.phone_list), 1))
?? Sort Method: quicksort? Memory: 17kB
?? ->? Nested Loop? (cost=0.00..3721.53 rows=500 width=110) (actual time=0.651..1279.292 rows=6 loops=1)
???????? Join Filter: ((t1.skyid <> t2.skyid) AND (t1.phone_list && t2.phone_list))
???????? ->? Index Scan using array_test_pkey on array_test t1? (cost=0.00..8.28 rows=1 width=57) (actual time=0.236..0.275 rows=1 loops=1)
?????????????? Index Cond: (skyid =

???????? ->? Seq Scan on array_test t2? (cost=0.00..2087.00 rows=100000 width=57) (actual time=0.013..608.045 rows=100000 loops=1)
Total runtime: 1279.619 ms
(9 rows)
--4.4創(chuàng)建 gin 索引
mydb=>? create index concurrently idx_array_test_phone_list on array_test using gin (phone_list);
CREATE INDEX?????
?????
??????
--4.5 再次查看PLAN
mydb=> explain analyze select t2.skyid,t2.phone_list, array_length(intersection(t1.phone_list,t2.phone_list),1)
mydb-> from array_test t1, array_test t2
mydb-> where t1.skyid=7 and t1.skyid!=t2.skyid and t1.phone_list && t2.phone_list
mydb-> order by array_length(intersection(t1.phone_list,t2.phone_list),1)? desc;
?????????????????????????????????????????????????????????????????? QUERY PLAN??????????????????????????????????????????????????????
????????????
-------------------------------------------------------------------------------------------------------------------------------------------------
Sort? (cost=1070.18..1071.43 rows=500 width=110) (actual time=1.185..1.215 rows=6 loops=1)
?? Sort Key: (array_length(intersection(t1.phone_list, t2.phone_list), 1))
?? Sort Method: quicksort? Memory: 17kB
?? ->? Nested Loop? (cost=19.88..1047.77 rows=500 width=110) (actual time=0.854..1.117 rows=6 loops=1)
???????? Join Filter: (t1.skyid <> t2.skyid)
???????? ->? Index Scan using array_test_pkey on array_test t1? (cost=0.00..8.28 rows=1 width=57) (actual time=0.231..0.239 rows=1 loops=1)
?????????????? Index Cond: (skyid = 7)
???????? ->? Bitmap Heap Scan on array_test t2? (cost=19.88..905.74 rows=500 width=57) (actual time=0.226..0.264 rows=7 loops=1)
?????????????? Recheck Cond: (t1.phone_list && phone_list)
?????????????? ->? Bitmap Index Scan on idx_array_test_phone_list? (cost=0.00..19.75 rows=500 width=0) (actual time=0.123..0.123 rows=7 loops=1)
???????????????????? Index Cond: (t1.phone_list && phone_list)
Total runtime: 1.399 ms
(12 rows)
?? 備注:由于測試是在虛擬機上進行,數(shù)據(jù)量并不大,但從上面看出上面的SQL在創(chuàng)建了 gin 類型索引后,
????????????? 執(zhí)行時間在 1.3? 毫秒左右,效率顯著提高。
????
五 參考
http://blog.163.com/digoal@126/blog/static/163877040201192624726272/
http://www.itfingers.com/Question/756871/postgres-function-to-return-the-intersection-of-2-arrays/zh
http://www.postgresql.org/docs/9.1/static/intarray.html
http://www.postgresql.org/docs/9.1/static/indexes-types.html
更多文章、技術(shù)交流、商務合作、聯(lián)系博主
微信掃碼或搜索:z360901061

微信掃一掃加我為好友
QQ號聯(lián)系: 360901061
您的支持是博主寫作最大的動力,如果您喜歡我的文章,感覺我的文章對您有幫助,請用微信掃描下面二維碼支持博主2元、5元、10元、20元等您想捐的金額吧,狠狠點擊下面給點支持吧,站長非常感激您!手機微信長按不能支付解決辦法:請將微信支付二維碼保存到相冊,切換到微信,然后點擊微信右上角掃一掃功能,選擇支付二維碼完成支付。
【本文對您有幫助就好】元
