PostgreSQL何以支援豐富的NoSQL特性？

PostgreSQL 全文檢索 NoSql · 發表 2018-09-26 10:47:27

摘要：在上一篇文章《PostgreSQL使用者應掌握的高階SQL特性》我們介紹了PostgreSQL的典型高階SQL特性。PostgreSQL不僅是關係型資料庫，同時支援豐富的NoSQL特性，所以本文將介紹PostgreSQL的NoSQ...

在上一篇文章 ofollow,noindex">《PostgreSQL使用者應掌握的高階SQL特性》我們介紹了PostgreSQL的典型高階SQL特性。PostgreSQL不僅是關係型資料庫，同時支援豐富的NoSQL特性，所以本文將介紹PostgreSQL的NoSQL特性，分以下三部分來介紹：

PostgreSQL的JSON和JSONB資料型別簡介；
JSON與JSONB讀寫效能測試；
PostgreSQL全文檢索支援JSON和JSONB（PosgreSQL 10新特性）。

一、JSON和JSONB資料型別

PostgreSQL支援非關係資料型別json (JavaScript Object Notation)，本節介紹json型別、json與jsonb差異、json與jsonb操作符和函式以及jsonb鍵值的追加、刪除、更新。

1、JSON型別簡介

PotgreSQL早在9.2版本已經提供了json型別，並且隨著大版本的演進，PostgreSQL對json的支援趨於完善，例如提供更多的json函式和操作符方便應用開發，一個簡單的json型別例子如下：

mydb=> SELECT '{"a":1,"b":2}'::json;
     json      
---------------
 {"a":1,"b":2}

為了更好演示json型別，接下來建立一張表，如下所示：

mydb=> CREATE TABLE test_json1 (id serial primary key,name json);

CREATE TABLE

以上示例定義欄位name為json型別，插入表資料，如下所示：

mydb=> INSERT INTO test_json1 (name)
VALUES ('{"col1":1,"col2":"francs","col3":"male"}');
INSERT 0 1
 
mydb=> INSERT INTO test_json1 (name)
VALUES ('{"col1":2,"col2":"fp","col3":"female"}');
INSERT 0 1

查詢表test_json1資料：

mydb=> SELECT * FROM test_json1;
 id |                   name                   
----+------------------------------------------
  1 | {"col1":1,"col2":"francs","col3":"male"}
  2 | {"col1":2,"col2":"fp","col3":"female"}

2、查詢JSON資料

通過->操作符可以查詢json資料的鍵值，如下所示：

mydb=> SELECT  name -> 'col2' FROM test_json1 WHERE id=1;
 ?column?
----------
 "francs"
(1 row)

如果想以文字格式返回json欄位鍵值可以使用->>符，如下所示：

mydb=> SELECT  name ->> 'col2' FROM test_json1 WHERE id=1;
 ?column?
----------
 francs
(1 row)

3、JSONB與JSON差異

PostgreSQL支援兩種JSON資料型別：json和jsonb，兩種型別在使用上幾乎完全相同，主要區別如下：

json儲存格式為文字，而jsonb儲存格式為二進位制，由於儲存格式的不同使得兩種json資料型別的處理效率不一樣，json型別以文字儲存並且儲存的內容和輸入資料一樣，當檢索json資料時必須重新解析，而jsonb以二進位制形式儲存已解析好的資料，當檢索jsonb資料時不需要重新解析，因此json寫入比jsonb快，但檢索比jsonb慢，後面會通過測試驗證兩者讀寫效能差異。

除了上述介紹的區別之外，json與jsonb在使用過程中還存在差異，例如jsonb輸出的鍵的順序和輸入不一樣，如下所示：

mydb=> SELECT '{"bar": "baz", "balance": 7.77, "active":false}'::jsonb;
                      jsonb                       
--------------------------------------------------
 {"bar": "baz", "active": false, "balance": 7.77}
(1 row)

而json的輸出鍵的順序和輸入完全一樣，如下所示：

mydb=> SELECT '{"bar": "baz", "balance": 7.77, "active":false}'::json;
                      json                       
-------------------------------------------------
 {"bar": "baz", "balance": 7.77, "active":false}
(1 row)

另外，jsonb型別會去掉輸入資料中鍵值的空格，如下所示：

mydb=> SELECT ' {"id":1,    "name":"francs"}'::jsonb;
            jsonb            
-----------------------------
 {"id": 1, "name": "francs"}
(1 row)

上例中id鍵與name鍵輸入時是有空格的，輸出顯示空格鍵被刪除，而json的輸出和輸入一樣，不會刪掉空格鍵：

mydb=> SELECT ' {"id":1,    "name":"francs"}'::json;
             json              
-------------------------------
  {"id":1,    "name":"francs"}
(1 row)

另外，jsonb會刪除重複的鍵，僅保留最後一個，如下所示：

mydb=> SELECT ' {"id":1,                                                              
"name":"francs",
"remark":"a good guy!",
"name":"test"
}'::jsonb;
                       jsonb                        
----------------------------------------------------
 {"id": 1, "name": "test", "remark": "a good guy!"}
(1 row)

上面name鍵重複，僅保留最後一個name鍵的值，而json資料型別會保留重複的鍵值。

相比json大多數應用場景建議使用jsonb，除非有特殊的需求，比如對json的鍵順序有特殊的要求。

4、JSONB與JSON操作符

PostgreSQL支援豐富的JSONB和JSON的操作符，舉例如下：

以文字格式返回json型別的欄位鍵值可以使用->>符，如下所示：

mydb=> SELECT  name ->> 'col2' FROM test_json1 WHERE id=1;
 ?column?
----------
 francs
(1 row)

字串是否作為頂層鍵值，如下所示：

mydb=> SELECT '{"a":1, "b":2}'::jsonb ? 'a';
 ?column?
----------
 t
(1 row)

刪除json資料的鍵/值，如下所示：

mydb=> SELECT '{"a":1, "b":2}'::jsonb - 'a';
 ?column?
----------
 {"b": 2}
(1 row)

5、JSONB與JSON函式

json與jsonb相關的函式非常豐富，舉例如下：

擴充套件最外層的json物件成為一組鍵/值結果集，如下所示：

mydb=> SELECT * FROM json_each('{"a":"foo", "b":"bar"}');
 key | value
-----+-------
 a   | "foo"
 b   | "bar"
(2 rows)

以文字形式返回結果，如下所示：

mydb=> SELECT * FROM json_each_text('{"a":"foo", "b":"bar"}');
 key | value
-----+-------
 a   | foo
 b   | bar
(2 rows)

一個非常重要的函式為row_to_json()函式，能夠將行作為json物件返回，此函式常用來生成json測試資料，比如將一個普通錶轉換成json型別表：

mydb=> SELECT * FROM test_copy WHERE id=1;
 id | name
----+------
  1 | a
(1 row)
 
mydb=> SELECT row_to_json(test_copy) FROM test_copy WHERE id=1;
     row_to_json     
---------------------
 {"id":1,"name":"a"}
(1 row)

返回最外層的json對像中的鍵的集合，如下所示：

mydb=> SELECT * FROM json_object_keys('{"a":"foo", "b":"bar"}');
 json_object_keys
------------------
 a
 b
(2 rows)

6、jsonb鍵/值的追加、刪除、更新

jsonb鍵/值追加可通過||操作符，如下增加sex鍵/值：

mydb=> SELECT '{"name":"francs","age":"31"}'::jsonb ||
'{"sex":"male"}'::jsonb;
                    ?column?                    
------------------------------------------------
 {"age": "31", "sex": "male", "name": "francs"}
(1 row)

jsonb鍵/值的刪除有兩種方法，一種是通過操作符號-刪除，另一種通過操作符#-刪除指定鍵/值。

通過操作符號-刪除鍵/值如下：

mydb=> SELECT '{"name": "James", "email": "james@localhost"}'::jsonb
 - 'email';
     ?column?      
-------------------
 {"name": "James"}
(1 row)
 
mydb=> SELECT '["red","green","blue"]'::jsonb - 0;
     ?column?      
-------------------
 ["green", "blue"]

第二種方法是通過操作符#-刪除指定鍵/值，通常用於有巢狀json資料刪除的場景，如下刪除巢狀contact中的fax鍵/值：

mydb=> SELECT '{"name": "James", "contact": {"phone": "01234 567890", "fax": "01987 543210"}}'::jsonb #- '{contact,fax}'::text[];
                        ?column?                         
---------------------------------------------------------
 {"name": "James", "contact": {"phone": "01234 567890"}}
(1 row)

刪除巢狀aliases中的位置為1的鍵/值，如下所示：

mydb=> SELECT '{"name": "James", "aliases": ["Jamie","The Jamester","J Man"]}'::jsonb #- '{aliases,1}'::text[];
                     ?column?                     
--------------------------------------------------
 {"name": "James", "aliases": ["Jamie", "J Man"]}
(1 row)

鍵/值的更新也有兩種方式，第一種方式為||操作符，||操作符可以連線json鍵，也可覆蓋重複的鍵值，如下修改age鍵的值：

mydb=> SELECT '{"name":"francs","age":"31"}'::jsonb ||
'{"age":"32"}'::jsonb;
            ?column?             
---------------------------------
 {"age": "32", "name": "francs"}
(1 row)

第二種方式是通過jsonb_set函式，語法如下：

jsonb_set(target jsonb, path text[], new_value jsonb[, create_missing boolean])

target指源jsonb資料，path指路徑，new_value指更新後的鍵值，create_missing 值為 true表示如果鍵不存在則新增，create_missing 值為 false表示如果鍵不存在則不新增，示例如下：

mydb=> SELECT jsonb_set('{"name":"francs","age":"31"}'::jsonb,'{age}','"32"'::jsonb,false);
            jsonb_set            
---------------------------------
 {"age": "32", "name": "francs"}
(1 row)
 
mydb=> SELECT jsonb_set('{"name":"francs","age":"31"}'::jsonb,'{sex}','"male"'::jsonb,true);
                   jsonb_set                    
------------------------------------------------
 {"age": "31", "sex": "male", "name": "francs"}
(1 row)

7、給JSONB型別建立索引

這一小節介紹給jsonb資料型別建立索引，jsonb資料型別支援GIN索引，為了便於說明，假如一個json欄位內容如下，並且以jsonb格式儲存。

{
  "id": 1,
  "user_id": 1440933,
  "user_name": "1_francs",
  "create_time": "2017-08-03 16:22:05.528432+08"
}

假如儲存以上jsonb資料的欄位名為user_info，表名為tbl_user_jsonb，在user_info欄位上建立GIN索引語法如下：

CREATE INDEX idx_gin ON tbl_user_jsonb USING gin(user_info);

jsonb上的GIN索引支援@>、?、 ?&、?|操作符，例如以下查詢將會使用索引：

SELECT * FROM tbl_user_jsonb WHERE user_info @> '{"user_name": "1_frans"}'

但是以下基於jsonb鍵值的查詢不會走索引idx_gin，如下所示：

SELECT * FROM tbl_user_jsonb WHERE user_info->>'user_name'= '1_francs';

如果要想提升基於jsonb型別的鍵值檢索效率，可以在jsonb資料型別對應的鍵值上建立索引，如下所示：

CREATE INDEX idx_gin_user_infob_user_name ON tbl_user_jsonb USING btree

((user_info ->> 'user_name'));

建立以上索引後，上述根據user_info->>'user_name'鍵值查詢的SQL將會走索引。

二、JSON與JSONB讀寫效能測試

前面介紹了jsonb資料型別索引建立相關內容，本部分將對json、jsonb讀寫效能進行簡單對比。json與jsonb讀寫效能存在差異，主要表現為json寫入時比jsonb快，但檢索時比jsonb慢，主要原因為：

json儲存格式為文字，而jsonb儲存格式為二進位制，儲存格式的不同使得兩種json資料型別的處理效率不一樣，json型別儲存的內容和輸入資料一樣，當檢索json資料時必須重新解析，而jsonb以二進位制形式儲存已解析好的資料，當檢索jsonb資料時不需要重新解析。

1、構建JSON、JSONB測試表

下面通過一個簡單的例子測試下json、jsonb的讀寫效能差異，計劃建立以下三張表：

quser_ini：基礎資料表，並插入200萬測試資料；
qtbl_user_json: json 資料型別表，200萬資料；
qtbl_user_jsonb：jsonb 資料型別表，200萬資料。

首先建立user_ini表並插入200萬測試資料，如下：

mydb=> CREATE TABLE user_ini(id int4 ,user_id int8, user_name character
varying(64),create_time timestamp(6) with time zone default
clock_timestamp());
CREATE TABLE
 
mydb=> INSERT INTO user_ini(id,user_id,user_name)
SELECT r,round(random()*2000000), r || '_francs'
FROM generate_series(1,2000000) as r;
INSERT 0 2000000

計劃使用user_ini表資料生成json、jsonb資料，建立user_ini_json、user_ini_jsonb表，如下所示：

mydb=> CREATE TABLE tbl_user_json(id serial, user_info json);
CREATE TABLE
mydb=> CREATE TABLE tbl_user_jsonb(id serial, user_info jsonb);
CREATE TABLE

2、JSON與JSONB表寫效能測試

根據user_ini資料通過row_to_json函式向表user_ini_json插入200萬json資料，如下：

mydb=> \timing
Timing is on.
 
mydb=> INSERT INTO tbl_user_json(user_info) SELECT row_to_json(user_ini)
FROM user_ini;
INSERT 0 2000000
Time: 13825.974 ms (00:13.826)

從以上結果看出tbl_user_json插入200萬資料花了13秒左右；接著根據user_ini表資料生成200萬jsonb資料並插入表tbl_user_jsonb，如下：

mydb=> INSERT INTO tbl_user_jsonb(user_info)
 SELECT row_to_json(user_ini)::jsonb FROM user_ini;    
INSERT 0 2000000
Time: 20756.993 ms (00:20.757)

從以上看出tbl_user_jsonb表插入200萬jsonb資料花了20秒左右，正好驗證了json資料寫入比jsonb快，比較兩表佔用空間大小，如下所示：

mydb=> \dt+ tbl_user_json
                       List of relations
 Schema |     Name      | Type  | Owner  |  Size  | Description
--------+---------------+-------+--------+--------+-------------
 pguser | tbl_user_json | table | pguser | 281 MB |
(1 row)
mydb=> \dt+ tbl_user_jsonb
                        List of relations
 Schema |      Name      | Type  | Owner  |  Size  | Description
--------+----------------+-------+--------+--------+-------------
 pguser | tbl_user_jsonb | table | pguser | 333 MB |
(1 row)

從佔用空間來看，同樣的資料量jsonb資料型別佔用空間比json稍大。

查詢tbl_user_json表的一條測試資料，如下：

mydb=> SELECT * FROM tbl_user_json LIMIT 1;
   id    |                                             user_info                                             
---------+------------------------------------------------------------------------------------
 2000001 | {"id":1,"user_id":1182883,"user_name":"1_francs","create_time":"2017-08-03T20:59:27.42741+08:00"}
(1 row)

3、JSON與JSONB表讀效能測試

對於json、jsonb讀效能測試我們選擇基於json、jsonb鍵值查詢的場景，例如，根據user_info欄位的user_name鍵的值查詢，如下所示：

mydb=> EXPLAIN ANALYZE SELECT * FROM tbl_user_jsonb WHERE user_info->>'user_name'='1_francs';
                          QUERY PLAN                                                     
-------------------------------------------------------------------------------------
 Seq Scan on tbl_user_jsonb  (cost=0.00..72859.90 rows=10042 width=143) (actual time=0.023..524.843 rows=1 loops=1)
   Filter: ((user_info ->> 'user_name'::text) = '1_francs'::text)
   Rows Removed by Filter: 1999999
 Planning time: 0.091 ms
 Execution time: 524.876 ms
(5 rows)

上述SQL執行時間為524毫秒左右，基於user_info欄位的user_name鍵值建立btree索引如下：

mydb=> CREATE INDEX idx_jsonb ON tbl_user_jsonb USING btree

((user_info->>'user_name'));

再次執行上述查詢，如下所示：

mydb=> EXPLAIN ANALYZE SELECT * FROM tbl_user_jsonb WHERE user_info->>'user_name'='1_francs';
                                   QUERY PLAN                                                         
-------------------------------------------------------------------------------------
 Bitmap Heap Scan on tbl_user_jsonb  (cost=155.93..14113.93 rows=10000 width=143) (actual time=0.027..0.027 rows=1 loops=1)
   Recheck Cond: ((user_info ->> 'user_name'::text) = '1_francs'::text)
   Heap Blocks: exact=1
   ->  Bitmap Index Scan on idx_jsonb  (cost=0.00..153.43 rows=10000 width=0) (actual time=0.021..0.021 rows=1 loops=1)
         Index Cond: ((user_info ->> 'user_name'::text) = '1_francs'::text)
 Planning time: 0.091 ms
 Execution time: 0.060 ms
(7 rows)

根據上述執行計劃看出走了索引，並且SQL時間下降到0.060ms。為更好地對比tbl_user_json、tbl_user_jsonb表基於鍵值查詢的效率，計劃根據user_info欄位id鍵進行範圍掃描對比效能，建立索引如下：

mydb=> CREATE INDEX idx_gin_user_info_id ON tbl_user_json USING btree
(((user_info ->> 'id')::integer));
CREATE INDEX
 
mydb=> CREATE INDEX idx_gin_user_infob_id ON tbl_user_jsonb USING btree
(((user_info ->> 'id')::integer));
CREATE INDEX

索引建立後，查詢tbl_user_json表如下：

mydb=> EXPLAIN ANALYZE SELECT id,user_info->'id',user_info->'user_name'
 FROM tbl_user_json
WHERE (user_info->>'id')::int4>1 AND (user_info->>'id')::int4<10000;
                               QUERY PLAN                                                            
-------------------------------------------------------------------------------------
 Bitmap Heap Scan on tbl_user_json  (cost=166.30..14178.17 rows=10329 width=68) (actual time=1.167..26.534 rows=9998 loops=1)
   Recheck Cond: ((((user_info ->> 'id'::text))::integer > 1) AND (((user_info ->> 'id'::text))::integer < 10000))
   Heap Blocks: exact=338
   ->  Bitmap Index Scan on idx_gin_user_info_id  (cost=0.00..163.72 rows=10329 width=0) (actual time=1.110..1.110 rows=19996 loops=
1)
         Index Cond: ((((user_info ->> 'id'::text))::integer > 1) AND (((user_info ->> 'id'::text))::integer < 10000))
 Planning time: 0.094 ms
 Execution time: 27.092 ms
(7 rows)

根據以上看出，查詢表tbl_user_json的user_info欄位id鍵值在1到10000範圍內的記錄走了索引，並且執行時間為27.092毫秒，接著測試tbl_user_jsonb表同樣SQL的檢索效能，如下所示：

mydb=> EXPLAIN ANALYZE SELECT id,user_info->'id',user_info->'user_name'
 FROM tbl_user_jsonb
WHERE (user_info->>'id')::int4>1 AND (user_info->>'id')::int4<10000;
                        QUERY PLAN                                                           
-------------------------------------------------------------------------------------
 Bitmap Heap Scan on tbl_user_jsonb  (cost=158.93..14316.93 rows=10000 width=68) (actual time=1.140..8.116 rows=9998 loops=1)
   Recheck Cond: ((((user_info ->> 'id'::text))::integer > 1) AND (((user_info ->> 'id'::text))::integer < 10000))
   Heap Blocks: exact=393
   ->  Bitmap Index Scan on idx_gin_user_infob_id  (cost=0.00..156.43 rows=10000 width=0) (actual time=1.058..1.058 rows=18992 loops
=1)
         Index Cond: ((((user_info ->> 'id'::text))::integer > 1) AND (((user_info ->> 'id'::text))::integer < 10000))
 Planning time: 0.104 ms
 Execution time: 8.656 ms
(7 rows)

根據以上看出，查詢表tbl_user_jsonb的user_info欄位id鍵值在1到10000範圍內的記錄走了索引並且執行時間為8.656毫秒，從這個測試看出jsonb檢索比json效率高。

從以上兩個測試看出，正好驗證了“json寫入比jsonb快，但檢索時比jsonb慢”的觀點，值得一提的是如果需要通過key/value進行檢索，例如以下：

SELECT * FROM tbl_user_jsonb WHERE user_info @> '{"user_name": "2_francs"}';

這時執行計劃為全表掃描，如下所示：

mydb=> EXPLAIN ANALYZE SELECT * FROM tbl_user_jsonb WHERE user_info @> '{"user_name": "2_francs"}';
                                QUERY PLAN                                                     
------------------------------------------------------------------------------------
 Seq Scan on tbl_user_jsonb  (cost=0.00..67733.00 rows=2000 width=143) (actual time=0.018..582.207 rows=1 loops=1)
   Filter: (user_info @> '{"user_name": "2_francs"}'::jsonb)
   Rows Removed by Filter: 1999999
 Planning time: 0.065 ms
 Execution time: 582.232 ms
(5 rows)

從以上看出執行時間為582毫秒左右，在tbl_user_jsonb欄位user_info上建立gin索引，如下所示：

mydb=> CREATE INDEX idx_tbl_user_jsonb_user_Info ON tbl_user_jsonb USING gin
 (user_Info);
CREATE INDEX

索引建立後，再次執行以下，如下所示：

mydb=> EXPLAIN ANALYZE SELECT * FROM tbl_user_jsonb WHERE user_info @> '{"user_name": "2_francs"}';
                                  QUERY PLAN                                                           
-------------------------------------------------------------------------------------
 Bitmap Heap Scan on tbl_user_jsonb  (cost=37.50..3554.34 rows=2000 width=143) (actual time=0.079..0.080 rows=1 loops=1)
   Recheck Cond: (user_info @> '{"user_name": "2_francs"}'::jsonb)
   Heap Blocks: exact=1
   ->  Bitmap Index Scan on idx_tbl_user_jsonb_user_info  (cost=0.00..37.00 rows=2000 width=0) (actual time=0.069..0.069 rows=1 loop
s=1)
         Index Cond: (user_info @> '{"user_name": "2_francs"}'::jsonb)
 Planning time: 0.094 ms
 Execution time: 0.114 ms
(7 rows)

從以上看出走了索引，並且執行時間下降到了0.114毫秒。

這部分內容測試了json、jsonb資料型別讀寫效能差異，驗證了json寫入時比jsonb快，但檢索時比jsonb慢的觀點。

三、全文檢索支援JSON和JSONB

接下來我們來介紹PostgreSQL 10的一個新特性：全文檢索支援json、jsonb資料型別。這部分我們會分兩部分來說明，第一部分簡單介紹PostgreSQL全文檢索，第二部分演示全文檢索對json、jsonb資料型別的支援。

1、PostgreSQL全文檢索簡介

對於大多數應用全文檢索很少放到資料庫中實現，一般使用單獨的全文檢索引擎，例如基於SQL全文檢索引擎Sphinx。PostgreSQL支援全文檢索，對於規模不大的應用如果不想搭建專門的搜尋引擎，PostgreSQL的全文檢索也可以滿足需求。

如果沒有使用專門的搜尋引擎，大部檢索需要通過資料庫like操作匹配，這種檢索方式主要缺點在於：

不能很好的支援索引，通常需全表掃描檢索資料，資料量大時檢索效能很低；
不提供檢索結果排序，當輸出結果資料量非常大時表現更加明顯。

PostgreSQL全文檢索能有效地解決這個問題，PostgreSQL全文檢索通過以下兩種資料型別來實現。

Tsvector

tsvector全文檢索資料型別代表一個被優化的可以基於搜尋的文件，將一串字串轉換成tsvector全文檢索資料型別，如下：

mydb=> SELECT 'Hello,cat,how are u? cat is smiling! '::tsvector;
                     tsvector                     
--------------------------------------------------
 'Hello,cat,how' 'are' 'cat' 'is' 'smiling!' 'u?'
(1 row)

可以看到，字串的內容被分隔成好幾段，但通過::tsvector只是做型別轉換，沒有進行資料標準化處理，對於英文全文檢索可通過函式to_tsvector進行資料標準化，如下所示：

mydb=> SELECT to_tsvector('english','Hello cat,');
    to_tsvector    
-------------------
 'cat':2 'hello':1
(1 row)

Tsquery

tsquery表示一個文字查詢，儲存用於搜尋的詞，並且支援布林操作&、|、!，將字串轉換成tsquery，如下所示：

mydb=> SELECT  'hello&cat'::tsquery;
     tsquery     
-----------------
 'hello' & 'cat'
(1 row)

上述只是轉換成tsquery型別，而並沒有做標準化，使用to_tsquery函式可以執行標準化，如下所示：

mydb=> SELECT to_tsquery( 'hello&cat' );
   to_tsquery    
-----------------
 'hello' & 'cat'
(1 row)

一個全文檢索示例如下，檢索字串是否包括hello和cat字元，本例中返回真。

mydb=> SELECT to_tsvector('english','Hello cat,how are u') @@
to_tsquery( 'hello&cat' );
 ?column?
----------
 t
(1 row)

檢索字串是否包含字元hello和dog，本例中返回假。

mydb=> SELECT to_tsvector('english','Hello cat,how are u') @@
 to_tsquery( 'hello&dog' );
 ?column?
----------
 f
(1 row)

有興趣的讀者可以測試tsquery的其他操作符，例如|、!等。

注意：這裡使用了帶雙引數的to_tsvector函式，函式to_tsvector雙引數的格式如下：

to_tsvector([ config regconfig , ] document text)，本節to_tsvector函式指定了config引數為english，如果不指定config引數，則預設使用default_text_search_config引數的配置。

英文全文檢索例子

下面演示一個英文全文檢索示例，建立一張測試表並插入200萬測試資料，如下所示：

mydb=> CREATE TABLE test_search(id int4,name text);
CREATE TABLE
mydb=> INSERT INTO test_search(id,name) SELECT n, n||'_francs'
FROM generate_series(1,2000000) n;
INSERT 0 2000000

執行以下SQL，查詢test_search表name欄位包含字元1_francs的記錄。

mydb=> SELECT * FROM test_search WHERE name LIKE '1_francs';
 id |   name   
----+----------
  1 | 1_francs
(1 row)

執行計劃如下：

mydb=> EXPLAIN ANALYZE SELECT * FROM test_search WHERE name LIKE '1_francs';
                                      QUERY PLAN                                                  
-------------------------------------------------------------------------------------Seq Scan on test_search  (cost=0.00..38465.04 rows=204 width=18) (actual time=0.022..261.766 rows=1 loops=1)
   Filter: (name ~~ '1_francs'::text)
   Rows Removed by Filter: 1999999
 Planning time: 0.101 ms
 Execution time: 261.796 ms
(5 rows)

以上執行計劃走了全表掃描，執行時間為261毫秒左右，效能很低，接著建立索引，如下所示：

mydb=> CREATE INDEX idx_gin_search ON test_search USING gin
(to_tsvector('english',name));
CREATE INDEX

執行以下SQL，查詢test_search表name欄位包含字元1_francs的記錄。

mydb=> SELECT * FROM test_search WHERE to_tsvector('english',name) @@
to_tsquery('english','1_francs');
 id |   name   
----+----------
  1 | 1_francs
(1 row)

再次檢視執行計劃和執行時間，如下所示：

mydb=> EXPLAIN ANALYZE SELECT * FROM test_search WHERE to_tsvector('english',name) @@
to_tsquery('english','1_francs');
                                        QUERY PLAN                                                        
-------------------------------------------------------------------------------------
 Bitmap Heap Scan on test_search  (cost=18.39..128.38 rows=50 width=36) (actual time=0.071..0.071 rows=1 loops=1)
   Recheck Cond: (to_tsvector('english'::regconfig, name) @@ '''1'' & ''franc'''::tsquery)
   Heap Blocks: exact=1
   ->  Bitmap Index Scan on idx_gin_search  (cost=0.00..18.38 rows=50 width=0) (actual time=0.064..0.064 rows=1 loops=1)
         Index Cond: (to_tsvector('english'::regconfig, name) @@ '''1'' & ''franc'''::tsquery)
 Planning time: 0.122 ms
 Execution time: 0.104 ms
(7 rows)

建立索引後，以上查詢走了索引並且執行時間下降到0.104毫秒，效能提升了3個數量級，值得一提的是如果SQL改成以下，則不走索引，如下所示：

mydb=> EXPLAIN ANALYZE SELECT * FROM test_search
 WHERE to_tsvector(name) @@ to_tsquery('1_francs');
                                 QUERY PLAN                                                    
-------------------------------------------------------------------------------------
 Seq Scan on test_search  (cost=0.00..1037730.00 rows=50 width=18) (actual time=0.036..10297.764 rows=1 loops=1)
   Filter: (to_tsvector(name) @@ to_tsquery('1_francs'::text))
   Rows Removed by Filter: 1999999
 Planning time: 0.098 ms
 Execution time: 10297.787 ms
(5 rows)

由於建立索引時使用的是to_tsvector('english',name)函式索引，帶了兩個引數，因此where條件中的to_tsvector函式帶兩個引數才能走索引，而to_tsvector(name)不走索引。

2、JSON、JSONB全文檢索實踐

在PostgreSQL 10版本之前全文檢索不支援json和jsonb資料型別，10版本的一個重要特性是全文檢索支援json和jsonb資料型別。

10版本與9.6版本to_tsvector函式的差異

先來看下9.6版本to_tsvector函式，如下：

[postgres@pghost1 ~]$ psql francs francs
psql (9.6.3)
Type "help" for help.
 
mydb=> \df *to_tsvector*
                                List of functions
   Schema   |       Name        | Result data type | Argument data types |  Type  
------------+-------------------+------------------+---------------------+--------
 pg_catalog | array_to_tsvector | tsvector         | text[]              | normal
 pg_catalog | to_tsvector       | tsvector         | regconfig, text     | normal
 pg_catalog | to_tsvector       | tsvector         | text                | normal
(3 rows)

從以上看出9.6版本to_tsvector函式的輸入引數僅支援text、text[]資料型別，接著看下10版本的to_tsvector函式，如下所示：

[postgres@pghost1 ~]$ psql mydb pguser
psql (10.0)
Type "help" for help.
mydb=> \df *to_tsvector*
                                List of functions
   Schema   |       Name        | Result data type | Argument data types |  Type  
------------+-------------------+------------------+---------------------+--------
 pg_catalog | array_to_tsvector | tsvector         | text[]              | normal
 pg_catalog | to_tsvector       | tsvector         | json                | normal
 pg_catalog | to_tsvector       | tsvector         | jsonb               | normal
 pg_catalog | to_tsvector       | tsvector         | regconfig, json     | normal
 pg_catalog | to_tsvector       | tsvector         | regconfig, jsonb    | normal
 pg_catalog | to_tsvector       | tsvector         | regconfig, text     | normal
 pg_catalog | to_tsvector       | tsvector         | text                | normal
(7 rows)

從以上看出，10版本的to_tsvector函式支援的資料型別增加了json和jsonb。

建立資料生成函式

為了便於生成測試資料，建立以下兩個函式用來隨機生成指定長度的字串，建立random_range(int4, int4)函式如下：

CREATE OR REPLACE FUNCTION random_range(int4, int4)
RETURNS int4
LANGUAGE SQL
AS $$
    SELECT ($1 + FLOOR(($2 - $1 + 1) * random() ))::int4;
$$;

接著建立random_text_simple(length int4)函式，此函式會呼叫random_range(int4, int4)函式。

CREATE OR REPLACE FUNCTION random_text_simple(length int4)
RETURNS text
LANGUAGE PLPGSQL
AS $$
DECLARE
    possible_chars text := '0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ';
    output text := '';
    i int4;
    pos int4;
BEGIN
 
    FOR i IN 1..length LOOP
        pos := random_range(1, length(possible_chars));
        output := output || substr(possible_chars, pos, 1);
    END LOOP;
 
    RETURN output;
END;
$$;

random_text_simple(length int4)函式可以隨機生成指定長度字串，如下隨機生成含三位字元的字串：

mydb=> SELECT random_text_simple(3);
 random_text_simple
--------------------
 LL9
(1 row)

隨機生成含六位字元的字串，如下所示：

mydb=> SELECT random_text_simple(6);
 random_text_simple
--------------------
 B81BPW
(1 row)

後面會用到這個函式生成測試資料。

建立JSON測試表

建立user_ini測試表，並通過random_text_simple(length int4)函式插入100萬隨機生成六位字元的字串測試資料，如下所示：

mydb=> CREATE TABLE user_ini(id int4 ,user_id int8,
user_name character varying(64),
create_time timestamp(6) with time zone default clock_timestamp());
CREATE TABLE
 
mydb=> INSERT INTO user_ini(id,user_id,user_name)
SELECT r,round(random()*1000000), random_text_simple(6)
FROM generate_series(1,1000000) as r;
INSERT 0 1000000

建立tbl_user_search_json表，並通過row_to_json函式將表user_ini行資料轉換成json資料，如下所示：

mydb=> CREATE TABLE tbl_user_search_json(id serial, user_info json);
CREATE TABLE
 
mydb=> INSERT INTO tbl_user_search_json(user_info)
 SELECT row_to_json(user_ini) FROM user_ini;
INSERT 0 1000000

生成的資料如下：

mydb=> SELECT * FROM tbl_user_search_json LIMIT 1;
 id |                                            user_info                                            
----+-----------------------------------------------------------------------------------------------
  1 | {"id":1,"user_id":186536,"user_name":"KTU89H","create_time":"2017-08-05T15:59:25.359148+08:00"}
(1 row)

JSON資料全文檢索測試

使用全文檢索查詢表tbl_user_search_json的user_info欄位中包含KTU89H字元的記錄，如下所示：

mydb=> SELECT * FROM tbl_user_search_json
WHERE to_tsvector('english',user_info) @@ to_tsquery('ENGLISH','KTU89H');
 id |                                            user_info                                            
----+----------------------------------------------------------------------------------------
  1 | {"id":1,"user_id":186536,"user_name":"KTU89H","create_time":"2017-08-05T15:59:25.359148+08:00"}
(1 row)

以上SQL能正常執行說明全文檢索支援json資料型別，只是上述SQL走了全表掃描效能低，執行時間為8061毫秒，如下所示：

mydb=> EXPLAIN ANALYZE SELECT * FROM tbl_user_search_json
 WHERE to_tsvector('english',user_info) @@ to_tsquery('ENGLISH','KTU89H');
                                        QUERY PLAN                                                         
-----------------------------------------------------------------------------------
 Seq Scan on tbl_user_search_json  (cost=0.00..279513.00 rows=5000 width=104) (actual time=0.046..8061.858 rows=1 loops=1)
   Filter: (to_tsvector('english'::regconfig, user_info) @@ '''ktu89h'''::tsquery)
   Rows Removed by Filter: 999999
 Planning time: 0.091 ms
 Execution time: 8061.880 ms
(5 rows)

建立如下索引：

mydb=>  CREATE INDEX idx_gin_search_json ON tbl_user_search_json USING
gin(to_tsvector('english',user_info));
 CREATE INDEX

索引建立後，再次執行以下SQL，如下所示：

mydb=> EXPLAIN ANALYZE SELECT * FROM tbl_user_search_json WHERE to_tsvector('english',user_info) @@ to_tsquery('ENGLISH','KTU89H');
                                        QUERY PLAN                                                           
-------------------------------------------------------------------------------------
 Bitmap Heap Scan on tbl_user_search_json  (cost=50.75..7876.06 rows=5000 width=104) (actual time=0.024..0.024 rows=1 loops=1)
   Recheck Cond: (to_tsvector('english'::regconfig, user_info) @@ '''ktu89h'''::tsquery)
   Heap Blocks: exact=1
   ->  Bitmap Index Scan on idx_gin_search_json  (cost=0.00..49.50 rows=5000 width=0) (actual time=0.018..0.018 rows=1 loops=1)
         Index Cond: (to_tsvector('english'::regconfig, user_info) @@ '''ktu89h'''::tsquery)
 Planning time: 0.113 ms
 Execution time: 0.057 ms
(7 rows)

從上述執行計劃看出走了索引，並且執行時間降為0.057毫秒，效能非常不錯。

這一小節前一部分對PostgreSQL全文檢索的實現做了簡單介紹，並且給出了一個英文檢索的例子，後一部分通過示例介紹了PostgreSQL10的一個新特性，即全文檢索支援json、jsonb型別。

四、總結

本文介紹了PostgreSQL的NoSQL特性，首先介紹了json和jsonb資料型別，之後通過示例對比json、jsonb資料型別讀寫效能差異，最後介紹了PostgreSQL全文檢索對json、jsonb型別的支援（PostgreSQL 10新特性）；值得一提的是，PostgreSQL對中文全文檢索也是支援的，有興趣的讀者可自行測試。

PostgreSQL何以支援豐富的NoSQL特性？

一、JSON和JSONB資料型別

您可能也會喜歡…