Elasticsearch 搜尋片語,如何更準?
更好的閱讀體驗,開啟【閱讀原文】,在PC上瀏覽
Lucene用了很久,其版本更新也很快。在ES出來之後,直接使用Lucene的時候就比較少了,更多的就在ES框架下一站式完成,ES目前在專案中幾乎佔據了半壁江山。
ES的功能很強大,使用過程中,有一個問題是繞不過的:就是中文分詞。這是至關重要的一個問題,直接影響搜尋結果的準確和召回。
一般來講,分詞的問題本身目前解決的已經相當不錯了,大家用的比較多的是jieba分詞,還有清華、斯坦福、復旦等開源的中文分詞。如果要在ES中使用jieba分詞,就需要定製一個ES的分詞外掛,將jieba分詞load到ES中。
幾年之前,因為專案需要,我擼過一個簡單的ES外掛,在github上開源: jieba分詞ES外掛,也有一些使用者在使用,期間也在斷斷續續的更新。
其中的關鍵,通過閱讀程式碼就會發現,在處理token的過程中,有以下屬性需要處理:
-
CharTermAttribute
-
OffsetAttribute
-
TypeAttribute
-
PositionIncrementAttribute
分別代表了分詞的結果的最小單元:term,分詞的offset:startOffset
和endOffset
,以及詞性,例如word、或者數字、字母等等。
最後一個屬性PositionIncrementAttribute
比較難以理解,在特定的場合下才需要特殊的處理,大部分情況下預設的結果就可以,但在特定的場合下,會丟掉部分的文件。下文我們就詳細解釋這個屬性,通過例子來說明這個是如何產生影響的,以及該如何解決。
我們先解釋一下分詞的結果,使用到的ES,以及外掛版本如下:
-
elasticsearch-6.4.0
-
elasticsearch-jieba-plugin-6.4.0
安裝好外掛,啟動ES:
./bin/elasticsearch
有如下輸出,則說明外掛載入成功:
... [2018-10-26T23:04:12,572][INFO ][o.e.p.PluginsService] [z7z-6dR] loaded plugin [analysis-jieba] ...
準備好示例文件:
現在 高階產品經理\n2003。4-2003。11 產品副經理\n向產品群經理彙報工作\負責產品為:得普利麻\n2002。5-2003。3 產品副經理\n向產品群經理彙報工作\n負責推廣產品為:精分(思瑞康),麻醉(得普利麻)
jieba包括兩種分詞模式:
-
index模式,適用於索引的分詞,可以分詞更多的term,照顧召回。
-
search模式,適用於查詢的分詞,分詞結果沒有交叉,更多考慮的是準確率的方面。
我們驗證一下分詞外掛,以及兩種模式的影響,通過如下命令,我們先看看search
模式的分詞效果:
curl -X GET "localhost:9200/_analyze" -H 'Content-Type: application/json' -d' { "tokenizer" : "jieba_search", "text" : "現在 高階產品經理\n2003。4-2003。11 產品副經理\n向產品群經理彙報工作\負責產品為:得普利麻\n2002。5-2003。3 產品副經理\n向產品群經理彙報工作\n負責推廣產品為:精分(思瑞康),麻醉(得普利麻)" }‘
檢視輸出:
{ "tokens": [ { "token": "現在", "start_offset": 0, "end_offset": 2, "type": "word", "position": 0 }, { "token": " ", "start_offset": 2, "end_offset": 3, "type": "word", "position": 1 }, { "token": "高階", "start_offset": 3, "end_offset": 5, "type": "word", "position": 2 }, { "token": "產品", "start_offset": 5, "end_offset": 7, "type": "word", "position": 3 }, { "token": "經理", "start_offset": 7, "end_offset": 9, "type": "word", "position": 4 }, { "token": "\n", "start_offset": 9, "end_offset": 10, "type": "word", "position": 5 }, { "token": "2003", "start_offset": 10, "end_offset": 14, "type": "word", "position": 6 }, { "token": "。", "start_offset": 14, "end_offset": 15, "type": "word", "position": 7 }, { "token": "4", "start_offset": 15, "end_offset": 16, "type": "word", "position": 8 }, { "token": "-", "start_offset": 16, "end_offset": 17, "type": "word", "position": 9 }, { "token": "2003", "start_offset": 17, "end_offset": 21, "type": "word", "position": 10 }, { "token": "。", "start_offset": 21, "end_offset": 22, "type": "word", "position": 11 }, { "token": "11", "start_offset": 22, "end_offset": 24, "type": "word", "position": 12 }, { "token": " ", "start_offset": 24, "end_offset": 25, "type": "word", "position": 13 }, { "token": "產品", "start_offset": 25, "end_offset": 27, "type": "word", "position": 14 }, { "token": "副經理", "start_offset": 27, "end_offset": 30, "type": "word", "position": 15 }, { "token": "\n", "start_offset": 30, "end_offset": 31, "type": "word", "position": 16 }, { "token": "向", "start_offset": 31, "end_offset": 32, "type": "word", "position": 17 }, { "token": "產品", "start_offset": 32, "end_offset": 34, "type": "word", "position": 18 }, { "token": "群", "start_offset": 34, "end_offset": 35, "type": "word", "position": 19 }, { "token": "經理", "start_offset": 35, "end_offset": 37, "type": "word", "position": 20 }, { "token": "彙報工作", "start_offset": 37, "end_offset": 41, "type": "word", "position": 21 }, { "token": "\n", "start_offset": 41, "end_offset": 42, "type": "word", "position": 22 }, { "token": "負責", "start_offset": 42, "end_offset": 44, "type": "word", "position": 23 }, { "token": "產品", "start_offset": 44, "end_offset": 46, "type": "word", "position": 24 }, { "token": "為", "start_offset": 46, "end_offset": 47, "type": "word", "position": 25 }, { "token": ":", "start_offset": 47, "end_offset": 48, "type": "word", "position": 26 }, { "token": "得", "start_offset": 48, "end_offset": 49, "type": "word", "position": 27 }, { "token": "普利", "start_offset": 49, "end_offset": 51, "type": "word", "position": 28 }, { "token": "麻", "start_offset": 51, "end_offset": 52, "type": "word", "position": 29 }, { "token": "\n", "start_offset": 52, "end_offset": 53, "type": "word", "position": 30 }, { "token": "2002", "start_offset": 53, "end_offset": 57, "type": "word", "position": 31 }, { "token": "。", "start_offset": 57, "end_offset": 58, "type": "word", "position": 32 }, { "token": "5", "start_offset": 58, "end_offset": 59, "type": "word", "position": 33 }, { "token": "-", "start_offset": 59, "end_offset": 60, "type": "word", "position": 34 }, { "token": "2003", "start_offset": 60, "end_offset": 64, "type": "word", "position": 35 }, { "token": "。", "start_offset": 64, "end_offset": 65, "type": "word", "position": 36 }, { "token": "3", "start_offset": 65, "end_offset": 66, "type": "word", "position": 37 }, { "token": " ", "start_offset": 66, "end_offset": 67, "type": "word", "position": 38 }, { "token": "產品", "start_offset": 67, "end_offset": 69, "type": "word", "position": 39 }, { "token": "副經理", "start_offset": 69, "end_offset": 72, "type": "word", "position": 40 }, { "token": "\n", "start_offset": 72, "end_offset": 73, "type": "word", "position": 41 }, { "token": "向", "start_offset": 73, "end_offset": 74, "type": "word", "position": 42 }, { "token": "產品", "start_offset": 74, "end_offset": 76, "type": "word", "position": 43 }, { "token": "群", "start_offset": 76, "end_offset": 77, "type": "word", "position": 44 }, { "token": "經理", "start_offset": 77, "end_offset": 79, "type": "word", "position": 45 }, { "token": "彙報工作", "start_offset": 79, "end_offset": 83, "type": "word", "position": 46 }, { "token": "\n", "start_offset": 83, "end_offset": 84, "type": "word", "position": 47 }, { "token": "負責", "start_offset": 84, "end_offset": 86, "type": "word", "position": 48 }, { "token": "推廣", "start_offset": 86, "end_offset": 88, "type": "word", "position": 49 }, { "token": "產品", "start_offset": 88, "end_offset": 90, "type": "word", "position": 50 }, { "token": "為", "start_offset": 90, "end_offset": 91, "type": "word", "position": 51 }, { "token": ":", "start_offset": 91, "end_offset": 92, "type": "word", "position": 52 }, { "token": "精分", "start_offset": 92, "end_offset": 94, "type": "word", "position": 53 }, { "token": "(", "start_offset": 94, "end_offset": 95, "type": "word", "position": 54 }, { "token": "思", "start_offset": 95, "end_offset": 96, "type": "word", "position": 55 }, { "token": "瑞康", "start_offset": 96, "end_offset": 98, "type": "word", "position": 56 }, { "token": ")", "start_offset": 98, "end_offset": 99, "type": "word", "position": 57 }, { "token": ",", "start_offset": 99, "end_offset": 100, "type": "word", "position": 58 }, { "token": "麻醉", "start_offset": 100, "end_offset": 102, "type": "word", "position": 59 }, { "token": "(", "start_offset": 102, "end_offset": 103, "type": "word", "position": 60 }, { "token": "得", "start_offset": 103, "end_offset": 104, "type": "word", "position": 61 }, { "token": "普利", "start_offset": 104, "end_offset": 106, "type": "word", "position": 62 }, { "token": "麻", "start_offset": 106, "end_offset": 107, "type": "word", "position": 63 }, { "token": ")", "start_offset": 107, "end_offset": 108, "type": "word", "position": 64 } ]}
分詞結果中,token對應的就是term屬性,start_offset和end_offset對應的就是Offset屬性,type類似於詞性。這幾個都是比較好理解的,那麼position
是什麼含義呢?通過觀察:
position
是分詞之後term/token的先對位置,代表了順序和距離。
這個例子中產品
和副經理
是緊挨著的,中間沒有間隔。也就意味著如下的查詢
{ "query": { "match_phrase":{ "field1": { "query": "產品經理", "slop": 0 }}}}
能夠搜到我們的示例文件。這裡要注意,slop
預設是0,可以不寫。當slop
要求為0的時候,就要求搜尋片語產品經理
在文件中連起來的,這個時候命中的是產品經理
,而不是產品|群|經理
,|
表示token分割。如果設定slop
為1,則產品|群|經理
也會命中。slop
的大小,就是position
的大小差異。
看下index
模式,要更加複雜,PositionIncrement
的作用也是在這裡體現。同樣是上面的文字:
curl -X GET "localhost:9200/_analyze" -H 'Content-Type: application/json' -d' { "tokenizer" : "jieba_index", "text" : "現在 高階產品經理\n2003。4-2003。11 產品副經理\n向產品群經理彙報工作\負責產品為:得普利麻\n2002。5-2003。3 產品副經理\n向產品群經理彙報工作\n負責推廣產品為:精分(思瑞康),麻醉(得普利麻)" }‘
結果如下,需要仔細對比和search
的差異。
{ "tokens": [ { "token": "現在", "start_offset": 0, "end_offset": 2, "type": "word", "position": 0 }, { "token": " ", "start_offset": 2, "end_offset": 3, "type": "word", "position": 1 }, { "token": "高階", "start_offset": 3, "end_offset": 5, "type": "word", "position": 2 }, { "token": "產品", "start_offset": 5, "end_offset": 7, "type": "word", "position": 3 }, { "token": "經理", "start_offset": 7, "end_offset": 9, "type": "word", "position": 4 }, { "token": "\n", "start_offset": 9, "end_offset": 10, "type": "word", "position": 5 }, { "token": "2003", "start_offset": 10, "end_offset": 14, "type": "word", "position": 6 }, { "token": "。", "start_offset": 14, "end_offset": 15, "type": "word", "position": 7 }, { "token": "4", "start_offset": 15, "end_offset": 16, "type": "word", "position": 8 }, { "token": "-", "start_offset": 16, "end_offset": 17, "type": "word", "position": 9 }, { "token": "2003", "start_offset": 17, "end_offset": 21, "type": "word", "position": 10 }, { "token": "。", "start_offset": 21, "end_offset": 22, "type": "word", "position": 11 }, { "token": "11", "start_offset": 22, "end_offset": 24, "type": "word", "position": 12 }, { "token": " ", "start_offset": 24, "end_offset": 25, "type": "word", "position": 13 }, { "token": "產品", "start_offset": 25, "end_offset": 27, "type": "word", "position": 14 }, { "token": "副經理", "start_offset": 27, "end_offset": 30, "type": "word", "position": 15 }, { "token": "經理", "start_offset": 28, "end_offset": 30, "type": "word", "position": 16 }, { "token": "\n", "start_offset": 30, "end_offset": 31, "type": "word", "position": 17 }, { "token": "向", "start_offset": 31, "end_offset": 32, "type": "word", "position": 18 }, { "token": "產品", "start_offset": 32, "end_offset": 34, "type": "word", "position": 19 }, { "token": "群", "start_offset": 34, "end_offset": 35, "type": "word", "position": 20 }, { "token": "經理", "start_offset": 35, "end_offset": 37, "type": "word", "position": 21 }, { "token": "彙報", "start_offset": 37, "end_offset": 39, "type": "word", "position": 22 }, { "token": "彙報工作", "start_offset": 37, "end_offset": 41, "type": "word", "position": 22 }, { "token": "工作", "start_offset": 39, "end_offset": 41, "type": "word", "position": 23 }, { "token": "\n", "start_offset": 41, "end_offset": 42, "type": "word", "position": 24 }, { "token": "負責", "start_offset": 42, "end_offset": 44, "type": "word", "position": 25 }, { "token": "產品", "start_offset": 44, "end_offset": 46, "type": "word", "position": 26 }, { "token": "為", "start_offset": 46, "end_offset": 47, "type": "word", "position": 27 }, { "token": ":", "start_offset": 47, "end_offset": 48, "type": "word", "position": 28 }, { "token": "得", "start_offset": 48, "end_offset": 49, "type": "word", "position": 29 }, { "token": "普利", "start_offset": 49, "end_offset": 51, "type": "word", "position": 30 }, { "token": "麻", "start_offset": 51, "end_offset": 52, "type": "word", "position": 31 }, { "token": "\n", "start_offset": 52, "end_offset": 53, "type": "word", "position": 32 }, { "token": "2002", "start_offset": 53, "end_offset": 57, "type": "word", "position": 33 }, { "token": "。", "start_offset": 57, "end_offset": 58, "type": "word", "position": 34 }, { "token": "5", "start_offset": 58, "end_offset": 59, "type": "word", "position": 35 }, { "token": "-", "start_offset": 59, "end_offset": 60, "type": "word", "position": 36 }, { "token": "2003", "start_offset": 60, "end_offset": 64, "type": "word", "position": 37 }, { "token": "。", "start_offset": 64, "end_offset": 65, "type": "word", "position": 38 }, { "token": "3", "start_offset": 65, "end_offset": 66, "type": "word", "position": 39 }, { "token": " ", "start_offset": 66, "end_offset": 67, "type": "word", "position": 40 }, { "token": "產品", "start_offset": 67, "end_offset": 69, "type": "word", "position": 41 }, { "token": "副經理", "start_offset": 69, "end_offset": 72, "type": "word", "position": 42 }, { "token": "經理", "start_offset": 70, "end_offset": 72, "type": "word", "position": 43 }, { "token": "\n", "start_offset": 72, "end_offset": 73, "type": "word", "position": 44 }, { "token": "向", "start_offset": 73, "end_offset": 74, "type": "word", "position": 45 }, { "token": "產品", "start_offset": 74, "end_offset": 76, "type": "word", "position": 46 }, { "token": "群", "start_offset": 76, "end_offset": 77, "type": "word", "position": 47 }, { "token": "經理", "start_offset": 77, "end_offset": 79, "type": "word", "position": 48 }, { "token": "彙報", "start_offset": 79, "end_offset": 81, "type": "word", "position": 49 }, { "token": "彙報工作", "start_offset": 79, "end_offset": 83, "type": "word", "position": 49 }, { "token": "工作", "start_offset": 81, "end_offset": 83, "type": "word", "position": 50 }, { "token": "\n", "start_offset": 83, "end_offset": 84, "type": "word", "position": 51 }, { "token": "負責", "start_offset": 84, "end_offset": 86, "type": "word", "position": 52 }, { "token": "推廣", "start_offset": 86, "end_offset": 88, "type": "word", "position": 53 }, { "token": "產品", "start_offset": 88, "end_offset": 90, "type": "word", "position": 54 }, { "token": "為", "start_offset": 90, "end_offset": 91, "type": "word", "position": 55 }, { "token": ":", "start_offset": 91, "end_offset": 92, "type": "word", "position": 56 }, { "token": "精分", "start_offset": 92, "end_offset": 94, "type": "word", "position": 57 }, { "token": "(", "start_offset": 94, "end_offset": 95, "type": "word", "position": 58 }, { "token": "思", "start_offset": 95, "end_offset": 96, "type": "word", "position": 59 }, { "token": "瑞康", "start_offset": 96, "end_offset": 98, "type": "word", "position": 60 }, { "token": ")", "start_offset": 98, "end_offset": 99, "type": "word", "position": 61 }, { "token": ",", "start_offset": 99, "end_offset": 100, "type": "word", "position": 62 }, { "token": "麻醉", "start_offset": 100, "end_offset": 102, "type": "word", "position": 63 }, { "token": "(", "start_offset": 102, "end_offset": 103, "type": "word", "position": 64 }, { "token": "得", "start_offset": 103, "end_offset": 104, "type": "word", "position": 65 }, { "token": "普利", "start_offset": 104, "end_offset": 106, "type": "word", "position": 66 }, { "token": "麻", "start_offset": 106, "end_offset": 107, "type": "word", "position": 67 }, { "token": ")", "start_offset": 107, "end_offset": 108, "type": "word", "position": 68 } ]}
因為index
模式的原因,產品副經理
分為了產品|副經理|經理
。這個時候,合理的position
就十分重要了。通過我最新的外掛的實現,這裡的position
分別是14,15,16。這是正確的,因為要正確處理下面的結果。
當我們執行如下搜尋:
{ "query": { "match_phrase":{ "field1": { "query": "產品經理" }}}, "highlight" : { "fields" : { "field1" : {}}}}
命中我們的示例文字,無間隔的產品經理
可以命中,並且可以高亮,但是產品副經理
沒有命中,也沒有高亮。
再看這個例子:
{ "query": { "match_phrase":{ "field1": { "query": "產品經理", "slop": 2 }}}, "highlight" : { "fields" : { "field1" : {}}}}
則,無間隔的產品經理
可以命中,並且可以高亮;同時,產品副經理
有命中,產品
和經理
分別高亮。這兩個例子的差別,大家要細細體會。
那麼如何正確的處理position
呢,關鍵就在於PositionIncrementAttribute
屬性的處理,通常我們使用search
模式類似的分詞是不會遇到問題的,即使使用預設的PositionIncrementAttribute
的實現:根據分詞得到的token,每次+1
,從而得到position
。
但預設的實現,遇到如下的情況,就會出現問題:
示例文字:
中國人民解放軍勝利了。
如果採用預設的實現,則輸出:
{ "tokens": [ { "token": "中國", "start_offset": 0, "end_offset": 2, "type": "word", "position": 0 }, { "token": "中國人", "start_offset": 0, "end_offset": 3, "type": "word", "position": 1 }, { "token": "中國人民解放軍", "start_offset": 0, "end_offset": 7, "type": "word", "position": 2 }, { "token": "國人", "start_offset": 1, "end_offset": 3, "type": "word", "position": 4 }, { "token": "人民", "start_offset": 2, "end_offset": 4, "type": "word", "position": 5 }, { "token": "解放", "start_offset": 4, "end_offset": 6, "type": "word", "position": 6 }, { "token": "解放軍", "start_offset": 4, "end_offset": 7, "type": "word", "position": 7 }, { "token": "勝利", "start_offset": 7, "end_offset": 9, "type": "word", "position": 8 }, { "token": "了", "start_offset": 9, "end_offset": 10, "type": "word", "position": 9 } ]}
根據這樣的position
,我們如下的查詢,就找不到這個示例文件,從而產生丟資料的現象。
{ "query": { "match_phrase":{ "field1": { "query": "中國人民" }}}, "highlight" : { "fields" : { "field1" : {}}}}
本來中國人民
在示例中是無間隔緊鄰的,但是由於position
解析的問題,直接導致slop
已經變成了4,所以必須制定查詢中的slop
比較大,才能夠返回正確的文件,但這裡Rank也會受到影響。
看一下正確position
的結果。
{ "tokens": [ { "token": "中國", "start_offset": 0, "end_offset": 2, "type": "word", "position": 0 }, { "token": "中國人", "start_offset": 0, "end_offset": 3, "type": "word", "position": 0 }, { "token": "中國人民解放軍", "start_offset": 0, "end_offset": 7, "type": "word", "position": 0 }, { "token": "國人", "start_offset": 1, "end_offset": 3, "type": "word", "position": 0 }, { "token": "人民", "start_offset": 2, "end_offset": 4, "type": "word", "position": 1 }, { "token": "解放", "start_offset": 4, "end_offset": 6, "type": "word", "position": 2 }, { "token": "解放軍", "start_offset": 4, "end_offset": 7, "type": "word", "position": 2 }, { "token": "勝利", "start_offset": 7, "end_offset": 9, "type": "word", "position": 3 }, { "token": "了", "start_offset": 9, "end_offset": 10, "type": "word", "position": 4 } ]}
其中,中國
是0,人民
是1,就可以命中了。
基本上,在處理token的時候,要判斷``是1,還是0。這裡的Lucene實現機制不好,對於分詞的實現約束比較多,並且只考慮了英文。現在的實現,優先考慮了召回。極個別情況,還是會有些準確率的問題。
另外一個層面,要從詞的切分的角度處理,分詞的結果應該提供一個最細粒度的、無交叉的切分,這個方式用來做索引,會比較好一些。那這樣,預設的PositionIncrement
也是能夠滿足需求的。接下來看看,jieba
是否可以改造一下,支援第三種分詞的模式:最細粒度的、無交叉的切分。