Elasticsearch 搜尋片語，如何更準？

ElasticSearch 中文分詞 · 發表 2018-10-30 07:08:19

摘要：更好的閱讀體驗，開啟【閱讀原文】，在PC上瀏覽 Lucene用了很久，其版本更新也很快。在ES出來之後，直接使用Lucene的時候就比較少了，更多的就在ES框架下一站式完成，ES目前在專案中幾乎佔據了半壁江山。 ES的功能很強大，使用過程中，有一個問題是繞不過的：就是中文分詞。這是...

更好的閱讀體驗，開啟【閱讀原文】，在PC上瀏覽

Lucene用了很久，其版本更新也很快。在ES出來之後，直接使用Lucene的時候就比較少了，更多的就在ES框架下一站式完成，ES目前在專案中幾乎佔據了半壁江山。

ES的功能很強大，使用過程中，有一個問題是繞不過的：就是中文分詞。這是至關重要的一個問題，直接影響搜尋結果的準確和召回。

一般來講，分詞的問題本身目前解決的已經相當不錯了，大家用的比較多的是jieba分詞，還有清華、斯坦福、復旦等開源的中文分詞。如果要在ES中使用jieba分詞，就需要定製一個ES的分詞外掛，將jieba分詞load到ES中。

幾年之前，因為專案需要，我擼過一個簡單的ES外掛，在github上開源: jieba分詞ES外掛，也有一些使用者在使用，期間也在斷斷續續的更新。

其中的關鍵，通過閱讀程式碼就會發現，在處理token的過程中，有以下屬性需要處理：

CharTermAttribute
OffsetAttribute
TypeAttribute
PositionIncrementAttribute

分別代表了分詞的結果的最小單元：term，分詞的offset：startOffset 和endOffset ，以及詞性，例如word、或者數字、字母等等。

最後一個屬性PositionIncrementAttribute 比較難以理解，在特定的場合下才需要特殊的處理，大部分情況下預設的結果就可以，但在特定的場合下，會丟掉部分的文件。下文我們就詳細解釋這個屬性，通過例子來說明這個是如何產生影響的，以及該如何解決。

我們先解釋一下分詞的結果，使用到的ES，以及外掛版本如下：

elasticsearch-6.4.0
elasticsearch-jieba-plugin-6.4.0

安裝好外掛，啟動ES：

./bin/elasticsearch

有如下輸出，則說明外掛載入成功：

...
[2018-10-26T23:04:12,572][INFO ][o.e.p.PluginsService] [z7z-6dR] loaded plugin [analysis-jieba]
...

準備好示例文件：

現在 高階產品經理\n2003。4-2003。11 產品副經理\n向產品群經理彙報工作\負責產品為：得普利麻\n2002。5-2003。3 產品副經理\n向產品群經理彙報工作\n負責推廣產品為：精分（思瑞康），麻醉（得普利麻）

jieba包括兩種分詞模式：

index模式，適用於索引的分詞，可以分詞更多的term，照顧召回。
search模式，適用於查詢的分詞，分詞結果沒有交叉，更多考慮的是準確率的方面。

我們驗證一下分詞外掛，以及兩種模式的影響，通過如下命令，我們先看看search 模式的分詞效果：

curl -X GET "localhost:9200/_analyze" -H 'Content-Type: application/json' -d' { "tokenizer" : "jieba_search", "text" : "現在 高階產品經理\n2003。4-2003。11 產品副經理\n向產品群經理彙報工作\負責產品為：得普利麻\n2002。5-2003。3 產品副經理\n向產品群經理彙報工作\n負責推廣產品為：精分（思瑞康），麻醉（得普利麻）" }‘

檢視輸出：

{
"tokens": [
{
"token": "現在",
"start_offset": 0,
"end_offset": 2,
"type": "word",
"position": 0
},
{
"token": " ",
"start_offset": 2,
"end_offset": 3,
"type": "word",
"position": 1
},
{
"token": "高階",
"start_offset": 3,
"end_offset": 5,
"type": "word",
"position": 2
},
{
"token": "產品",
"start_offset": 5,
"end_offset": 7,
"type": "word",
"position": 3
},
{
"token": "經理",
"start_offset": 7,
"end_offset": 9,
"type": "word",
"position": 4
},
{
"token": "\n",
"start_offset": 9,
"end_offset": 10,
"type": "word",
"position": 5
},
{
"token": "2003",
"start_offset": 10,
"end_offset": 14,
"type": "word",
"position": 6
},
{
"token": "。",
"start_offset": 14,
"end_offset": 15,
"type": "word",
"position": 7
},
{
"token": "4",
"start_offset": 15,
"end_offset": 16,
"type": "word",
"position": 8
},
{
"token": "-",
"start_offset": 16,
"end_offset": 17,
"type": "word",
"position": 9
},
{
"token": "2003",
"start_offset": 17,
"end_offset": 21,
"type": "word",
"position": 10
},
{
"token": "。",
"start_offset": 21,
"end_offset": 22,
"type": "word",
"position": 11
},
{
"token": "11",
"start_offset": 22,
"end_offset": 24,
"type": "word",
"position": 12
},
{
"token": " ",
"start_offset": 24,
"end_offset": 25,
"type": "word",
"position": 13
},
{
"token": "產品",
"start_offset": 25,
"end_offset": 27,
"type": "word",
"position": 14
},
{
"token": "副經理",
"start_offset": 27,
"end_offset": 30,
"type": "word",
"position": 15
},
{
"token": "\n",
"start_offset": 30,
"end_offset": 31,
"type": "word",
"position": 16
},
{
"token": "向",
"start_offset": 31,
"end_offset": 32,
"type": "word",
"position": 17
},
{
"token": "產品",
"start_offset": 32,
"end_offset": 34,
"type": "word",
"position": 18
},
{
"token": "群",
"start_offset": 34,
"end_offset": 35,
"type": "word",
"position": 19
},
{
"token": "經理",
"start_offset": 35,
"end_offset": 37,
"type": "word",
"position": 20
},
{
"token": "彙報工作",
"start_offset": 37,
"end_offset": 41,
"type": "word",
"position": 21
},
{
"token": "\n",
"start_offset": 41,
"end_offset": 42,
"type": "word",
"position": 22
},
{
"token": "負責",
"start_offset": 42,
"end_offset": 44,
"type": "word",
"position": 23
},
{
"token": "產品",
"start_offset": 44,
"end_offset": 46,
"type": "word",
"position": 24
},
{
"token": "為",
"start_offset": 46,
"end_offset": 47,
"type": "word",
"position": 25
},
{
"token": "：",
"start_offset": 47,
"end_offset": 48,
"type": "word",
"position": 26
},
{
"token": "得",
"start_offset": 48,
"end_offset": 49,
"type": "word",
"position": 27
},
{
"token": "普利",
"start_offset": 49,
"end_offset": 51,
"type": "word",
"position": 28
},
{
"token": "麻",
"start_offset": 51,
"end_offset": 52,
"type": "word",
"position": 29
},
{
"token": "\n",
"start_offset": 52,
"end_offset": 53,
"type": "word",
"position": 30
},
{
"token": "2002",
"start_offset": 53,
"end_offset": 57,
"type": "word",
"position": 31
},
{
"token": "。",
"start_offset": 57,
"end_offset": 58,
"type": "word",
"position": 32
},
{
"token": "5",
"start_offset": 58,
"end_offset": 59,
"type": "word",
"position": 33
},
{
"token": "-",
"start_offset": 59,
"end_offset": 60,
"type": "word",
"position": 34
},
{
"token": "2003",
"start_offset": 60,
"end_offset": 64,
"type": "word",
"position": 35
},
{
"token": "。",
"start_offset": 64,
"end_offset": 65,
"type": "word",
"position": 36
},
{
"token": "3",
"start_offset": 65,
"end_offset": 66,
"type": "word",
"position": 37
},
{
"token": " ",
"start_offset": 66,
"end_offset": 67,
"type": "word",
"position": 38
},
{
"token": "產品",
"start_offset": 67,
"end_offset": 69,
"type": "word",
"position": 39
},
{
"token": "副經理",
"start_offset": 69,
"end_offset": 72,
"type": "word",
"position": 40
},
{
"token": "\n",
"start_offset": 72,
"end_offset": 73,
"type": "word",
"position": 41
},
{
"token": "向",
"start_offset": 73,
"end_offset": 74,
"type": "word",
"position": 42
},
{
"token": "產品",
"start_offset": 74,
"end_offset": 76,
"type": "word",
"position": 43
},
{
"token": "群",
"start_offset": 76,
"end_offset": 77,
"type": "word",
"position": 44
},
{
"token": "經理",
"start_offset": 77,
"end_offset": 79,
"type": "word",
"position": 45
},
{
"token": "彙報工作",
"start_offset": 79,
"end_offset": 83,
"type": "word",
"position": 46
},
{
"token": "\n",
"start_offset": 83,
"end_offset": 84,
"type": "word",
"position": 47
},
{
"token": "負責",
"start_offset": 84,
"end_offset": 86,
"type": "word",
"position": 48
},
{
"token": "推廣",
"start_offset": 86,
"end_offset": 88,
"type": "word",
"position": 49
},
{
"token": "產品",
"start_offset": 88,
"end_offset": 90,
"type": "word",
"position": 50
},
{
"token": "為",
"start_offset": 90,
"end_offset": 91,
"type": "word",
"position": 51
},
{
"token": "：",
"start_offset": 91,
"end_offset": 92,
"type": "word",
"position": 52
},
{
"token": "精分",
"start_offset": 92,
"end_offset": 94,
"type": "word",
"position": 53
},
{
"token": "（",
"start_offset": 94,
"end_offset": 95,
"type": "word",
"position": 54
},
{
"token": "思",
"start_offset": 95,
"end_offset": 96,
"type": "word",
"position": 55
},
{
"token": "瑞康",
"start_offset": 96,
"end_offset": 98,
"type": "word",
"position": 56
},
{
"token": "）",
"start_offset": 98,
"end_offset": 99,
"type": "word",
"position": 57
},
{
"token": "，",
"start_offset": 99,
"end_offset": 100,
"type": "word",
"position": 58
},
{
"token": "麻醉",
"start_offset": 100,
"end_offset": 102,
"type": "word",
"position": 59
},
{
"token": "（",
"start_offset": 102,
"end_offset": 103,
"type": "word",
"position": 60
},
{
"token": "得",
"start_offset": 103,
"end_offset": 104,
"type": "word",
"position": 61
},
{
"token": "普利",
"start_offset": 104,
"end_offset": 106,
"type": "word",
"position": 62
},
{
"token": "麻",
"start_offset": 106,
"end_offset": 107,
"type": "word",
"position": 63
},
{
"token": "）",
"start_offset": 107,
"end_offset": 108,
"type": "word",
"position": 64
}
]}

分詞結果中，token對應的就是term屬性，start_offset和end_offset對應的就是Offset屬性，type類似於詞性。這幾個都是比較好理解的，那麼position 是什麼含義呢？通過觀察：

position 是分詞之後term/token的先對位置，代表了順序和距離。

這個例子中產品 和副經理 是緊挨著的，中間沒有間隔。也就意味著如下的查詢

{
"query": {
"match_phrase":{
"field1": {
"query": "產品經理",
"slop": 0 
}}}}

能夠搜到我們的示例文件。這裡要注意，slop 預設是0，可以不寫。當slop 要求為0的時候，就要求搜尋片語產品經理 在文件中連起來的，這個時候命中的是產品經理 ，而不是產品|群|經理 ，| 表示token分割。如果設定slop 為1，則產品|群|經理 也會命中。slop 的大小，就是position 的大小差異。

看下index 模式，要更加複雜，PositionIncrement 的作用也是在這裡體現。同樣是上面的文字：

curl -X GET "localhost:9200/_analyze" -H 'Content-Type: application/json' -d' { "tokenizer" : "jieba_index", "text" : "現在 高階產品經理\n2003。4-2003。11 產品副經理\n向產品群經理彙報工作\負責產品為：得普利麻\n2002。5-2003。3 產品副經理\n向產品群經理彙報工作\n負責推廣產品為：精分（思瑞康），麻醉（得普利麻）" }‘

結果如下，需要仔細對比和search 的差異。

{
"tokens": [
{
"token": "現在",
"start_offset": 0,
"end_offset": 2,
"type": "word",
"position": 0
},
{
"token": " ",
"start_offset": 2,
"end_offset": 3,
"type": "word",
"position": 1
},
{
"token": "高階",
"start_offset": 3,
"end_offset": 5,
"type": "word",
"position": 2
},
{
"token": "產品",
"start_offset": 5,
"end_offset": 7,
"type": "word",
"position": 3
},
{
"token": "經理",
"start_offset": 7,
"end_offset": 9,
"type": "word",
"position": 4
},
{
"token": "\n",
"start_offset": 9,
"end_offset": 10,
"type": "word",
"position": 5
},
{
"token": "2003",
"start_offset": 10,
"end_offset": 14,
"type": "word",
"position": 6
},
{
"token": "。",
"start_offset": 14,
"end_offset": 15,
"type": "word",
"position": 7
},
{
"token": "4",
"start_offset": 15,
"end_offset": 16,
"type": "word",
"position": 8
},
{
"token": "-",
"start_offset": 16,
"end_offset": 17,
"type": "word",
"position": 9
},
{
"token": "2003",
"start_offset": 17,
"end_offset": 21,
"type": "word",
"position": 10
},
{
"token": "。",
"start_offset": 21,
"end_offset": 22,
"type": "word",
"position": 11
},
{
"token": "11",
"start_offset": 22,
"end_offset": 24,
"type": "word",
"position": 12
},
{
"token": " ",
"start_offset": 24,
"end_offset": 25,
"type": "word",
"position": 13
},
{
"token": "產品",
"start_offset": 25,
"end_offset": 27,
"type": "word",
"position": 14
},
{
"token": "副經理",
"start_offset": 27,
"end_offset": 30,
"type": "word",
"position": 15
},
{
"token": "經理",
"start_offset": 28,
"end_offset": 30,
"type": "word",
"position": 16
},
{
"token": "\n",
"start_offset": 30,
"end_offset": 31,
"type": "word",
"position": 17
},
{
"token": "向",
"start_offset": 31,
"end_offset": 32,
"type": "word",
"position": 18
},
{
"token": "產品",
"start_offset": 32,
"end_offset": 34,
"type": "word",
"position": 19
},
{
"token": "群",
"start_offset": 34,
"end_offset": 35,
"type": "word",
"position": 20
},
{
"token": "經理",
"start_offset": 35,
"end_offset": 37,
"type": "word",
"position": 21
},
{
"token": "彙報",
"start_offset": 37,
"end_offset": 39,
"type": "word",
"position": 22
},
{
"token": "彙報工作",
"start_offset": 37,
"end_offset": 41,
"type": "word",
"position": 22
},
{
"token": "工作",
"start_offset": 39,
"end_offset": 41,
"type": "word",
"position": 23
},
{
"token": "\n",
"start_offset": 41,
"end_offset": 42,
"type": "word",
"position": 24
},
{
"token": "負責",
"start_offset": 42,
"end_offset": 44,
"type": "word",
"position": 25
},
{
"token": "產品",
"start_offset": 44,
"end_offset": 46,
"type": "word",
"position": 26
},
{
"token": "為",
"start_offset": 46,
"end_offset": 47,
"type": "word",
"position": 27
},
{
"token": "：",
"start_offset": 47,
"end_offset": 48,
"type": "word",
"position": 28
},
{
"token": "得",
"start_offset": 48,
"end_offset": 49,
"type": "word",
"position": 29
},
{
"token": "普利",
"start_offset": 49,
"end_offset": 51,
"type": "word",
"position": 30
},
{
"token": "麻",
"start_offset": 51,
"end_offset": 52,
"type": "word",
"position": 31
},
{
"token": "\n",
"start_offset": 52,
"end_offset": 53,
"type": "word",
"position": 32
},
{
"token": "2002",
"start_offset": 53,
"end_offset": 57,
"type": "word",
"position": 33
},
{
"token": "。",
"start_offset": 57,
"end_offset": 58,
"type": "word",
"position": 34
},
{
"token": "5",
"start_offset": 58,
"end_offset": 59,
"type": "word",
"position": 35
},
{
"token": "-",
"start_offset": 59,
"end_offset": 60,
"type": "word",
"position": 36
},
{
"token": "2003",
"start_offset": 60,
"end_offset": 64,
"type": "word",
"position": 37
},
{
"token": "。",
"start_offset": 64,
"end_offset": 65,
"type": "word",
"position": 38
},
{
"token": "3",
"start_offset": 65,
"end_offset": 66,
"type": "word",
"position": 39
},
{
"token": " ",
"start_offset": 66,
"end_offset": 67,
"type": "word",
"position": 40
},
{
"token": "產品",
"start_offset": 67,
"end_offset": 69,
"type": "word",
"position": 41
},
{
"token": "副經理",
"start_offset": 69,
"end_offset": 72,
"type": "word",
"position": 42
},
{
"token": "經理",
"start_offset": 70,
"end_offset": 72,
"type": "word",
"position": 43
},
{
"token": "\n",
"start_offset": 72,
"end_offset": 73,
"type": "word",
"position": 44
},
{
"token": "向",
"start_offset": 73,
"end_offset": 74,
"type": "word",
"position": 45
},
{
"token": "產品",
"start_offset": 74,
"end_offset": 76,
"type": "word",
"position": 46
},
{
"token": "群",
"start_offset": 76,
"end_offset": 77,
"type": "word",
"position": 47
},
{
"token": "經理",
"start_offset": 77,
"end_offset": 79,
"type": "word",
"position": 48
},
{
"token": "彙報",
"start_offset": 79,
"end_offset": 81,
"type": "word",
"position": 49
},
{
"token": "彙報工作",
"start_offset": 79,
"end_offset": 83,
"type": "word",
"position": 49
},
{
"token": "工作",
"start_offset": 81,
"end_offset": 83,
"type": "word",
"position": 50
},
{
"token": "\n",
"start_offset": 83,
"end_offset": 84,
"type": "word",
"position": 51
},
{
"token": "負責",
"start_offset": 84,
"end_offset": 86,
"type": "word",
"position": 52
},
{
"token": "推廣",
"start_offset": 86,
"end_offset": 88,
"type": "word",
"position": 53
},
{
"token": "產品",
"start_offset": 88,
"end_offset": 90,
"type": "word",
"position": 54
},
{
"token": "為",
"start_offset": 90,
"end_offset": 91,
"type": "word",
"position": 55
},
{
"token": "：",
"start_offset": 91,
"end_offset": 92,
"type": "word",
"position": 56
},
{
"token": "精分",
"start_offset": 92,
"end_offset": 94,
"type": "word",
"position": 57
},
{
"token": "（",
"start_offset": 94,
"end_offset": 95,
"type": "word",
"position": 58
},
{
"token": "思",
"start_offset": 95,
"end_offset": 96,
"type": "word",
"position": 59
},
{
"token": "瑞康",
"start_offset": 96,
"end_offset": 98,
"type": "word",
"position": 60
},
{
"token": "）",
"start_offset": 98,
"end_offset": 99,
"type": "word",
"position": 61
},
{
"token": "，",
"start_offset": 99,
"end_offset": 100,
"type": "word",
"position": 62
},
{
"token": "麻醉",
"start_offset": 100,
"end_offset": 102,
"type": "word",
"position": 63
},
{
"token": "（",
"start_offset": 102,
"end_offset": 103,
"type": "word",
"position": 64
},
{
"token": "得",
"start_offset": 103,
"end_offset": 104,
"type": "word",
"position": 65
},
{
"token": "普利",
"start_offset": 104,
"end_offset": 106,
"type": "word",
"position": 66
},
{
"token": "麻",
"start_offset": 106,
"end_offset": 107,
"type": "word",
"position": 67
},
{
"token": "）",
"start_offset": 107,
"end_offset": 108,
"type": "word",
"position": 68
}
]}

因為index 模式的原因，產品副經理 分為了產品|副經理|經理 。這個時候，合理的position 就十分重要了。通過我最新的外掛的實現，這裡的position 分別是14,15,16。這是正確的，因為要正確處理下面的結果。

當我們執行如下搜尋：

{
"query": {
"match_phrase":{
"field1": {
"query": "產品經理"
}}},
"highlight" : {
"fields" : {
"field1" : {}}}}

命中我們的示例文字，無間隔的產品經理 可以命中，並且可以高亮，但是產品副經理 沒有命中，也沒有高亮。

再看這個例子：

{
"query": {
"match_phrase":{
"field1": {
"query": "產品經理",
"slop": 2
}}},
"highlight" : {
"fields" : {
"field1" : {}}}}

則，無間隔的產品經理 可以命中，並且可以高亮；同時，產品副經理 有命中，產品 和經理 分別高亮。這兩個例子的差別，大家要細細體會。

那麼如何正確的處理position 呢，關鍵就在於PositionIncrementAttribute 屬性的處理，通常我們使用search 模式類似的分詞是不會遇到問題的，即使使用預設的PositionIncrementAttribute 的實現：根據分詞得到的token，每次+1 ，從而得到position 。

但預設的實現，遇到如下的情況，就會出現問題：

示例文字：

中國人民解放軍勝利了。

如果採用預設的實現，則輸出：

{
"tokens": [
{
"token": "中國",
"start_offset": 0,
"end_offset": 2,
"type": "word",
"position": 0
},
{
"token": "中國人",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 1
},
{
"token": "中國人民解放軍",
"start_offset": 0,
"end_offset": 7,
"type": "word",
"position": 2
},
{
"token": "國人",
"start_offset": 1,
"end_offset": 3,
"type": "word",
"position": 4
},
{
"token": "人民",
"start_offset": 2,
"end_offset": 4,
"type": "word",
"position": 5
},
{
"token": "解放",
"start_offset": 4,
"end_offset": 6,
"type": "word",
"position": 6
},
{
"token": "解放軍",
"start_offset": 4,
"end_offset": 7,
"type": "word",
"position": 7
},
{
"token": "勝利",
"start_offset": 7,
"end_offset": 9,
"type": "word",
"position": 8
},
{
"token": "了",
"start_offset": 9,
"end_offset": 10,
"type": "word",
"position": 9
}
]}

根據這樣的position ，我們如下的查詢，就找不到這個示例文件，從而產生丟資料的現象。

{
"query": {
"match_phrase":{
"field1": {
"query": "中國人民"
}}},
"highlight" : {
"fields" : {
"field1" : {}}}}

本來中國人民 在示例中是無間隔緊鄰的，但是由於position 解析的問題，直接導致slop 已經變成了4，所以必須制定查詢中的slop 比較大，才能夠返回正確的文件，但這裡Rank也會受到影響。

看一下正確position 的結果。

{
"tokens": [
{
"token": "中國",
"start_offset": 0,
"end_offset": 2,
"type": "word",
"position": 0
},
{
"token": "中國人",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 0
},
{
"token": "中國人民解放軍",
"start_offset": 0,
"end_offset": 7,
"type": "word",
"position": 0
},
{
"token": "國人",
"start_offset": 1,
"end_offset": 3,
"type": "word",
"position": 0
},
{
"token": "人民",
"start_offset": 2,
"end_offset": 4,
"type": "word",
"position": 1
},
{
"token": "解放",
"start_offset": 4,
"end_offset": 6,
"type": "word",
"position": 2
},
{
"token": "解放軍",
"start_offset": 4,
"end_offset": 7,
"type": "word",
"position": 2
},
{
"token": "勝利",
"start_offset": 7,
"end_offset": 9,
"type": "word",
"position": 3
},
{
"token": "了",
"start_offset": 9,
"end_offset": 10,
"type": "word",
"position": 4
}
]}

其中，中國 是0，人民 是1，就可以命中了。

基本上，在處理token的時候，要判斷``是1，還是0。這裡的Lucene實現機制不好，對於分詞的實現約束比較多，並且只考慮了英文。現在的實現，優先考慮了召回。極個別情況，還是會有些準確率的問題。

另外一個層面，要從詞的切分的角度處理，分詞的結果應該提供一個最細粒度的、無交叉的切分，這個方式用來做索引，會比較好一些。那這樣，預設的PositionIncrement 也是能夠滿足需求的。接下來看看，jieba 是否可以改造一下，支援第三種分詞的模式：最細粒度的、無交叉的切分。

Elasticsearch 搜尋片語，如何更準？

您可能也會喜歡…