elasticsearch學習筆i記(二十五)——Elasticsearch mapping詳解以及索引內部原理
下面先簡單描述一下mapping是什麼?
當我們插入幾條資料,讓ES自動為我們建立一個索引
PUT /website/_doc/1 { "post_date": "2017-01-01", "title": "my first article", "content": "this is my first article in this website", "author_id": 11400 } PUT /website/_doc/2 { "post_date": "2017-01-02", "title": "my second article", "content": "this is my second article in this website", "author_id": 11400 } PUT /website/_doc/3 { "post_date": "2017-01-03", "title": "my third article", "content": "this is my third article in this website", "author_id": 11400 }
檢視mapping
GET /website/_mapping { "website" : { "mappings" : { "properties" : { "author_id" : { "type" : "long" }, "content" : { "type" : "text", "fields" : { "keyword" : { "type" : "keyword", "ignore_above" : 256 } } }, "post_date" : { "type" : "date" }, "title" : { "type" : "text", "fields" : { "keyword" : { "type" : "keyword", "ignore_above" : 256 } } } } } } }
上面是插入資料自動生成的mapping,還有手動生成的mapping。這種自動或手動為index中的type建立的一種資料結構和相關配置,稱為mapping。
下面是手動建立的mapping。
PUT /test_mapping { "mappings" : { "properties" : { "author_id" : { "type" : "long" }, "content" : { "type" : "text", "fields" : { "keyword" : { "type" : "keyword", "ignore_above" : 256 } } }, "post_date" : { "type" : "date" }, "title" : { "type" : "text", "fields" : { "keyword" : { "type" : "keyword", "ignore_above" : 256 } } } } } }
1、精確匹配與全文搜尋的對比分析
(1)exact value
也就是某個field必須全部匹配才能返回相應的document
示例:
GET /website/_search?q=post_date:2017 { "took" : 0, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 0, "relation" : "eq" }, "max_score" : null, "hits" : [ ] } } GET /website/_search?q=post_date:2017-01-01 { "took" : 1, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 1, "relation" : "eq" }, "max_score" : 1.0, "hits" : [ { "_index" : "website", "_type" : "doc", "_id" : "1", "_score" : 1.0, "_source" : { "post_date" : "2017-01-01", "title" : "my first article", "content" : "this is my first article in this website", "author_id" : 11400 } } ] } }
(2)full text
full text與exact value不一樣,不是說單純的只是匹配完整的一個值,而是可以對值進行拆分詞語後(分詞)進行匹配,也可以通過縮寫、時態、大小寫、同義詞等進行匹配。
示例:
GET /website/_search?q=title:article { "took" : 7, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 3, "relation" : "eq" }, "max_score" : 0.087011375, "hits" : [ { "_index" : "website", "_type" : "doc", "_id" : "1", "_score" : 0.087011375, "_source" : { "post_date" : "2017-01-01", "title" : "my first article", "content" : "this is my first article in this website", "author_id" : 11400 } }, { "_index" : "website", "_type" : "doc", "_id" : "2", "_score" : 0.087011375, "_source" : { "post_date" : "2017-01-02", "title" : "my second article", "content" : "this is my second in this website", "author_id" : 11400 } }, { "_index" : "website", "_type" : "doc", "_id" : "3", "_score" : 0.087011375, "_source" : { "post_date" : "2017-01-03", "title" : "my third article", "content" : "this is my third in this website", "author_id" : 11400 } } ] } }
2、倒排索引核心原理
下面演示一下倒排索引簡單建立的過程,當然實際中倒排索引的建立過程會非常的複雜。
doc1: I really liked my small dogs, and I think my mom also liked them.
doc2: He never liked any dogs, so I hope that my mom will not expect me to liked him.
分詞,初步的倒排索引的建立
worddoc1doc2 I** really* liked** my** small* dogs* and* think* mom** also* them* He* never* any* so* hope* that* will* not* expect* me* to* him*
搜尋 mother like little dog, 不會有任何結果
mother
like
little
dog
這肯定不是我們想要的結果。比如mother和mom其實根本就沒有區別。但是卻檢索不到。但是做下測試發現ES是可以查到的。實際上ES在建立倒排索引的時候,還會執行一個操作,就是會對拆分的各個單詞進行相應的處理,以提升後面搜尋的時候能夠搜尋到相關聯的文件的概率。像時態的轉換,單複數的轉換,同義詞的轉換,大小寫的轉換。這個過程稱為正則化(normalization)
mother-> mom
liked -> like
small -> little
dogs -> dog
這樣重新建立倒排索引:
worddoc1doc2 I** really* like** my** little* dog* and* think* mom** also* them* He* never* any* so* hope* that* will* not* expect* me* to* him*
查詢:mother like little dog 分詞正則化
mother -> mom
like -> like
little -> little
dog -> dog
doc1和doc2都會搜尋出來
doc1:I really liked my small dogs, and I think my mom also liked them.
doc2:He never liked any dogs, so I hope that my mom will not expect me to liked him.
3、對mapping進一步總結
(1)往ES裡面直接插入資料,ES會自動建立索引,同時建立type以及對應的mapping
(2)mapping中自動定義了每個fieldd的資料型別
(3)不同的資料型別(比如說text和date),可能有的是exact value,有的是full text
(4)exact value,在建立倒排索引的時候,分詞的時候,都是將整個值一起作為關鍵字建立到倒排索引中;full text會經歷各種各樣的處理,分詞,normalization(時態轉換,同義詞轉換,大小寫轉換),才會建立到倒排索引中
(5)在搜尋的時候,exact value和full text型別就決定了,對exact value和full text field進行搜尋的行為也是不一樣的,會跟建立倒排索引的行為保持一致;比如說exact value搜尋的時候,就是直接按照整個值進行匹配,full text也會進行分詞和正則化normalization再去倒排索引中去搜索。
(6)可以用 ES的dynamic mapping,讓其自動建立mapping,包括自動設定資料型別;也可以提前手動建立index和type的mapping,自己對各個field進行設定,包括資料型別,包括索引行為,包括分析器等等。
mapping本質上就是index的type的元資料,決定了資料型別,建立倒排索引的行為,還有進行搜尋的行為。
4、mapping核心資料型別以及dynamic mapping
(1)核心資料型別
string text:字串型別
byte:位元組型別
short:短整型
integer:整型
long:長整型
float:浮點型
boolean:布林型別
date:時間型別
當然還有一些高階型別,像陣列,物件object,但其底層都是text字串型別
(2) dynamic mapping
true or false -> boolean
123 -> long
123.45 -> float
2017-01-01 -> date
"hello world" -> string text
(3)檢視mapping
GET /{index}/mapping GET /test/_mapping { "test" : { "mappings" : { "properties" : { "field1" : { "type" : "text", "fields" : { "keyword" : { "type" : "keyword", "ignore_above" : 256 } } }, "field2" : { "type" : "text", "fields" : { "keyword" : { "type" : "keyword", "ignore_above" : 256 } } } } } } }
5、手動建立和修改mapping以及定製string型別是否分詞
注意:只能建立index時手動建立mapping,或者新增field mapping,但是不能update field mapping。
# 建立索引 PUT /website { "mappings": { "properties": { "author_id": { "type": "long" }, "title": { "type": "text", "analyzer": "standard" }, "content": { "type": "text" }, "post_date": { "type": "date" }, "publisher_id": { "type": "text", "index": false } } } } #修改欄位的mapping PUT /website { "mappings": { "properties": { "author_id": { "type": "text" } } } } { "error": { "root_cause": [ { "type": "resource_already_exists_exception", "reason": "index [website/5xLohnJITHqCwRYInmBFmA] already exists", "index_uuid": "5xLohnJITHqCwRYInmBFmA", "index": "website" } ], "type": "resource_already_exists_exception", "reason": "index [website/5xLohnJITHqCwRYInmBFmA] already exists", "index_uuid": "5xLohnJITHqCwRYInmBFmA", "index": "website" }, "status": 400 } #增加mapping的欄位 PUT /website/_mapping { "properties": { "new_field": { "type": "text" } } } { "acknowledged" : true }
6、mapping複雜型別y以及object型別資料底層結構
(1)multivalue field
{ "tags": ["tag1", "tag2"] }
(2)empty field
null, []
(3)object field
PUT /test/_create/1 { "address": { "country": "china", "province": "guangdong", "city": "guangzhou" }, "name": "jack", "age": 27, "join_date": "2017-01-01" } GET /test/_mapping { "test" : { "mappings" : { "properties" : { "address" : { "properties" : { "city" : { "type" : "text", "fields" : { "keyword" : { "type" : "keyword", "ignore_above" : 256 } } }, "country" : { "type" : "text", "fields" : { "keyword" : { "type" : "keyword", "ignore_above" : 256 } } }, "province" : { "type" : "text", "fields" : { "keyword" : { "type" : "keyword", "ignore_above" : 256 } } } } }, "age" : { "type" : "long" }, "join_date" : { "type" : "date" }, "name" : { "type" : "text", "fields" : { "keyword" : { "type" : "keyword", "ignore_above" : 256 } } } } } } } GET /test/_doc/1 { "_index" : "test", "_type" : "_doc", "_id" : "1", "_version" : 1, "_seq_no" : 0, "_primary_term" : 1, "found" : true, "_source" : { "address" : { "country" : "china", "province" : "guangdong", "city" : "guangzhou" }, "name" : "jack", "age" : 27, "join_date" : "2017-01-01" } }
注意:建立索引的時候與string時一樣的,資料型別不能混