elasticsearch學習筆i記（二十五）——Elasticsearch mapping詳解以及索引內部原理

ElasticSearch 倒排索引中文分詞 · 發表 2019-04-27 15:17:44

摘要：下面先簡單描述一下mapping是什麼？當我們插入幾條資料，讓ES自動為我們建立一個索引 PUT /website/_doc/1 { "post_date": "2017-01-01", "title": "...

下面先簡單描述一下mapping是什麼？

當我們插入幾條資料，讓ES自動為我們建立一個索引

PUT /website/_doc/1
{
"post_date": "2017-01-01",
"title": "my first article",
"content": "this is my first article in this website",
"author_id": 11400
}
PUT /website/_doc/2
{
"post_date": "2017-01-02",
"title": "my second article",
"content": "this is my second article in this website",
"author_id": 11400
}
PUT /website/_doc/3
{
"post_date": "2017-01-03",
"title": "my third article",
"content": "this is my third article in this website",
"author_id": 11400
}

檢視mapping

GET /website/_mapping
{
"website" : {
"mappings" : {
"properties" : {
"author_id" : {
"type" : "long"
},
"content" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"post_date" : {
"type" : "date"
},
"title" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
}
}
}

上面是插入資料自動生成的mapping，還有手動生成的mapping。這種自動或手動為index中的type建立的一種資料結構和相關配置，稱為mapping。

下面是手動建立的mapping。

PUT /test_mapping
{
"mappings" : {
"properties" : {
"author_id" : {
"type" : "long"
},
"content" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"post_date" : {
"type" : "date"
},
"title" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
}
}

1、精確匹配與全文搜尋的對比分析

（1）exact value

也就是某個field必須全部匹配才能返回相應的document

示例:

GET /website/_search?q=post_date:2017
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 0,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
}
}

GET /website/_search?q=post_date:2017-01-01
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "website",
"_type" : "doc",
"_id" : "1",
"_score" : 1.0,
"_source" : {
"post_date" : "2017-01-01",
"title" : "my first article",
"content" : "this is my first article in this website",
"author_id" : 11400
}
}
]
}
}

（2）full text

full text與exact value不一樣，不是說單純的只是匹配完整的一個值，而是可以對值進行拆分詞語後（分詞）進行匹配，也可以通過縮寫、時態、大小寫、同義詞等進行匹配。

示例：

GET /website/_search?q=title:article
{
"took" : 7,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : 0.087011375,
"hits" : [
{
"_index" : "website",
"_type" : "doc",
"_id" : "1",
"_score" : 0.087011375,
"_source" : {
"post_date" : "2017-01-01",
"title" : "my first article",
"content" : "this is my first article in this website",
"author_id" : 11400
}
},
{
"_index" : "website",
"_type" : "doc",
"_id" : "2",
"_score" : 0.087011375,
"_source" : {
"post_date" : "2017-01-02",
"title" : "my second article",
"content" : "this is my second in this website",
"author_id" : 11400
}
},
{
"_index" : "website",
"_type" : "doc",
"_id" : "3",
"_score" : 0.087011375,
"_source" : {
"post_date" : "2017-01-03",
"title" : "my third article",
"content" : "this is my third in this website",
"author_id" : 11400
}
}
]
}
}

2、倒排索引核心原理

下面演示一下倒排索引簡單建立的過程，當然實際中倒排索引的建立過程會非常的複雜。

doc1: I really liked my small dogs, and I think my mom also liked them.

doc2: He never liked any dogs, so I hope that my mom will not expect me to liked him.

分詞，初步的倒排索引的建立

worddoc1doc2
I**
really*
liked**
my**
small*
dogs*
and*
think*
mom**
also*
them*
He*
never*
any*
so*
hope*
that*
will*
not*
expect*
me*
to*
him*

搜尋 mother like little dog, 不會有任何結果

mother

little

dog

這肯定不是我們想要的結果。比如mother和mom其實根本就沒有區別。但是卻檢索不到。但是做下測試發現ES是可以查到的。實際上ES在建立倒排索引的時候，還會執行一個操作，就是會對拆分的各個單詞進行相應的處理，以提升後面搜尋的時候能夠搜尋到相關聯的文件的概率。像時態的轉換，單複數的轉換，同義詞的轉換，大小寫的轉換。這個過程稱為正則化（normalization）

mother-> mom

liked -> like

small -> little

dogs -> dog

這樣重新建立倒排索引：

worddoc1doc2
I**
really*
like**
my**
little*
dog*
and*
think*
mom**
also*
them*
He*
never*
any*
so*
hope*
that*
will*
not*
expect*
me*
to*
him*

查詢：mother like little dog 分詞正則化

mother -> mom

like -> like

little -> little

dog -> dog

doc1和doc2都會搜尋出來

doc1：I really liked my small dogs, and I think my mom also liked them.

doc2：He never liked any dogs, so I hope that my mom will not expect me to liked him.

3、對mapping進一步總結

（1）往ES裡面直接插入資料，ES會自動建立索引，同時建立type以及對應的mapping

（2）mapping中自動定義了每個fieldd的資料型別

（3）不同的資料型別（比如說text和date），可能有的是exact value，有的是full text

（4）exact value，在建立倒排索引的時候，分詞的時候，都是將整個值一起作為關鍵字建立到倒排索引中；full text會經歷各種各樣的處理，分詞，normalization（時態轉換，同義詞轉換，大小寫轉換），才會建立到倒排索引中

（5）在搜尋的時候，exact value和full text型別就決定了，對exact value和full text field進行搜尋的行為也是不一樣的，會跟建立倒排索引的行為保持一致；比如說exact value搜尋的時候，就是直接按照整個值進行匹配，full text也會進行分詞和正則化normalization再去倒排索引中去搜索。

（6）可以用 ES的dynamic mapping，讓其自動建立mapping,包括自動設定資料型別；也可以提前手動建立index和type的mapping,自己對各個field進行設定，包括資料型別，包括索引行為，包括分析器等等。

mapping本質上就是index的type的元資料，決定了資料型別，建立倒排索引的行為，還有進行搜尋的行為。

4、mapping核心資料型別以及dynamic mapping

（1）核心資料型別

string text：字串型別

byte:位元組型別

short：短整型

integer：整型

long:長整型

float:浮點型

boolean:布林型別

date:時間型別

當然還有一些高階型別，像陣列，物件object，但其底層都是text字串型別

（2） dynamic mapping

true or false -> boolean

123 -> long

123.45 -> float

2017-01-01 -> date

"hello world" -> string text

（3）檢視mapping

GET /{index}/mapping


GET /test/_mapping
{
"test" : {
"mappings" : {
"properties" : {
"field1" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"field2" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
}
}
}

5、手動建立和修改mapping以及定製string型別是否分詞

注意：只能建立index時手動建立mapping，或者新增field mapping，但是不能update field mapping。

# 建立索引
PUT /website
{
"mappings": {
"properties": {
"author_id": {
"type": "long"
},
"title": {
"type": "text",
"analyzer": "standard"
},
"content": {
"type": "text"
},
"post_date": {
"type": "date"
},
"publisher_id": {
"type": "text",
"index": false
}
}
}
}
#修改欄位的mapping
PUT /website
{
"mappings": {
"properties": {
"author_id": {
"type": "text"
}
}
}
}
{
"error": {
"root_cause": [
{
"type": "resource_already_exists_exception",
"reason": "index [website/5xLohnJITHqCwRYInmBFmA] already exists",
"index_uuid": "5xLohnJITHqCwRYInmBFmA",
"index": "website"
}
],
"type": "resource_already_exists_exception",
"reason": "index [website/5xLohnJITHqCwRYInmBFmA] already exists",
"index_uuid": "5xLohnJITHqCwRYInmBFmA",
"index": "website"
},
"status": 400
}
#增加mapping的欄位
PUT /website/_mapping
{
"properties": {
"new_field": {
"type": "text"
}
}
}
{
"acknowledged" : true
}

6、mapping複雜型別y以及object型別資料底層結構

（1）multivalue field

{
"tags": ["tag1", "tag2"]
}

（2）empty field

null, []

（3）object field

PUT /test/_create/1
{
"address": {
"country": "china",
"province": "guangdong",
"city": "guangzhou"
},
"name": "jack",
"age": 27,
"join_date": "2017-01-01"
}
GET /test/_mapping
{
"test" : {
"mappings" : {
"properties" : {
"address" : {
"properties" : {
"city" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"country" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"province" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
},
"age" : {
"type" : "long"
},
"join_date" : {
"type" : "date"
},
"name" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
}
}
}

GET /test/_doc/1

{
"_index" : "test",
"_type" : "_doc",
"_id" : "1",
"_version" : 1,
"_seq_no" : 0,
"_primary_term" : 1,
"found" : true,
"_source" : {
"address" : {
"country" : "china",
"province" : "guangdong",
"city" : "guangzhou"
},
"name" : "jack",
"age" : 27,
"join_date" : "2017-01-01"
}
}

注意：建立索引的時候與string時一樣的，資料型別不能混