Druid學習之查詢語法
寫在前面
Line"/>
最近一段時間都在做druid實時資料查詢的工作,本文簡單將官網上的英文文件加上自己的理解翻譯成中文,同時將自己遇到的問題及解決方法list下,防止遺忘。
本文的demo示例均來源於官網。
Druid查詢概述
Druid的查詢是使用Rest風格的http請求查詢服務節點,客戶端通過傳送Json物件請求查詢介面。可以使用shell指令碼查詢或通過Google的ARC外掛構造Post請求進行查詢。
Query構建
Shell指令碼
curl -X POST '<queryable_host>:<port>/druid/v2/?pretty' -H 'Content-Type:application/json' -d @<query_json_file>
其中<queryable_host>:<port>為broker、historical或realtime程序所在機器的ip和提供服務的埠,query_json_file為json配置檔案路徑。
ARC外掛
見下圖
Druid查詢
1 Druid查詢型別
不同的查詢場景使用不同的查詢方式。Druid有很多查詢型別,對於各種型別的查詢型別的配置可以通過配置不同的Query實現。Druid的查詢型別,概括為以下3類:
1.聚合查詢:時間序列查詢(Timeseroes),Top查詢(TopN),GroupBy查詢(GroupBy) 2.元資料查詢:時間範圍(Time Boundary),段元資料(Segment Metadata),資料來源(DataSource) 2.Search查詢(Search)
一般聚合查詢使用的較多,其他型別的查詢方式使用場景較少且較簡單,可直接參考官網給出的 ofollow,noindex" target="_blank">demo 即可查詢;本文主要介紹聚合查詢。一般情況下,Timeseries和TopN查詢效能優於GroupBy,GroupBy查詢方式最靈活但是最耗效能。Timeseries查詢效能明顯優於GroupBy,因為聚合不需要其他GroupBy其他維度;對於Groupby和排序在一個單一維度的場景,TopN優於GroupBy。
2 Druid主要查詢屬性簡介
一條Druid query中主要包含以下幾種屬性:
1.queryType:查詢型別,即timeseries,topN,groupBy等; 2.dataSource:資料來源,類似Mysql中的表的概念; 3.granularity:聚合粒度,聚合粒度有none,all,week,day,hour等; 4.filter:過濾條件,類似Mysql中的where條件; 5.aggregator:聚合方式,類似Mysql中的count,sum等操作
2.1 granularity 簡介
2.1.1 簡單的聚合粒度
簡單的聚合粒度有:all、none、second、minute、fifteen_minute、thirty_minute、hour、day、week、month、quarter、year;簡單聚合粒度的查詢取決於druid儲存資料的最小粒度,如果構建資料的最小粒度是小時,使用minute粒度去查詢,結果資料也是小時粒度的資料。
假設儲存在Druid中的資料使用毫秒粒度構建,資料格式如下:
{"timestamp": "2013-08-31T01:02:33Z", "page": "AAA", "language" : "en"} {"timestamp": "2013-09-01T01:02:33Z", "page": "BBB", "language" : "en"} {"timestamp": "2013-09-02T23:32:45Z", "page": "CCC", "language" : "en"} {"timestamp": "2013-09-03T03:32:45Z", "page": "DDD", "language" : "en"}
提交一個小時粒度的groupBy查詢,查詢query如下:
{ "queryType":"groupBy", "dataSource":"my_dataSource", "granularity":"hour", "dimensions":[ "language" ], "aggregations":[ { "type":"count", "name":"count" } ], "intervals":[ "2000-01-01T00:00Z/3000-01-01T00:00Z" ] }
按小時粒度進行的groupby查詢結果中timestamp值精確到小時,比小時粒度更小粒度值自動補填零,以此類推按天查詢,則小時及小粒度補零。timestamp值為UTC。查詢結果如下:
[ { "version" : "v1", "timestamp" : "2013-08-31T01:00:00.000Z", "event" : { "count" : 1, "language" : "en" } }, { "version" : "v1", "timestamp" : "2013-09-01T01:00:00.000Z", "event" : { "count" : 1, "language" : "en" } }, { "version" : "v1", "timestamp" : "2013-09-02T23:00:00.000Z", "event" : { "count" : 1, "language" : "en" } }, { "version" : "v1", "timestamp" : "2013-09-03T03:00:00.000Z", "event" : { "count" : 1, "language" : "en" } } ]
如若指定聚合粒度為day,則按照天為單位對資料進行聚合,查詢結果如下:
[ { "version" : "v1", "timestamp" : "2013-08-31T00:00:00.000Z", "event" : { "count" : 1, "language" : "en" } }, { "version" : "v1", "timestamp" : "2013-09-01T00:00:00.000Z", "event" : { "count" : 1, "language" : "en" } }, { "version" : "v1", "timestamp" : "2013-09-02T00:00:00.000Z", "event" : { "count" : 1, "language" : "en" } }, { "version" : "v1", "timestamp" : "2013-09-03T00:00:00.000Z", "event" : { "count" : 1, "language" : "en" } } ]
如若聚合粒度設定為none,則按照druid中build資料的最小粒度查詢資料,即不進行聚合,如bulid資料的粒度是ms,則聚合出來的結果也是毫秒:
[ { "version" : "v1", "timestamp" : "2013-08-31T01:02:33.000Z", "event" : { "count" : 1, "language" : "en" } }, { "version" : "v1", "timestamp" : "2013-09-01T01:02:33.000Z", "event" : { "count" : 1, "language" : "en" } }, { "version" : "v1", "timestamp" : "2013-09-02T23:32:45.000Z", "event" : { "count" : 1, "language" : "en" } }, { "version" : "v1", "timestamp" : "2013-09-03T03:32:45.000Z", "event" : { "count" : 1, "language" : "en" } } ]
如若將聚合粒度設定為all,則返回資料的長度為1,即把查詢時間段的資料做一個彙總:
[ { "version" : "v1", "timestamp" : "2000-01-01T00:00:00.000Z", "event" : { "count" : 4, "language" : "en" } } ]
2.1.2 時間聚合粒度
可指定一定的時間段進行聚合,返回UTC時間;支援可選屬性origin;不指定時間,預設的開始時間=1970-01-01T00:00:00Z;
持續時間段2小時,從1970-01-01T00:00:00開始:
{"type": "duration", "duration": 7200000}
2.1.3 常用時間段聚合粒度
時間聚合粒度的特例,方便使用,如年、月、日、小時等,日期標準是 ISO 8601 。無特別指定的情況下,year從1月份開始,month從1號開始,week從週一開始。
一般的格式為:其中timeZone可選,預設值是UTC;origin可選,預設1970-01-01T00:00:00;
{"type": "period", "period": "P2D", "timeZone": "America/Los_Angeles"}
period的一般寫法為:
month:P2M代表2個月作為一個聚合粒度; week:P2W代表2周作為一個聚合粒度; day:P1D代表1天作為一個聚合粒度; hour:PT1H代表1個小時作為一個聚合粒度; minute:PT0.750S代表750s作為一個聚合粒度;
如提交一個1d作為聚合粒度的groupby查詢的query:
{ "queryType":"groupBy", "dataSource":"my_dataSource", "granularity":{"type": "period", "period": "P1D", "timeZone": "America/Los_Angeles"}, "dimensions":[ "language" ], "aggregations":[ { "type":"count", "name":"count" } ], "intervals":[ "1999-12-31T16:00:00.000-08:00/2999-12-31T16:00:00.000-08:00" ] }
查詢得到的結果為:
[ { "version" : "v1", "timestamp" : "2013-08-30T00:00:00.000-07:00", "event" : { "count" : 1, "language" : "en" } }, { "version" : "v1", "timestamp" : "2013-08-31T00:00:00.000-07:00", "event" : { "count" : 1, "language" : "en" } }, { "version" : "v1", "timestamp" : "2013-09-02T00:00:00.000-07:00", "event" : { "count" : 2, "language" : "en" } } ]
官網給出的例子是以美國洛杉磯的時區為準,一般中國的時區這樣使用,更多時區可 移步該連結 查詢:
"granularity": { "period": "PT1H", "timeZone": "+08:00", "type": "period" }
2.2 filter 簡介
一個filter即一個json物件,代表一個過濾條件,等價於mysql中的一個where條件;過濾器的型別主要有:Selector filter,Regular expression filter(正則表示式過濾)、Logical expression filters(AND、OR、NOT)、In filter、Bound filter、Search filter、JavaScript filter、Extraction filter;
2.2.1 Selector 過濾器
等價於 WHERE <dimension_string> = '<dimension_value_string>'
json格式:
"filter": { "type": "selector", "dimension": <dimension_string>, "value": <dimension_value_string> }
2.2.2 正則表示式 過濾器
類似Selector過濾器,只不過過濾使用的是正則表示式;正則表示式為標準的java正則表示式規範;
"filter": { "type": "regex", "dimension": <dimension_string>, "pattern": <pattern_string> }
2.2.3 邏輯表示式 過濾器
AND
"filter": { "type": "and", "fields": [<filter>, <filter>, ...] }
OR
"filter": { "type": "not", "field": <filter> }
NOT
"filter": { "type": "not", "field": <filter> }
IN
等價於
SELECT COUNT(*) AS 'Count' FROM `table` WHERE `outlaw` IN ('Good', 'Bad', 'Ugly') { "type": "in", "dimension": "outlaw", "values": ["Good", "Bad", "Ugly"] }
BOUND
數值型:21<=age<=31
{ "type": "bound", "dimension": "age", "lower": "21", "upper": "31" , "ordering": "numeric" }
數值型:21<age<31
{ "type": "bound", "dimension": "age", "lower": "21", "lowerStrict": true, "upper": "31" , "upperStrict": true, "ordering": "numeric" }
字元型:‘foo’<=name<='hoo'
{ "type": "bound", "dimension": "name", "lower": "foo", "upper": "hoo" }
2.3 aggregations 簡介
aggregations即彙總資料記性druid之前提供的一個數據採集一種聚合方式。常用的聚合型別主要有:count,sum,min/max,approximate,miscellaneous;
2.3.1 Count aggregator
符合查詢條件的行數,類似mysql中的count計算:
{ "type" : "count", "name" : <output_name> }
Note: Druid進行Count查詢的資料量並不一定等於資料採集時匯入的資料量,因為Druid在採集資料查詢時已經按照相應的聚合方式對資料進行了聚合。
2.3.2 Sum aggregator
與底層druid表中的欄位型別一致。
longSum
{ "type" : "longSum", "name" : <output_name>, "fieldName" : <metric_name> }
doubleSum
{ "type" : "doubleSum", "name" : <output_name>, "fieldName" : <metric_name> }
2.3.3 MIN/MAX aggregator
doubleMin
{ "type" : "doubleMin", "name" : <output_name>, "fieldName" : <metric_name> }
doubleMax
{ "type" : "doubleMax", "name" : <output_name>, "fieldName" : <metric_name> }
long型別類似,不在贅述。其他聚合方式請 移步到官網 查詢示例。
2.4 聚合查詢
2.4.1 Timeseries query
query
{ "queryType": "timeseries", "dataSource": "sample_datasource", "granularity": "day", "descending": "true", "filter": { "type": "and", "fields": [ { "type": "selector", "dimension": "sample_dimension1", "value": "sample_value1" }, { "type": "or", "fields": [ { "type": "selector", "dimension": "sample_dimension2", "value": "sample_value2" }, { "type": "selector", "dimension": "sample_dimension3", "value": "sample_value3" } ] } ] }, "aggregations": [ { "type": "longSum", "name": "sample_name1", "fieldName": "sample_fieldName1" }, { "type": "doubleSum", "name": "sample_name2", "fieldName": "sample_fieldName2" } ], "postAggregations": [ { "type": "arithmetic", "name": "sample_divide", "fn": "/", "fields": [ { "type": "fieldAccess", "name": "postAgg__sample_name1", "fieldName": "sample_name1" }, { "type": "fieldAccess", "name": "postAgg__sample_name2", "fieldName": "sample_name2" } ] } ], "intervals": [ "2012-01-01T00:00:00.000/2012-01-03T00:00:00.000" ] }
query屬性說明
屬性 | 描述 | 是否必填 |
querytype | 字串型別,時間序列 "timeseries" | 是 |
dataSource | 字串型別, 資料來源 | 是 |
descending | 排序方式,預設false | 否 |
intervals | 查詢時間範圍 | 是 |
granularity | 聚合粒度, 說明 | 是 |
filter | 過濾條件, 說明 | 否 |
aggregations | 聚合, 說明 | 是 |
postAggregations | 後聚合, 說明 | 否 |
context | 上下文, 說明 |
否 |
上述query的返回結果:
[ { "timestamp": "2012-01-01T00:00:00.000Z", "result": { "sample_name1": <some_value>, "sample_name2": <some_value>, "sample_divide": <some_value> } }, { "timestamp": "2012-01-02T00:00:00.000Z", "result": { "sample_name1": <some_value>, "sample_name2": <some_value>, "sample_divide": <some_value> } } ]
2.4.2 TopN query
TopN查詢根據規範返回給定維度的有序的結果集,從概念上來講,TopN查詢被認為單維度、有序的類似分組查詢。在某些情況下,TopN查詢比分組查詢(groupby query)快。TopN查詢結果返回Json陣列物件。TopN在每個節點將頂上K個結果排名,在Druid預設情況下最大值為1000。在實踐中,如果你要求前1000個項順序排名,那麼從第1-999個項的順序正確性是100%,其後項的結果順序沒有保證。你可以通過增加threshold值來保證順序準確。
query
{ "queryType": "topN", "dataSource": "sample_data", "dimension": "sample_dim", "threshold": 5, "metric": "count", "granularity": "all", "filter": { "type": "and", "fields": [ { "type": "selector", "dimension": "dim1", "value": "some_value" }, { "type": "selector", "dimension": "dim2", "value": "some_other_val" } ] }, "aggregations": [ { "type": "longSum", "name": "count", "fieldName": "count" }, { "type": "doubleSum", "name": "some_metric", "fieldName": "some_metric" } ], "postAggregations": [ { "type": "arithmetic", "name": "sample_divide", "fn": "/", "fields": [ { "type": "fieldAccess", "name": "some_metric", "fieldName": "some_metric" }, { "type": "fieldAccess", "name": "count", "fieldName": "count" } ] } ], "intervals": [ "2013-08-31T00:00:00.000/2013-09-03T00:00:00.000" ] }
query屬性說明
屬性 | 描述 | 是否必填 |
querytype | 字串型別,時間序列 "topN" | 是 |
dataSource | 字串型別,資料來源, 說明 | 是 |
dimension | groupBy的維度, 說明 | 是 |
intervals | 查詢時間範圍 | 是 |
granularity | 聚合粒度, 說明 | 是 |
filter | 過濾條件, 說明 | 否 |
aggregations | 聚合, 說明 | 是 |
postAggregations | 後聚合, 說明 | 否 |
threshold | topN的N值 | 是 |
metric | 字串或Json物件指定度量對Top N個結果排序, 說明 | 否 |
context | 上下文, 說明 | 否 |
上述query的查詢結果形如:
[ { "timestamp": "2013-08-31T00:00:00.000Z", "result": [ { "dim1": "dim1_val", "count": 111, "some_metrics": 10669, "average": 96.11711711711712 }, { "dim1": "another_dim1_val", "count": 88, "some_metrics": 28344, "average": 322.09090909090907 }, { "dim1": "dim1_val3", "count": 70, "some_metrics": 871, "average": 12.442857142857143 }, { "dim1": "dim1_val4", "count": 62, "some_metrics": 815, "average": 13.14516129032258 }, { "dim1": "dim1_val5", "count": 60, "some_metrics": 2787, "average": 46.45 } ] } ]
2.4.3 GroupBy query
類似mysql中的groupBy查詢方式。
query
{ "queryType": "topN", "dataSource": "sample_data", "dimension": "sample_dim", "threshold": 5, "metric": "count", "granularity": "all", "filter": { "type": "and", "fields": [ { "type": "selector", "dimension": "dim1", "value": "some_value" }, { "type": "selector", "dimension": "dim2", "value": "some_other_val" } ] }, "aggregations": [ { "type": "longSum", "name": "count", "fieldName": "count" }, { "type": "doubleSum", "name": "some_metric", "fieldName": "some_metric" } ], "postAggregations": [ { "type": "arithmetic", "name": "sample_divide", "fn": "/", "fields": [ { "type": "fieldAccess", "name": "some_metric", "fieldName": "some_metric" }, { "type": "fieldAccess", "name": "count", "fieldName": "count" } ] } ], "intervals": [ "2013-08-31T00:00:00.000/2013-09-03T00:00:00.000" ] }
query屬性說明
屬性 | 描述 | 是否必填 |
querytype | 字串型別,時間序列 "topN" | 是 |
dataSource | 字串型別,資料來源, 說明 | 是 |
dimensions | groupBy的維度, 說明 | 是 |
intervals | 查詢時間範圍 | 是 |
granularity | 聚合粒度, 說明 | 是 |
filter | 過濾條件, 說明 | 否 |
aggregations | 聚合, 說明 | 是 |
postAggregations | 後聚合, 說明 | 否 |
limitSpec | 返回指定數量的查詢結果,類似mysql中的limit字句, 說明 | 否 |
having | 類似mysql中的having字句, 說明 | 否 |
context | 上下文, 說明 | 否 |
上述query的查詢結果形如:
[ { "timestamp": "2013-08-31T00:00:00.000Z", "result": [ { "dim1": "dim1_val", "count": 111, "some_metrics": 10669, "average": 96.11711711711712 }, { "dim1": "another_dim1_val", "count": 88, "some_metrics": 28344, "average": 322.09090909090907 }, { "dim1": "dim1_val3", "count": 70, "some_metrics": 871, "average": 12.442857142857143 }, { "dim1": "dim1_val4", "count": 62, "some_metrics": 815, "average": 13.14516129032258 }, { "dim1": "dim1_val5", "count": 60, "some_metrics": 2787, "average": 46.45 } ] } ]
groupBy多值欄位
Druid中的欄位會有多值查詢,針對多值查詢的groupBy操作,滿足多值中一個過濾條件,查詢結果中會把多值欄位中的每個值都返回。下面通過例子進行說明。
底層資料格式
{"timestamp": "2011-01-12T00:00:00.000Z", "tags": ["t1","t2","t3"]}#row1 {"timestamp": "2011-01-13T00:00:00.000Z", "tags": ["t3","t4","t5"]}#row2 {"timestamp": "2011-01-14T00:00:00.000Z", "tags": ["t5","t6","t7"]}#row3 {"timestamp": "2011-01-14T00:00:00.000Z", "tags": []}#row4
查詢query:
{ "queryType": "groupBy", "dataSource": "test", "intervals": [ "1970-01-01T00:00:00.000Z/3000-01-01T00:00:00.000Z" ], "filter": { "type": "selector", "dimension": "tags", "value": "t3" }, "granularity": { "type": "all" }, "dimensions": [ { "type": "default", "dimension": "tags", "outputName": "tags" } ], "aggregations": [ { "type": "count", "name": "count" } ] }
返回結果:命中row1和row2
[ { "timestamp": "1970-01-01T00:00:00.000Z", "event": { "count": 1, "tags": "t1" } }, { "timestamp": "1970-01-01T00:00:00.000Z", "event": { "count": 1, "tags": "t2" } }, { "timestamp": "1970-01-01T00:00:00.000Z", "event": { "count": 2, "tags": "t3" } }, { "timestamp": "1970-01-01T00:00:00.000Z", "event": { "count": 1, "tags": "t4" } }, { "timestamp": "1970-01-01T00:00:00.000Z", "event": { "count": 1, "tags": "t5" } } ]
遇到的問題及解決方法
問題1:北京時區進行查詢
解決方法:通過設定timeZone的時區解決
"granularity": { "period": "PT1H", "timeZone": "+08:00", "type": "period" }
問題2:獲取datasource的最新build時間
解決方法:在context中設定requireMessageWatermark=true,在http返回結果的header中拿到該資料;
if (HttpStatus.SC_OK == httpResponse.getStatusLine().getStatusCode()) { HttpEntity entity = httpResponse.getEntity(); result.setResultStr(EntityUtils.toString(entity, "utf-8")); String waterMark = httpResponse.getHeaders(DRUID_WATERMARK_HEADER)[0].getValue(); if (StringUtil.isNotEmpty(waterMark)) { JSONObject obj = JSONObject.parseObject(waterMark); waterMark = obj.getString(DRUID_MESSAGE_WATERMARK); result.setDruidWatermarkStr(waterMark); } // end if }
問題3:設定不同query的優先順序
解決方法:在context中設定priority屬性;
問題4:設定query的超時時間
解決方法:在context中設定timeout屬性;
問題5:多值groupBy指定返回想要的值
解決方法:使用listField屬性設定多值列中想要的值;
{ "queryType": "groupBy", "dataSource": "test", "intervals": [ "1970-01-01T00:00:00.000Z/3000-01-01T00:00:00.000Z" ], "filter": { "type": "selector", "dimension": "tags", "value": "t3" }, "granularity": { "type": "all" }, "dimensions": [ { "type": "listFiltered", "delegate": { "type": "default", "dimension": "tags", "outputName": "tags" }, "values": ["t3"] } ], "aggregations": [ { "type": "count", "name": "count" } ] }