sqoop 從mysql直接匯入到hive表

Hive Sqoop MySQL · 發表 2019-02-19 10:53:11

摘要： mysql的資料庫資料過大，做資料分析，需要從mysql轉向hadoop。 1，遇到的問題從mysql轉資料到hive中，本想用parquet格式，但是一直都沒有成功，提示 Hive import and create hive table is not compat...

mysql的資料庫資料過大，做資料分析，需要從mysql轉向hadoop。

1，遇到的問題

從mysql轉資料到hive中，本想用parquet格式，但是一直都沒有成功，提示

Hive import and create hive table is not compatible with importing into ParquetFile format.

sqoop不管是mysql直接到hive。還是把mysql匯出成parquet檔案，然後在把parquet檔案，在匯入到hive的外部表，都沒有成功

存為avro格式也是一樣。

2，安裝sqoop

下載：http://mirrors.shu.edu.cn/apache/sqoop/1.4.7/

# tar zxvf sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz
# cp -r sqoop-1.4.7.bin__hadoop-2.6.0 /bigdata/sqoop

3，配置sqoop

3.1，配置使用者環境變數

# cd ~
# vim .bashrc
export SQOOP_HOME=/bigdata/sqoop
export PATH=$ZOOKEEPER_HOME/bin:$SPARK_HOME/bin:$HIVE_HOME/bin:/bigdata/hadoop/bin:$SQOOP_HOME/bin:$PATH

# source .bashrc

3.2，配置sqoop-env.sh

# vim /bigdata/sqoop/sqoop-env.sh 

#Set path to where bin/hadoop is available
export HADOOP_COMMON_HOME=/bigdata/hadoop

#Set path to where hadoop-*-core.jar is available
export HADOOP_MAPRED_HOME=/bigdata/hadoop

#set the path to where bin/hbase is available
#export HBASE_HOME=

#Set the path to where bin/hive is available
export HIVE_HOME=/bigdata/hive
export HIVE_CONF_DIR=/bigdata/hive/conf//要加上，不然會提示hiveconf找不到

#Set the path for where zookeper config dir is
export ZOOCFGDIR=/bigdata/zookeeper/conf

3.3，匯入資料

# sqoop import \
--connect jdbc:mysql://10.0.0.237:3306/bigdata \
--username root \
--password ******* \
--table track_app \
-m 1 \
--warehouse-dir /user/hive/warehouse/tanktest.db \
--hive-database tanktest \
--create-hive-table \
--hive-import \
--hive-table track_app

這樣就可以匯入了，不過匯入hive後，在hdfs上面儲存的檔案格式是文字形勢。

hive> describe formatted track_app;
OK
# col_name data_type comment 

id int
log_date int
log_time int
user_id int
ticket string 

# Detailed Table Information
Database: tanktest
Owner: root
CreateTime: Fri Feb 15 18:08:55 CST 2019
LastAccessTime: UNKNOWN
Retention: 0
Location: hdfs://bigserver1:9000/user/hive/warehouse/tanktest.db/track_app
Table Type: MANAGED_TABLE
Table Parameters:
 comment Imported by sqoop on 2019/02/15 18:08:42
 numFiles 1
 numRows 0
 rawDataSize 0
 totalSize 208254
 transient_lastDdlTime 1550225337 

# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat //文字格式，也可以在hdfs上面，開啟檔案檢視內容
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
 field.delim \u0001
 line.delim \n
 serialization.format \u0001
Time taken: 0.749 seconds, Fetched: 56 row(s)

注意：

導致hive表後，通過sql（hive，spark-sql）的方式，看一下，能不能查詢到資料。如果查不到資料，說明沒有匯入成功。

4，sqoop引數

Import和export引數解釋

Common arguments:

--connect <jdbc-uri> ：連線RDBMS的jdbc連線字串，例如：–connect jdbc:mysql:// MYSQL_SERVER:PORT/DBNAME。

--connection-manager <class-name> ：

--hadoop-home <hdir> ：

--username <username> ：連線RDBMS所使用的使用者名稱。

--password <password> ：連線RDBMS所使用的密碼，明文。

--password-file <password-file> ：使用檔案儲存密碼。

-p ：互動式連線RDBMS的密碼。

Import control arguments:

--append ：追加資料到HDFS已經存在的檔案中。

--as-sequencefile ：import序列化的檔案。

--as-textfile ：import文字檔案，預設。

--columns <col,col,col…> ：指定列import，逗號分隔，比如：–columns “id,name”。

--delete-target-dir ：刪除存在的import目標目錄。

--direct ：直連模式，速度更快（HBase不支援）

--split-by ：分割匯入任務所使用的欄位，需要明確指定，推薦使用主鍵。

--inline-lob-limit < n > ：設定內聯的BLOB物件的大小。

--fetch-size <n> ：一次從資料庫讀取n個例項，即n條資料。

-e,--query <statement> ：構建表示式<statement>執行。

--target-dir <d> ：指定HDFS目標儲存目錄。

--warehouse-dir <d> ：可以指定為-warehouse-dir/user/hive/warehouse/即匯入資料的存放路徑，如果該路徑不存在，會首先建立。

--table <table-name> ：將要匯入到hive的表。

--where <where clause> ：指定where從句，如果有雙引號，注意轉義 \$CONDITIONS，不能用or，子查詢，join。

-z,--compress ：開啟壓縮。

--null-string <null-string> ：string列為空指定為此值。

--null-non-string <null-string> ：非string列為空指定為此值，-null這兩個引數are optional, 如果不設定，會指定為”null”。

--autoreset-to-one-mapper ：如果沒有主鍵和split-by用one mapper import （split-by和此選項不共存）。

-m,--num-mappers <n> ：建立n個併發執行import，預設4個執行緒。

Incremental import arguments:

--check-column <column> ：Source column to check for incremental change

--incremental <import-type> ：Define an incremental import of type ‘append’ or ‘lastmodified’

--last-value <value> ：Last imported value in the incremental check column

Hive arguments:

--create-hive-table ：自動推斷表字段型別直接建表，hive-overwrite功能可以替代掉了，但Hive裡此表不能存在，不然操作會報錯。

--hive-database <database-name> ：指定要把HDFS資料匯入到哪個Hive庫。

--hive-table <table-name> ：設定到Hive當中的表名。

--hive-delims-replacement <arg> ：匯入到hive時用自定義的字元替換掉\n, \r, and \01。

--hive-drop-import-delims ：匯入到hive時刪除欄位中\n, \r，\t and \01等符號；避免欄位中有空格導致匯入資料被截斷。

--hive-home <dir> ：指定Hive的儲存目錄。

--hive-import ：將HDFS資料匯入到Hive中，會自動建立Hive表，使用hive的預設分隔符。

--hive-overwrite ：對Hive表進行覆蓋操作（需配合--hive-import使用，如果Hive裡沒有表會先建立之），不然就是追加資料。

--hive-partition-key <partition-key> ：hive分割槽的key。

--hive-partition-value <partition-value> ：hive分割槽的值。

--map-column-hive <arg> ：型別匹配，SQL型別對應到hive型別。

HBase arguments:

--column-family < family > ：把內容匯入到hbase當中，預設是用主鍵作為split列。

--hbase-create-table ：建立Hbase表。

--hbase-row-key < col > ：指定欄位作為row key ，如果輸入表包含複合主鍵，用逗號分隔。

--hbase-table < table-name > ：指定hbase表。