hive从mysql导入数据量变多的解决方案

Z时代
2024-01-10
分类：IT

原始导数命令：


bin/sqoop import -connect jdbc:mysql://192.168.169.128:3306/yubei -username root -password 123456 -table yl_city_mgr_evt_info --split-by rec_id -m 4 --fields-terminated-by "\t" --lines-terminated-by "\n" --hive-import --hive-overwrite -create-hive-table -delete-target-dir -hive-database default -hive-table yl_city_mgr_evt_info

原因分析：可能是mysql中字段里面有'\n'等分隔符，导入hive时默认以'n'作换行符，导致hive中的记录数变多。

解决方法：

导入数据时加上--hive-drop-import-delims选项，会删除字段中的\n,\r,\01。

最终导数命令：


bin/sqoop import -connect jdbc:mysql://192.168.169.128:3306/yubei -username root -password 123456 -table yl_city_mgr_evt_info --split-by rec_id -m 4 --hive-drop-import-delims --fields-terminated-by "\t" --lines-terminated-by "\n" --hive-import --hive-overwrite -create-hive-table -delete-target-dir -hive-database default -hive-table yl_city_mgr_evt_info

参考官方文档：https://sqoop.apache.org/docs/1.4.7/SqoopUserGuide.html

补充：Sqoop导入MySQL数据到Hive遇到的坑

1.sqoop导入到HDFS

1.1执行sqoop job，会自动更新last value


# sqoop 增量导入脚本
bin/sqoop job --create sqoop_hdfs_test02 -- import \
--connect jdbc:mysql://localhost:3306/pactera_test \
--username root \
--password 123456 \
--table student \
--target-dir /user/sqoop/test002/ \
--fields-terminated-by "\t" \
--check-column last_modified \
--incremental lastmodified \
--last-value "2018-12-12 00:03:00" \
--append

说明：--append 参数是必须的，要不然第二次运行job 会报错，如下：