将数据从CSV文件映射到HDFS上的Hive表时出错 [英] Error while mapping the data from CSV file to a Hive table on HDFS

查看：333 发布时间：2020/9/4 21:04:33 apache-spark hadoop hive apache-spark-sql

本文介绍了将数据从CSV文件映射到HDFS上的Hive表时出错的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试通过以下步骤将数据帧加载到Hive表中:

I am trying to load a dataframe into a Hive table by following the below steps:

读取源表并将数据帧另存为HDFS上的CSV文件

Read the source table and save the dataframe as a CSV file on HDFS

val yearDF = spark.read.format("jdbc").option("url", connectionUrl).option("dbtable", s"(${execQuery}) as year2016").option("user", devUserName).option("password", devPassword).option("partitionColumn","header_id").option("lowerBound", 199199).option("upperBound", 284058).option("numPartitions",10).load()

按照我的Hive表列排序列我的蜂巢表列以字符串的形式出现:

Order the columns as per my Hive table columns My hive table columns are present in a string in the format of:

val hiveCols = col1:coldatatype|col2:coldatatype|col3:coldatatype|col4:coldatatype...col200:datatype
val schemaList        = hiveCols.split("\\|")
val hiveColumnOrder   = schemaList.map(e => e.split("\\:")).map(e => e(0)).toSeq
val finalDF           = yearDF.selectExpr(hiveColumnOrder:_*)

我在"execQuery"中读取的列顺序与"hiveColumnOrder"相同，为确保顺序，我再次使用selectExpr

The order of columns that I read in "execQuery" are same as "hiveColumnOrder" and just to make sure of the order, I select the columns in yearDF once again using selectExpr

在HDFS上将数据框另存为CSV文件:

Saving the dataframe as a CSV file on HDFS:

newDF.write.format("CSV").save("hdfs://username/apps/hive/warehouse/database.db/lines_test_data56/")

保存数据框后，我会从"hiveCols"中获取相同的列，准备DDL以在相同位置创建配置单元表，并按给定值将其逗号分隔下方:

Once I save the dataframe, I take the same columns from "hiveCols", prepare a DDL to create a hive table on the same location with values being comma separated as given below:

如果不存在则创建表schema.tablename(col1 coldatatype，col2 coldatatype，col3 coldatatype，col4 coldatatype ... col200数据类型)
排以'，'
结尾的格式分隔字段存储为文本文件

create table if not exists schema.tablename(col1 coldatatype,col2 coldatatype,col3 coldatatype,col4 coldatatype...col200 datatype)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE

位置 'hdfs://username/apps/hive/warehouse/database.db/lines_test_data56/';

LOCATION 'hdfs://username/apps/hive/warehouse/database.db/lines_test_data56/';

将数据框加载到创建的表中后，这里面临的问题是查询表时，查询中的输出不正确. 例如:如果将以下查询应用于数据框，然后再将其另存为文件:

After I load the dataframe into the table created, the problem I am facing here is when I query the table, I am getting improper output in the query. For ex: If I apply the below query on the dataframe before saving it as a file:

finalDF.createOrReplaceTempView("tmpTable")
select header_id,line_num,debit_rate,debit_rate_text,credit_rate,credit_rate_text,activity_amount,activity_amount_text,exchange_rate,exchange_rate_text,amount_cr,amount_cr_text from tmpTable where header_id=19924598 and line_num=2

我正确地获得了输出.所有值都与列正确对齐:

I get the output properly. All the values are properly aligned to the columns:

[19924598,2,null,null,381761.40000000000000000000,381761.4,-381761.40000000000000000000,-381761.4,0.01489610000000000000,0.014896100000000,5686.76000000000000000000,5686.76]

但是将数据框保存在CSV文件中之后，在其上创建一个表(第4步)，并对创建的表应用相同的查询，我发现数据混乱并且与列的映射不正确:

But after saving the dataframe in a CSV file, create a table on top of it (step4) and apply the same query on the created table I see the data is jumbled and improperly mapped with the columns:

select header_id,line_num,debit_rate,debit_rate_text,credit_rate,credit_rate_text,activity_amount,activity_amount_text,exchange_rate,exchange_rate_text,amount_cr,amount_cr_text from schema.tablename where header_id=19924598 and line_num=2

+---------------+--------------+-------------+------------------+-------------+------------------+--------------------------+-------------------------------+------------------------+-----------------------------+--------------------+-------------------------+--+
| header_id     | line_num     | debit_rate  | debit_rate_text  | credit_rate  | credit_rate_text  | activity_amount  | activity_amount_text  | exchange_rate  | exchange_rate_text  | amount_cr  | amount_cr_text  |
+---------------+--------------+-------------+------------------+-------------+------------------+--------------------------+-------------------------------+------------------------+-----------------------------+--------------------+-------------------------+--+
| 19924598      | 2            | NULL        |                  | 381761.4    |                    | 5686.76          | 5686.76               | NULL           | -5686.76            | NULL       |                 |

因此，我尝试使用一种不同的方法，在该方法中，我先创建了配置单元表，然后从数据框中将数据插入到其中:

So I tried use a different approach where I created the hive table upfront and insert data into it from dataframe:

在上面的步骤4中运行DDL
finalDF.createOrReplaceTempView("tmpTable")
spark.sql(插入schema.table select * from tmpTable")

如果作业完成后我运行上述select查询，即使这种方式也失败了. 我尝试使用refresh table schema.table和msckrepair table schema.table刷新表格，只是为了查看元数据是否存在问题，但似乎没什么锻炼的机会.

And even this way fails if I run the aforementioned select query once the job is completed. I tried to refresh the table using refresh table schema.table and msckrepair table schema.table just to see if there is any problem with the metadata but nothing seems to workout.

任何人都可以让我知道导致这种现象的原因吗，我在这里操作数据的方式是否有问题?

Could anyone let me know what is causing this phenomenon, is there is any problem with the way I operating the data here ?

将数据从CSV文件映射到HDFS上的Hive表时出错 [英] Error while mapping the data from CSV file to a Hive table on HDFS

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

将数据从CSV文件映射到HDFS上的Hive表时出错 [英] Error while mapping the data from CSV file to a Hive table on HDFS

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭