将数据从CSV文件映射到HDFS上的Hive表时出错 [英] Error while mapping the data from CSV file to a Hive table on HDFS

查看:333
本文介绍了将数据从CSV文件映射到HDFS上的Hive表时出错的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试通过以下步骤将数据帧加载到Hive表中:

I am trying to load a dataframe into a Hive table by following the below steps:

  1. 读取源表并将数据帧另存为HDFS上的CSV文件

  1. Read the source table and save the dataframe as a CSV file on HDFS

val yearDF = spark.read.format("jdbc").option("url", connectionUrl).option("dbtable", s"(${execQuery}) as year2016").option("user", devUserName).option("password", devPassword).option("partitionColumn","header_id").option("lowerBound", 199199).option("upperBound", 284058).option("numPartitions",10).load()

  • 按照我的Hive表列排序列 我的蜂巢表列以字符串的形式出现:

  • Order the columns as per my Hive table columns My hive table columns are present in a string in the format of:

    val hiveCols = col1:coldatatype|col2:coldatatype|col3:coldatatype|col4:coldatatype...col200:datatype
    val schemaList        = hiveCols.split("\\|")
    val hiveColumnOrder   = schemaList.map(e => e.split("\\:")).map(e => e(0)).toSeq
    val finalDF           = yearDF.selectExpr(hiveColumnOrder:_*)
    

    我在"execQuery"中读取的列顺序与"hiveColumnOrder"相同,为确保顺序,我再次使用selectExpr

    The order of columns that I read in "execQuery" are same as "hiveColumnOrder" and just to make sure of the order, I select the columns in yearDF once again using selectExpr

    在HDFS上将数据框另存为CSV文件:

    Saving the dataframe as a CSV file on HDFS:

    newDF.write.format("CSV").save("hdfs://username/apps/hive/warehouse/database.db/lines_test_data56/")
    

  • 保存数据框后,我会从"hiveCols"中获取相同的列, 准备DDL以在相同位置创建配置单元表,并按给定值将其逗号分隔 下方:

  • Once I save the dataframe, I take the same columns from "hiveCols", prepare a DDL to create a hive table on the same location with values being comma separated as given below:

    如果不存在则创建表schema.tablename(col1 coldatatype,col2 coldatatype,col3 coldatatype,col4 coldatatype ... col200数据类型)
    排 以','
    结尾的格式分隔字段 存储为文本文件

    create table if not exists schema.tablename(col1 coldatatype,col2 coldatatype,col3 coldatatype,col4 coldatatype...col200 datatype)
    ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
    STORED AS TEXTFILE

    位置 'hdfs://username/apps/hive/warehouse/database.db/lines_test_data56/';

    LOCATION 'hdfs://username/apps/hive/warehouse/database.db/lines_test_data56/';

    将数据框加载到创建的表中后,这里面临的问题是查询表时,查询中的输出不正确. 例如:如果将以下查询应用于数据框,然后再将其另存为文件:

    After I load the dataframe into the table created, the problem I am facing here is when I query the table, I am getting improper output in the query. For ex: If I apply the below query on the dataframe before saving it as a file:

    finalDF.createOrReplaceTempView("tmpTable")
    select header_id,line_num,debit_rate,debit_rate_text,credit_rate,credit_rate_text,activity_amount,activity_amount_text,exchange_rate,exchange_rate_text,amount_cr,amount_cr_text from tmpTable where header_id=19924598 and line_num=2
    

    我正确地获得了输出.所有值都与列正确对齐:

    I get the output properly. All the values are properly aligned to the columns:

    [19924598,2,null,null,381761.40000000000000000000,381761.4,-381761.40000000000000000000,-381761.4,0.01489610000000000000,0.014896100000000,5686.76000000000000000000,5686.76]
    

    但是将数据框保存在CSV文件中之后,在其上创建一个表(第4步),并对创建的表应用相同的查询,我发现数据混乱并且与列的映射不正确:

    But after saving the dataframe in a CSV file, create a table on top of it (step4) and apply the same query on the created table I see the data is jumbled and improperly mapped with the columns:

    select header_id,line_num,debit_rate,debit_rate_text,credit_rate,credit_rate_text,activity_amount,activity_amount_text,exchange_rate,exchange_rate_text,amount_cr,amount_cr_text from schema.tablename where header_id=19924598 and line_num=2
    
    +---------------+--------------+-------------+------------------+-------------+------------------+--------------------------+-------------------------------+------------------------+-----------------------------+--------------------+-------------------------+--+
    | header_id     | line_num     | debit_rate  | debit_rate_text  | credit_rate  | credit_rate_text  | activity_amount  | activity_amount_text  | exchange_rate  | exchange_rate_text  | amount_cr  | amount_cr_text  |
    +---------------+--------------+-------------+------------------+-------------+------------------+--------------------------+-------------------------------+------------------------+-----------------------------+--------------------+-------------------------+--+
    | 19924598      | 2            | NULL        |                  | 381761.4    |                    | 5686.76          | 5686.76               | NULL           | -5686.76            | NULL       |                 |
    

    因此,我尝试使用一种不同的方法,在该方法中,我先创建了配置单元表,然后从数据框中将数据插入到其中:

    So I tried use a different approach where I created the hive table upfront and insert data into it from dataframe:

    • 在上面的步骤4中运行DDL
    • finalDF.createOrReplaceTempView("tmpTable")
    • spark.sql(插入schema.table select * from tmpTable")

    如果作业完成后我运行上述select查询,即使这种方式也失败了. 我尝试使用refresh table schema.tablemsckrepair table schema.table刷新表格,只是为了查看元数据是否存在问题,但似乎没什么锻炼的机会.

    And even this way fails if I run the aforementioned select query once the job is completed. I tried to refresh the table using refresh table schema.table and msckrepair table schema.table just to see if there is any problem with the metadata but nothing seems to workout.

    任何人都可以让我知道导致这种现象的原因吗,我在这里操作数据的方式是否有问题?

    Could anyone let me know what is causing this phenomenon, is there is any problem with the way I operating the data here ?

    推荐答案

    我在Hive DDL中使用了行格式Serde:org.apache.hadoop.hive.serde2.OpenCSVSerde.这也有','作为默认的分隔符char,而我不必提供任何其他定界符.

    I used the rowformat serde: org.apache.hadoop.hive.serde2.OpenCSVSerde in the Hive DDL. This also has ',' as default separator char and I didn't have to give any other delimiter.

    这篇关于将数据从CSV文件映射到HDFS上的Hive表时出错的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆