Hive-从HDFS中的三个CSV文件的特定数据创建Hive表 [英] Hive - create hive table from specific data of three csv files in hdfs

查看:886
本文介绍了Hive-从HDFS中的三个CSV文件的特定数据创建Hive表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有三个.csv文件,每个文件位于不同的hdfs目录中.我现在想用这三个文件中的数据制作一个Hive内部表.我想要第一个文件中的四列,第二个文件中的三列,第三个文件中的两列.第一个文件与第二个文件共享唯一的id列,第三个文件与第三个文件共享另一个唯一的id列.这两个唯一的ID都存在于第二个文件中;使用这些ID,我想从左到外加入表格.

I have three .csv files, each in different hdfs directory. I now want to make a Hive internal table with data from those three files. I want four columns from first file, three columns from second file and two columns from third file. first file share an unique id column with second file and third file share another unique id column with third file. both unique ids are present in second file; using these ids I would like to left-outer-join to make table.

文件1:"/directory_1/sub_directory_1/table1_data_on_01_01_2014.csv"
文件2:"/directory_2/sub_directory_2/table2_data_on_01_01_2014.csv"
文件3:"/directory_3/sub_directory_3/table3_data_on_01_01_2014.csv"

file 1: '/directory_1/sub_directory_1/table1_data_on_01_01_2014.csv'
file 2: '/directory_2/sub_directory_2/table2_data_on_01_01_2014.csv'
file 3: '/directory_3/sub_directory_3/table3_data_on_01_01_2014.csv'

文件1的内容

unique_id_1,age,department,reason_of_visit,--more columns--,,,
id_11,entry_12,entry_13,entry_14,--more entries--
id_12,entry_22,entry_23,entry_24,--more entries--
id_13,entry_32,entry_33,entry_34,--more entries--

文件2的内容:

unique_id_1,date_of_transaction,transaction_fee,unique_id_2--more columns--,,,
id_11,entry_121,entry_131,id_21,--more entries--
id_12,entry_221,entry_231,id_22,--more entries--
id_13,entry_321,entry_331,id_23,--more entries--

文件3的内容:

unique_id_2,diagnosis,gender --more columns--,,,
id_21,entry_141,entry_151,--more entries--
id_22,entry_241,entry_151,--more entries--
id_23,entry_341,entry_151,--more entries--

我现在想制作一个内部表,像这样:

I now want to make an internal table like this:

unique_id_1 age department reason_of_visit date_of_transaction unique_id_2 transaction_fee diagnosis gender
id_11 entry_12 entry_13 entry_14 entry_121 entry_131 id_21 entry_141 entry_151
id_12 entry_22 entry_23 entry_24 entry_221 entry_231 id_22 entry_241 entry_251
id_13 entry_32 entry_33 entry_34 entry_321 entry_331 id_23 entry_341 entry_251

我如何做到这一点?

推荐答案

@Naveen Kumar这里的解决方案是为3个源创建外部表.接下来,使用3种来源中所需列的架构创建组合的内部表.我称这些临时表或临时表.一旦创建了这些登台表,您就应该能够将插入的选择作为INSERT INTO Combined_table SELECT ...

@Naveen Kumar The solution here is to create external tables for your 3 sources. Next create combined internal table with the schema for columns you need from the 3 sources. I call these temp or staging tables. Once these staging tables are created, you should be able to do your joined select as an INSERT INTO combined_table SELECT ...

这篇关于Hive-从HDFS中的三个CSV文件的特定数据创建Hive表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆