在hadoop中处理非结构化和多行CSV [英] Process unstructured and multiple line CSV in hadoop

查看:997
本文介绍了在hadoop中处理非结构化和多行CSV的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在Hadoop Mapreduce中处理数据,我的数据格式包含非结构化,多行和未终止的报价。

I would like to process data in Hadoop Mapreduce, I am having data below format with unstructured, multiple line and un-terminated quotations.

    2/1/2013 5:16,Edward Felton,2,8/1/2012 3:57,Working on all the digital elements for our big event in Sydney in a couple of weeks... for more visit http://www.xy.com/au/geworks/,324005862,2,18200695
    12/28/2012 19:28,Laura McCullum,2,7/26/2012 18:03,"The Day You Give Them Jive  <br>
<a href="http://youtu.be/qfq9LVD2Qr4" > http://youtu.be/qfq9LVD2Qr4 <br>
 <br>
'Like' if you have always wanted to destroy a cube!",502114904,2,18400313
    11/21/2012 13:35,Timothy Widdowson,4,8/17/2012 12:38,"Can a table really replace a laptop...

With the new Windows tablets on the horizon and the Apple / Android devices out there I have been wondering if it is possible to really work with just and tablet. 

My mission:
-For one whole week I will be working with just my iPad. 

Hardware:
-Apple iPad
-Apple keyboard.
-Apple to HDMI connector.
-HDMI capable monitor.
- InCase iPad stand.

:-)",105001439,1,19301609
    3/15/2013 13:43,Mary Romeo,3,8/16/2012 22:23,"HOW TO SHORTEN LONG LINKS YOU'RE POSTING <br>
The attached image describes how to shorten a long url before posting it.  In 4 easy steps the 3-4 line urls can become a tiny link to post.",213022329,1,19901561
    11/30/2012 2:17,Lu Yin Zhong,3,8/29/2012 1:29,working on 2013 comms plan...need big ideas!!,302014449,2,20300666
    3/5/2013 22:15,Tim Steigert,12,8/29/2012 15:36,"Looking up 1024 email addresses. Manually? Probably a day! Doing it with SSOget, the add-in for  #[&quot;excel&quot;]? 5 minutes! Effort saved and  #[&quot;productivity&quot;] gained? Priceless! Now go get it and enjoy it for yourself! :)<br>http://sc.xy.com/*SSOget @@@data@@@{&quot;image&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;}",100011871,11,20400713
    11/1/2012 20:46,Pranay Jain,2,8/30/2012 14:26,Do people agree with the iCloud restrictions that Airwatch will put on Personal iOS devices that have email?,212065316,0,20700913
    11/9/2012 18:32,Monica Sharma,5,9/7/2012 11:42,hhghghghghgh hg h gh gh gh gh gh gh ghghghghgghhhghghghghgh hg h gh gh gh gh gh gh ghghghghgghhhghghghghgh hg h gh gh gh gh gh gh ghghghghgghhhghghghghgh hg h gh gh gh gh gh gh ghghghghgghhhghghghghgh hg h gh gh gh gh gh gh ghghghghgghhhghghghghgh hg h gh gh gh gh gh gh ghghghghgghhhghghghghgh hg h gh gh gh gh gh gh ghghghghgghhhghghghghgh hg h gh gh gh gh gh gh ghghghghgghhhghghghghgh hg h gh gh,502000192,5,21400516

请提供我的代码片段如何处理上述数据
提前感谢!!!!!!!

Please provide me code snippet how to handle mentioned data ? Thanks in advance!!!!!!!

推荐答案

因为你正在处理多行数据,你不能使用一个简单的 TextInputFormat 来访问你的数据。因此,您需要为CSV文件使用自定义 InputFormat

Because you're coping with multi-line data you cannot use a simple TextInputFormat to access your data. Thus you need to use a custom InputFormat for CSV files.

目前没有内置的处理方式Hadoop中(见 https://issues.apache.org/jira/多行的CSV文件浏览/ MapReduce的2208 ),但幸运的问世在GitHub上的代码,你可以尝试: https://开头的github .com / mvallebr / CSVInputFormat

Currently there is no built-in way of processing multi-line CSV files in Hadoop (see https://issues.apache.org/jira/browse/MAPREDUCE-2208), but luckily there's come code on github you can try: https://github.com/mvallebr/CSVInputFormat.

至于未终止的引用,可能需要预处理数据并清理在首位。一个简单的规则是如果在引用之前或之后没有分隔符(),则转义引号:

As far as the non-terminated quotations is concerned, it might be necessary to pre-process the data and clean it up in the first place. One simple rule would be to escape the quotations if there is no separator before or after the quotation ("):


  • escape: ab => a\b

  • 保持不变: a;b a; b

  • escape: a"b => a\"b
  • leave unchanged: a;"b and a";b

另一个选择是纠正产生无效CSV的应用程序以正确的方式转义数据。

Another option would be correcting the application that produces invalid CSV to escape the data in a proper way.

这篇关于在hadoop中处理非结构化和多行CSV的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆