Spark用自定义的InputFormat读WARC文件 [英] Spark reading WARC file with custom InputFormat

查看:519
本文介绍了Spark用自定义的InputFormat读WARC文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要通过Spark来处理.warc文件,但我似乎无法找到一个直接的方式。我宁愿使用Python,也不要通过 wholeTextFiles()将整个文件读入RDD中(因为整个文件将在单个节点(?)中处理),因此它似乎是唯一/最好的方法是通过在Python中与 .hadoopFile()一起使用的自定义Hadoop InputFormat 。 / p>

然而,我找不到这样做的简单方法。将.warc文件拆分为条目就像在 \\\
\\\
\\\
中拆分一样简单;所以我怎么能实现这一点,而不需要编写大量在线的各种教程中所示的额外(无用的)代码?它可以在Python中完成吗?



即如何将warc文件拆分为条目而不用整个文本文件 \\\
\\\
\\\

解决方案

c $ c>您可以使用 textinputformat.record.delimiter

  sc .newAPIHadoopFile(
path,
'org.apache.hadoop.mapreduce.lib.input.TextInputFormat',
'org.apache.hadoop.io.LongWritable',
' org.apache.hadoop.io.Text',
conf = {'textinputformat.record.delimiter':'\\\
\\\
\\\
'}


I need to process a .warc file through Spark but I can't seem to find a straightforward way of doing so. I would prefer to use Python and to not read the whole file into an RDD through wholeTextFiles() (because the whole file would be processed at a single node(?)) therefore it seems like the only/best way is through a custom Hadoop InputFormat used with .hadoopFile() in Python.

However, I could not find an easy way of doing this. To split a .warc file into entries is as simple as splitting on \n\n\n; so how can I achieve this, without writing a ton of extra (useless) code as shown in various "tutorials" online? Can it be done all in Python?

i.e., How to split a warc file into entries without reading the whole thing with wholeTextFiles?

解决方案

If delimiter is \n\n\n you can use textinputformat.record.delimiter

sc.newAPIHadoopFile(
  path ,
  'org.apache.hadoop.mapreduce.lib.input.TextInputFormat',
  'org.apache.hadoop.io.LongWritable',
  'org.apache.hadoop.io.Text',
  conf={'textinputformat.record.delimiter': '\n\n\n'}
)

这篇关于Spark用自定义的InputFormat读WARC文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆