Spark用自定义的InputFormat读WARC文件 [英] Spark reading WARC file with custom InputFormat
问题描述
我需要通过Spark来处理.warc文件,但我似乎无法找到一个直接的方式。我宁愿使用Python,也不要通过 wholeTextFiles()
将整个文件读入RDD中(因为整个文件将在单个节点(?)中处理),因此它似乎是唯一/最好的方法是通过在Python中与 .hadoopFile()
一起使用的自定义Hadoop InputFormat
。 / p>
然而,我找不到这样做的简单方法。将.warc文件拆分为条目就像在 \\\
中拆分一样简单;所以我怎么能实现这一点,而不需要编写大量在线的各种教程中所示的额外(无用的)代码?它可以在Python中完成吗?
\\\
\\\
即如何将warc文件拆分为条目而不用整个文本文件$ c如果分隔符是
\\\
\\\
\\\
$ <$ c $?>
c $ c>您可以使用 textinputformat.record.delimiter
sc .newAPIHadoopFile(
path,
'org.apache.hadoop.mapreduce.lib.input.TextInputFormat',
'org.apache.hadoop.io.LongWritable',
' org.apache.hadoop.io.Text',
conf = {'textinputformat.record.delimiter':'\\\
\\\
\\\
'}
)
I need to process a .warc file through Spark but I can't seem to find a straightforward way of doing so. I would prefer to use Python and to not read the whole file into an RDD through wholeTextFiles()
(because the whole file would be processed at a single node(?)) therefore it seems like the only/best way is through a custom Hadoop InputFormat
used with .hadoopFile()
in Python.
However, I could not find an easy way of doing this. To split a .warc file into entries is as simple as splitting on \n\n\n
; so how can I achieve this, without writing a ton of extra (useless) code as shown in various "tutorials" online? Can it be done all in Python?
i.e., How to split a warc file into entries without reading the whole thing with wholeTextFiles
?
If delimiter is \n\n\n
you can use textinputformat.record.delimiter
sc.newAPIHadoopFile(
path ,
'org.apache.hadoop.mapreduce.lib.input.TextInputFormat',
'org.apache.hadoop.io.LongWritable',
'org.apache.hadoop.io.Text',
conf={'textinputformat.record.delimiter': '\n\n\n'}
)
这篇关于Spark用自定义的InputFormat读WARC文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!