Spark用自定义的InputFormat读WARC文件 [英] Spark reading WARC file with custom InputFormat

查看：519 发布时间：2018/5/31 18:59:20 python hadoop apache-spark

本文介绍了Spark用自定义的InputFormat读WARC文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我需要通过Spark来处理.warc文件，但我似乎无法找到一个直接的方式。我宁愿使用Python，也不要通过 wholeTextFiles（）将整个文件读入RDD中（因为整个文件将在单个节点（？）中处理），因此它似乎是唯一/最好的方法是通过在Python中与 .hadoopFile（）一起使用的自定义Hadoop InputFormat 。 / p>

然而，我找不到这样做的简单方法。将.warc文件拆分为条目就像在\\\ \\\ \\\中拆分一样简单;所以我怎么能实现这一点，而不需要编写大量在线的各种教程中所示的额外（无用的）代码？它可以在Python中完成吗？

即如何将warc文件拆分为条目而不用整个文本文件 \\\ \\\ \\\

解决方案

c $ c>您可以使用 textinputformat.record.delimiter

  sc .newAPIHadoopFile（
 path，
'org.apache.hadoop.mapreduce.lib.input.TextInputFormat'，
'org.apache.hadoop.io.LongWritable'，
' org.apache.hadoop.io.Text'，
 conf = {'textinputformat.record.delimiter'：'\\\
\\\
\\\
'} 
）

I need to process a .warc file through Spark but I can't seem to find a straightforward way of doing so. I would prefer to use Python and to not read the whole file into an RDD through wholeTextFiles() (because the whole file would be processed at a single node(?)) therefore it seems like the only/best way is through a custom Hadoop InputFormat used with .hadoopFile() in Python.



However, I could not find an easy way of doing this. To split a .warc file into entries is as simple as splitting on \n\n\n; so how can I achieve this, without writing a ton of extra (useless) code as shown in various "tutorials" online? Can it be done all in Python?

i.e., How to split a warc file into entries without reading the whole thing with wholeTextFiles?
 解决方案 
If delimiter is \n\n\n you can use textinputformat.record.delimiter
sc.newAPIHadoopFile(
  path ,
  'org.apache.hadoop.mapreduce.lib.input.TextInputFormat',
  'org.apache.hadoop.io.LongWritable',
  'org.apache.hadoop.io.Text',
  conf={'textinputformat.record.delimiter': '\n\n\n'}
)


                        
这篇关于Spark用自定义的InputFormat读WARC文件的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

Spark用自定义的InputFormat读WARC文件 [英] Spark reading WARC file with custom InputFormat

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Spark用自定义的InputFormat读WARC文件 [英] Spark reading WARC file with custom InputFormat

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭