用python打开warc文件 [英] open warc file with python

查看:91
本文介绍了用python打开warc文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用以下链接中的工具箱使用 python 打开一个 warc 文件:http://warc.readthedocs.org/en/latest/

I'm trying to open a warc file with python using the toolbox from the following link: http://warc.readthedocs.org/en/latest/

打开文件时:

import warc
f = warc.open("00.warc.gz")

一切都很好,f 对象是:

Everything is fine and the f object is:

<warc.warc.WARCFile instance at 0x1151d34d0>

但是,当我尝试使用以下方法读取文件中的所有内容时:

However when I'm trying to read everything in the file using:

for record in f:
     print record['WARC-Target-URI'], record['Content-Length']

出现以下错误:

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/xxx/anaconda/lib/python2.7/site-packages/warc/warc.py", line 390, in         __iter__
record = self.read_record()
File "/Users/xxx/anaconda/lib/python2.7/site-packages/warc/warc.py", line 373, in read_record
header = self.read_header(fileobj)
File "/Users/xxx/anaconda/lib/python2.7/site-packages/warc/warc.py", line 331, in read_header
raise IOError("Bad version line: %r" % version_line)
IOError: Bad version line: 'WARC/0.18\n'

这是因为我使用的 warc 工具箱或其他东西不支持我的 warc 文件版本吗?

Is this because my warc file version is not supported by the warc toolbox I'm using or something else?

推荐答案

ClueWeb09 数据集以 WARC 0.18 格式提供.但是,它有几个问题.有些记录是格式错误.

ClueWeb09 dataset is available in the WARC 0.18 format. However, it has several issues. Some records are malformed.

最普遍的问题是 WARC 标头中的额外换行符.还有一些其他格式错误的标题的情况.

The most prevalent problem is an extra newline in the WARC header. There are a few cases of other malformed headers also.

此外,它不使用标准的 \r\n 行尾标记,这实际上是您的问题.

Moreover, it does not use the standard \r\n end-of-line markers which is actually your problem.

warc-clueweb 库 可以处理它.这是一个特殊的 python 库,用于处理 ClueWeb09 WARC 文件.根据文档

warc-clueweb library can handle it. This is a special python library to work with ClueWeb09 WARC files. According to documentation

仅对原始库进行了少量修改.warc库的原始文档仍然保存

Only minor modifications to the original library were made. The original documentation of the warc library still holds

这篇关于用python打开warc文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆