根据标头将WARC文件拆分为多个块:WARC/1.0 Python [英] Splitting a WARC file into chunks based on the header: WARC/1.0 Python

查看：81 发布时间：2021/4/30 20:00:38 python html dictionary file-processing warc

本文介绍了根据标头将WARC文件拆分为多个块:WARC/1.0 Python的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我是编程新手，正在尝试通过将WARC文件拆分为多个块然后将每个块存储在字典中的方式来处理它.

I'm new to programming and am trying to process a WARC file by splitting it into chunks and then storing each chunk in a dictionary.

每个块都应以WARC/1.0标头开头，并由3个空行分隔.我也想删除前两段:

Each chunk should start with the WARC/1.0 header and is separated by 3 empty lines. I also would like to remove the first 2 paragraphs:

WARC/1.0
WARC-Type: warcinfo
WARC-Date: 2020-08-04T01:43:40Z
WARC-Record-ID: <urn:uuid:959ea654-33fd-466b-b1bf-f08aa8abe774>
Content-Length: 500
Content-Type: application/warc-fields
WARC-Filename: CC-MAIN-20200804014340-20200804044340-00045.warc.gz

isPartOf: CC-MAIN-2020-34
publisher: Common Crawl
description: Wide crawl of the web for August 2020
operator: Common Crawl Admin (info@commoncrawl.org)
hostname: ip-10-67-67-22.ec2.internal
software: Apache Nutch 1.17 (modified, https://github.com/commoncrawl/nutch/)
robots: checked via crawler-commons 1.2-SNAPSHOT (https://github.com/crawler-commons/crawler-commons)
format: WARC File Format 1.1
conformsTo: http://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/

#从这里开始保持一切:

WARC/1.0
WARC-Type: request
WARC-Date: 2020-08-04T03:25:25Z
WARC-Record-ID: <urn:uuid:6c0b749a-4d02-4a77-ab93-9bc4ba094cdc>
Content-Length: 303
Content-Type: application/http; msgtype=request
WARC-Warcinfo-ID: <urn:uuid:959ea654-33fd-466b-b1bf-f08aa8abe774>
WARC-IP-Address: 104.254.66.40
WARC-Target-URI: http://00.auto.sohu.com/d/details?cityCode=450100&planId=1450&trimId=145372

我尝试使用生成器对块进行分组，但是它返回一个组(整个文件).有一种简单的方法可以将它们分开吗?

I've tried using a generator to group the chunks, but it's returning one group (the whole file). Is there a simple way to separate these?

我无法导入任何库.

任何帮助将不胜感激！

根据标头将WARC文件拆分为多个块:WARC/1.0 Python [英] Splitting a WARC file into chunks based on the header: WARC/1.0 Python

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

根据标头将WARC文件拆分为多个块:WARC/1.0 Python [英] Splitting a WARC file into chunks based on the header: WARC/1.0 Python

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭