从多个网页收集格式化的内容 [英] Gathering Formatted Content From Multiple Webpages

查看:137
本文介绍了从多个网页收集格式化的内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在做一个研究项目,需要节目成绩单的内容作为数据.问题是,抄本是为特定Wiki格式化的(被捕的开发Wiki ),而我需要它们是机器可读的.

I'm doing a research project, and need the contents of a show's transcript for the data. The problem is, the transcripts are formatted for the particular wiki (Arrested Development wiki), whereas I need them to be machine readable.

下载所有这些成绩单并重新格式化的最佳方式是什么? Python的HTMLParser 是我最好的选择吗?

What's the best way to go about downloading all of these transcripts and reformatting them? Is Python's HTMLParser my best bet?

推荐答案

我用python编写了一个脚本,该脚本将Wiki成绩单的链接作为输入,然后在文本文件中为该成绩单提供了纯文本版本,作为输出.我希望这对您的项目有所帮助.

I wrote a script in python that takes the link of the wiki transcript as an input and then gives you a plaintext version of the transcript in a text file as the output. I hope this helps with your project.

from pycurl import *
import cStringIO
import re

link = raw_input("Link to transcript: ")
filename = link.split("/")[-1]+".txt"

buf = cStringIO.StringIO()

c = Curl()
c.setopt(c.URL, link)
c.setopt(c.WRITEFUNCTION, buf.write)
c.perform()
html = buf.getvalue()
buf.close()

Speaker = ""
SpeakerPositions = [m.start() for m in re.finditer(':</b>', html)]

file = open(filename, 'w')

for x in range(0, len(SpeakerPositions)):
    if html[SpeakerPositions[x] + 5] != "<":

        searchpos = SpeakerPositions[x] - 1
        char = ""
        while char != ">":
            char = html[searchpos]
            searchpos = searchpos - 1
            if char != ">":
                Speaker += char

        Speaker = Speaker[::-1]
        Speaker += ": "

        searchpos = SpeakerPositions[x] + 5
        char = ""
        while char != "<":
            char = html[searchpos]
            searchpos = searchpos + 1
            if char != "<":
                Speaker += char

        Speaker = Speaker.replace("&#160;", "")
        file.write(Speaker + "\n")
        Speaker = ""

file.close()

这篇关于从多个网页收集格式化的内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆