从多个网页收集格式化的内容 [英] Gathering Formatted Content From Multiple Webpages
问题描述
我正在做一个研究项目,需要节目成绩单的内容作为数据.问题是,抄本是为特定Wiki格式化的(被捕的开发Wiki ),而我需要它们是机器可读的.
I'm doing a research project, and need the contents of a show's transcript for the data. The problem is, the transcripts are formatted for the particular wiki (Arrested Development wiki), whereas I need them to be machine readable.
下载所有这些成绩单并重新格式化的最佳方式是什么? Python的HTMLParser 是我最好的选择吗?
What's the best way to go about downloading all of these transcripts and reformatting them? Is Python's HTMLParser my best bet?
推荐答案
我用python编写了一个脚本,该脚本将Wiki成绩单的链接作为输入,然后在文本文件中为该成绩单提供了纯文本版本,作为输出.我希望这对您的项目有所帮助.
I wrote a script in python that takes the link of the wiki transcript as an input and then gives you a plaintext version of the transcript in a text file as the output. I hope this helps with your project.
from pycurl import *
import cStringIO
import re
link = raw_input("Link to transcript: ")
filename = link.split("/")[-1]+".txt"
buf = cStringIO.StringIO()
c = Curl()
c.setopt(c.URL, link)
c.setopt(c.WRITEFUNCTION, buf.write)
c.perform()
html = buf.getvalue()
buf.close()
Speaker = ""
SpeakerPositions = [m.start() for m in re.finditer(':</b>', html)]
file = open(filename, 'w')
for x in range(0, len(SpeakerPositions)):
if html[SpeakerPositions[x] + 5] != "<":
searchpos = SpeakerPositions[x] - 1
char = ""
while char != ">":
char = html[searchpos]
searchpos = searchpos - 1
if char != ">":
Speaker += char
Speaker = Speaker[::-1]
Speaker += ": "
searchpos = SpeakerPositions[x] + 5
char = ""
while char != "<":
char = html[searchpos]
searchpos = searchpos + 1
if char != "<":
Speaker += char
Speaker = Speaker.replace(" ", "")
file.write(Speaker + "\n")
Speaker = ""
file.close()
这篇关于从多个网页收集格式化的内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!