迭代解析HTML(使用lxml?) [英] Iteratively parsing HTML (with lxml?)
问题描述
我目前正在尝试迭代解析一个非常大的HTML文档(我知道.. yuck),以减少所使用的内存量.我遇到的问题是我遇到了XML语法错误,例如:
I'm currently trying to iteratively parse a very large HTML document (I know.. yuck) to reduce the amount of memory used. The problem I'm having is that I'm getting XML syntax errors such as:
lxml.etree.XMLSyntaxError: Attribute name redefined, line 134, column 59
这将导致一切停止.
有没有一种方法可以迭代分析HTML而不会出现语法错误呢?
Is there a way to iteratively parse HTML without choking on syntax errors?
此刻,我正在从XML语法错误异常中提取行号,从文档中删除该行,然后重新启动该过程.似乎是一个非常令人作呕的解决方案.有更好的方法吗?
At the moment I'm extracting the line number from the XML syntax error exception, removing that line from the document, and then restarting the process. Seems like a pretty disgusting solution. Is there a better way?
这是我目前正在做的事情:
This is what I'm currently doing:
context = etree.iterparse(tfile, events=('start', 'end'), html=True)
in_table = False
header_row = True
while context:
try:
event, el = context.next()
# do something
# remove old elements
while el.getprevious() is not None:
del el.getparent()[0]
except etree.XMLSyntaxError, e:
print e.msg
lineno = int(re.search(r'line (\d+),', e.msg).group(1))
remove_line(tfilename, lineno)
tfile = open(tfilename)
context = etree.iterparse(tfile, events=('start', 'end'), html=True)
except KeyError:
print 'oops keyerror'
推荐答案
The perfect solution ended up being Python's very own HTMLParser
[docs].
这是我最终使用的(非常糟糕的)代码:
This is the (pretty bad) code I ended up using:
class MyParser(HTMLParser):
def __init__(self):
self.finished = False
self.in_table = False
self.in_row = False
self.in_cell = False
self.current_row = []
self.current_cell = ''
HTMLParser.__init__(self)
def handle_starttag(self, tag, attrs):
attrs = dict(attrs)
if not self.in_table:
if tag == 'table':
if ('id' in attrs) and (attrs['id'] == 'dgResult'):
self.in_table = True
else:
if tag == 'tr':
self.in_row = True
elif tag == 'td':
self.in_cell = True
elif (tag == 'a') and (len(self.current_row) == 7):
url = attrs['href']
self.current_cell = url
def handle_endtag(self, tag):
if tag == 'tr':
if self.in_table:
if self.in_row:
self.in_row = False
print self.current_row
self.current_row = []
elif tag == 'td':
if self.in_table:
if self.in_cell:
self.in_cell = False
self.current_row.append(self.current_cell.strip())
self.current_cell = ''
elif (tag == 'table') and self.in_table:
self.finished = True
def handle_data(self, data):
if not len(self.current_row) == 7:
if self.in_cell:
self.current_cell += data
使用该代码,我可以执行以下操作:
With that code I could then do this:
parser = MyParser()
for line in myfile:
parser.feed(line)
这篇关于迭代解析HTML(使用lxml?)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!