用Python解析非常大的HTML文件(ElementTree?) [英] Parsing very large HTML file with Python (ElementTree?)

查看:1722
本文介绍了用Python解析非常大的HTML文件(ElementTree?)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我问到使用BeautifulSoup解析一个非常大的( 270MB)的HTML文件,并得到一个内存错误,并指向ElementTree作为解决方案。



我试图使用它们的事件驱动解析,记录在这里。使用较小的设置文件测试它工作正常:

 >>> settings = open('S:\\Documents\\FacebookData\\html\\tingtings.htm')
>>> ET.iterparse中的事件元素(设置,事件=(开始,结束)):
print(%5s,%4s,%s%(event,element.tag,element。文本))

成功打印出元素。然而,在开始实际的编码过程之前,使用'messages.htm'而不是'settings.htm'的相同代码来查看它是否工作,这是结果:

  Traceback(最近一次调用的最后一个):
在< module>中的第1行中输入< pyshell#16>
for event,ET.iterparse中的元素(source,events =(start,end)):
文件C:\程序文件(x86)\Python\lib\\ \\ xml \ etree \ElementTree.py,第1294行,在__next__
中用于self._parser.read_events()中的事件:
文件C:\ Program Files(x86)\Python \lib\xml\etree\ElementTree.py,行1277,在read_events
引发事件
文件C:\程序文件(x86)\Python\lib\ xml\etree\ElementTree.py,第1235行,在feed中
self._parser.feed(data)
文件< string>,无
xml.etree。 ElementTree.ParseError:格式不正确(无效标记):第1行第6列

我是想知道这是否是因为ET更适合解析XML文档?如果是这种情况,并且没有解决方法,那我就回到原点。任何关于如何解析这个文件的建议,以及如何一路调试将不胜感激! 解决方案

用于解析HTML或XML的是 lxml xpath



使用xpath:

  from lxml import etree 
data = open('result.html','r' ).read()
doc = etree.HTML(data)

for doc in doc.xpath('// table / tr [@ class =trmenu1]'):
print tr.xpath('./ td / text()')


I asked about using BeautifulSoup to parse a very large (270MB) HTML file and getting a memory error andwas pointed toward ElementTree as a solution.

I was trying to use their event-driven parsing, documented here. Testing it with the smaller settings file worked fine:

>>> settings = open('S:\\Documents\\FacebookData\\html\\settings.htm')
>>> for event, element in ET.iterparse(settings, events=("start", "end")):
    print("%5s, %4s, %s" % (event, element.tag, element.text))

Successfully prints out the elements. However, using that same code with 'messages.htm' instead of 'settings.htm' just to see if it's working before even beginning the actual coding process, this is the result:

Traceback (most recent call last):
  File "<pyshell#16>", line 1, in <module>
    for event, element in ET.iterparse(source, events=("start", "end")):
  File "C:\Program Files (x86)\Python\lib\xml\etree\ElementTree.py", line 1294, in __next__
for event in self._parser.read_events():
  File "C:\Program Files (x86)\Python\lib\xml\etree\ElementTree.py", line 1277, in read_events
raise event
  File "C:\Program Files (x86)\Python\lib\xml\etree\ElementTree.py", line 1235, in feed
self._parser.feed(data)
  File "<string>", line None
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 1, column 6

I'm wondering if this is because ET is just better suited to parsing XML documents? If this is the case, and there's no workaround, then I'm back to square one. Any suggestions on how to parse this file, along with how to debug along the way would be greatly appreciated!

解决方案

A good solution for parsing HTML or XML is lxml and xpath.

To use xpath:

from lxml import etree
data = open('result.html','r').read()
doc = etree.HTML(data)

for tr in doc.xpath('//table/tr[@class="trmenu1"]'):
    print tr.xpath('./td/text()')

这篇关于用Python解析非常大的HTML文件(ElementTree?)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆