使用python和lxml从大型HTML文件中解析和提取信息 [英] Parsing and extracting information from large HTML files with python and lxml

查看:47
本文介绍了使用python和lxml从大型HTML文件中解析和提取信息的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想解析大型HTML文件并通过xpath从这些文件中提取信息.为了做到这一点,我正在使用python和lxml.但是,lxml似乎不适用于大文件,它可以正确解析大小不超过16 MB的文件.尝试通过xpath从HTML代码中提取信息的代码片段如下:

I would like to parse large HTML files and extract information from those files through xpath. Aiming to do that, I'm using python and lxml. However, lxml seems not to work well with large files, it can parse correctly files whose size isn't larger than around 16 MB. The fragment of code where it tries to extract information from HTML code though xpath is the following:

tree = lxml.html.fragment_fromstring(htmlCode)
links = tree.xpath("//*[contains(@id, 'item')]/div/div[2]/p/text()")

变量 htmlCode 包含从文件读取的HTML代码.我还尝试使用 parse 方法从文件中读取代码,而不是直接从字符串中获取代码,但是它也不起作用.由于从文件中成功读取了文件的内容,因此我认为问题与lxml有关.我一直在寻找另一个库来解析HTML并使用xpath,但看起来lxml是用于此的主要库.

The variable htmlCode contains the HTML code read from a file. I also tried using parse method for reading the code from file instead of getting the code directly from a string, but it didn't work either. As the contents of file is read successfully from file, I think the problem is related to lxml. I've been looking for another libraries in order to parse HTML and use xpath, but it looks like lxml is the main library used for that.

lxml是否有另一种方法/功能可以更好地处理大型HTML文件?

Is there another method/function of lxml that deals better with large HTML files?

推荐答案

如果文件很大,则可以使用iterparse并添加 html = True 参数来解析文件,而无需进行任何验证.您需要手动为xpath创建条件.

If the file is very large, you can use iterparse and add html=True argument to parse files without any validation. You need to manually create conditions for xpath.

from lxml import etree
import sys
import unicodedata

TAG = '{http://www.mediawiki.org/xml/export-0.8/}text'

def fast_iter(context, func, *args, **kwargs):
    # http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
    # Author: Liza Daly
    # modified to call func() only in the event and elem needed
    for event, elem in context:
        if event == 'end' and elem.tag == TAG:
            func(elem, *args, **kwargs)
        elem.clear()
        while elem.getprevious() is not None:
            del elem.getparent()[0]
    del context

def process_element(elem, fout):
    global counter
    normalized = unicodedata.normalize('NFKD', \
            unicode(elem.text)).encode('ASCII','ignore').lower()
    print >>fout, normalized.replace('\n', ' ')
    if counter % 10000 == 0: print "Doc " + str(counter)
    counter += 1

def main():
    fin = open("large_file", 'r')
    fout = open('output.txt', 'w')
    context = etree.iterparse(fin,html=True)
    global counter
    counter = 0
    fast_iter(context, process_element, fout)

if __name__ == "__main__":
main()

来源

这篇关于使用python和lxml从大型HTML文件中解析和提取信息的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆