需要python lxml语法帮助来解析html [英] Need python lxml syntax help for parsing html

查看:79
本文介绍了需要python lxml语法帮助来解析html的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是python的新手,我需要一些语法方面的帮助,以使用lxml查找和遍历html标签.这是我正在处理的用例:

I am brand new to python, and I need some help with the syntax for finding and iterating through html tags using lxml. Here are the use-cases I am dealing with:

HTML文件格式相当正确(但并不完美).屏幕上有多个表,一个表包含一组搜索结果,每个表分别包含一个页眉和页脚.每个结果行都包含一个搜索结果详细信息的链接.

HTML file is fairly well formed (but not perfect). Has multiple tables on screen, one containing a set of search results, and one each for a header and footer. Each result row contains a link for the search result detail.

  1. 我需要找到带有搜索结果行的中间表(我能够弄清楚这一行):

  1. I need to find the middle table with the search result rows (this one I was able to figure out):

    self.mySearchTables = self.mySearchTree.findall(".//table")
    self.myResultRows = self.mySearchTables[1].findall(".//tr")

  • 我需要找到此表中包含的链接(这是我遇到的问题):

  • I need to find the links contained in this table (this is where I'm getting stuck):

        for searchRow in self.myResultRows:
            searchLink = patentRow.findall(".//a")
    

    似乎并没有真正找到链接元素.

    It doesn't seem to actually locate the link elements.

    我需要链接的纯文本.我想如果真的把链接元素放在首位,就会像searchLink.text一样.

    I need the plain text of the link. I imagine it would be something like searchLink.text if I actually got the link elements in the first place.

    最后,在lxml的实际API参考中,我无法找到有关find和findall调用的信息.我从我在Google上找到的一些代码中收集了这些信息.我是否缺少有关如何使用lxml有效查找和迭代HTML标记的内容?

    Finally, in the actual API reference for lxml, I wasn't able to find information on the find and the findall calls. I gleaned these from bits of code I found on google. Am I missing something about how to effectively find and iterate over HTML tags using lxml?

    推荐答案

    首先,关于解析HTML:如果您遵循zweiterlinde和S.Lott的建议,请至少使用

    Okay, first, in regards to parsing the HTML: if you follow the recommendation of zweiterlinde and S.Lott at least use the version of beautifulsoup included with lxml. That way you will also reap the benefit of a nice xpath or css selector interface.

    但是,我个人更喜欢Ian Bicking的 lxml中包含的HTML解析器.

    However, I personally prefer Ian Bicking's HTML parser included in lxml.

    第二,.find().findall()来自lxml,试图与ElementTree兼容,并且 ElementTree中的XPath支持.

    Secondly, .find() and .findall() come from lxml trying to be compatible with ElementTree, and those two methods are described in XPath Support in ElementTree.

    这两个函数非常易于使用,但它们的XPath却非常有限.我建议尝试使用完整的lxml xpath()方法,或者,如果使用您已经使用 cssselect()方法熟悉CSS.

    Those two functions are fairly easy to use but they are very limited XPath. I recommend trying to use either the full lxml xpath() method or, if you are already familiar with CSS, using the cssselect() method.

    以下是一些示例,其中的HTML字符串解析如下:

    Here are some examples, with an HTML string parsed like this:

    from lxml.html import fromstring
    mySearchTree = fromstring(your_input_string)
    

    使用css选择器类,您的程序将大致如下所示:

    Using the css selector class your program would roughly look something like this:

    # Find all 'a' elements inside 'tr' table rows with css selector
    for a in mySearchTree.cssselect('tr a'):
        print 'found "%s" link to href "%s"' % (a.text, a.get('href'))
    

    使用xpath方法的等效项是:

    The equivalent using xpath method would be:

    # Find all 'a' elements inside 'tr' table rows with xpath
    for a in mySearchTree.xpath('.//tr/*/a'):
        print 'found "%s" link to href "%s"' % (a.text, a.get('href'))
    

    这篇关于需要python lxml语法帮助来解析html的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆