如何使XPath选择具有相同id属性的多个表元素? [英] How to make XPath select multiple table elements with identical id attributes?

查看:541
本文介绍了如何使XPath选择具有相同id属性的多个表元素?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在尝试从格式错误的网页中提取信息.具体地说,页面对多个表元素使用了相同的id属性.标记等效于以下内容:

I'm currently trying to extract information from a badly formatted web page. Specifically, the page has used the same id attribute for multiple table elements. The markup is equivalent to something like this:

<body>
    <div id="random_div">
        <p>Some content.</p>
        <table id="table_1">
            <tr>
                <td>Important text 1.</td>
            </tr>
        </table>
        <h4>Some heading in between</h4>
        <table id="table_1">
            <tr>
                <td>Important text 2.</td>
                <td>Important text 3.</td>
            </tr>
        </table>
        <p>How about some more text here.</p>
        <table id="table_1">
            <tr>
                <td>Important text 4.</td>
                <td>Important text 5.</td>
            </tr>
        </table>
    </div>
</body>

由于元素多次使用相同的ID,显然这是格式错误的HTML.

Clearly this is incorrectly formatted HTML, due to the multiple use of the same id for an element.

我正在使用XPath尝试提取各种表格元素中的所有文本,并通过 Scrapy 框架.

I'm using XPath to try and extract all the text in the various table elements, utilising the language through the Scrapy framework.

我的电话,看起来像这样:

My call, looks something like this:

hxs.select('//div[contains(@id, "random_div")]//table[@id="table_1"]//text()').extract()

因此,XPath表达式为: //div[contains(@id, "random_id")]//table[@id="table_1"]//text()

Thus the XPath expression is: //div[contains(@id, "random_id")]//table[@id="table_1"]//text()

这将返回:[u'Important text 1.'],即与ID值"table_1"匹配的第一个表的内容.在我看来,一旦遇到具有特定id的元素,它将忽略标记中将来出现的任何情况.有人可以确认吗?

This returns: [u'Important text 1.'], i.e., the contents of the first table that matches the id value "table_1". It seems to me that once it has come across an element with a certain id it ignores any future occurrences in the markup. Can anyone confirm this?

更新

感谢您的以下快速回复.我已经在本地托管的页面上测试了我的代码,该页面具有与上述相同的测试格式,并且返回了正确的响应,即

Thanks for the fast responses below. I have tested my code on a page hosted locally, which has the same test format as above and the correct response is returned, i.e.,

`[u'Important text 1.', u'Important text 2.', . . . . ,u'Important text 5.']`

因此,我正在执行的Xpath表达式或Python调用都没有问题.

There is therefore nothing wrong with either the Xpath expression or the Python calls I'm making.

我想这意味着网页本身存在问题,可能是XPath或html解析器(libxml2)搞砸了.

I guess this means that there is a problem on the webpage itself which is either screwing up XPath or the html parser, which is libxml2.

有人对我如何进一步了解它有任何建议吗?

Does anyone have any advice as to how I can dig into this a bit more?

更新2

我已成功隔离问题.它实际上与基础解析库lxml(为libxml2 C库提供Python绑定)

I have successfully isolated the problem. It is actually with the underlying parsing library, which is lxml (which provides Python bindings for the libxml2 C library.

问题是解析器无法处理垂直制表符.我不知道是谁编码了我正在处理的网站,但是该网站的垂直标签完整完整. Web浏览器似乎能够忽略这些,这就是为什么例如在有关站点上从Firebug运行XPath查询成功的原因.

The problem is that the parser is unable to deal with vertical tabs. I have no idea who coded up the site I am dealing with but it is full of vertical tabs. Web browser seem to be able to ignore these, which is why running the XPath queries from Firebug on the site in question, for example, are successful.

此外,由于上面的简化示例不包含垂直制表符,因此可以正常工作.对于在Scrapy(或通常在python中)中遇到此问题的任何人,以下修复对我有用,可以从html响应中删除垂直制表符:

Further, because the above simplified example doesn't contain vertical tabs it works fine. For anyone who comes across this issue in Scrapy (or in python generally), the following fix worked for me, to remove vertical tabs from the html responses:

def parse_item(self, response):
    # remove all vertical tabs from the html response
    response.body = filter(lambda c: c != "\v", response.body)
    hxs = HtmlXPathSelector(response)
    items = hxs.select('//div[contains(@id, \"random_div\")]' \
                       '//table[@id="table_1"]//text()').extract()

推荐答案

对于Firebug,此表达式:

With Firebug, this expression:

//table[@id='table_1']//td/text()

给我这个:

[<TextNode textContent="Important text 1.">,
 <TextNode textContent="Important text 2.">,
 <TextNode textContent="Important text 3.">,
 <TextNode textContent="Important text 4.">,
 <TextNode textContent="Important text 5.">]

我包括了td过滤以提供更好的结果,因为否则,您将获得标记之间的空格和换行符.但总的来说,它似乎可行.

I included the td filtering to give a nicer result, since otherwise, you would get the whitespace and newlines between the tags. But all in all, it seems to work.

我注意到的是您查询了//div[contains(@id, "random_id")],而您的HTML代码段中的标签却显示了<div id="random_div">-_id_div是不同的.我不了解Scrapy,所以我真的不能说这是否有帮助,但这不是您的问题吗?

What I noticed was that you query for //div[contains(@id, "random_id")], while your HTML snippet has a tag that reads <div id="random_div"> -- the _id and _div being different. I don't know Scrapy so I can't really say if that does something, but couldn't that be your issue as well?

这篇关于如何使XPath选择具有相同id属性的多个表元素?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆