如何确定正确的 XPath? [英] How do I determine the correct XPath?

查看:98
本文介绍了如何确定正确的 XPath?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在 Scrapy 中为 wunderground.com 制作一个网页抓取工具,但是我选择的许多不同的 XPath 都返回空数组.我发现了一个关于同一主题的不同问题 (此处),这实际上是我将代码切换到 wunderground.com 的原因.但是,给出的答案是专门针对一个确切对象的.如何确定其他对象的正确 XPath?

I am making a web scraper in Scrapy for wunderground.com, but many different XPaths that I choose return empty arrays. I found a different question on the same topic (here), which is actually why I switched my code to wunderground.com. However, the answer given is specifically directed at one exact object. How could I determine the correct XPaths for the other objects?

代码如下:

# -*- coding: utf-8 -*-
import scrapy
from scrapy.selector import Selector
import time

from wunderground_scraper.items import WundergroundScraperItem

class WundergroundComSpider(scrapy.Spider):
    name = "wunderground"
    allowed_domains = ["www.wunderground.com"]
    start_urls = (
        'http://www.wunderground.com/q/zmw:10001.5.99999',
    )

    def parse(self, response):
        info_set = Selector(response).xpath('//div[@id="current"]')
        list = []
        for i in info_set:
            item = WundergroundScraperItem()
            # WORKS FINE
            item['temperature'] = i.xpath('div/div/div/div/span/span/text()').extract()
            item['temperature'] = item['temperature'][0]


            # EDITED XPATH FROM OTHER QUESTION
            item['humidity'] = i.xpath('.//td[dfn="Humidity"]/following-sibling::td//text()').extract()
            item['humidity'] = item['humidity'][2]


            # RETURNS EMPTY ARRAY
            item['chance_rain'] = i.xpath('div/div/div/div/a/strong/text()').extract()


            list.append(item)
        return list

推荐答案

通常,如何确定正确的 XPath 表达式"的答案要么是通过检查"(即查看您'正在尝试查询,或通过反复试验"(从一般表达式开始,然后缩小范围,直到得到你想要的).

Generally, the answer to "how do I determine the correct XPath expression" is going to be either "by inspection" (that is, look at the document you're trying to query, or "through trial and error" (start with general expressions and then narrow them down until you get what you want).

在这种情况下,您遇到了一个非常常见的问题:您在浏览器中看到的页面是使用 Javascript 本地部分呈现的.包含沉淀机会的元素作为 <script> 资源的一部分包含在内,从您的 XML 解析器的角度来看,它是 (a) 只是一个不透明的文本块,(b) 不'甚至不包含您要查找的信息,因为它需要先由脚本填写.直到使用 Javascript 呈现页面时,元素才在文档中实际实例化.

In this case, you've run into a very common problem: the page you see in your browser is partially rendered locally using Javascript. The element that contains the chance of precipitation is include as part of a <script> resource, which from the perspective of your XML parser is (a) simply an opaque blob of text and (b) doesn't even contain the information you're looking for because it needs to be filled in by the script first. It's not until the page is rendered with Javascript that the the element is actually instantiated in the document.

不可能从文档源中提取这些数据.

It's not going to be possible to extract this data from the document source.

这篇关于如何确定正确的 XPath?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆