Scrapy + Splash:在内部 html 中抓取元素 [英] Scrapy + Splash: scraping element inside inner html

查看:23
本文介绍了Scrapy + Splash:在内部 html 中抓取元素的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 Scrapy + Splash 来抓取网页并尝试从谷歌广告横幅和其他广告中提取数据,但我很难通过 xpath 进入它们.

I'm using Scrapy + Splash to crawl webpages and try to extract data form google ad banners and other ads and I'm having difficulty getting scrapy to follow the xpath into them.

我正在使用

Splash 确保代码被呈现,所以我不会遇到scrapy 在脚本中读取脚本内容而不是结果 html 的常见问题——但我似乎无法找到一种方法来表明到达我需要的元素节点所需的 XPath(广告的 href 链接).

Splash makes sure the code is rendered so I don't run into the usual problem scrapy has with scripts where it reads the script's content instead of it's resulting html -- but I can't seem to find a way to indicate the XPath necessary to get to the element nodes I need (ad's href link).

如果我检查 google 中的元素并复制它的 xpath,它只会给我 //*[@id="aw0"],如果 iframe 的 html 是全部,我觉得这会起作用在这里,但无论我怎么写它都会返回空,我觉得这可能是因为 XPath 不能优雅地处理堆叠在 html 文档中的 html 文档.

If I inspect the element in google and copy it's xpath it simply gives me //*[@id="aw0"], which I feel would work if the iframe's html was all there was here, but it returns empty no matter how I write it and I fele it's probably because XPath doesn't elegantly handle html documents stacked within html documents.

包含 google 广告代码的 iframe 的 XPath 是 //*[@id="google_ads_iframe_/87824813/hola/blogs/home_0"]{数字是恒定的}.

The XPath to the iframe that contains the google ad code is //*[@id="google_ads_iframe_/87824813/hola/blogs/home_0"]{the numbers are constant}.

有没有办法将这些 XPath 堆叠在一起,以便进行抓取以跟踪到我需要的容器中?或者我应该以其他方式直接解析 Splash 响应对象,而我不能为此依赖 Response.Xpath/Response.CSS?

Is there a way to stack these XPaths together to get scrapy to follow the trail into the container I need? Or should I be parsing the Splash response object directly in some other way and I can't rely on Response.Xpath/Response.CSS for this?

推荐答案

问题是 iframe 内容没有作为 html 的一部分返回.您可以尝试直接(通过其 src)获取 iframe 内容,或使用带有 iframes=1 选项的 render.json 端点:

The problem is that iframe content is not returned as a part of html. You can either try to fetch iframe content directly (by its src), or use render.json endpoint with iframes=1 option:

# ...
    yield SplashRequest(url, self.parse_result, endpoint='render.json', 
                        args={'html': 1, 'iframes': 1})

def parse_result(self, response):
    iframe_html = response.data['childFrames'][0]['html']
    sel = parsel.Selector(iframe_html)
    item = {
        'my_field': sel.xpath(...),
        # ...  
    }

/execute 端点不支持从 Splash 2.3.3 开始获取 iframe 内容.

/execute endpoint doesn't support fetching iframes content as of Splash 2.3.3.

这篇关于Scrapy + Splash:在内部 html 中抓取元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆