带有scrapy和Xpath的空列表 [英] Empty list with scrapy and Xpath

查看:50
本文介绍了带有scrapy和Xpath的空列表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我开始使用scrapy和xpath来抓取一些页面,我只是尝试使用ipython做一些简单的事情,我在一些页面(如IMDB)中得到响应,但是当我在其他页面(如www.bbb.org)中尝试时我总是得到一个空列表.这就是我正在做的:

scrapy shell 'http://www.bbb.org/central-western-massachusetts/business-reviews/auto-repair-and-service/toms-automotive-in-fitchburg-ma-211787'

<块引用>

BBB 认证

自 2010 年 2 月 12 日起获得 BBB 认证的企业

BBB 已确定 Tom's Automotive 符合 BBB 认证标准,其中包括对……的承诺."

这一段的xpath是:

'//*[@id="business-accreditation-content"]/p[2]'

所以我使用:

data = response.xpath('//*[@id="business-accreditation-content"]/p[2]').extract()

但是 data 是一个空列表,我使用 chrome 获取 Xpath 并且它可以在其他页面中工作,但是在这里无论我尝试页面的哪个部分,我都没有得到任何结果.

解决方案

网站实际上会检查 User-Agent 标头.

如果不指定,看看它返回什么:

$ scrapy shell 'http://www.bbb.org/central-western-massachusetts/business-reviews/auto-repair-and-service/toms-automotive-in-fitchburg-ma-211787'在 [1] 中:打印(响应.body)出[1]:123在 [2] 中: response.xpath('//*[@id="business-accreditation-content"]/p[2]').extract()出[2]:[]

是的,没错 - 如果存在意外请求用户代理,响应仅包含 123.

现在有了标题(注意指定的 -s 命令行参数):

$ scrapy shell 'http://www.bbb.org/central-western-massachusetts/business-reviews/auto-repair-and-service/toms-automotive-in-fitchburg-ma-211787'-s USER_AGENT='Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.80 Safari/537.36'在 [1] 中: response.xpath('//*[@id="business-accreditation-content"]/p[2]').extract()Out[1]:[u'<p itemprop="description">BBB 已经确定 Tom\'s Automotive 符合 <a href="http://www.bbb.org/central-western-massachusetts/for-businesses/about-bbb-accreditation/bbb-code-of-business-practices-bbb-accreditation-standards/" lang="LS30TPCERNY5b60c87311af50cf82720b237d8ef866"> BBB 认证标准<,其中包括做出良好/a>信念努力解决任何消费者投诉.BBB 认证企业为认证审查/监控以及为公众提供 BBB 服务的支持支付费用.</p>']

这是来自 shell 的一个例子.在真正的 Scrapy 项目中,您需要设置 USER_AGENT 项目设置.或者,您也可以在此中间件的帮助下使用用户代理轮换:scrapy-fake-useragent.

I'm starting to use scrapy and xpath to scrape some page, I'm just trying simple things using ipython, an I get response in some pages like in IMDB, but when I try in others like www.bbb.org I always get an empty list. This is what I'm doing:

scrapy shell 'http://www.bbb.org/central-western-massachusetts/business-reviews/auto-repair-and-service/toms-automotive-in-fitchburg-ma-211787'

BBB Accreditation

A BBB Accredited Business since 02/12/2010

BBB has determined that Tom's Automotive meets BBB accreditation standards, which include a commitment to......"

the xpath of this paragraph is:

'//*[@id="business-accreditation-content"]/p[2]'

So I use:

data = response.xpath('//*[@id="business-accreditation-content"]/p[2]').extract()

But data is an empty list, I'm getting the Xpath with chrome and it works in other pages, but here I get nothing regardless what part of the page I try.

解决方案

The website actually checks for the User-Agent header.

See what it returns if you don't specify it:

$ scrapy shell 'http://www.bbb.org/central-western-massachusetts/business-reviews/auto-repair-and-service/toms-automotive-in-fitchburg-ma-211787'
In [1]: print(response.body)
Out[1]: 123

In [2]: response.xpath('//*[@id="business-accreditation-content"]/p[2]').extract()
Out[2]: []

Yes, that's right - the response contains only 123 if there is an unexpected request user agent.

Now with the header (note the specified -s command-line argument):

$ scrapy shell 'http://www.bbb.org/central-western-massachusetts/business-reviews/auto-repair-and-service/toms-automotive-in-fitchburg-ma-211787' -s USER_AGENT='Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.80 Safari/537.36'
In [1]: response.xpath('//*[@id="business-accreditation-content"]/p[2]').extract()
Out[1]: [u'<p itemprop="description">BBB has determined that Tom\'s Automotive meets <a href="http://www.bbb.org/central-western-massachusetts/for-businesses/about-bbb-accreditation/bbb-code-of-business-practices-bbb-accreditation-standards/" lang="LS30TPCERNY5b60c87311af50cf82720b237d8ef866">BBB accreditation standards</a>, which include a commitment to make a good faith effort to resolve any consumer complaints. BBB Accredited Businesses pay a fee for accreditation review/monitoring and for support of BBB services to the public.</p>']

This was an example from the shell. In a real Scrapy project, you would need to set the USER_AGENT project setting. Or, you may also use user agent rotation with the help of this middleware: scrapy-fake-useragent.

这篇关于带有scrapy和Xpath的空列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆