使用 Xpath 提取值时 Scrapy 中的空列表 [英] Empty List From Scrapy When Using Xpath to Extract Values

查看:62
本文介绍了使用 Xpath 提取值时 Scrapy 中的空列表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

真的需要这个社区的帮助.

我的问题是当我在python中使用代码时

response.xpath("//div[contains(@class,'check-prices-widget-not-sponsor')]/a/div[contains(@class,'check-prices-widget-非赞助链接')]").extract()

在scrapy shell中提取供应商名称,输出为空.我真的不知道为什么会这样,在我看来问题可能是网站信息正在动态更新?

此网页抓取的网址为:

如您所见,您需要的信息并未显示.然而,这个 https://cruiseline.com/destination/caribbean/cruise/best?sort=rank,ship_status&&direction=desc&page=1&per_page=10&sailing_counts=0 是静态,所以这就是为什么你可以抓取你需要的东西.

我给你提供了两种抓取动态网站的方法(当然还有更多):

1.Splash(官方文档):在您的 Spider 中,使用 SplashRequest 而不是 scrapy.Request 生成您的网址.

2.Selenium + PhantomJS(官方文档)

祝你刮刮乐!:)

Really need the help from this community.

My question is that when I used the code in python

response.xpath("//div[contains(@class,'check-prices-widget-not-sponsored')]/a/div[contains(@class,'check-prices-widget-not-sponsored-link')]").extract()

to extract the vendor name in scrapy shell, the output is empty. I really did not know why that happened, and it seems to me that the problem might be the website info is updating dynamically?

The url for this web scraping is: https://cruiseline.com/cruise/7-night-bahamas-florida-new-york-roundtrip-32860, and what I need is the Vendor name and Price for each vendor. Besides the attached pic is the screenshot of "the inspect". enter image description here

However, the similar code works to extract price in the following page url ('https://cruiseline.com/destination/caribbean/cruise/best?sort=rank,ship_status&&direction=desc&page=1&per_page=10&sailing_counts=0')

Prices = response.xpath(
        "//div[contains(@class,'featured-cruise-price-inner-price')]/span/descendant::text()").extract()

Really appreciate the help!

解决方案

I tried this url in scrapy shell:https://cruiseline.com/cruise/7-night-bahamas-florida-new-york-roundtrip-32860, and i also got nothing with

response.xpath("//div[contains(@class,'check-prices-widget-not-sponsored')]/a/div[contains(@class,'check-prices-widget-not-sponsored-link')]").extract()

Then I used view(response) command to figure out what the spider sees, and found out that the site is dynamic, which means if you want to scrape info on that website, you need to execute the js codes that show the info.

Here are the screenshots:

As you can see, the info you need doesn't show. However, this one https://cruiseline.com/destination/caribbean/cruise/best?sort=rank,ship_status&&direction=desc&page=1&per_page=10&sailing_counts=0 is static, so that's why you can scrape what you need.

I got two ways for you to scrape dynamic website(of course, there are more):

1.Splash(Official Doc): In your Spider, yield your url with SplashRequest instead of scrapy.Request.

2.Selenium + PhantomJS(Official Doc)

Good luck with your scraping! :)

这篇关于使用 Xpath 提取值时 Scrapy 中的空列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆