XPath的不工作的屏幕抓取 [英] XPath not working for screen scraping

查看:127
本文介绍了XPath的不工作的屏幕抓取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用Scrapy的屏幕抓取项目时遇到的一个XPath的问题。

我想从下面的图片的94218,但我已经使用的XPath和CSS是行不通的。

在这里输入的形象描述
这是从该页面: https://fancy.com/things/280558613/ I%27米-FINE-T恤

我已经试过多次的XPath和CSS与Scrapy但一切都返回空白。

下面是一些例子:

<$p$p><$c$c>response.xpath('/html/body/div[1]/div[1]/div[1]/aside/div[1]/div/div/a[2]/text()').extract()response.xpath('// * [@ ID =侧边栏] / DIV [1] / DIV / DIV / A [2] /文()')。提取物()response.xpath('// * [含有(的concat(,@class,),的concat(,fancyd_list,))])'。提取物()response.xpath(.//*[@ ID ='边栏'] / DIV [1] / DIV / DIV / A [2] /文())

我尝试过萤火虫,Firepath,Chrome开发工具和不同的插件,但没有的XPath或CSS似乎工作..有人可以帮助?

实际的页面上的code是:

 &LT; A HREF =#类=fancyd_list/&GT;
    6
&所述; / A&GT;

部分的XPath的工作,但它们不包含任何文本,所以它看起来像这样:&LT; A HREF =#类=fancyd_list/&GT;&LT; / A&GT;

我也使用BeautifulSoup试过,但它有同样的问题:

 打印soup.find_all('A',类_ ='fancyd_list')
并[d一类=fancyd_list的href =#&GT;&下; /一&gt;中&下;一类=fancyd_list的href =#&GT;&下; /一个]的计算值

谢谢!


解决方案

这里的问题是,提供的URL与畸形&LT返回HTML; A&GT; 标记以下内容:

 &LT; A HREF =#类=fancyd_list/&GT; #畸形的HTML,&LT; A&GT;标签在这里关闭
    94218
&所述; / A&GT;

第一行这里包含一个 / 右括号,它通过HTML标准表示℃的完成之前,A&GT; 标记。由于到Scrapy,在&LT;一个方式&gt; 元素完成后,你不能获取标签外的文本

使用BeautifulSoup可能是一个好主意,在这里,因为它处理畸形的HTML的previous推荐的更好。

另一种选择,你可以有这样的例子是解决自己的HTML,通过类似于:

  new_body =应用re.sub(R'&LT; A HREF =#类=fancyd_list/&GT;','&LT; A HREF =#类= fancyd_list&GT;',response.body)
响应= response.replace(体= new_body)

您随后将能够从响应选择通过

  response.xpath(// DIV [@类='FRM'] / DIV [@类='数字键'] / A [含有(@class,'fancyd_list ')] /文本())。提取物()

我使用的原因,包含是因为(我)的类名称用空格它的名字的末尾出现,和一[@class这样Scrapy支票='fancyd_list']将失败,因为fancyd_list!=fancyd_list

I am using Scrapy for a screen scraping project and am having problems with an XPath.

I am trying to get the 94,218 from the image below, but the XPaths and CSS I have used is not working.

It's from this page: https://fancy.com/things/280558613/I%27m-Fine-T-Shirt

I have tried multiple XPaths and CSS with Scrapy but everything is returning blank.

Here are some examples:

response.xpath('/html/body/div[1]/div[1]/div[1]/aside/div[1]/div/div/a[2]/text()').extract()

response.xpath('//*[@id="sidebar"]/div[1]/div/div/a[2]/text()').extract()

response.xpath('//*[contains(concat( " ", @class, " " ), concat( " ", "fancyd_list", " " ))])'.extract()

response.xpath(".//*[@id='sidebar']/div[1]/div/div/a[2]/text()")

I've tried Firebug, Firepath, Chrome Dev Tools and different plugins but none of the XPaths or CSS seem to work.. can someone assist?

The code on the actual page is:

<a href="#" class="fancyd_list "/>
    6
</a>

Some of the XPaths work, but they contain no text, so it looks like this: <a href="#" class="fancyd_list "/></a>

I've also tried using BeautifulSoup, but it has the same problem:

print soup.find_all('a',class_='fancyd_list')
[<a class="fancyd_list " href="#"></a>, <a class="fancyd_list " href="#"></a>]

Thanks!

解决方案

The problem here is that the provided URL is returning HTML with a malformed <a> tag in the following:

<a href="#" class="fancyd_list "/>  # Malformed HTML, <a> tag closes here
    94,218
</a>

The first line here contains a / prior to the closing bracket, which by HTML standards indicates a completion of the <a> tag. Since to Scrapy, the <a> element is done, you can't fetch the text outside of the tags.

The previous recommendation of using BeautifulSoup may be a good idea here, because it handles malformed HTML much better.

Another option you can have for this example would be to fix the HTML yourself, via something similar to:

new_body = re.sub(r'<a href="#" class="fancyd_list "/>', '<a href="#" class="fancyd_list ">', response.body)
response = response.replace(body=new_body)

You would then be able to select from the response via

response.xpath("//div[@class='frm']/div[@class='figure-button']/a[contains(@class, 'fancyd_list')]/text()").extract()

The reason I'm using "contains" is because the class name (for me) is appearing with a space at the end of it's name, and as such Scrapy's check of "a[@class='fancyd_list']" will fail, because "fancyd_list" != "fancyd_list "

这篇关于XPath的不工作的屏幕抓取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆