XPath的不工作的屏幕抓取 [英] XPath not working for screen scraping
问题描述
我使用Scrapy的屏幕抓取项目时遇到的一个XPath的问题。
我想从下面的图片的94218,但我已经使用的XPath和CSS是行不通的。
这是从该页面: https://fancy.com/things/280558613/ I%27米-FINE-T恤
我已经试过多次的XPath和CSS与Scrapy但一切都返回空白。
下面是一些例子:
<$p$p><$c$c>response.xpath('/html/body/div[1]/div[1]/div[1]/aside/div[1]/div/div/a[2]/text()').extract()response.xpath('// * [@ ID =侧边栏] / DIV [1] / DIV / DIV / A [2] /文()')。提取物()response.xpath('// * [含有(的concat(,@class,),的concat(,fancyd_list,))])'。提取物()response.xpath(.//*[@ ID ='边栏'] / DIV [1] / DIV / DIV / A [2] /文())我尝试过萤火虫,Firepath,Chrome开发工具和不同的插件,但没有的XPath或CSS似乎工作..有人可以帮助?
实际的页面上的code是:
&LT; A HREF =#类=fancyd_list/&GT;
6
&所述; / A&GT;
部分的XPath的工作,但它们不包含任何文本,所以它看起来像这样:&LT; A HREF =#类=fancyd_list/&GT;&LT; / A&GT;
我也使用BeautifulSoup试过,但它有同样的问题:
打印soup.find_all('A',类_ ='fancyd_list')
并[d一类=fancyd_list的href =#&GT;&下; /一&gt;中&下;一类=fancyd_list的href =#&GT;&下; /一个]的计算值
谢谢!
这里的问题是,提供的URL与畸形&LT返回HTML; A&GT;
标记以下内容:
&LT; A HREF =#类=fancyd_list/&GT; #畸形的HTML,&LT; A&GT;标签在这里关闭
94218
&所述; / A&GT;
第一行这里包含一个 /
右括号,它通过HTML标准表示℃的完成之前,A&GT;
标记。由于到Scrapy,在&LT;一个方式&gt;
元素完成后,你不能获取标签外的文本
使用BeautifulSoup可能是一个好主意,在这里,因为它处理畸形的HTML的previous推荐的多更好。
另一种选择,你可以有这样的例子是解决自己的HTML,通过类似于:
new_body =应用re.sub(R'&LT; A HREF =#类=fancyd_list/&GT;','&LT; A HREF =#类= fancyd_list&GT;',response.body)
响应= response.replace(体= new_body)
您随后将能够从响应选择通过
response.xpath(// DIV [@类='FRM'] / DIV [@类='数字键'] / A [含有(@class,'fancyd_list ')] /文本())。提取物()
我使用的原因,包含是因为(我)的类名称用空格它的名字的末尾出现,和一[@class这样Scrapy支票='fancyd_list']
将失败,因为fancyd_list!=fancyd_list
I am using Scrapy for a screen scraping project and am having problems with an XPath.
I am trying to get the 94,218 from the image below, but the XPaths and CSS I have used is not working.
It's from this page: https://fancy.com/things/280558613/I%27m-Fine-T-Shirt
I have tried multiple XPaths and CSS with Scrapy but everything is returning blank.
Here are some examples:
response.xpath('/html/body/div[1]/div[1]/div[1]/aside/div[1]/div/div/a[2]/text()').extract()
response.xpath('//*[@id="sidebar"]/div[1]/div/div/a[2]/text()').extract()
response.xpath('//*[contains(concat( " ", @class, " " ), concat( " ", "fancyd_list", " " ))])'.extract()
response.xpath(".//*[@id='sidebar']/div[1]/div/div/a[2]/text()")
I've tried Firebug, Firepath, Chrome Dev Tools and different plugins but none of the XPaths or CSS seem to work.. can someone assist?
The code on the actual page is:
<a href="#" class="fancyd_list "/>
6
</a>
Some of the XPaths work, but they contain no text, so it looks like this: <a href="#" class="fancyd_list "/></a>
I've also tried using BeautifulSoup, but it has the same problem:
print soup.find_all('a',class_='fancyd_list')
[<a class="fancyd_list " href="#"></a>, <a class="fancyd_list " href="#"></a>]
Thanks!
The problem here is that the provided URL is returning HTML with a malformed <a>
tag in the following:
<a href="#" class="fancyd_list "/> # Malformed HTML, <a> tag closes here
94,218
</a>
The first line here contains a /
prior to the closing bracket, which by HTML standards indicates a completion of the <a>
tag. Since to Scrapy, the <a>
element is done, you can't fetch the text outside of the tags.
The previous recommendation of using BeautifulSoup may be a good idea here, because it handles malformed HTML much better.
Another option you can have for this example would be to fix the HTML yourself, via something similar to:
new_body = re.sub(r'<a href="#" class="fancyd_list "/>', '<a href="#" class="fancyd_list ">', response.body)
response = response.replace(body=new_body)
You would then be able to select from the response via
response.xpath("//div[@class='frm']/div[@class='figure-button']/a[contains(@class, 'fancyd_list')]/text()").extract()
The reason I'm using "contains" is because the class name (for me) is appearing with a space at the end of it's name, and as such Scrapy's check of "a[@class='fancyd_list']"
will fail, because "fancyd_list" != "fancyd_list "
这篇关于XPath的不工作的屏幕抓取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!