XPath的不工作的屏幕抓取 [英] XPath not working for screen scraping

查看：127 发布时间：2016/8/5 19:14:41 python web-scraping beautifulsoup scrapy screen-scraping

本文介绍了XPath的不工作的屏幕抓取的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我使用Scrapy的屏幕抓取项目时遇到的一个XPath的问题。

我想从下面的图片的94218，但我已经使用的XPath和CSS是行不通的。

这是从该页面： https://fancy.com/things/280558613/ I％27米-FINE-T恤

我已经试过多次的XPath和CSS与Scrapy但一切都返回空白。

下面是一些例子：

<$p$p><$c$c>response.xpath('/html/body/div[1]/div[1]/div[1]/aside/div[1]/div/div/a[2]/text()').extract()response.xpath（'// * [@ ID =侧边栏] / DIV [1] / DIV / DIV / A [2] /文（）'）。提取物（）response.xpath（'// * [含有（的concat（，@class，），的concat（，fancyd_list，））]）'。提取物（）response.xpath（.//*[@ ID ='边栏'] / DIV [1] / DIV / DIV / A [2] /文（））

我尝试过萤火虫，Firepath，Chrome开发工具和不同的插件，但没有的XPath或CSS似乎工作..有人可以帮助？

实际的页面上的code是：

 ＆LT; A HREF =＃类=fancyd_list/＆GT;
    6
＆所述; / A＆GT;

部分的XPath的工作，但它们不包含任何文本，所以它看起来像这样：＆LT; A HREF =＃类=fancyd_list/＆GT;＆LT; / A＆GT;

我也使用BeautifulSoup试过，但它有同样的问题：

 打印soup.find_all（'A'，类_ ='fancyd_list'）
并[d一类=fancyd_list的href =＃＆GT;＆下; /一＆gt;中＆下;一类=fancyd_list的href =＃＆GT;＆下; /一个]的计算值

谢谢！

解决方案

这里的问题是，提供的URL与畸形＆LT返回HTML; A＆GT; 标记以下内容：

 ＆LT; A HREF =＃类=fancyd_list/＆GT; ＃畸形的HTML，＆LT; A＆GT;标签在这里关闭
    94218
＆所述; / A＆GT;

第一行这里包含一个 / 右括号，它通过HTML标准表示℃的完成之前，A＆GT; 标记。由于到Scrapy，在＆LT;一个方式＆gt; 元素完成后，你不能获取标签外的文本

使用BeautifulSoup可能是一个好主意，在这里，因为它处理畸形的HTML的previous推荐的多更好。

另一种选择，你可以有这样的例子是解决自己的HTML，通过类似于：

  new_body =应用re.sub（R'＆LT; A HREF =＃类=fancyd_list/＆GT;'，'＆LT; A HREF =＃类= fancyd_list＆GT;'，response.body）
响应= response.replace（体= new_body）

您随后将能够从响应选择通过

  response.xpath（// DIV [@类='FRM'] / DIV [@类='数字键'] / A [含有（@class，'fancyd_list '）] /文本（））。提取物（）

我使用的原因，包含是因为（我）的类名称用空格它的名字的末尾出现，和一[@class这样Scrapy支票='fancyd_list']将失败，因为fancyd_list！=fancyd_list

I am using Scrapy for a screen scraping project and am having problems with an XPath.

I am trying to get the 94,218 from the image below, but the XPaths and CSS I have used is not working.

It's from this page: https://fancy.com/things/280558613/I%27m-Fine-T-Shirt

I have tried multiple XPaths and CSS with Scrapy but everything is returning blank.

Here are some examples:

response.xpath('/html/body/div[1]/div[1]/div[1]/aside/div[1]/div/div/a[2]/text()').extract()

response.xpath('//*[@id="sidebar"]/div[1]/div/div/a[2]/text()').extract()

response.xpath('//*[contains(concat( " ", @class, " " ), concat( " ", "fancyd_list", " " ))])'.extract()

response.xpath(".//*[@id='sidebar']/div[1]/div/div/a[2]/text()")

I've tried Firebug, Firepath, Chrome Dev Tools and different plugins but none of the XPaths or CSS seem to work.. can someone assist?

The code on the actual page is:

<a href="#" class="fancyd_list "/>
    6
</a>

Some of the XPaths work, but they contain no text, so it looks like this: <a href="#" class="fancyd_list "/></a>

I've also tried using BeautifulSoup, but it has the same problem:

print soup.find_all('a',class_='fancyd_list')
[<a class="fancyd_list " href="#"></a>, <a class="fancyd_list " href="#"></a>]

Thanks!

解决方案

The problem here is that the provided URL is returning HTML with a malformed <a> tag in the following:

<a href="#" class="fancyd_list "/>  # Malformed HTML, <a> tag closes here
    94,218
</a>

The first line here contains a / prior to the closing bracket, which by HTML standards indicates a completion of the <a> tag. Since to Scrapy, the <a> element is done, you can't fetch the text outside of the tags.

The previous recommendation of using BeautifulSoup may be a good idea here, because it handles malformed HTML much better.

Another option you can have for this example would be to fix the HTML yourself, via something similar to:

new_body = re.sub(r'<a href="#" class="fancyd_list "/>', '<a href="#" class="fancyd_list ">', response.body)
response = response.replace(body=new_body)

You would then be able to select from the response via

response.xpath("//div[@class='frm']/div[@class='figure-button']/a[contains(@class, 'fancyd_list')]/text()").extract()

The reason I'm using "contains" is because the class name (for me) is appearing with a space at the end of it's name, and as such Scrapy's check of "a[@class='fancyd_list']" will fail, because "fancyd_list" != "fancyd_list "

这篇关于XPath的不工作的屏幕抓取的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

XPath的不工作的屏幕抓取 [英] XPath not working for screen scraping

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

XPath的不工作的屏幕抓取 [英] XPath not working for screen scraping

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭