无法使用scrapy检索xpath [英] cannot retrieve xpath using scrapy

查看:54
本文介绍了无法使用scrapy检索xpath的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

您好,我正在尝试获取类 listCell 的标题和文本的 xpath.我相信我做得对,因为我没有出错,但是当我在 csv 文件中显示它时,我在输出文件中什么也没有.我还在亚马逊等其他网站上测试了我的scrapy,它运行良好,但不适用于该网站.请帮忙!!

Hello I am trying to get the xpath for title and text for class listCell. I believe I am doing it right because i get no errors but when i display it in a csv file i do not get nothing in the output file. I also tested my scrapy in other websites such as amazon and it worked fine but not working for this website. Please help!!

    def parse(self, response):
    self.log("\n\n\n We got data! \n\n\n")
    hxs = HtmlXPathSelector(response)
    sites = hxs.select('//form[@id=\'listForm\']/table/tbody/tr')
    items = []
    for site in sites:
        item = CarrierItem()
        item['title'] = site.select('.//td[@class\'listCell\']/a/text()').extract()
        item['link'] = site.select('.//td[@class\'listCell\']/a/@href').extract()
        items.append(item)
    return items

这是我的 html.可能是因为它的 html 中有 javascript 导致它无法工作吗?

here is my html. Could it be possible it is not working because it has javascript in the html?

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title> Carrier IQ DIS 2.4 :: All Devices</title>
<script type="text/javascript" src="/dis/js/main.js">
<script type="text/javascript" src="/dis/js/validate.js">
<link rel="stylesheet" type="text/css" href="/dis/css/portal.css">
<link rel="stylesheet" type="text/css" href="/dis/css/style.css">
<script type="text/javascript">

    ....

<form id="listForm" name="listForm" method="POST" action="">
<table>
<thead>
<tbody>
<tr>
<td class="crt">1</td>
<td class="listCell" align="center">
<a href="/dis/packages.jsp?view=list&show=perdevice&device_gid=3651746C4173775343535452414567746D75643855673D3D53564A6151624D41716D534C68395A6337634E2F62413D3D&hwdid=probe0&mdn=6505550000&subscrbid=6505550000&maxlength=100">6505550000</a>
</td>
<td class="listCell" align="center">
<a href="/dis/packages.jsp?view=list&show=perdevice&device_gid=3651746C4173775343535452414567746D75643855673D3D53564A6151624D41716D534C68395A6337634E2F62413D3D&hwdid=probe0&subscrbid=6505550000&mdn=6505550000&maxlength=100">probe0</a>
</td>
<td class="listCell" align="center">
<td class="listCell" align="center">
<td class="cell" align="center">2013-07-01 13:39:38.820</td>
<td class="cell" align="left">1 - SMS_PullRequest_CS</td>
<td class="listCell" align="right">
<td class="listCell" align="center">
<td class="listCell" align="center">
</tr>
</tbody>
</table>
</form>

输出

    C:\Users\ye831c\Documents\Big Data\Scrapy\carrier>scrapy crawl dis -o iqDis.csv
-t csv
2013-07-01 10:50:18-0500 [scrapy] INFO: Scrapy 0.16.5 started (bot: carrier)
2013-07-01 10:50:18-0500 [scrapy] DEBUG: Enabled extensions: FeedExporter, LogSt
ats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2013-07-01 10:50:19-0500 [scrapy] DEBUG: Enabled downloader middlewares: HttpAut
hMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, De
faultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMi
ddleware, ChunkedTransferMiddleware, DownloaderStats
2013-07-01 10:50:19-0500 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMi
ddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddle
ware
2013-07-01 10:50:19-0500 [scrapy] DEBUG: Enabled item pipelines:
2013-07-01 10:50:19-0500 [dis] INFO: Spider opened
2013-07-01 10:50:19-0500 [dis] INFO: Crawled 0 pages (at 0 pages/min), scraped 0
 items (at 0 items/min)
2013-07-01 10:50:19-0500 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:602
3
2013-07-01 10:50:19-0500 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2013-07-01 10:50:19-0500 [dis] DEBUG: Crawled (200) <GET https://qvpweb01.ciq.la
bs.att.com:8080/dis/login.jsp> (referer: None)
2013-07-01 10:50:19-0500 [dis] DEBUG: Redirecting (302) to <GET https://qvpweb01
.ciq.labs.att.com:8080/dis/> from <POST https://qvpweb01.ciq.labs.att.com:8080/d
is/login>
2013-07-01 10:50:20-0500 [dis] DEBUG: Crawled (200) <GET https://qvpweb01.ciq.la
bs.att.com:8080/dis/> (referer: https://qvpweb01.ciq.labs.att.com:8080/dis/login
.jsp)
2013-07-01 10:50:20-0500 [dis] DEBUG:


    Successfully logged in. Let's start crawling!



2013-07-01 10:50:21-0500 [dis] DEBUG: Crawled (200) <GET https://qvpweb01.ciq.la
bs.att.com:8080/dis/> (referer: https://qvpweb01.ciq.labs.att.com:8080/dis/)
2013-07-01 10:50:21-0500 [dis] DEBUG:


     We got data!



2013-07-01 10:50:21-0500 [dis] INFO: Closing spider (finished)
2013-07-01 10:50:21-0500 [dis] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 1382,
     'downloader/request_count': 4,
     'downloader/request_method_count/GET': 3,
     'downloader/request_method_count/POST': 1,
     'downloader/response_bytes': 147888,
     'downloader/response_count': 4,
     'downloader/response_status_count/200': 3,
     'downloader/response_status_count/302': 1,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2013, 7, 1, 15, 50, 21, 221000),
     'log_count/DEBUG': 12,
     'log_count/INFO': 4,
     'request_depth_max': 2,
     'response_received_count': 3,
     'scheduler/dequeued': 4,
     'scheduler/dequeued/memory': 4,
     'scheduler/enqueued': 4,
     'scheduler/enqueued/memory': 4,
     'start_time': datetime.datetime(2013, 7, 1, 15, 50, 19, 42000)}
2013-07-01 10:50:21-0500 [dis] INFO: Spider closed (finished)

推荐答案

尝试简化您的 XPath:

Try to simplify your XPaths:

sites = hxs.select('//form[@id="listForm"]//tr')

As tbody 元素(在某些情况下)不存在于 HTML 中,而是由您的浏览器生成.

As tbody element is (in several cases) not present in the HTML, but generated by your browser.

这篇关于无法使用scrapy检索xpath的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆