无法使用刮scrap的蜘蛛抓取特定网站的元素 [英] Failed to crawl element of specific website with scrapy spider

查看:97
本文介绍了无法使用刮scrap的蜘蛛抓取特定网站的元素的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想获取某些工作的网站地址,所以我写了一个沙哑的蜘蛛,我想用 xpath:// article / dl / dd / h2 / a [@ class = job-title] / @ href,,但是当我使用以下命令执行蜘蛛程序时:

I want to get website addresses of some jobs, so I write a scrapy spider, I want to get all of the value with xpath://article/dl/dd/h2/a[@class="job-title"]/@href, but when I execute the spider with command :

scrapy spider auseek -a addsthreshold=3

变量 urls 用来保存值的是空的,有人可以帮我弄清楚吗,

the variable "urls" used to preserve values is empty, can someone help me to figure it,

这是我的代码:

from scrapy.contrib.spiders import CrawlSpider,Rule
from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.conf import settings
from scrapy.mail import MailSender
from scrapy.xlib.pydispatch import dispatcher
from scrapy.exceptions import CloseSpider
from scrapy import log
from scrapy import signals

from myProj.items import ADItem
import time

class AuSeekSpider(CrawlSpider):
    name = "auseek"
    result_address = []
    addressCount = int(0)
    addressThresh = int(0)
    allowed_domains = ["seek.com.au"]
    start_urls = [
        "http://www.seek.com.au/jobs/in-australia/"
    ]

    def __init__(self,**kwargs):
        super(AuSeekSpider, self).__init__()
        self.addressThresh = int(kwargs.get('addsthreshold'))
        print 'init finished...'

    def parse_start_url(self,response):
        print 'This is start url function'
        log.msg("Pipeline.spider_opened called", level=log.INFO)
        hxs = Selector(response)
        urls = hxs.xpath('//article/dl/dd/h2/a[@class="job-title"]/@href').extract()
        print 'urls is:',urls
        print 'test element:',urls[0].encode("ascii")
        for url in urls:
            postfix = url.getAttribute('href')
            print 'postfix:',postfix
            url = urlparse.urljoin(response.url,postfix)
            yield Request(url, callback = self.parse_ad)

        return 


    def parse_ad(self, response):
        print 'this is parse_ad function'
        hxs = Selector(response) 

        item = ADItem()
        log.msg("Pipeline.parse_ad called", level=log.INFO)
        item['name'] = str(self.name)
        item['picNum'] = str(6)
        item['link'] = response.url
        item['date'] = time.strftime('%Y%m%d',time.localtime(time.time()))

        self.addressCount = self.addressCount + 1
        if self.addressCount > self.addressThresh:
            raise CloseSpider('Get enough website address')
        return item

问题是:

urls = hxs.xpath('//article/dl/dd/h2/a[@class="job-title"]/@href').extract()

网址为空当我尝试将其打印出来时,我无法弄清楚为什么它不起作用以及如何纠正它,谢谢您的帮助。

urls is empty when I tried to print it out, I just cant figure out why it doesn't work and how can I correct it, thanks for your help.

推荐答案

Scrapy不评估Javascript。如果运行以下命令,则会看到原始HTML不包含您要查找的锚点。

Scrapy does not evaluate Javascript. If you run the following command, you will see that the raw HTML does not contain the anchors you are looking for.

curl http://www.seek.com.au/jobs/in-australia/ | grep job-title

您应该尝试使用PhantomJS或Selenium。

You should try PhantomJS or Selenium instead.

在Chrome中检查了网络请求后,工作清单似乎源自。应该很容易从中检索任何内容。

After examining the network requests in Chrome, the job listing appear to have originated from this JSONP request. It should be easy to retrieve whatever you need from it.

这篇关于无法使用刮scrap的蜘蛛抓取特定网站的元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆