Scrapy 输出问题 [英] Scrapy output issue

查看:53
本文介绍了Scrapy 输出问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在按需要显示我的项目时遇到问题.我的代码如下:

from scrapy.contrib.spiders import CrawlSpider, Rule从 scrapy.contrib.linkextractors.sgml 导入 SgmlLinkExtractor从scrapy.http导入请求from scrapy.selector import HtmlXPathSelector从 texashealth.items 导入 TexashealthItem类 texashealthspider(CrawlSpider):名称=德克萨斯健康"allowed_domains=['jobs.texashealth.org']start_urls=['http://jobs.texashealth.org/search/?&q=&title=Filter%3A%20title&facility=Filter%3A%20facility&location=Filter%3A%20city&date=Filter%3A%20date']规则=(Rule(SgmlLinkExtractor(allow=("search/",)), callback="parse_health", follow=True),#Rule(SgmlLinkExtractor(allow=("startrow=\d",)),callback="parse_health",follow=True),)def parse_health(self, response):hxs=HtmlXPathSelector(响应)titles=hxs.select('//tbody/tr/td')项目 = []对于标题中的标题:item=TexashealthItem()item['title']=titles.select('span[@class="jobTitle"]/a/text()').extract()item['link']=titles.select('span[@class="jobTitle"]/a/@href').extract()item['shifttype']=titles.select('span[@class="jobShiftType"]/text()').extract()item['location']=titles.select('span[@class="jobLocation"]/text()').extract()items.append(item)打印项目退换货品

并且正在显示的输出在 json 格式中如下所示:

<预><代码>[TexashealthItem(location=[], link=[u'/job/Fort-Worth-ULTRASONOGRAPHER-II-Job-TX-76101/31553900/'], shifttype=[], title=[u'ULTRASONOGRAPHER II Job']),TexashealthItem(location=[], link=[], shifttype=[u'Texas Health Fort Worth'], title=[]),TexashealthItem(location=[u'Fort Worth, TX, US'], link=[], shifttype=[], title=[]),TexashealthItem(location=[], link=[], shifttype=[], title=[]),TexashealthItem(location=[], link=[u'/job/Kaufman-RN-Acute-ICU-Full-Time-Kaufman-Job-TX-75142/35466900/'], shifttype=[], title=[u'RN--遥测--全职--考夫曼工作']),TexashealthItem(location=[], link=[], shifttype=[u'Texas Health Kaufman'], title=[]),TexashealthItem(location=[u'Kaufman, TX, US'], link=[], shifttype=[], title=[]),TexashealthItem(location=[], link=[], shifttype=[], title=[]),TexashealthItem(location=[], link=[u'/job/Fort-Worth-NURSE-PRACTITIONER-Occ-Med-Full-Time-Alliance-Job-TX-76101/35465400/'], shifttype=[], title=[u'护士执业者-Occ Med-Full Time-Alliance Job']),TexashealthItem(location=[], link=[], shifttype=[u'Texas Health Alliance'], title=[]),TexashealthItem(location=[u'Fort Worth, TX, US'], link=[], shifttype=[], title=[]),TexashealthItem(location=[], link=[], shifttype=[], title=[])]

如上图所示,items的参数是分开显示的,即标题和链接在一行显示,其余的输出在其他单独的行.

我能否得到一个解决方案,以便我可以一次性显示所有参数?

感谢您的帮助

解决方案

你应该循环表格行 -- tr 元素,而不是表格单元格 -- td 元素.

我建议你使用 hxs.select('//table[@id="searchresults"]/tbody/tr') 然后使用 .//span... 在每次循环迭代中

titles=hxs.select('//table[@id="searchresults"]/tbody/tr')项目 = []对于标题中的标题:item['title']=titles.select('.//span[@class="jobTitle"]/a/text()').extract()item['link']=titles.select('.//span[@class="jobTitle"]/a/@href').extract()item['shifttype']=titles.select('.//span[@class="jobShiftType"]/text()').extract()item['location']=titles.select('.//span[@class="jobLocation"]/text()').extract()items.append(item)退换货品

I am having issues displaying my items as i wanted. My code is as follows:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.http import request
from scrapy.selector import HtmlXPathSelector
from texashealth.items import TexashealthItem

class texashealthspider(CrawlSpider):

    name="texashealth"
    allowed_domains=['jobs.texashealth.org']
    start_urls=['http://jobs.texashealth.org/search/?&q=&title=Filter%3A%20title&facility=Filter%3A%20facility&location=Filter%3A%20city&date=Filter%3A%20date']

    rules=(
    Rule(SgmlLinkExtractor(allow=("search/",)), callback="parse_health", follow=True),
    #Rule(SgmlLinkExtractor(allow=("startrow=\d",)),callback="parse_health",follow=True),
    )

    def parse_health(self, response):
        hxs=HtmlXPathSelector(response)
    titles=hxs.select('//tbody/tr/td')
    items = []

    for titles in titles:
        item=TexashealthItem()
        item['title']=titles.select('span[@class="jobTitle"]/a/text()').extract()
        item['link']=titles.select('span[@class="jobTitle"]/a/@href').extract()
        item['shifttype']=titles.select('span[@class="jobShiftType"]/text()').extract()
        item['location']=titles.select('span[@class="jobLocation"]/text()').extract()
        items.append(item)
    print items
    return items

and the output that is being displayed looks as follows in the json format:

[
    TexashealthItem(location=[], link=[u'/job/Fort-Worth-ULTRASONOGRAPHER-II-Job-TX-76101/31553900/'], shifttype=[], title=[u'ULTRASONOGRAPHER II Job']), 
    TexashealthItem(location=[], link=[], shifttype=[u'Texas Health Fort Worth'], title=[]), 
    TexashealthItem(location=[u'Fort Worth, TX, US'], link=[], shifttype=[], title=[]), 
    TexashealthItem(location=[], link=[], shifttype=[], title=[]), 
    TexashealthItem(location=[], link=[u'/job/Kaufman-RN-Acute-ICU-Full-Time-Kaufman-Job-TX-75142/35466900/'], shifttype=[], title=[u'RN--Telemetry--Full Time--Kaufman Job']), 
    TexashealthItem(location=[], link=[], shifttype=[u'Texas Health Kaufman'], title=[]), 
    TexashealthItem(location=[u'Kaufman, TX, US'], link=[], shifttype=[], title=[]), 
    TexashealthItem(location=[], link=[], shifttype=[], title=[]), 
    TexashealthItem(location=[], link=[u'/job/Fort-Worth-NURSE-PRACTITIONER-Occ-Med-Full-Time-Alliance-Job-TX-76101/35465400/'], shifttype=[], title=[u'NURSE PRACTITIONER-Occ Med-Full Time-Alliance Job']), 
    TexashealthItem(location=[], link=[], shifttype=[u'Texas Health Alliance'], title=[]), 
    TexashealthItem(location=[u'Fort Worth, TX, US'], link=[], shifttype=[], title=[]), 
    TexashealthItem(location=[], link=[], shifttype=[], title=[])
]

As you can see above, the parameters of the items are being displayed in separate intervals, that is, it displays the title and link in one line, and the rest of the output in other separate lines.

Can i get a solution so that i can display all the parameters in just one shot?

Thank you for your help

解决方案

You should loop on table rows -- tr elements, and not table cells -- td elements.

I suggest you use hxs.select('//table[@id="searchresults"]/tbody/tr') and then use .//span... in each loop iteration

titles=hxs.select('//table[@id="searchresults"]/tbody/tr')
items = []
for titles in titles:
    item['title']=titles.select('.//span[@class="jobTitle"]/a/text()').extract()
    item['link']=titles.select('.//span[@class="jobTitle"]/a/@href').extract()
    item['shifttype']=titles.select('.//span[@class="jobShiftType"]/text()').extract()
    item['location']=titles.select('.//span[@class="jobLocation"]/text()').extract()
    items.append(item)
return items

这篇关于Scrapy 输出问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆