scrapy 输出项作为每行 1 个列表元素 [英] scrapy output item as 1 list element per row

查看:66
本文介绍了scrapy 输出项作为每行 1 个列表元素的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

scrapy 新手,在过去一周或更长时间里到处寻找解决我的问题的方法.我正在尝试在 http://ufcstats.com/event-details/6420efac0578988b 上抓取 ufc 1 的表格数据.

New to scrapy and have looked everywhere over the past week or more for some solution to my problem. I am trying to scrape tabular data for ufc 1 at http://ufcstats.com/event-details/6420efac0578988b.

我的蜘蛛工作正常,它以字符串列表的形式返回每个项目.例如:'赢家':['罗伊斯格雷西','杰森德卢西亚','罗伊斯·格雷西','杰拉德·戈尔多','肯三叶草','罗伊斯·格雷西','凯文·罗西尔','杰拉德·戈尔多']}当我输出到 csv 时,事件赢家/输家/其他统计数据仅作为字符串列表输出到 1 行.我想在它自己的行中输出每个项目元素.我已经能够在 Pandas 中解决这个问题,但感觉不必要地工作,而且我怀疑它能否很好地扩展.

My spider is working fine and it returns each item as a list of strings. For example: 'winner': ['Royce Gracie', 'Jason DeLucia', 'Royce Gracie', 'Gerard Gordeau', 'Ken Shamrock', 'Royce Gracie', 'Kevin Rosier', 'Gerard Gordeau']} When I output to csv the event winners/losers/other stats are outputted as a list of strings in only 1 row. I want to output each item element in it's own row. I have been able to sort this out in pandas but feels unnecessarily worky and I have doubts that it will scale well.

希望能够输出到表格中的 csv.不知道这是否应该在蜘蛛本身、项目/项目加载器或管道中完成.

Would like to be able to output to csv as it is in the table. Dunno if this should be done in the spider itself, in items/itemloaders or in pipelines.

似乎是一个常见的问题,但一直无法找出一个scrapy解决方案

Seems like a common issue but haven't been able to figure out a scrapy solution

尝试使用我的标准项目加载器、项目输入处理器和/或输出处理器以及我在 SO 的各种示例中可以找到的所有其他内容,在蜘蛛代码中迭代 for 循环,但未能实现所需的输出.虽然能够解决其他先前的问题.很卡,这里的任何帮助将不胜感激

Tried iterating in for loops in the spider code, with my standard itemloader, in item input processors and/or output processors and everything else that I could find in various examples from SO but haven't been able to achieve desired output. Was able to troubleshoot other prior issues though. Quite stuck and any help here would be greatly appreciated

#items.py
import scrapy
from scrapy.loader.processors import Identity, TakeFirst, Compose, 
MapCompose, Join

def compact(s):
    return s if s else None

class StatsItem(scrapy.Item):
# define the fields for your item here like:
   event_name = scrapy.Field(input_processor=MapCompose(str.strip, compact), )
   event_date = scrapy.Field(input_processor=MapCompose(str.strip, compact), )
   event_loc  = scrapy.Field(input_processor=MapCompose(str.strip, compact), )
   attendance = scrapy.Field(input_processor=MapCompose(str.strip, compact), )
   f_info = scrapy.Field(input_processor=MapCompose(str.strip, compact,),)
   winner = scrapy.Field(input_processor=MapCompose(str.strip),)
   loser = scrapy.Field(input_processor=MapCompose(str.strip),) 

#spider code
import scrapy
from ..items import StatsItem
from scrapy.loader import ItemLoader
#from scrapy.loader.processors import Join, MapCompose, TakeFirst

class StatsSpider(scrapy.Spider):
name = 'stats'
allowed_domains = ['fcstats...']
start_urls = ['http://fcstats.../']

custom_settings = {
    # specifies exported fields and order
    'FEED_EXPORT_FIELDS': 
    ['event_name','event_date','event_loc','attendance', 
'winner',#'w_str', 'w_td', 'w_sub', 'w_pass', 'w_wclass', 'w_method', 'w_mthdtl', 'w_round', 'w_time', 
'loser' ,#'l_str', 'l_td', 'l_sub', 'l_pass', 'l_wclass', 'l_method', 'l_mthdtl', 'l_round', 'l_time',
    'f_info',]}

def parse(self, response):
    rev_orderd_events = response.css('tr.b-statistics__table-row')[::-1]
    # full event_links
    # event_links = rev_orderd_events.css('i>a::attr(href)').extract()
    # for url in event_links:
    #     yield scrapy.Request(url=event_links, callback=self.parse_event)
    event_links = rev_orderd_events.css('i>a::attr(href)').extract_first()
    yield scrapy.Request(url=event_links,callback=self.parse_event)

# follow links
def parse_event(self, response):
    #sel = Selector(response)
    pg = response.css('div.l-page__container')
    #fights = response.css('tr.b-fight-details__table-row.b-fight-details__table-row__hover.js-fight-details-click')
    #table = response.css('table.b-fight-details__table.b-fight-details__table_style_margin-top.b-fight-details__table_type_event-details.js-fight-table')

    for match in pg:
        il = ItemLoader(StatsItem(), response=response)       
        il.add_css('event_name','h2.b-content__title>span::text')
        il.add_css('event_date','ul.b-list__box-list>li:nth-child(1)::text')
        il.add_css('event_loc' ,'ul.b-list__box-list>li:nth-child(2)::text')
        il.add_css('attendance','ul.b-list__box-list>li:nth-child(3)::text')
        il.add_css('winner','p.b-fight-details__table-text:nth-child(odd)>a::text')
        il.add_css('loser' ,'p.b-fight-details__table-text:nth-child(even)>a::text')
        il.add_css('f_info', 'td p.b-fight-details__table-text::text')
        yield il.load_item()

实际结果:

event_name  event_date  event_loc   attendance  winner  loser   f_info

UFC 1: The Beginning    12-Nov-93   Denver, Colorado, USA   2,800   Royce Gracie,Jason DeLucia,Royce Gracie,Gerard Gordeau,Ken Shamrock,Royce Gracie,Kevin Rosier,Gerard Gordeau    Gerard Gordeau,Trent Jenkins,Ken Shamrock,Kevin Rosier,Patrick Smith,Art Jimmerson,Zane Frazier,Teila Tuli  1,0,1,0,1,0,2,0,Open Weight,SUB,Rear Naked Choke,1,1:44,3,1,1,0,1,0,1,0,Open Weight,SUB,Rear Naked Choke,1,0:52,0,0,0,0,1,0,2,0,Open Weight,SUB,Rear Naked Choke,1,0:57,11,0,0,0,0,0,0,0,Open Weight,KO/TKO,1,0:59,1,4,1,0,2,0,0,0,Open Weight,SUB,Heel Hook,1,1:49,0,0,1,0,0,0,2,0,Open Weight,SUB,Other,1,2:18,15,12,0,0,0,0,0,0,Open Weight,KO/TKO,1,4:20,3,0,0,0,0,0,0,0,Open Weight,KO/TKO,Kick,1,0:26

预期结果更像是:

event_name  event_date  event_loc   attendance  winner  loser   f_info

UFC 1: The Beginning    12-Nov-93   Denver, Colorado, USA   2,800   Royce Gracie, Gerard Gordeau, 1,0,1,0,1,0,2,0,Open Weight,SUB,Rear Naked Choke,1,1:44,

UFC 1: The Beginning    12-Nov-93   Denver, Colorado, USA   2,800 Jason DeLucia, Trent Jenkins 3,1,1,0,1,0,1,0,Open Weight,SUB,Rear Naked Choke,1,0:52 ....

*为清晰起见进行了编辑

*Edited for clarity

推荐答案

感谢@umair 和@Catalina_Chircu

thanks @umair and @Catalina_Chircu

def parse_event(self, response):

    pg = response.css('div.l-page__container')

    for event in response.css('div.b-fight-details'):
        event_name = pg.css('h2.b-content__title>span::text').extract_first()
        event_date = event.css('ul.b-list__box-list>li:nth-child(1)::text').extract()
        event_loc  = event.css('ul.b-list__box-list>li:nth-child(2)::text').extract()
        attendance = event.css('ul.b-list__box-list>li:nth-child(3)::text').extract()


        for fights in event.css('tr')[1:]: 
            il = ItemLoader(StatsItem(), selector=fights)
            il.add_value('event_name', event_name)
            il.add_value('event_date', event_date)
            il.add_value('event_loc', event_loc)
            il.add_value('attendance', attendance)
            il.add_css('winner', 'td.b-fight-details__table-col:nth-child(2) p.b-fight-details__table-text:nth-child(odd)>a::text')
            il.add_css('loser', 'td.b-fight-details__table-col:nth-child(2) p.b-fight-details__table-text:nth-child(even)>a::text')
            #il.add_css('f_info', ':nth-child(3) p.b-fight-details__table-text::text')
            il.add_css('w_str' ,'td.b-fight-details__table-col:nth-child(3)>p:nth-child(odd)::text')
            il.add_css('l_str' ,'td.b-fight-details__table-col:nth-child(3)>p:nth-child(even)::text')
            il.add_css('w_td'  ,'td.b-fight-details__table-col:nth-child(4)>p:nth-child(odd)::text')
            il.add_css('l_td'  ,'td.b-fight-details__table-col:nth-child(4)>p:nth-child(even)::text')
            il.add_css('w_sub' ,'td.b-fight-details__table-col:nth-child(5)>p:nth-child(odd)::text')
            il.add_css('l_sub' ,'td.b-fight-details__table-col:nth-child(5)>p:nth-child(even)::text')
            il.add_css('w_pass','td.b-fight-details__table-col:nth-child(6)>p:nth-child(odd)::text')
            il.add_css('l_pass','td.b-fight-details__table-col:nth-child(6)>p:nth-child(even)::text')
            il.add_css('w_wclass','td.b-fight-details__table-col:nth-child(7)>p:nth-child(1)::text')
            il.add_css('l_wclass','td.b-fight-details__table-col:nth-child(7)>p:nth-child(1)::text')
            il.add_css('w_method','td.b-fight-details__table-col:nth-child(8)>p:nth-child(odd)::text')
            il.add_css('l_method','td.b-fight-details__table-col:nth-child(8)>p:nth-child(odd)::text')
            il.add_css('w_mthdtl','td.b-fight-details__table-col:nth-child(8)>p:nth-child(even)::text')
            il.add_css('l_mthdtl','td.b-fight-details__table-col:nth-child(8)>p:nth-child(even)::text')
            il.add_css('w_round','td.b-fight-details__table-col:nth-child(9)>p:nth-child(odd)::text')
            il.add_css('l_round','td.b-fight-details__table-col:nth-child(9)>p:nth-child(odd)::text')
            il.add_css('w_time','td.b-fight-details__table-col:nth-child(10)>p:nth-child(odd)::text')
            il.add_css('l_time','td.b-fight-details__table-col:nth-child(10)>p:nth-child(odd)::text')
            yield il.load_item()'

与相关项目输入/输出处理器给了我我所希望的大部分

with associated items input/output processors is giving me most of what I was hoping for

这篇关于scrapy 输出项作为每行 1 个列表元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆