在scrapy/python中创建循环来解析表数据 [英] Creating loop to parse table data in scrapy/python

查看:36
本文介绍了在scrapy/python中创建循环来解析表数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用 scrapy 编写 python 脚本,它从网站上抓取数据,将其分配给 3 个字段,然后生成一个 .csv.工作正常,但有一个主要问题.所有字段都包含所有数据,而不是针对每个表行将其分开.我确定这是由于我的循环不起作用,当它找到 xpath 时,它只是抓取每一行的所有数据,然后继续获取其他 2 个字段的数据,而不是创建单独的行

Have python script using scrapy , which scrapes the data from a website, allocates it to 3 fields and then generates a .csv. Works ok but with one major problem. All fields contain all of the data, rather than it being separated out for each table row. I'm sure this is due to my loop not working and when it finds the xpath it just grabs all the data for every row before moving on to get data for the other 2 fields, instead of creating seperate rows

def parse(self, response):
    hxs = HtmlXPathSelector(response)
    divs = hxs.select('//tr[@class="someclass"]')
    for div in divs:
        item = TestBotItem()
        item['var1'] = div.select('//table/tbody/tr[*]/td[2]/p/span[2]/text()').extract()
        item['var2'] = div.select('//table/tbody/tr[*]/td[3]/p/span[2]/text()').extract() 
        item['var3'] = div.select('//table/tbody/tr[*]/td[4]/p/text()').extract()
        return item

带有 * 的 tr 会随着我需要抓取的网站上的每个条目而增加,另外两条路径位于下方.我如何编辑它,以便它仅获取第一组数据,例如//table/tbody/tr[3],将其存储为所有三个字段,然后移至//table/tbody/tr[4] 等??

The tr with the * increases in number with each entry on the website I need to crawl, and the other two paths slot in below. How do I edit this so it grabs the first set of data for say //table/tbody/tr[3] only, stores it for all three fields and then moves on to //table/tbody/tr[4] etc??

更新

工作正常,但是我正在尝试向 pipelines.py 文件添加一些验证以删除 var1 超过 100% 的任何记录.我确定我下面的代码是错误的,并且yield"而不是return"是否会阻止正在使用的管道?

Works correctly, however I'm trying to add some validation to the pipelines.py file to drop any records where var1 is more than 100%. I'm certain my code below is wrong, and also does "yield" instead of "return" stop the pipeline being used?

from scrapy.exceptions import DropItem 

class TestbotPipeline(object):
def process_item(self, item, spider):
    if item('var1') > 100%:
        return item
    else: 
        raise Dropitem(item)

推荐答案

我认为这就是您要找的:

I think this is what you are looking for:

def parse(self, response):
    hxs = HtmlXPathSelector(response)
    divs = hxs.select('//tr[@class="someclass"]')
    for div in divs:
        item = TestBotItem()
        item['var1'] = div.select('./td[2]/p/span[2]/text()').extract()
        item['var2'] = div.select('./td[3]/p/span[2]/text()').extract() 
        item['var3'] = div.select('./td[4]/p/text()').extract()

        yield item

您在 tr 上循环,然后使用相对 XPath 表达式 (./td...),并且在每次迭代中您使用 yield 指令.

You loop on the trs and then use relative XPath expressions (./td...), and in each iteration you use the yield instruction.

您还可以将每个项目附加到一个列表中,然后在循环之外返回该列表),如下所示(相当于上面的代码):

You can also append each item to a list and return that list outside of the loop) like this (it's equivalent to the code above):

def parse(self, response):
    hxs = HtmlXPathSelector(response)
    divs = hxs.select('//tr[@class="someclass"]')
    items = []

    for div in divs:

        item = TestBotItem()
        item['var1'] = div.select('./td[2]/p/span[2]/text()').extract()
        item['var2'] = div.select('./td[3]/p/span[2]/text()').extract() 
        item['var3'] = div.select('./td[4]/p/text()').extract()

        items.append(item)

    return items

这篇关于在scrapy/python中创建循环来解析表数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆