在scrapy/python中创建循环来解析表数据 [英] Creating loop to parse table data in scrapy/python
问题描述
使用 scrapy 编写 python 脚本,它从网站上抓取数据,将其分配给 3 个字段,然后生成一个 .csv.工作正常,但有一个主要问题.所有字段都包含所有数据,而不是针对每个表行将其分开.我确定这是由于我的循环不起作用,当它找到 xpath 时,它只是抓取每一行的所有数据,然后继续获取其他 2 个字段的数据,而不是创建单独的行
Have python script using scrapy , which scrapes the data from a website, allocates it to 3 fields and then generates a .csv. Works ok but with one major problem. All fields contain all of the data, rather than it being separated out for each table row. I'm sure this is due to my loop not working and when it finds the xpath it just grabs all the data for every row before moving on to get data for the other 2 fields, instead of creating seperate rows
def parse(self, response):
hxs = HtmlXPathSelector(response)
divs = hxs.select('//tr[@class="someclass"]')
for div in divs:
item = TestBotItem()
item['var1'] = div.select('//table/tbody/tr[*]/td[2]/p/span[2]/text()').extract()
item['var2'] = div.select('//table/tbody/tr[*]/td[3]/p/span[2]/text()').extract()
item['var3'] = div.select('//table/tbody/tr[*]/td[4]/p/text()').extract()
return item
带有 * 的 tr 会随着我需要抓取的网站上的每个条目而增加,另外两条路径位于下方.我如何编辑它,以便它仅获取第一组数据,例如//table/tbody/tr[3],将其存储为所有三个字段,然后移至//table/tbody/tr[4] 等??
The tr with the * increases in number with each entry on the website I need to crawl, and the other two paths slot in below. How do I edit this so it grabs the first set of data for say //table/tbody/tr[3] only, stores it for all three fields and then moves on to //table/tbody/tr[4] etc??
更新
工作正常,但是我正在尝试向 pipelines.py 文件添加一些验证以删除 var1 超过 100% 的任何记录.我确定我下面的代码是错误的,并且yield"而不是return"是否会阻止正在使用的管道?
Works correctly, however I'm trying to add some validation to the pipelines.py file to drop any records where var1 is more than 100%. I'm certain my code below is wrong, and also does "yield" instead of "return" stop the pipeline being used?
from scrapy.exceptions import DropItem
class TestbotPipeline(object):
def process_item(self, item, spider):
if item('var1') > 100%:
return item
else:
raise Dropitem(item)
推荐答案
我认为这就是您要找的:
I think this is what you are looking for:
def parse(self, response):
hxs = HtmlXPathSelector(response)
divs = hxs.select('//tr[@class="someclass"]')
for div in divs:
item = TestBotItem()
item['var1'] = div.select('./td[2]/p/span[2]/text()').extract()
item['var2'] = div.select('./td[3]/p/span[2]/text()').extract()
item['var3'] = div.select('./td[4]/p/text()').extract()
yield item
您在 tr
上循环,然后使用相对 XPath 表达式 (./td...
),并且在每次迭代中您使用 yield
指令.
You loop on the tr
s and then use relative XPath expressions (./td...
), and in each iteration you use the yield
instruction.
您还可以将每个项目附加到一个列表中,然后在循环之外返回该列表),如下所示(相当于上面的代码):
You can also append each item to a list and return that list outside of the loop) like this (it's equivalent to the code above):
def parse(self, response):
hxs = HtmlXPathSelector(response)
divs = hxs.select('//tr[@class="someclass"]')
items = []
for div in divs:
item = TestBotItem()
item['var1'] = div.select('./td[2]/p/span[2]/text()').extract()
item['var2'] = div.select('./td[3]/p/span[2]/text()').extract()
item['var3'] = div.select('./td[4]/p/text()').extract()
items.append(item)
return items
这篇关于在scrapy/python中创建循环来解析表数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!