scrapy xpath 选择器重复数据 [英] scrapy xpath selector repeats data

查看:47
本文介绍了scrapy xpath 选择器重复数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从每个列表中提取公司名称和地址并将其导出到 -csv,但我在输出 csv 时遇到问题.我认为 bizs = hxs.select("//div[@class='listing_content']") 可能是导致问题的原因.

yp_spider.py

from scrapy.spider import BaseSpiderfrom scrapy.selector import HtmlXPathSelector从 yp.items 导入 Biz类 MySpider(BaseSpider):名称 = "ypages"allowed_domains = ["yellowpages.com"]start_urls = ["http://www.yellowpages.com/sanfrancisco/restaraunts"]定义解析(自我,响应):hxs = HtmlXPathSelector(响应)bizs = hxs.select("//div[@class='listing_content']")项目 = []对于 biz 中的 biz:项目 = 业务()item['name'] = biz.select("//h3/a/text()").extract()item['address'] = biz.select("//span[@class='street-address']/text()").extract()打印项目items.append(item)

items.py

# 在这里定义你抓取的物品的模型## 查看文档:# http://doc.scrapy.org/topics/items.htmlfrom scrapy.item import Item, Field类业务(项目):名称 = 字段()地址 = 字段()def __str__(self):return "网站:名称=%s 地址=%s" % (self.get('name'), self.get('address'))

'scrapy crawl ypages -o list.csv -t csv' 的输出是一长串商家名称和位置,并且多次重复相同的数据.

解决方案

你应该添加一个."选择相对的xpath,这里来自scrapy文档(http://doc.scrapy.org/en/0.16/topics/selectors.html)

起初,您可能会倾向于使用以下方法,这是错误的,因为它实际上从文档中提取了所有

元素,而不仅仅是那些内部元素:

<预><代码>>>>for p in divs.select('//p') # 这是错误的 - 获取所有 <p>从整个文件>>>打印 p.extract()

这是正确的方法(注意 .//p XPath 前面的点):

<预><代码>>>>for p in divs.select('.//p') # 提取所有 <p>里面>>>打印 p.extract()

I am trying to extract the business name and address from each listing and export it to a -csv, but I am having problems with the output csv. I think bizs = hxs.select("//div[@class='listing_content']") may be causing the problems.

yp_spider.py

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from yp.items import Biz

class MySpider(BaseSpider):
    name = "ypages"
    allowed_domains = ["yellowpages.com"]
    start_urls = ["http://www.yellowpages.com/sanfrancisco/restaraunts"]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        bizs = hxs.select("//div[@class='listing_content']")
        items = []

        for biz in bizs:
            item = Biz()
            item['name'] = biz.select("//h3/a/text()").extract()
            item['address'] = biz.select("//span[@class='street-address']/text()").extract()
            print item
            items.append(item)

items.py

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/topics/items.html

from scrapy.item import Item, Field

class Biz(Item):
    name = Field()
    address = Field()

    def __str__(self):
        return "Website: name=%s address=%s" %  (self.get('name'), self.get('address'))

The output from 'scrapy crawl ypages -o list.csv -t csv' is a long list of business names then locations and it repeats the same data several times.

解决方案

you should add one "." to select the relative xpath, and here is from scrapy document(http://doc.scrapy.org/en/0.16/topics/selectors.html)

At first, you may be tempted to use the following approach, which is wrong, as it actually extracts all

elements from the document, not only those inside elements:

>>> for p in divs.select('//p') # this is wrong - gets all <p> from the whole document
>>>     print p.extract()

This is the proper way to do it (note the dot prefixing the .//p XPath):

>>> for p in divs.select('.//p') # extracts all <p> inside
>>>     print p.extract()

这篇关于scrapy xpath 选择器重复数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆