如何去除 Scrapy Spider 数据中的空白区域 [英] How To Remove White Space in Scrapy Spider Data

查看:29
本文介绍了如何去除 Scrapy Spider 数据中的空白区域的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在 Scrapy 中编写我的第一个蜘蛛程序并尝试遵循文档.我已经实现了 ItemLoaders.蜘蛛提取数据,但数据包含很多行返回.我尝试了很多方法来删除它们,但似乎没有任何效果.replace_escape_chars 实用程序应该可以工作,但我不知道如何将它与 ItemLoader 一起使用.也有人使用 (unicode.strip),但同样,我似乎无法让它工作.有些人尝试在 items.py 中使用这些,而其他人则在蜘蛛中使用.如何清理这些行返回的数据 (\r\n)?我的 items.py 文件只包含项目名称和 field().蜘蛛代码如下:

I am writing my first spider in Scrapy and attempting to follow the documentation. I have implemented ItemLoaders. The spider extracts the data, but the data contains many line returns. I have tried many ways to remove them, but nothing seems to work. The replace_escape_chars utility is supposed to work, but I can't figure out how to use it with the ItemLoader. Also some people use (unicode.strip), but again, I can't seem to get it to work. Some people try to use these in items.py and others in the spider. How can I clean the data of these line returns (\r\n)? My items.py file only contains the item names and field(). The spider code is below:

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.loader import XPathItemLoader
from scrapy.utils.markup import replace_escape_chars
from ccpstore.items import Greenhouse

class GreenhouseSpider(BaseSpider):
    name = "greenhouse"
    allowed_domains = ["domain.com"]
    start_urls = [
        "http://www.domain.com",
    ]

    def parse(self, response):
        items = []
        l = XPathItemLoader(item=Greenhouse(), response=response)
        l.add_xpath('name', '//div[@class="product_name"]')
        l.add_xpath('title', '//h1')
        l.add_xpath('usage', '//li[@id="ctl18_ctl00_rptProductAttributes_ctl00_liItem"]')
        l.add_xpath('repeat', '//li[@id="ctl18_ctl00_rptProductAttributes_ctl02_liItem"]')
        l.add_xpath('direction', '//li[@id="ctl18_ctl00_rptProductAttributes_ctl03_liItem"]')
        items.append(l.load_item())

        return items

推荐答案

原来数据中也有很多空格,所以结合 Steven 的回答,再加上一些研究,让数据有了所有的标签,行返回并删除重复的空格.工作代码如下.请注意在加载器行上添加的 text() 删除了标签,以及拆分和连接处理器以删除空格和换行符.

It turns out that there were also many blank spaces in the data, so combining the answer of Steven with some more research allowed the data to have all tags, line returns and duplicate spaces removed. The working code is below. Note the addition of text() on the loader lines which removes the tags and the split and join processors to remove spaces and line returns.

def parse(self, response):
        items = []
        l = XPathItemLoader(item=Greenhouse(), response=response)
        l.default_input_processor = MapCompose(lambda v: v.split(), replace_escape_chars)
        l.default_output_processor = Join()
        l.add_xpath('title', '//h1/text()')
        l.add_xpath('usage', '//li[@id="ctl18_ctl00_rptProductAttributes_ctl00_liItem"]/text()')
        l.add_xpath('repeat', '//li[@id="ctl18_ctl00_rptProductAttributes_ctl02_liItem"]/text()')
        l.add_xpath('direction', '//li[@id="ctl18_ctl00_rptProductAttributes_ctl03_liItem"]/text()')
        items.append(l.load_item())
        return items        

这篇关于如何去除 Scrapy Spider 数据中的空白区域的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆