使用Scrapy清除清除的数据 [英] Cleaning data scraped using Scrapy

查看:311
本文介绍了使用Scrapy清除清除的数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我最近开始使用Scrapy,并尝试清除一些我已抓取并要导出为CSV的数据,即以下三个示例:

I have recently started using Scrapy and am trying to clean some data I have scraped and want to export to CSV, namely the following three examples:


  • 示例1 –删除某些文本

  • 示例2 –删除/替换不需要的字符

  • 示例3 –分隔逗号分隔的文本

示例1数据如下:


我想要的文本,我不需要的文本

Text I want,Text I don’t want

使用以下代码:

'Scraped 1':response.xpath('// div / div / div / div / h1 / span / text()')。extract ()

示例2数据如下:

Â-但我想将其更改为£

 - but I want to change this to £

使用以下代码:

' Scraped 2': response.xpath('//html/body/div/div/section/div/form/div/div/em/text()').extract()

示例3数据如下:


项目1,项目2,项目3,项目4,项目4,项目5 –最终我想将
拆分为CSV中的单独列文件

Item 1,Item 2,Item 3,Item 4,Item 4,Item5 – ultimately I want to split this into separate columns in a CSV file

使用以下代码:

' Scraped 3': response.xpath('//div/div/div/ul/li/p/text()').extract()

我尝试使用 str.replace(),但似乎无法使其工作,例如:
'Scraped 1':response.xpath('// div / div / div / div / h1 / span / text()')。extract((str .replace(,我不想要的文本,)

I have tried using str.replace(), but can’t seem to get that to work, e.g: 'Scraped 1': response.xpath('//div/div/div/h1/span/text()').extract((str.replace(",Text I don't want",""))

我正在调查这个问题,但是如果有人可以为我指明正确的方向!

I am looking into this but what appreciate if anyone could point me in the right direction!

以下代码:

import scrapy
from scrapy.loader import ItemLoader
from tutorial.items import Product


class QuotesSpider(scrapy.Spider):
    name = "quotes_product"
    start_urls = [
        'http://www.unitestudents.com/',
            ]

    # Step 1
    def parse(self, response):
        for city in response.xpath('//select[@id="frm_homeSelect_city"]/option[not(contains(text(),"Select your city"))]/text()').extract(): # Select all cities listed in the select (exclude the "Select your city" option)
            yield scrapy.Request(response.urljoin("/"+city), callback=self.parse_citypage)

    # Step 2
    def parse_citypage(self, response):
        for url in response.xpath('//div[@class="property-header"]/h3/span/a/@href').extract(): #Select for each property the url
            yield scrapy.Request(response.urljoin(url), callback=self.parse_unitpage)


    # Step 3
    def parse_unitpage(self, response):
        for final in response.xpath('//div/div/div[@class="content__btn"]/a/@href').extract(): #Select final page for data scrape
            yield scrapy.Request(response.urljoin(final), callback=self.parse_final)

    #Step 4 
    def parse_final(self, response):
        unitTypes = response.xpath('//html/body/div').extract()
        for unitType in unitTypes: # There can be multiple unit types so we yield an item for each unit type we can find.
            l = ItemLoader(item=Product(), response=response)
            l.add_xpath('area_name', '//div/ul/li/a/span/text()')
            l.add_xpath('type', '//div/div/div/h1/span/text()')
            l.add_xpath('period', '/html/body/div/div/section/div/form/h4/span/text()')
            l.add_xpath('duration_weekly', '//html/body/div/div/section/div/form/div/div/em/text()')
            l.add_xpath('guide_total', '//html/body/div/div/section/div/form/div/div/p/text()')
            l.add_xpath('amenities','//div/div/div/ul/li/p/text()')
            return l.load_item()

但是,我得到以下信息吗?

However, I'm getting the following?

value = self.item.fields[field_name].get(key, default)
KeyError: 'type'


推荐答案

使用 str.replace ,尽管我建议使用Python' re 正则表达式库,因为它功能更强大。该文档是一流的,您可以在其中找到一些有用的代码示例。

You have the right idea with str.replace, although I would suggest the Python 're' regular expressions library as it is more powerful. The documentation is top notch and you can find some useful code samples there.

我不熟悉scrapy库,但看起来像。 extract()返回字符串列表。如果要使用 str.replace 或正则表达式函数之一来转换它们,则需要使用列表理解:

I am not familiar with the scrapy library, but it looks like .extract() returns a list of strings. If you want to transform these using str.replace or one of the regex functions, you will need to use a list comprehension:

'Selector 1': [ x.replace('A', 'B') for x in response.xpath('...').extract() ]

编辑:关于单独的列-如果数据已经用逗号分隔,则直接将其写到文件!如果要拆分逗号分隔的数据以进行某些转换,可以使用 str.split 像这样:

Regarding the separate columns-- if the data is already comma-separated just write it directly to a file! If you want to split the comma-separated data to do some transformations, you can use str.split like this:

"A,B,C".split(",") # returns [ "A", "B", "C" ]

在这种情况下,从 .extract()返回的数据将是逗号列表-分隔的字符串。如果您使用上面的列表理解,您将得到一个列表列表。

In this case, the data returned from .extract() will be a list of comma-separated strings. If you use a list comprehension as above, you will end up with a list-of-lists.

如果您希望获得比在每个逗号上分割更复杂的功能,则可以使用python的 csv 库。

If you want something more sophisticated than splitting on each comma, you can use python's csv library.

这篇关于使用Scrapy清除清除的数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆