Scrapy将奇怪的符号导出到csv文件 [英] Scrapy exporting weird symbols into csv file

查看:44
本文介绍了Scrapy将奇怪的符号导出到csv文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

好的,这就是问题所在.我是一个刚开始研究scrapy/python的初学者.

Ok, so here's the issue. I'm a beginner who has just started to delve into scrapy/python.

我使用下面的代码抓取网站并将结果保存到csv中.当我在命令提示符下查看时,它会将Officiële之类的单词转换为Offici \ xele.在csv文件中,将其更改为officiële.我认为这是因为它以unicode而不是UTF-8格式保存?但是,我对如何更改代码有0条提示,到目前为止,我一直在努力尝试.

I use the code below to scrape a website and save the results into a csv. When I look in the command prompt, it turns words like Officiële into Offici\xele. In the csv file, it changes it to officiële. I think this is because it's saving in unicode instead of UTF-8? I however have 0 clue how to change my code, and I've been trying all morning so far.

有人可以帮我吗?我专门在确保item ["publicatietype"]正常工作.如何编码/解码?我需要写些什么?我尝试使用replace('ë','ë'),但这给了我一个错误(非ASCCI字符,但未声明编码).

Could anyone help me out here? I'm specifically looking at making sure item["publicatietype"] works properly. How can I encode/decode it? What do I need to write? I tried using replace('ë', 'ë'), but that gives me an error (non-ASCCI character, but no encoding declared).

class pagespider(Spider):
    name = "OBSpider"
    #max_page is put here to prevent endless loops; make it as large as you need. It will try and go up to that page
    #even if there's nothing there. A number too high will just take way too much time and yield no results
    max_pages = 1

    def start_requests(self):
        for i in range(self.max_pages):
            yield scrapy.Request("https://zoek.officielebekendmakingen.nl/zoeken/resultaat/?zkt=Uitgebreid&pst=Tractatenblad|Staatsblad|Staatscourant|BladGemeenschappelijkeRegeling|ParlementaireDocumenten&vrt=Cybersecurity&zkd=InDeGeheleText&dpr=Alle&sdt=general_informationPublicatie&ap=&pnr=18&rpp=10&_page=%d&sorttype=1&sortorder=4" % (i+1), callback = self.parse)


    def parse(self, response):
        for sel in response.xpath('//div[@class = "lijst"]/ul/li'):
            item = ThingsToGather()
            item["titel"] = ' '.join(sel.xpath('a/text()').extract())
            deeplink = ''.join(["https://zoek.officielebekendmakingen.nl/", ' '.join(sel.xpath('a/@href').extract())])
            request = scrapy.Request(deeplink, callback=self.get_page_info)
            request.meta['item'] = item
            yield request

    def get_page_info(self, response):
        for sel in response.xpath('//*[@id="Inhoud"]'):
            item = response.meta['item']

    #it loads some general info from the header. If this string is less than 5 characters, the site probably is a faulthy link (i.e. an error 404). If this is the case, then it drops the item. Else it continues

            if len(' '.join(sel.xpath('//div[contains(@class, "logo-nummer")]/div[contains(@class, "nummer")]/text()').extract())) < 5:
                raise DropItem()
            else:
                item["filename"] = ' '.join(sel.xpath('//*[@id="downloadPdfHyperLink"]/@href').extract())
                item['publicatiedatum'] = sel.xpath('//span[contains(@property, "http://purl.org/dc/terms/available")]/text()').extract()
                item["publicatietype"] = sel.xpath('//span[contains(@property, "http://purl.org/dc/terms/type")]/text()').extract()
                item["filename"] = ' '.join(sel.xpath('//*[@id="downloadPdfHyperLink"]/@href').extract())
                item = self.__normalise_item(item, response.url)

    #if the string is less than 5, then the required data is not on the page. It then needs to be
    #retrieved from the technical information link. If it's the proper link (the else clause), you're done and it proceeds to 'else'
                if len(item['publicatiedatum']) < 5:
                    tech_inf_link = ''.join(["https://zoek.officielebekendmakingen.nl/", ' '.join(sel.xpath('//*[@id="technischeInfoHyperlink"]/@href').extract())])
                    request = scrapy.Request(tech_inf_link, callback=self.get_date_info)
                    request.meta['item'] = item
                    yield request 
                else:
                    yield item

    def get_date_info (self, response):
        for sel in response.xpath('//*[@id="Inhoud"]'):
            item = response.meta['item']
            item["filename"] = sel.xpath('//span[contains(@property, "http://standaarden.overheid.nl/oep/meta/publicationName")]/text()').extract()
            item['publicatiedatum'] = sel.xpath('//span[contains(@property, "http://purl.org/dc/terms/available")]/text()').extract()
            item['publicatietype'] = sel.xpath('//span[contains(@property, "http://purl.org/dc/terms/type")]/text()').extract()
            item["filename"] = ' '.join(sel.xpath('//*[@id="downloadPdfHyperLink"]/@href').extract())
            item = self.__normalise_item(item, response.url)    
            return item

    # commands below are intended to clean up strings. Everything is sent to __normalise_item to clean unwanted characters (strip) and double spaces (split)

    def __normalise_item(self, item, base_url):
        for key, value in vars(item).values()[0].iteritems():
            item[key] = self.__normalise(item[key])

        item ['titel']= item['titel'].replace(';', '& ')
        return item

    def __normalise(self, value):
        value = value if type(value) is not list else ' '.join(value)
        value = value.strip()
        value = " ".join(value.split())
        return value

答案:

请参阅下面的保罗·特姆布尔斯(paul trmbrth)评论.问题不在于草率,而是卓越.

See comment by paul trmbrth below. The problem is not scrapy, it's excel.

对于任何遇到此问题的人也是如此.tldr是:在excel中(在功能区的数据"菜单中)导入数据,然后将Windows(ANSI)或其上的任何内容切换为Unicode(UTF-8).

For anyone coming across this question as well. The tldr is: import the data in excel (in the data menu in the ribbon) and switch Windows (ANSI) or whatever it is on to Unicode (UTF-8).

推荐答案

Officiële在Python 2中将表示为 u'Offici \ xeble'下面的python shell会话示例(无需担心 \ xXX 字符,这就是Python表示非ASCII Unicode字符的方式)

Officiële will be represented as u'Offici\xeble' in Python 2, as seen in the python shell session example below (no need to worry about the \xXX characters, it's just how Python represents non-ASCII Unicode characters)

$ python
Python 2.7.9 (default, Apr  2 2015, 15:33:21) 
[GCC 4.9.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> u'Officiële'
u'Offici\xeble'
>>> u'Offici\u00EBle'
u'Offici\xeble'
>>> 

我认为这是因为它以unicode而不是UTF-8格式保存

I think this is because it's saving in unicode instead of UTF-8

UTF-8 是一种编码, Unicode 不是.

ë,又名 U + 00EB ,又名带小写字母的拉丁文小写字母E ,将被UTF-8编码为2个字节, \ xc3 \ xab

ë, a.k.a U+00EB, a.k.a LATIN SMALL LETTER E WITH DIAERESIS, will be UTF-8 encoded as 2 bytes, \xc3 and \xab

>>> u'Officiële'.encode('UTF-8')
'Offici\xc3\xable'
>>> 

在csv文件中,它将其更改为officiële.

In the csv file, it changes it to officiële.

如果看到此消息,则可能是在程序内部打开CSV文件时需要将输入编码设置为UTF-8.

If you see this, it's probably that you need to set the input encoding to UTF-8 when opening the CSV file inside your program.

Scrapy CSV导出器将Python Unicode字符串作为UTF-8编码的字符串写入输出文件中.

Scrapy CSV exporter will write Python Unicode strings as UTF-8 encoded strings in the output file.

Scrapy选择器将输出Unicode字符串:

Scrapy selectors will output Unicode strings:

$ scrapy shell "https://zoek.officielebekendmakingen.nl/zoeken/resultaat/?zkt=Uitgebreid&pst=Tractatenblad|Staatsblad|Staatscourant|BladGemeenschappelijkeRegeling|ParlementaireDocumenten&vrt=Cybersecurity&zkd=InDeGeheleText&dpr=Alle&sdt=general_informationPublicatie&ap=&pnr=18&rpp=10&_page=1&sorttype=1&sortorder=4"
2016-03-15 10:44:51 [scrapy] INFO: Scrapy 1.0.5 started (bot: scrapybot)
(...)
2016-03-15 10:44:52 [scrapy] DEBUG: Crawled (200) <GET https://zoek.officielebekendmakingen.nl/zoeken/resultaat/?zkt=Uitgebreid&pst=Tractatenblad|Staatsblad|Staatscourant|BladGemeenschappelijkeRegeling|ParlementaireDocumenten&vrt=Cybersecurity&zkd=InDeGeheleText&dpr=Alle&sdt=general_informationPublicatie&ap=&pnr=18&rpp=10&_page=1&sorttype=1&sortorder=4> (referer: None)
(...)
In [1]: response.css('div.menu-bmslink > ul > li > a::text').extract()
Out[1]: 
[u'Offici\xeble bekendmakingen vandaag',
 u'Uitleg nieuwe nummering Handelingen vanaf 1 januari 2011',
 u'Uitleg nieuwe\r\n            nummering Staatscourant vanaf 1 juli 2009']

In [2]: for t in response.css('div.menu-bmslink > ul > li > a::text').extract():
    print t
   ...:     
Officiële bekendmakingen vandaag
Uitleg nieuwe nummering Handelingen vanaf 1 januari 2011
Uitleg nieuwe
            nummering Staatscourant vanaf 1 juli 2009

让我们看看在项目中提取这些字符串的蜘蛛会将您转换为CSV的

Let's see what a spider extracting these strings in items will get you as CSV:

$ cat testspider.py
import scrapy


class TestSpider(scrapy.Spider):
    name = 'testspider'
    start_urls = ['https://zoek.officielebekendmakingen.nl/zoeken/resultaat/?zkt=Uitgebreid&pst=Tractatenblad|Staatsblad|Staatscourant|BladGemeenschappelijkeRegeling|ParlementaireDocumenten&vrt=Cybersecurity&zkd=InDeGeheleText&dpr=Alle&sdt=general_informationPublicatie&ap=&pnr=18&rpp=10&_page=1&sorttype=1&sortorder=4']

    def parse(self, response):
        for t in response.css('div.menu-bmslink > ul > li > a::text').extract():
            yield {"link": t}

运行蜘蛛并要求CSV输出:

Run the spider and ask for CSV output:

$ scrapy runspider testspider.py -o test.csv
2016-03-15 11:00:13 [scrapy] INFO: Scrapy 1.0.5 started (bot: scrapybot)
2016-03-15 11:00:13 [scrapy] INFO: Optional features available: ssl, http11
2016-03-15 11:00:13 [scrapy] INFO: Overridden settings: {'FEED_FORMAT': 'csv', 'FEED_URI': 'test.csv'}
2016-03-15 11:00:14 [scrapy] INFO: Enabled extensions: CloseSpider, FeedExporter, TelnetConsole, LogStats, CoreStats, SpiderState
2016-03-15 11:00:14 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-03-15 11:00:14 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-03-15 11:00:14 [scrapy] INFO: Enabled item pipelines: 
2016-03-15 11:00:14 [scrapy] INFO: Spider opened
2016-03-15 11:00:14 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-03-15 11:00:14 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-03-15 11:00:14 [scrapy] DEBUG: Crawled (200) <GET https://zoek.officielebekendmakingen.nl/zoeken/resultaat/?zkt=Uitgebreid&pst=Tractatenblad|Staatsblad|Staatscourant|BladGemeenschappelijkeRegeling|ParlementaireDocumenten&vrt=Cybersecurity&zkd=InDeGeheleText&dpr=Alle&sdt=general_informationPublicatie&ap=&pnr=18&rpp=10&_page=1&sorttype=1&sortorder=4> (referer: None)
2016-03-15 11:00:14 [scrapy] DEBUG: Scraped from <200 https://zoek.officielebekendmakingen.nl/zoeken/resultaat/?zkt=Uitgebreid&pst=Tractatenblad|Staatsblad|Staatscourant|BladGemeenschappelijkeRegeling|ParlementaireDocumenten&vrt=Cybersecurity&zkd=InDeGeheleText&dpr=Alle&sdt=general_informationPublicatie&ap=&pnr=18&rpp=10&_page=1&sorttype=1&sortorder=4>
{'link': u'Offici\xeble bekendmakingen vandaag'}
2016-03-15 11:00:14 [scrapy] DEBUG: Scraped from <200 https://zoek.officielebekendmakingen.nl/zoeken/resultaat/?zkt=Uitgebreid&pst=Tractatenblad|Staatsblad|Staatscourant|BladGemeenschappelijkeRegeling|ParlementaireDocumenten&vrt=Cybersecurity&zkd=InDeGeheleText&dpr=Alle&sdt=general_informationPublicatie&ap=&pnr=18&rpp=10&_page=1&sorttype=1&sortorder=4>
{'link': u'Uitleg nieuwe nummering Handelingen vanaf 1 januari 2011'}
2016-03-15 11:00:14 [scrapy] DEBUG: Scraped from <200 https://zoek.officielebekendmakingen.nl/zoeken/resultaat/?zkt=Uitgebreid&pst=Tractatenblad|Staatsblad|Staatscourant|BladGemeenschappelijkeRegeling|ParlementaireDocumenten&vrt=Cybersecurity&zkd=InDeGeheleText&dpr=Alle&sdt=general_informationPublicatie&ap=&pnr=18&rpp=10&_page=1&sorttype=1&sortorder=4>
{'link': u'Uitleg nieuwe\r\n            nummering Staatscourant vanaf 1 juli 2009'}
2016-03-15 11:00:14 [scrapy] INFO: Closing spider (finished)
2016-03-15 11:00:14 [scrapy] INFO: Stored csv feed (3 items) in: test.csv
2016-03-15 11:00:14 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 488,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 12018,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2016, 3, 15, 10, 0, 14, 991735),
 'item_scraped_count': 3,
 'log_count/DEBUG': 5,
 'log_count/INFO': 8,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2016, 3, 15, 10, 0, 14, 59471)}
2016-03-15 11:00:14 [scrapy] INFO: Spider closed (finished)

检查CSV文件的内容:

Check content of CSV file:

$ cat test.csv
link
Officiële bekendmakingen vandaag
Uitleg nieuwe nummering Handelingen vanaf 1 januari 2011
"Uitleg nieuwe
            nummering Staatscourant vanaf 1 juli 2009"
$ hexdump -C test.csv 
00000000  6c 69 6e 6b 0d 0a 4f 66  66 69 63 69 c3 ab 6c 65  |link..Offici..le|
00000010  20 62 65 6b 65 6e 64 6d  61 6b 69 6e 67 65 6e 20  | bekendmakingen |
00000020  76 61 6e 64 61 61 67 0d  0a 55 69 74 6c 65 67 20  |vandaag..Uitleg |
00000030  6e 69 65 75 77 65 20 6e  75 6d 6d 65 72 69 6e 67  |nieuwe nummering|
00000040  20 48 61 6e 64 65 6c 69  6e 67 65 6e 20 76 61 6e  | Handelingen van|
00000050  61 66 20 31 20 6a 61 6e  75 61 72 69 20 32 30 31  |af 1 januari 201|
00000060  31 0d 0a 22 55 69 74 6c  65 67 20 6e 69 65 75 77  |1.."Uitleg nieuw|
00000070  65 0d 0a 20 20 20 20 20  20 20 20 20 20 20 20 6e  |e..            n|
00000080  75 6d 6d 65 72 69 6e 67  20 53 74 61 61 74 73 63  |ummering Staatsc|
00000090  6f 75 72 61 6e 74 20 76  61 6e 61 66 20 31 20 6a  |ourant vanaf 1 j|
000000a0  75 6c 69 20 32 30 30 39  22 0d 0a                 |uli 2009"..|
000000ab

您可以验证ë是否正确编码为 c3 ab

You can verify that ë is correctly encoded as c3 ab

例如,在使用LibreOffice时,我可以正确看到文件数据(注意字符集:Unicode UTF-8"):

I can see the file data correctly when using LibreOffice for example (notice "Character set: Unicode UTF-8"):

您可能正在使用Latin-1.这是使用Latin-1代替UTF-8作为输入编码设置时所得到的(再次在LibreOffice中):

You are probably using Latin-1. Here's what you get when using Latin-1 instead of UTF-8 as input encoding setting (in LibreOffice again):

这篇关于Scrapy将奇怪的符号导出到csv文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆