Python的 - 从刮网站数据问题时,重音字符 [英] Python - problem with accented chars when scraping data from website

查看:159
本文介绍了Python的 - 从刮网站数据问题时,重音字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是尼古拉,Python中的一个新的用户,无需在计算机编程真实背景。所以,我真的需要一些帮助一个问题,我有。我写了一个code凑从这个网页数据:

I'm Nicola, a new user of Python without a real background in computer programming. Therefore, I'd really need some help with a problem I have. I wrote a code to scrape data from this webpage:

<一个href=\"http://finanzalocale.interno.it/sitophp/showQuadro.php?codice=2080500230&tipo=CO&descr_ente=MODENA&anno=2009&cod_modello=CCOU&sigla=MO&tipo_cert=C&isEuro=0&quadro=02\" rel=\"nofollow\">http://finanzalocale.interno.it/sitophp/showQuadro.php?codice=2080500230&tipo=CO&descr_ente=MODENA&anno=2009&cod_modello=CCOU&sigla=MO&tipo_cert=C&isEuro=0&quadro=02

基本上,我的code的目标是从页面中所有的表刮数据和他们在一个txt文件写入。
在这里,我贴我的code:

Basically, the goal of my code is to scrape the data from all the tables in the page and write them in a txt file. Here I paste my code:

#!/usr/bin/env python


from mechanize import Browser
from BeautifulSoup import BeautifulSoup
import urllib2, os


def extract(soup):
table = soup.findAll("table")[1]
for row in table.findAll('tr')[1:19]:
        col = row.findAll('td')
        voce = col[0].string
        accertamento = col[1].string
        competenza = col[2].string
        residui = col[3].string
        record = (voce, accertamento, competenza, residui)
        print >> outfile, "|".join(record)

table = soup.findAll("table")[2]
for row in table.findAll('tr')[1:21]:
        col = row.findAll('td')
        voce = col[0].string
        accertamento = col[1].string
        competenza = col[2].string
        residui = col[3].string
        record = (voce, accertamento, competenza, residui)
        print >> outfile, "|".join(record)

table = soup.findAll("table")[3]
for row in table.findAll('tr')[1:44]:
        col = row.findAll('td')
        voce = col[0].string
        accertamento = col[1].string
        competenza = col[2].string
        residui = col[3].string
        record = (voce, accertamento, competenza, residui)
        print >> outfile, "|".join(record)

table = soup.findAll("table")[4]
for row in table.findAll('tr')[1:18]:
        col = row.findAll('td')
        voce = col[0].string
        accertamento = col[1].string
        competenza = col[2].string
        residui = col[3].string
        record = (voce, accertamento, competenza, residui)
        print >> outfile, "|".join(record)

    table = soup.findAll("table")[5]
for row in table.findAll('tr')[1:]:
        col = row.findAll('td')
        voce = col[0].string
        accertamento = col[1].string
        competenza = col[2].string
        residui = col[3].string
        record = (voce, accertamento, competenza, residui)
        print >> outfile, "|".join(record)

    table = soup.findAll("table")[6]
for row in table.findAll('tr')[1:]:
        col = row.findAll('td')
        voce = col[0].string
        accertamento = col[1].string
        competenza = col[2].string
        residui = col[3].string
        record = (voce, accertamento, competenza, residui)
        print >> outfile, "|".join(record)


outfile = open("modena_quadro02.txt", "w")
br = Browser()
br.set_handle_robots(False)
url = "http://finanzalocale.interno.it/sitophp/showQuadro.php?codice=2080500230&tipo=CO&descr_ente=MODENA&anno=2009&cod_modello=CCOU&sigla=MO&tipo_cert=C&isEuro=0&quadro=02"
page1 = br.open(url)
html1 = page1.read()
soup1 = BeautifulSoup(html1)
extract(soup1)
outfile.close()

一切都将正常工作,但在页面一些表的第一列包含重音字符的话。
当我运行code,我得到以下内容:

Everything would work fine, but the first column of some tables in that page contains words with accented characters. When I run the code, I get the following:

Traceback (most recent call last):
File "modena2.py", line 158, in <module>
  extract(soup1)
File "modena2.py", line 98, in extract
  print >> outfile, "|".join(record)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 32: ordinal not in range(128)

我知道问题出在重音字符的编码​​。我试图找到一个解决的办法,但它确实超出了我的知识。
我想事先大家感谢是要帮助我,我真的AP preciate吧!
很遗憾,如果问题太基本的,但正如我所说的,我刚开始接触蟒蛇,我由我自己学习的一切。

I know that the problem is with the encoding of the accented characters. I tried to find a solution to this, but it really goes beyond my knowledge. I want to thank in advance everybody that is going to help me.I really appreciate it! And sorry if the question is too basic, but, as I said, I'm just getting started with python and I'm learning everything by myself.

谢谢!
尼古拉

Thanks! Nicola

推荐答案

这个问题是统一打印code文本到二进制文件:

The issue is with printing Unicode text to a binary file:

>>> print >>open('e0.txt', 'wb'), u'\xe0'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 0: ordinal not in range(128)

要解决这个问题,无论是连接code单向code文本转换为字节( U'\\ xe0'.en code('utf-8'))或打开在文本模式下的文件:

To fix it, either encode the Unicode text into bytes (u'\xe0'.encode('utf-8')) or open the file in the text mode:

#!/usr/bin/env python
from __future__ import print_function
import io

with io.open('e0.utf8.txt', encoding='utf-8') as file:
    print(u'\xe0', file=file)

这篇关于Python的 - 从刮网站数据问题时,重音字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆