Python的 - 从刮网站数据问题时，重音字符 [英] Python - problem with accented chars when scraping data from website

查看：159 发布时间：2016/8/5 19:17:33 python unicode beautifulsoup web-scraping diacritics

本文介绍了Python的 - 从刮网站数据问题时，重音字符的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我是尼古拉，Python中的一个新的用户，无需在计算机编程真实背景。所以，我真的需要一些帮助一个问题，我有。我写了一个code凑从这个网页数据：

I'm Nicola, a new user of Python without a real background in computer programming. Therefore, I'd really need some help with a problem I have. I wrote a code to scrape data from this webpage:

<一个href=\"http://finanzalocale.interno.it/sitophp/showQuadro.php?codice=2080500230&tipo=CO&descr_ente=MODENA&anno=2009&cod_modello=CCOU&sigla=MO&tipo_cert=C&isEuro=0&quadro=02\" rel=\"nofollow\">http://finanzalocale.interno.it/sitophp/showQuadro.php?codice=2080500230&tipo=CO&descr_ente=MODENA&anno=2009&cod_modello=CCOU&sigla=MO&tipo_cert=C&isEuro=0&quadro=02

基本上，我的code的目标是从页面中所有的表刮数据和他们在一个txt文件写入。
在这里，我贴我的code：

Basically, the goal of my code is to scrape the data from all the tables in the page and write them in a txt file. Here I paste my code:

#!/usr/bin/env python


from mechanize import Browser
from BeautifulSoup import BeautifulSoup
import urllib2, os


def extract(soup):
table = soup.findAll("table")[1]
for row in table.findAll('tr')[1:19]:
        col = row.findAll('td')
        voce = col[0].string
        accertamento = col[1].string
        competenza = col[2].string
        residui = col[3].string
        record = (voce, accertamento, competenza, residui)
        print >> outfile, "|".join(record)

table = soup.findAll("table")[2]
for row in table.findAll('tr')[1:21]:
        col = row.findAll('td')
        voce = col[0].string
        accertamento = col[1].string
        competenza = col[2].string
        residui = col[3].string
        record = (voce, accertamento, competenza, residui)
        print >> outfile, "|".join(record)

table = soup.findAll("table")[3]
for row in table.findAll('tr')[1:44]:
        col = row.findAll('td')
        voce = col[0].string
        accertamento = col[1].string
        competenza = col[2].string
        residui = col[3].string
        record = (voce, accertamento, competenza, residui)
        print >> outfile, "|".join(record)

table = soup.findAll("table")[4]
for row in table.findAll('tr')[1:18]:
        col = row.findAll('td')
        voce = col[0].string
        accertamento = col[1].string
        competenza = col[2].string
        residui = col[3].string
        record = (voce, accertamento, competenza, residui)
        print >> outfile, "|".join(record)

    table = soup.findAll("table")[5]
for row in table.findAll('tr')[1:]:
        col = row.findAll('td')
        voce = col[0].string
        accertamento = col[1].string
        competenza = col[2].string
        residui = col[3].string
        record = (voce, accertamento, competenza, residui)
        print >> outfile, "|".join(record)

    table = soup.findAll("table")[6]
for row in table.findAll('tr')[1:]:
        col = row.findAll('td')
        voce = col[0].string
        accertamento = col[1].string
        competenza = col[2].string
        residui = col[3].string
        record = (voce, accertamento, competenza, residui)
        print >> outfile, "|".join(record)


outfile = open("modena_quadro02.txt", "w")
br = Browser()
br.set_handle_robots(False)
url = "http://finanzalocale.interno.it/sitophp/showQuadro.php?codice=2080500230&tipo=CO&descr_ente=MODENA&anno=2009&cod_modello=CCOU&sigla=MO&tipo_cert=C&isEuro=0&quadro=02"
page1 = br.open(url)
html1 = page1.read()
soup1 = BeautifulSoup(html1)
extract(soup1)
outfile.close()

一切都将正常工作，但在页面一些表的第一列包含重音字符的话。
当我运行code，我得到以下内容：

Everything would work fine, but the first column of some tables in that page contains words with accented characters. When I run the code, I get the following:

Traceback (most recent call last):
File "modena2.py", line 158, in <module>
  extract(soup1)
File "modena2.py", line 98, in extract
  print >> outfile, "|".join(record)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 32: ordinal not in range(128)

我知道问题出在重音字符的编码。我试图找到一个解决的办法，但它确实超出了我的知识。
我想事先大家感谢是要帮助我，我真的AP preciate吧！
很遗憾，如果问题太基本的，但正如我所说的，我刚开始接触蟒蛇，我由我自己学习的一切。

I know that the problem is with the encoding of the accented characters. I tried to find a solution to this, but it really goes beyond my knowledge. I want to thank in advance everybody that is going to help me.I really appreciate it! And sorry if the question is too basic, but, as I said, I'm just getting started with python and I'm learning everything by myself.

谢谢！
尼古拉

Thanks! Nicola

Python的 - 从刮网站数据问题时，重音字符 [英] Python - problem with accented chars when scraping data from website

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Python的 - 从刮网站数据问题时，重音字符 [英] Python - problem with accented chars when scraping data from website

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭