使用带有重音和不同字符的美丽汤 [英] Using Beautiful Soup with accents and different characters

查看:36
本文介绍了使用带有重音和不同字符的美丽汤的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用《美丽的汤》来吸引过去奥运会的奖牌获得者.在某些比赛和运动员的名字中,口音的使用是绊脚石.我在网上看到过类似的问题,但是我是Python的新手,很难将其应用于我的代码.

I'm using Beautiful Soup to pull medal winners from past Olympics. It's tripping over the use of accents in some of the events and athlete names. I've seen similar problems posted online but I'm new to Python and having trouble applying them to my code.

如果我打印汤,口音看起来很好.但是当我开始解析汤(并将其写入CSV文件)时,带重音的字符会出现乱码."LouisPerrée"成为"LouisPerr√©e"

If I print my soup, the accents appear fine. but when I start parsing the soup (and write it to a CSV file) the accented characters become garbled. 'Louis Perrée' becomes 'Louis Perr√©e'

from BeautifulSoup import BeautifulSoup
import urllib2

response = urllib2.urlopen('http://www.databaseolympics.com/sport/sportevent.htm?sp=FEN&enum=130')
html = response.read()
soup = BeautifulSoup(html)

g = open('fencing_medalists.csv','w"')
t = soup.findAll("table", {'class' : 'pt8'})

for table in t:
    rows = table.findAll('tr')
    for tr in rows:
        cols = tr.findAll('td')
        for td in cols:
            theText=str(td.find(text=True))
            #theText=str(td.find(text=True)).encode("utf-8")
            if theText!="None":
                g.write(theText)
            else: 
                g.write("")
            g.write(",")
        g.write("\n")

非常感谢您的帮助.

推荐答案

如果要处理unicode,请始终将从磁盘或网络读取的响应视为字节包而不是字符串.

If you're dealing with unicode, always treat the response read from disk or network as bag of bytes instead of string.

您的CSV文件中的文本可能是utf-8编码的,应首先对其进行解码.

The text in your CSV file is probably utf-8 encoded, which should be decoded first.

import codecs
# ...
content = response.read()
html = codecs.decode(content, 'utf-8')

此外,您还需要在将unicode文本编码为utf-8之前将其写入输出文件.使用 codecs.open 打开输出文件,并指定编码.它将透明地为您处理输出编码.

Also you need to encode your unicode text to utf-8 before writing it to output file. Use codecs.open to open the output file, specifying encoding. It will transparently handle output encoding for you.

g = codecs.open('fencing_medalists.csv', 'wb', encoding='utf-8')

并对字符串编写代码进行以下更改:

and make the following changes to string writing code:

    theText = td.find(text=True)
    if theText is not None:
        g.write(unicode(theText))

BeautifulSoup可能会进行自动unicode解码,因此您可以跳过 codecs.decode 响应.

BeautifulSoup probably does automatic unicode decoding, so you could skip the codecs.decode on response.

这篇关于使用带有重音和不同字符的美丽汤的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆