如何从美丽的汤而不是unicode返回纯文本 [英] How to return plain text from Beautiful Soup instead of unicode

查看:160
本文介绍了如何从美丽的汤而不是unicode返回纯文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用BeautifulSoup4来刮擦这个网页,但是我得到了BeautifulSoup返回的奇怪的unicode文本。



这是我的代码:

  site =http: /en.wikipedia.org/wiki/\"+a+\"_\"+str(b)
hdr = {'User-Agent':'Mozilla / 5.0'}
req = urllib2.Request(site ,headers = hdr)
req.add_header('Accept-enconding','gzip')#Header来检查gzip
page = urllib2.urlopen(req)
if page.info ).get('Content-Encoding')=='gzip':#IF检查gzip
data = page.read()
data = StringIO.StringIO(data)
gzipper = gzip .GzipFile(fileobj = data)
html = gzipper.read()
soup = BeautifulSoup(html,fromEncoding ='gbk')
else:
soup = BeautifulSoup(page)

section = soup.find('span',id ='事件')parent
events = section.find_next('ul')find_all('li')
print soup.originalEncoding
for x在事件中:
print x

我想X是在平原恩中文。相反,我得到的东西如下所示:

 < li>< a href =/ wiki / 153_BC title =153 BC> 153 BC a / < a href =/ wiki / Roman_consultitle =Roman consul>罗马领事< / a>开始他们的职位。< / li> 

在这个特定的字符串中只有一个示例,但你会得到想法。



相关:我用一些正则表达式和其他字符串切割方法来剪切这个字符串,在切割之前或之后我应该将其切换为纯文本吗?我假设没关系,但是看到我推迟到了,反正我想我会问。



如果有人知道如何解决这个问题,我会很感激。谢谢



编辑:感谢JF的提示,我现在在我的for循环后使用这个:

 对于x在事件:
X = x.encode( 'ASCII')
X = STR(X)
#Find内容
regex2 = re.compile ( > [^>] *<)
的文本清单= re.findall(regex2,X)
文本= 。加入(文本清单)
文本= text.replace (>,)
text = text.replace(<,)
contents.append(text)
pre>

但是,我仍然得到这样的东西:

  2013 &安培;#8211;在象牙海岸的阿比让的F&#233; lix Houphou&#235; t-Boigny体育场庆祝活动中,至少有60人遇难,200人受伤。 



编辑:
下面是如何使我的Excel电子表格(CSV),并送我列表

  rows = zip(days,contents)
with open(events.csv,wb)作为f:
writer = csv.writer(f)
行中的行:
writer.writerow(row)

因此,在程序中创建csv文件,并在生成列表后导入所有内容。我只需要它是在该点阅读的文本


解决方案

没有这样的纯文本。你看到的是字节被解释为使用不正确的字符编码的文本,即字符串的编码与终端使用的不同,除非早先通过使用不正确的网页字符编码引入错误。



print x call str(x)返回UTF-8编码的字符串BeautifulSoup对象



尝试:

  print unicode(x)

或:

  print x.encode('ascii')


I am using BeautifulSoup4 to scrape this web page, however I'm getting the weird unicode text that BeautifulSoup returns.

Here is my code:

    site = "http://en.wikipedia.org/wiki/"+a+"_"+str(b)
    hdr = {'User-Agent': 'Mozilla/5.0'}
    req = urllib2.Request(site,headers=hdr)  
    req.add_header('Accept-enconding', 'gzip') #Header to check for gzip
    page = urllib2.urlopen(req)
    if page.info().get('Content-Encoding') == 'gzip': #IF checks gzip
        data = page.read()
        data = StringIO.StringIO(data)
        gzipper = gzip.GzipFile(fileobj=data)
        html = gzipper.read()
        soup = BeautifulSoup(html, fromEncoding='gbk')
    else:
        soup = BeautifulSoup(page)

    section = soup.find('span', id='Events').parent
    events = section.find_next('ul').find_all('li')
    print soup.originalEncoding
    for x in events:
        print x

Bascially I want x to be in plain English. I get, instead, things that look like this:

<li><a href="/wiki/153_BC" title="153 BC">153 BC</a> â€" <a href="/wiki/Roman_consul" title="Roman consul">Roman consuls</a> begin their year in office.</li>

There's only one example in this particular string, but you get the idea.

Related: I go on to cut up this string with some regex and other string cutting methods, should I switch this to plain text before or after I cut it up? I'm assuming it doesn't matter but seeing as I'm defering to SO anyways, I thought I'd ask.

If anyone knows how to fix this, I'd appreciate it. Thanks

EDIT: Thanks J.F. for the tip, I now used this after my for loop:

    for x in events:
        x = x.encode('ascii')
        x = str(x)
        #Find Content
        regex2 = re.compile(">[^>]*<")
        textList = re.findall(regex2, x)
        text = "".join(textList)
        text = text.replace(">", "")
        text = text.replace("<", "")
        contents.append(text)

However, I still get things like this:

2013 &#8211; At least 60 people are killed and 200 injured in a stampede after celebrations at F&#233;lix Houphou&#235;t-Boigny Stadium in Abidjan, Ivory Coast.

EDIT: Here is how I make my excel spreadsheet (csv) and send in my list

rows = zip(days, contents)
with open("events.csv", "wb") as f:
writer = csv.writer(f)
for row in rows:
    writer.writerow(row)

So the csv file is created during the program and everything is imported after the lists are generated. I just need to it to be readable text at that point.

解决方案

There is no such thing as plain text. What you see are bytes interpreted as text using incorrect character encoding i.e., the encoding of the strings is different from the one used by your terminal unless the error were introduced earlier by using incorrect character encoding for the web page.

print x calls str(x) that returns UTF-8 encoded string for BeautifulSoup objects.

Try:

print unicode(x)

Or:

print x.encode('ascii')

这篇关于如何从美丽的汤而不是unicode返回纯文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆