将汤.get_text()与UTF-8一起使用 [英] Use soup.get_text() with UTF-8

查看:41
本文介绍了将汤.get_text()与UTF-8一起使用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要使用BeautifulSoup从页面中获取所有文本.在BeautifulSoup的文档中,它表明您可以执行 soup.get_text()来执行此操作.当我尝试在reddit.com上执行此操作时,出现以下错误:

I need to get all the text from a page using BeautifulSoup. At BeautifulSoup's documentation, it showed that you could do soup.get_text() to do this. When I tried doing this on reddit.com, I got this error:


UnicodeEncodeError in soup.py:16
  'cp932' codec can't encode character u'\xa0' in position 2262: illegal multibyte sequence

在我检查的大多数网站上都出现类似的错误.
我也做了 soup.prettify()时也遇到了类似的错误,但是我通过将其更改为 soup.prettify('UTF-8')来修复了它.有没有什么办法解决这一问题?预先感谢!

I get errors like that on most of the sites I checked.
I got similar errors when I did soup.prettify() too, but I fixed it by changing it to soup.prettify('UTF-8'). Is there any way to fix this? Thanks in advance!

6月24日更新
我发现了一些似乎对其他人有用的代码,但是我仍然需要使用UTF-8而不是默认值.代码:

Update June 24
I've found a bit of code that seems to work for other people, but I still need to use UTF-8 instead of the default. Code:


texts = soup.findAll(text=True)

   def visible(element):
      if element.parent.name in ['style', 'script', '[document]', 'head', 'title']:
         return False
      elif re.match('', str(element)): return False
      elif re.match('\n', str(element)): return False
      return True

   visible_texts = filter(visible, texts)

   print visible_texts

不过,错误有所不同.进展吗?

Error is different, though. Progress?


UnicodeEncodeError in soup.py:29
'ascii' codec can't encode character u'\xbb' in position 1: ordinal not in range
(128)

推荐答案

soup.get_text()返回Unicode字符串,这就是您收到错误的原因.

soup.get_text() returns a Unicode string that's why you're getting the error.

您可以通过多种方式解决此问题,包括在shell级别上设置编码.

You can solve this in a number of ways including setting the encoding at the shell level.

export PYTHONIOENCODING=UTF-8

您可以重新加载sys并通过将其包含在脚本中来设置编码.

You can reload sys and set the encoding by including this in your script.

if __name__ == "__main__":
  reload(sys)
  sys.setdefaultencoding("utf-8")

或者您可以在代码中将字符串编码为utf-8.对于您的reddit问题,类似以下的方法将起作用:

Or you can encode the string as utf-8 in code. For your reddit problem something like the following would work:

import urllib
from bs4 import BeautifulSoup

url = "https://www.reddit.com/r/python"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)

# get text
text = soup.get_text()

print(text.encode('utf-8'))

这篇关于将汤.get_text()与UTF-8一起使用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆