如何从网站识别字符编码? [英] How to identify character encoding from website?

查看:62
本文介绍了如何从网站识别字符编码?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我要做什么:我正在从数据库中获取uri列表并下载它们,删除停用词并计算单词在网页中出现的频率,然后尝试保存在mongodb中.

What I'm trying to do: I'm getting from a database a list of uris and download them, removing the stopwords and counting the frequency that the words appears in the webpage, then trying to save in the mongodb.

问题:当我尝试将结果保存到数据库中时,出现错误bson.errors.invalidDocument:该文档必须是有效的utf-8

The Problem: When I try to save the result in the database I get the error bson.errors.invalidDocument: the document must be a valid utf-8

它似乎与代码"\ xc3someotherstrangewords","\ xe2something"相关处理网页时,我会尝试删除标点符号,但无法删除重音符号,因为我会输入错误的单词.

it appears to be related to the codes '\xc3someotherstrangewords', '\xe2something' when I'm processing the webpages I try remove the punctuation, but I can't remove accents because I'll get a wrong word.

我已经尝试过的东西我试过通过网页标题识别char编码我试过利用甜菜

What I already tried I've tried identify the char encode through the header from the webpage I've tried utilize the chardet

利用re.compile(r"[^ a-zA-Z]")和/或unicode(变量,'ascii','ignore');
这对非英语语言不利,因为它们会消除重音.

utilize the re.compile(r"[^a-zA-Z]") and/or unicode(variable,'ascii', 'ignore');
that isn't good for non-English languages because they remove the accents.

我想知道的是:
任何人都知道如何识别字符并转换为正确的单词/编码吗?
例如从网页"\ xe2"获取并转换为â"

What I want know is:
anyone know how identify the chars and translate to the right word/encode?
e.g. get this from webpage '\xe2' and translate to 'â'

(英语不是我的母语,所以请原谅我)如果有人想看源代码

(English isn't my first language so forgive me) if anyone want see the source code

推荐答案

要找到正确的网站字符编码并不容易,因为标题中的信息可能是错误的. BeautifulSoup 在猜测字符编码并自动对其进行解码方面做得非常好到Unicode.

It is not easy to find out the correct character encoding of a website because the information in the header might be wrong. BeautifulSoup does a pretty good job at guessing the character encoding and automatically decodes it to Unicode.

from bs4 import BeautifulSoup
import urllib

url = 'http://www.google.de'
fh = urllib.urlopen(url)
html = fh.read()
soup = BeautifulSoup(html)

# text is a Unicode string 
text = soup.body.get_text()
# encoded_text is a utf-8 string that you can store in mongo
encoded_text = text.encode('utf-8')

另请参阅此问题的答案.

这篇关于如何从网站识别字符编码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆