美丽的汤和统一code问题 [英] Beautiful Soup and Unicode Problems

查看:226
本文介绍了美丽的汤和统一code问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用BeautifulSoup解析某些网页。

I'm using BeautifulSoup to parse some web pages.

我偶尔碰上像下面这样的UNI code地狱错误:

Occasionally I run into a "unicode hell" error like the following :

在TheAtlantic.com [<中,在这篇文章的来源看href=\"http://www.theatlantic.com/education/archive/2013/10/why-are-hundreds-of-harvard-students-studying-ancient-chinese-philosophy/280356/\">http://www.theatlantic.com/education/archive/2013/10/why-are-hundreds-of-harvard-students-studying-ancient-chinese-philosophy/280356/ ]

Looking at the source of this article on TheAtlantic.com [ http://www.theatlantic.com/education/archive/2013/10/why-are-hundreds-of-harvard-students-studying-ancient-chinese-philosophy/280356/ ]

我们看到在OG:描述元属性:

We see this in the og:description meta property :

<meta property="og:description" content="The professor who teaches&nbsp;Classical Chinese Ethical and Political Theory claims, &quot;This course will change your life.&quot;" />

在BeautifulSoup解析它,我看到:

When BeautifulSoup parses it, I see this:

>>> print repr(description)
u'The professor who teaches\xa0Classical Chinese Ethical and Political Theory claims, "This course will change your life."'

如果我尝试编码成UTF-8,这样的SO意见建议: http://stackoverflow.com/a/10996267 / 442650

If I try encoding it to UTF-8 , like this SO comment suggests : http://stackoverflow.com/a/10996267/442650

>>> print repr(description.encode('utf8'))
'The professor who teaches\xc2\xa0Classical Chinese Ethical and Political Theory claims, "This course will change your life."'

就在我以为我已经控制住我所有的UNI code的问题,我还是不太明白是怎么回事,所以我要制定出几个问题:

Just when I thought I had all my unicode issues under control, I still don't quite understand what's going on, so I'm going to lay out a few questions:

1,为什么会BeautifulSoup转换&放大器; NBSP; \\ XA0 [拉丁字符集的空格字符] ?此页面上的字符集和头文件是UTF-8,我想BeautifulSoup拉的编码数据?为什么不是替换为&lt;空&GT;

1- why would BeautifulSoup convert the &nbsp; to \xa0 [a latin charset space character]? The charset and headers on this page are UTF-8, I thought BeautifulSoup pulls that data for the encoding ? Why wasn't it replaced with a <space> ?

2 - 是否有空格正常化转换为一种通用的方式?

2- Is there a common way to normalize whitespaces for conversion ?

3。当我连接coded到UTF8,放哪儿 \\ XA0 成为的\\序列XC2 \\ XA0

3- When I encoded to UTF8 , where did \xa0 become the sequence of \xc2\xa0 ?

我能管的一切 - 但我爱要了解什么是错的,并避免像这个问题在未来。

I can pipe everything through unicodedata.normalize('NFKD',string) to help get me where I want to be -- but I'd love to understand what's wrong and avoid problem like this in the future.

推荐答案

您没有遇到问题。一切都表现为意。

You aren't encountering a problem. Everything is behaving as intended.

&放大器; NBSP; 表示不间断空格字符。这不与空间替换,因为它不会重新present的空间;它重新presents非打破空间。用空格代替它会丢失信息。但如空间时,文本渲染引擎不应该把换行符

&nbsp; indicates a non-breaking space character. This isn't replaced with a space because it doesn't represent a space; it represents a non-breaking space. Replacing it with a space would lose information: that where that space occurs, a text rendering engine shouldn't put a line break.

单向code code点不间断空格是U + 00A0,这是写在Python中的统一code字符串 \\ XA0

The Unicode code point for non-breaking space is U+00A0, which is written in a Unicode string in Python as \xa0.

U + 00A0的 UTF-8编码的是,十六进制,两个字节序列C2 A0,或写入字符串形式重新presentation, \\ XC2 \\ XA0 。在UTF-8,任何超出7位ASCII字符集需要两个或多个字节重新present它。在这种情况下,最高位设置是第八位。这意味着,它可以重新由两个字节序列psented(二进制)$ P $ 110xxxxx 10xxxxxx 其中x是二进制重新presentation的位的code点。在A0的情况下,也就是千万,或当EN为UTF-8, codeD 11000010千万或C2 A0。

The UTF-8 encoding of U+00A0 is, in hexadecimal, the two byte sequence C2 A0, or written in a Python string representation, \xc2\xa0. In UTF-8, anything beyond the 7-bit ASCII set needs two or more bytes to represent it. In this case, the highest bit set is the eighth bit. That means that it can be represented by the two-byte sequence (in binary) 110xxxxx 10xxxxxx where the x's are the bits of the binary representation of the code point. In the case of A0, that is 10000000, or when encoded in UTF-8, 11000010 10000000 or C2 A0.

很多人用&放大器; NBSP; 在HTML中获得并非由通常的HTML空白崩溃规则倒塌空间(在HTML中,连续的空格的所有运行,标签和换行符得到PTED为一个空格间$ p $,除非的 CSS <用应用 code>空格规则),但是这不是真的,他们的目的是什么;他们都应该被用于诸如名称,比如宫城先生,在这里你不想有成为先生之间的换行符和宫城。我不知道为什么它是在这种特殊情况下使用;它显得格格不入这里,但是这更多的与你的来源,而不是在code的一个问题,国米$ P $点吧。

Many people use &nbsp; in HTML to get spaces which aren't collapsed by the usual HTML whitespace collapsing rules (in HTML, all runs of consecutive spaces, tabs, and newlines get interpreted as a single space unless one of the CSS white-space rules are applied), but that's not really what they are intended for; they are supposed to be used for things like names, like "Mr. Miyagi", where you don't want there to be a line break between the "Mr." and "Miyagi". I'm not sure why it was used in this particular case; it seems out of place here, but that's more of a problem with your source, not the code that interprets it.

现在,如果你真的不关心布局,使你不介意的文本布局算法是否选择,作为包装一个地方,而是想国米preT这仅仅是一个普通的空间,使用NFKD正火是一个完全合理的答案(或NFKC如果preFER pre-组成口音分解口音)。该 NFKC和NFKD的标准化字符映射为使得重新present基本上是相同的语义值大多数字符大多数情况下展开的。例如,连字都扩大了(FFI - > FFI),古代长的人物被转换为s(秒 - > S),罗马数字字符被扩展到他们的单个字母(Ⅳ - > IV)和非换空间转换成正常的空间。对于一些字符,NFKC或NFKD正常化可能丢失是在某些情况下重要的信息:ℌ和ℍ都将正常化至H,但在数学文本可以用来指不同的事

Now, if you don't really care about layout so you don't mind whether or not text layout algorithms choose that as a place to wrap, but would like to interpret this merely as a regular space, normalizing using NFKD is a perfectly reasonable answer (or NFKC if you prefer pre-composed accents to decomposed accents). The NFKC and NFKD normalizations map characters such that most characters that represent essentially the same semantic value in most contexts are expanded out. For instance, ligatures are expanded out (ffi -> ffi), archaic long s characters are converted into s (ſ -> s), Roman numeral characters are expanded into their individual letters (Ⅳ -> IV), and non-breaking space converted into a normal space. For some characters, NFKC or NFKD normalization may lose information that is important in some contexts: ℌ and ℍ will both normalize to H, but in mathematical texts can be used to refer to different things.

这篇关于美丽的汤和统一code问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆