字符编码转换 [英] character encoding conversion

查看:83
本文介绍了字符编码转换的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



这就是我要做的事情:


- 从各种来源获取一些HTML内容


我遇到的问题:


- 某些来源的字符编码错误...对于

例如, cp1252卷曲引号很可能是作者的结果

复制并粘贴Word中的内容


我已搜索并阅读了好几个小时,但有找不到解决方案

用于处理页面作者不使用他们指定的字符

编码的情况。


我尝试的东西包括encode()/ decode()和替换查找

表(例如
http://groups-beta.google.com/group/...991de6ced3406b

)。但是,我仍然无法将字符转换成有意义的字符。在查找表的情况下,这个失败了,因为所有的非常编码的字符都返回为

?而不是

他们的原始编码。


我正在使用urllib和htmllib来打开,阅读和解析html

片段,OS X 10.3上的Python 2.3

任何想法或指示都将不胜感激。


-Dylan Schiemann
< a rel =nofollowhref =http://www.dylanschiemann.com/target =_ blank> http://www.dylanschiemann.com/


Here''s what I''m trying to do:

- scrape some html content from various sources

The issue I''m running to:

- some of the sources have incorrectly encoded characters... for
example, cp1252 curly quotes that were likely the result of the author
copying and pasting content from Word

I''ve searched and read for many hours, but have not found a solution
for handling the case where the page author does not use the character
encoding that they have specified.

Things I have tried include encode()/decode(), and replacement lookup
tables (i.e. something like
http://groups-beta.google.com/group/...991de6ced3406b
) . However, I am still unable to convert the characters to something
meaningful. In the case of the lookup table, this failed as all of
the imporoperly encoded characters were returning as ? rather than
their original encoding.

I''m using urllib and htmllib to open, read, and parse the html
fragments, Python 2.3 on OS X 10.3

Any ideas or pointers would be greatly appreciated.

-Dylan Schiemann
http://www.dylanschiemann.com/

推荐答案

Dylan写道:
我尝试的东西包括encode()/ decode()
Things I have tried include encode()/decode()




这应该有效。如果你以某种方式设法猜测编码,

,例如猜它为cp1252,然后


htmlstring.decode(" cp1252")。encode(" us-ascii"," xmlcharrefreplace")

将为您提供仅包含ASCII字符的文件,以及

其他所有内容的字符引用。


现在,您应该如何猜测编码?这是一个策略:

1.使用通过HTTP标头发送的编码。绝对肯定不会忽略这种编码。

2.使用XML声明中的编码(如果有的话)。

3.使用http-equiv元元素中的编码(如果有的话)

4.使用UTF-8

5.使用Latin-1,并检查是否没有字符

范围(128,160)

6.使用cp1252

7.使用Latin-1


按照从1到6的顺序,检查您是否设法解码

输入。请注意,在第5步中,您一定会获得成功

解码;如果你得到任何控制权,可以认为这是一个失败。

个字符(来自范围(128,160));然后在第7步尝试latin-1




当你找到第一个正确解码的编码时,编码

it使用ascii和xmlcharrefreplace,你不再需要担心

关于编码了。


问候,

Martin



This should work. If you somehow manage to guess the encoding,
e.g. guess it as cp1252, then

htmlstring.decode("cp1252").encode("us-ascii", "xmlcharrefreplace")

will give you a file that contains only ASCII characters, and
character references for everything else.

Now, how should you guess the encoding? Here is a strategy:
1. use the encoding that was sent through the HTTP header. Be
absolutely certain to not ignore this encoding.
2. use the encoding in the XML declaration (if any).
3. use the encoding in the http-equiv meta element (if any)
4. use UTF-8
5. use Latin-1, and check that there are no characters in the
range(128,160)
6. use cp1252
7. use Latin-1

In the order from 1 to 6, check whether you manage to decode
the input. Notice that in step 5, you will definitely get successful
decoding; consider this a failure if you have get any control
characters (from range(128, 160)); then try in step 7 latin-1
again.

When you find the first encoding that decodes correctly, encode
it with ascii and xmlcharrefreplace, and you won''t need to worry
about the encoding, anymore.

Regards,
Martin


Martin v.L?wis写道:
Martin v. L?wis wrote:
Dylan写道:
Dylan wrote:
我拥有的东西试过包括encode()/ decode()
Things I have tried include encode()/decode()



这应该有效。如果你以某种方式设法猜测编码,
例如猜它为cp1252,然后

htmlstring.decode(" cp1252")。encode(" us-ascii"," xmlcharrefreplace")

会给你一个只包含ASCII字符的文件,以及其他所有内容的字符引用。

现在,您应该如何猜测编码?这是一个策略:
1.使用通过HTTP标头发送的编码。绝对肯定不要忽略这种编码。
2.在XML声明中使用编码(如果有的话)。
3.使用http-equiv元素中的编码(如果有的话) )
4.使用UTF-8
5.使用Latin-1,并检查
范围内是否有字符(128,160)
6.使用cp1252 7.使用Latin-1
按顺序从1到6,检查是否设法解码输入。请注意,在第5步中,您一定会获得成功的解码;如果你得到任何控制字符(来自范围(128,160)),则认为这是一个失败;然后再次尝试第7步latin-1

当你找到第一个正确解码的编码时,用ascii和xmlcharrefreplace编码它,你就不需要了再担心编码问题了。

问候,
Martin


This should work. If you somehow manage to guess the encoding,
e.g. guess it as cp1252, then

htmlstring.decode("cp1252").encode("us-ascii", "xmlcharrefreplace")

will give you a file that contains only ASCII characters, and
character references for everything else.

Now, how should you guess the encoding? Here is a strategy:
1. use the encoding that was sent through the HTTP header. Be
absolutely certain to not ignore this encoding.
2. use the encoding in the XML declaration (if any).
3. use the encoding in the http-equiv meta element (if any)
4. use UTF-8
5. use Latin-1, and check that there are no characters in the
range(128,160)
6. use cp1252
7. use Latin-1

In the order from 1 to 6, check whether you manage to decode
the input. Notice that in step 5, you will definitely get successful
decoding; consider this a failure if you have get any control
characters (from range(128, 160)); then try in step 7 latin-1
again.

When you find the first encoding that decodes correctly, encode
it with ascii and xmlcharrefreplace, and you won''t need to worry
about the encoding, anymore.

Regards,
Martin



我有类似的问题,有像? UA?ü?等等。我从网页中提取一些内容,然后他们提供了任何内容,

有时甚至不会在标题中提供任何编码信息。但是

你的解决方案听起来相当不错,我只是不知道是否

- 它适用于我提到的字符

- 你用什么编码到底有多少?b $ b - 你究竟是怎么做到这一切的?全部带有somestring.decode()

或...你能举一个例子来说明这7个步骤吗?

提前用于帮助

Chris


I have a similar problem, with characters like ??üA?ü? and so on. I am
extracting some content out of webpages, and they deliver whatever,
sometimes not even giving any encoding information in the header. But
your solution sounds quite good, i just do not know if
- it works with the characters i mentioned
- what encoding do you have in the end
- and how exactly are you doing all this? All with somestring.decode()
or... Can you please give an example for these 7 steps?
Thanx in advance for the help
Chris


Christian Ergh写道:
Christian Ergh wrote:
- 它适用于我提到的字符


确实如此。

- 你到底有什么编码


US-ASCII

- 你究竟在做什么这些所有?全部用somestring.decode()
或者......你能举一个这7个步骤的例子吗?
- it works with the characters i mentioned
It does.
- what encoding do you have in the end
US-ASCII
- and how exactly are you doing all this? All with somestring.decode()
or... Can you please give an example for these 7 steps?




我可以,但我不是''有时间 - 只是尝试拿出一些

代码,我试着评论它。


问候,

Martin



I could, but I don''t have the time - just try to come up with some
code, and I try to comment on it.

Regards,
Martin


这篇关于字符编码转换的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆