如何解决这个奇怪的python编码问题? [英] How to solve this weird python encoding issue?

查看:266
本文介绍了如何解决这个奇怪的python编码问题?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在网络上的字符串语料库上进行一些NLP任务 - 正如你所料,存在编码问题。这里有几个例子:

 他们不服务寿司:撇号不是标准的, \\ xe2\x80\x99 
美味的食物 - 哇:哇之前的连字符是\xe2\x80\x93

所以现在,我要读这样的行,把它们传递给NLTK进行解析,使用parse信息通过mallet训练一个CRF模型。我们先从堆栈溢出处的所有方法开始。以下是一些实验: -

  st =他们不提供寿司

st.encode('utf-8')
输出[2]:'他们don \xc3\xa2\xe2\x82\xac\xe2\x84\xa2t服务寿司'

st.decode('utf-8')
出[3]:你们不要为寿司服务

所以这些只是试图和错误的尝试,看看是否有可能起作用。



我终于使用了编码的句子,并将其传递给下一个部分 - 使用nltk的pos标记。 posTags = nltk.pos_tag(tokens),它会抛出一个所有人都知道的丑陋的例外: -

 文件C:\Users\user\workspacePy\_projectname_\CRF\FeatureGen.py,第95行,getSentenceFeatures 
posTags = nltk.pos_tag(令牌)
文件C:\Users\user\Anaconda\lib\site-packages\\\
ltk\tag\__init __。py,第101行,pos_tag
return tagger.tag (令牌)
文件C:\Users\user\Anaconda\lib\site-packages\\\
ltk\tag\sequential.py,第61行,标签
tags.append(self.tag_one(tokens,i,tags))
文件C:\Users\user\Anaconda\lib\site-packages\\\
ltk\tag\sequential。 py,第81行,tag_one
标签= tagger.choose_tag(令牌,索引,历史记录)
文件C:\Users\user\Anaconda\lib\site- packages\\ \\ nltk\tag\sequential.py,第634行,在choose_tag
中featureset = self.feature_detector(令牌,索引,历史记录)
文件C:\Users\user\Anaconda\lib\site- packages\\\
ltk\tag\sequential.py, line 736,in feature_detector
'prevtag + word':'%s +%s'%(prevtag,word.lower()),
UnicodeDecodeError:'ascii'编解码器无法解码位置0xe2 0:序号不在范围(128)

当我尝试解码时,它说 UnicodeEncodeError:'ascii'编码解码器不能在位置42中编码字符u'\\\–':在解码字符串的行中,序号不在范围(128)中。 p>

所以我目前的解决方案是删除所有非ASCII字符。但是它完全改变了基于unigram -ramram(单词组合)的模型严重损失数据的单词。



应该是正确的方法?

解决方案

在你的例子中 st 是一个str(字节列表)。为了做到这一点,它以某种形式(utf8由外观)进行编码,但将其视为字节列表,您需要知道如何对其进行编码以解码(尽管utf8总是一个很好的第一个平底锅)。

 >>> st =他们不提供寿司
>>> st
'他们don \xe2\x80\x99t服务寿司'
>>> type(st)
< type'str'>

>>> st.encode('utf8')
UnicodeDecodeError:'ascii'编解码器无法解码位置8中的字节0xe2:序号不在范围(128)

所以 st.encode 在这里是不感兴趣的。它已经被编码(由解释器的utf8由事物的外观)。因为一些疯狂的原因,在python2 str.encode 将首先将 decode 转换为unicode,然后编码回到一个str。它默认情况下选择解码为ascii,但是您的数据被编码为utf8。所以你看到的错误是在你的编码操作的解码步骤!它正在查看字节列表 e2,80,99 ,并说 - '嗯,那些不是真正的ASCII字符'。



我们从unicode数据开始(注意你):

 >>> st = u他们不提供寿司
>>> s
>>>           type(st)
< type'unicode'>
>>> st.encode('utf8')
'他们don \xe2\x80\x99t服务寿司'

真的,这一切都是python2的错。 Python3不会让你摆脱这些unicode和str思想的shenanigans同样的事情。



经验法则是:总是在你的代码中使用unicode。只有在数据进出系统时才进行编码/解码,通常编码为utf8,除非您有其他特定要求。



在python2中,您可以确保代码中的'data'自动被unicode u'data'

  from __future__ import unicode_literals 

>>> st =他们不提供寿司
>>> s
>>>           type(st)
< type'unicode'>


I'm doing some NLP task on a corpus of strings from the web - and as you expect, there are encoding issues. Here're a few examples:

they don’t serve sushi : the apostrophe in don't is not standard ' but \xe2\x80\x99
Delicious food – Wow   : the hyphen before wow is \xe2\x80\x93

So now, I'm gonna read such lines, pass them to NLTK for parsing, use the parse information to train a CRF model through mallet.

Let's begin with the solution I've been seeing everywhere on stack-overflow. Here're a few experiments:-

st = "they don’t serve sushi"

st.encode('utf-8')
Out[2]: 'they don\xc3\xa2\xe2\x82\xac\xe2\x84\xa2t serve sushi'

st.decode('utf-8')
Out[3]: u'they don\u2019t serve sushi'

So these are just trial-and-error attempts to see if something might work.

I finally used the encoded sentence and passed it to the next part - pos tagging using nltk. posTags = nltk.pos_tag(tokens) and it throws an ugly exception known to everyone :-

 File "C:\Users\user\workspacePy\_projectname_\CRF\FeatureGen.py", line 95, in getSentenceFeatures
    posTags = nltk.pos_tag(tokens)
  File "C:\Users\user\Anaconda\lib\site-packages\nltk\tag\__init__.py", line 101, in pos_tag
    return tagger.tag(tokens)
  File "C:\Users\user\Anaconda\lib\site-packages\nltk\tag\sequential.py", line 61, in tag
    tags.append(self.tag_one(tokens, i, tags))
  File "C:\Users\user\Anaconda\lib\site-packages\nltk\tag\sequential.py", line 81, in tag_one
    tag = tagger.choose_tag(tokens, index, history)
  File "C:\Users\user\Anaconda\lib\site-packages\nltk\tag\sequential.py", line 634, in choose_tag
    featureset = self.feature_detector(tokens, index, history)
  File "C:\Users\user\Anaconda\lib\site-packages\nltk\tag\sequential.py", line 736, in feature_detector
    'prevtag+word': '%s+%s' % (prevtag, word.lower()),
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)

And when I tried decoding, it says UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 42: ordinal not in range(128) in the line where I'm decoding the string.

So my current solution is to remove all the non-ascii characters. But it totally changes the word which causes a serious loss of data for unigram-bigram (word combination) based model.

What should be the right approach?

解决方案

In your example st is a str (list of bytes). To do that it was encoded in some form (utf8 by the looks), but think of it as a list of bytes, and you need to know how it was encoded in order to decode it (though utf8 is always generally a good first punt).

>>> st = "they don’t serve sushi"
>>> st
'they don\xe2\x80\x99t serve sushi'
>>> type(st)
<type 'str'>

>>> st.encode('utf8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 8: ordinal not in range(128)

So st.encode is non-sensical here. It's already encoded (as utf8 by the interpreter by the looks of things). For some mad reason, in python2 str.encode will first decode into a unicode and then encode back to a str. It chooses to decode as ascii by default, but your data is encoded as utf8. So the error you're seeing is in the decode step of your encode operation! It's looking at that list of bytes e2,80,99 and saying - 'hmmm, those aren't real ascii characters'.

Let's start with unicode data instead (notice the u):

>>> st = u"they don’t serve sushi"
>>> st
u'they don\u2019t serve sushi'
>>> type(st)
<type 'unicode'>
>>> st.encode('utf8')
'they don\xe2\x80\x99t serve sushi'

Really, all this is python2's fault. Python3 won't let you get away with these shenanigans of thinking of unicode and str as the same thing.

The rule of thumb is; always work with unicode within your code. Only encode/decode when you're getting data in and out of the system, and generally, encode as utf8 unless you have some other specific requirement.

In python2 you can ensure that 'data' in your code is automatically unicode u'data'

from __future__ import unicode_literals

>>> st = "they don’t serve sushi"
>>> st
u'they don\u2019t serve sushi'
>>> type(st)
<type 'unicode'>

这篇关于如何解决这个奇怪的python编码问题?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆