替换 pandas 数据框中的特殊字符 [英] Replacing special characters in pandas dataframe

查看:126
本文介绍了替换 pandas 数据框中的特殊字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以,我有一个用iso8859_15编码的巨大DF.

So, I have this huge DF which encoded in iso8859_15.

我有几列包含巴西的名称和位置,因此其中一些包含特殊字符,例如í"或Ô".

I have a few columns which contain names and places in Brazil, so some of them contain special characters such as "í" or "Ô".

我有密钥将它们替换为字典{'í':'i','á':'a',...}

I have the key to replace them in a dictionary {'í':'i', 'á':'a', ...}

我尝试了以下几种方法来替换它,但是没有一种起作用.

I tried replacing it a couple of ways (below), but none of them worked.

df.replace(dictionary, regex=True, inplace=True) ###BOTH WITH AND WITHOUT REGEX AND REPLACE

也:

df.udpate(pd.Series(dic))

它们都没有预期的输出,这将使诸如NÍCOLAS"之类的字符串变成"NICOLAS".

None of them had the expected output, which would be for strings such as "NÍCOLAS" to become "NICOLAS".

帮助?

推荐答案

pandas.DataFrame.replace 说您必须提供一个嵌套词典:第一级是您必须提供的列名 第二个具有替换对的字典.

因此,这应该可行:

>>> df=pd.DataFrame({'a': ['NÍCOLAS','asdč'], 'b': [3,4]})
>>> df
         a  b
0  NÍCOLAS  3
1     asdč  4

>>> df.replace({'a': {'č': 'c', 'Í': 'I'}}, regex=True)
         a  b
0  NICOLAS  3
1     asdc  4

编辑..似乎pandas也接受非嵌套的翻译词典.在这种情况下,问题可能出在字符编码上,尤其是如果您使用 Python 2 .假设您的CSV加载函数正确解码了文件字符(作为真正的Unicode代码点),那么您应该注意您的翻译/替换字典也使用Unicode字符定义,如下所示:

Edit. Seems pandas also accepts non-nested translation dictionary. In that case, the problem is probably with character encoding, particularly if you use Python 2. Assuming your CSV load function decoded the file characters properly (as true Unicode code-points), then you should take care your translation/substitution dictionary is also defined with Unicode characters, like this:

dictionary = {u'í': 'i', u'á': 'a'}

如果您具有这样的定义(并使用Python 2):

If you have a definition like this (and using Python 2):

dictionary = {'í': 'i', 'á': 'a'}

然后该词典中的实际键是多字节字符串.它们是哪个字节(字符)取决于所使用的实际源文件字符编码,但是假设您使用的是UTF-8,您将获得:

then the actual keys in that dictionary are multibyte strings. Which bytes (characters) they are depends on the actual source file character encoding used, but presuming you use UTF-8, you'll get:

dictionary = {'\xc3\xa1': 'a', '\xc3\xad': 'i'}

这将解释为什么pandas无法替换那些字符.因此,请确保在Python 2中使用Unicode文字:u'this is unicode string'.

And that would explain why pandas fails to replace those chars. So, be sure to use Unicode literals in Python 2: u'this is unicode string'.

另一方面,在Python 3中,所有字符串都是Unicode字符串,您不必使用u前缀(实际上Python 2中的unicode类型在Python 3中被重命名为str ,而Python 2中的旧str现在在Python 3中为bytes.

On the other hand, in Python 3, all strings are Unicode strings, and you don't have to use the u prefix (in fact unicode type from Python 2 is renamed to str in Python 3, and the old str from Python 2 is now bytes in Python 3).

这篇关于替换 pandas 数据框中的特殊字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆