从 pandas 的字符串列中删除非ASCII字符 [英] Remove non-ASCII characters from string columns in pandas

查看:108
本文介绍了从 pandas 的字符串列中删除非ASCII字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有多个列的熊猫数据框,其中混合了值和不需要的字符.

I have panda dataframe with multiple columns which mixed with values and unwanted characters.

columnA        columnB    columnC        ColumnD
\x00A\X00B     NULL       \x00C\x00D        123
\x00E\X00F     NULL       NULL              456

我想做的是使此数据框如下所示.

what I'd like to do is to make this dataframe as below.

columnA  columnB  columnC   ColumnD
AB        NULL       CD        123
EF        NULL       NULL      456

在下面的代码中,我可以从columnA中删除"\ x00",但是columnC在某些行中与NULL混合时比较棘手.

With my codes below, I can remove '\x00' from columnA but columnC is tricky as it is mixed with NULL in certain row.

col_names = cols_to_clean
fixer = dict.fromkeys([0x00], u'')
for i in col_names:
if df[i].isnull().any() == False:
    if df[i].dtype != np.int64:
            df[i] = df[i].map(lambda x: x.translate(fixer))

有什么有效的方法可以从columnC中删除不需要的字符?

Is there any efficient way to remove unwanted characters from columnC?

推荐答案

通常,要删除非ASCII字符,请使用带有错误='ignore'的str.encode:

In general, to remove non-ascii characters, use str.encode with errors='ignore':

df['col'] = df['col'].str.encode('ascii', 'ignore').str.decode('ascii')

要在多个字符串列上执行此操作,请使用

To perform this on multiple string columns, use

u = df.select_dtypes(object)
df[u.columns] = u.apply(
    lambda x: x.str.encode('ascii', 'ignore').str.decode('ascii'))

尽管那样仍然无法处理您列中的空字符.为此,您可以使用正则表达式替换它们:

Although that still won't handle the null characters in your columns. For that, you replace them using regex:

df2 = df.replace(r'\W+', '', regex=True)

这篇关于从 pandas 的字符串列中删除非ASCII字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆