从 pandas 的字符串列中删除非ASCII字符 [英] Remove non-ASCII characters from string columns in pandas
问题描述
我有多个列的熊猫数据框,其中混合了值和不需要的字符.
I have panda dataframe with multiple columns which mixed with values and unwanted characters.
columnA columnB columnC ColumnD
\x00A\X00B NULL \x00C\x00D 123
\x00E\X00F NULL NULL 456
我想做的是使此数据框如下所示.
what I'd like to do is to make this dataframe as below.
columnA columnB columnC ColumnD
AB NULL CD 123
EF NULL NULL 456
在下面的代码中,我可以从columnA中删除"\ x00",但是columnC在某些行中与NULL混合时比较棘手.
With my codes below, I can remove '\x00' from columnA but columnC is tricky as it is mixed with NULL in certain row.
col_names = cols_to_clean
fixer = dict.fromkeys([0x00], u'')
for i in col_names:
if df[i].isnull().any() == False:
if df[i].dtype != np.int64:
df[i] = df[i].map(lambda x: x.translate(fixer))
有什么有效的方法可以从columnC中删除不需要的字符?
Is there any efficient way to remove unwanted characters from columnC?
推荐答案
通常,要删除非ASCII字符,请使用带有错误='ignore'的str.encode
:
In general, to remove non-ascii characters, use str.encode
with errors='ignore':
df['col'] = df['col'].str.encode('ascii', 'ignore').str.decode('ascii')
要在多个字符串列上执行此操作,请使用
To perform this on multiple string columns, use
u = df.select_dtypes(object)
df[u.columns] = u.apply(
lambda x: x.str.encode('ascii', 'ignore').str.decode('ascii'))
尽管那样仍然无法处理您列中的空字符.为此,您可以使用正则表达式替换它们:
Although that still won't handle the null characters in your columns. For that, you replace them using regex:
df2 = df.replace(r'\W+', '', regex=True)
这篇关于从 pandas 的字符串列中删除非ASCII字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!