从 pandas 列中删除非ASCII字符 [英] Remove non-ASCII characters from pandas column
问题描述
一段时间以来,我一直在尝试解决此问题.我正在尝试从DB_user列中删除非ASCII字符并将其替换为空格.但是我总是遇到一些错误.这是我的数据框的外观:
I have been trying to work on this issue for a while.I am trying to remove non ASCII characters form DB_user column and trying to replace them with spaces. But I keep getting some errors. This is how my data frame looks:
+-----------------------------------------------------------
| DB_user source count |
+-----------------------------------------------------------
| ???/"Ò|Z?)?]??C %??J A 10 |
| ?D$ZGU ;@D??_???T(?) B 3 |
| ?Q`H??M'?Y??KTK$?ً???ЩJL4??*?_?? C 2 |
+-----------------------------------------------------------
我正在使用此功能,是我在研究SO问题时遇到的.
I was using this function, which I had come across while researching the problem on SO.
def filter_func(string):
for i in range(0,len(string)):
if (ord(string[i])< 32 or ord(string[i])>126
break
return ''
And then using the apply function:
df['DB_user'] = df.apply(filter_func,axis=1)
我不断收到错误消息:
'ord() expected a character, but string of length 66 found', u'occurred at index 2'
但是,我认为通过使用filter_func函数中的循环,我通过在'ord'中输入一个字符来解决这个问题.因此,当它碰到非ASCII字符时,应将其替换为空格.
However, I thought by using the loop in the filter_func function, I was dealing with this by inputing a char into 'ord'. Therefore the moment it hits a non-ASCII character, it should be replaced by a space.
有人可以帮我吗?
谢谢!
推荐答案
您的代码失败,因为您没有将其应用于每个字符,由于每个单词和ord错误都将其应用于单个字符,因此您需要:
You code fails as you are not applying it on each character, you are applying it per word and ord errors as it takes a single character, you would need:
df['DB_user'] = df["DB_user"].apply(lambda x: ''.join([" " if ord(i) < 32 or ord(i) > 126 else i for i in x]))
您还可以使用链式比较简化连接:
You can also simplify the join using a chained comparison:
''.join([i if 32 < ord(i) < 126 else " " for i in x])
您还可以使用string.printable
过滤字符:
You could also use string.printable
to filter the chars:
from string import printable
st = set(printable)
df["DB_user"] = df["DB_user"].apply(lambda x: ''.join([" " if i not in st else i for i in x]))
最快的是使用笔译:
from string import maketrans
del_chars = " ".join(chr(i) for i in range(32) + range(127, 256))
trans = maketrans(t, " "*len(del_chars))
df['DB_user'] = df["DB_user"].apply(lambda s: s.translate(trans))
有趣的是,它比:
df['DB_user'] = df["DB_user"].str.translate(trans)
这篇关于从 pandas 列中删除非ASCII字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!