从 pandas 列中删除非ASCII字符 [英] Remove non-ASCII characters from pandas column

查看:80
本文介绍了从 pandas 列中删除非ASCII字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

一段时间以来,我一直在尝试解决此问题.我正在尝试从DB_user列中删除非ASCII字符并将其替换为空格.但是我总是遇到一些错误.这是我的数据框的外观:

I have been trying to work on this issue for a while.I am trying to remove non ASCII characters form DB_user column and trying to replace them with spaces. But I keep getting some errors. This is how my data frame looks:



+-----------------------------------------------------------
|      DB_user                            source   count  |                                             
+-----------------------------------------------------------
| ???/"Ò|Z?)?]??C %??J                      A        10   |                                       
| ?D$ZGU   ;@D??_???T(?)                    B         3   |                                       
| ?Q`H??M'?Y??KTK$?ً???ЩJL4??*?_??        C         2   |                                        
+-----------------------------------------------------------

我正在使用此功能,是我在研究SO问题时遇到的.

I was using this function, which I had come across while researching the problem on SO.

def filter_func(string):
   for i in range(0,len(string)):


      if (ord(string[i])< 32 or ord(string[i])>126
           break

      return ''

And then using the apply function:

df['DB_user'] = df.apply(filter_func,axis=1)

我不断收到错误消息:



'ord() expected a character, but string of length 66 found', u'occurred at index 2'

但是,我认为通过使用filter_func函数中的循环,我通过在'ord'中输入一个字符来解决这个问题.因此,当它碰到非ASCII字符时,应将其替换为空格.

However, I thought by using the loop in the filter_func function, I was dealing with this by inputing a char into 'ord'. Therefore the moment it hits a non-ASCII character, it should be replaced by a space.

有人可以帮我吗?

谢谢!

推荐答案

您的代码失败,因为您没有将其应用于每个字符,由于每个单词和ord错误都将其应用于单个字符,因此您需要:

You code fails as you are not applying it on each character, you are applying it per word and ord errors as it takes a single character, you would need:

  df['DB_user'] = df["DB_user"].apply(lambda x: ''.join([" " if ord(i) < 32 or ord(i) > 126 else i for i in x]))

您还可以使用链式比较简化连接:

You can also simplify the join using a chained comparison:

   ''.join([i if 32 < ord(i) < 126 else " " for i in x])

您还可以使用string.printable过滤字符:

You could also use string.printable to filter the chars:

from string import printable
st = set(printable)
df["DB_user"] = df["DB_user"].apply(lambda x: ''.join([" " if  i not in  st else i for i in x]))

最快的是使用笔译:

from string import maketrans

del_chars =  " ".join(chr(i) for i in range(32) + range(127, 256))
trans = maketrans(t, " "*len(del_chars))

df['DB_user'] = df["DB_user"].apply(lambda s: s.translate(trans))

有趣的是,它比:

  df['DB_user'] = df["DB_user"].str.translate(trans)

这篇关于从 pandas 列中删除非ASCII字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆