从 pandas 列中删除非 ASCII 字符 [英] Remove non-ASCII characters from pandas column

查看:45
本文介绍了从 pandas 列中删除非 ASCII 字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经尝试解决这个问题一段时间了.我正在尝试从 DB_user 列中删除非 ASCII 字符并尝试用空格替换它们.但我不断收到一些错误.这是我的数据框的外观:

<前>+-----------------------------------------------------------|DB_user 源计数 |+-----------------------------------------------------------|???/"Ò|Z?)?]??C %??J A 10 ||?D$ZGU ;@D??_???T(?) B 3 ||?Q`H??M'?Y??KTK$?Ù‹???ЩJL4??*?_??C 2 |+-----------------------------------------------------------

我正在使用这个功能,这是我在研究 SO 上的问题时遇到的.

def filter_func(string):对于范围内的 i (0,len(string)):如果 (ord(string[i])<32 或 ord(string[i])>126休息返回 '​​'然后使用 apply 函数:df['DB_user'] = df.apply(filter_func,axis=1)

我不断收到错误:

<前>'ord() 需要一个字符,但找到长度为 66 的字符串',你'出现在索引 2'

但是,我认为通过在 filter_func 函数中使用循环,我是通过将字符输入到 'ord' 来处理这个问题的.因此,在遇到非 ASCII 字符时,应将其替换为空格.

有人可以帮我吗?

谢谢!

解决方案

您的代码失败,因为您没有将其应用于每个字符,而是按单词应用它,并且因为它需要单个字符而出现 ord 错误,您需要:

 df['DB_user'] = df["DB_user"].apply(lambda x: ''.join([" " if ord(i) <32 or ord(i) > 126 elsei for i in x]))

您还可以使用链式比较来简化连接:

 ''.join([i if 32 < ord(i) < 126 else " " for i in x])

您也可以使用 string.printable 来过滤字符:

 from string import 可打印st = 设置(可打印)df["DB_user"] = df["DB_user"].apply(lambda x: ''.join([" " if i not in st else i for i in x]))

最快的是使用translate:

from string import maketransdel_chars = " ".join(chr(i) for i in range(32) + range(127, 256))trans = maketrans(t, " "*len(del_chars))df['DB_user'] = df["DB_user"].apply(lambda s: s.translate(trans))

有趣的是,它比:

 df['DB_user'] = df["DB_user"].str.translate(trans)

I have been trying to work on this issue for a while.I am trying to remove non ASCII characters form DB_user column and trying to replace them with spaces. But I keep getting some errors. This is how my data frame looks:


+-----------------------------------------------------------
|      DB_user                            source   count  |                                             
+-----------------------------------------------------------
| ???/"Ò|Z?)?]??C %??J                      A        10   |                                       
| ?D$ZGU   ;@D??_???T(?)                    B         3   |                                       
| ?Q`H??M'?Y??KTK$?ً???ЩJL4??*?_??        C         2   |                                        
+-----------------------------------------------------------

I was using this function, which I had come across while researching the problem on SO.

def filter_func(string):
   for i in range(0,len(string)):


      if (ord(string[i])< 32 or ord(string[i])>126
           break

      return ''

And then using the apply function:

df['DB_user'] = df.apply(filter_func,axis=1)

I keep getting the error:


'ord() expected a character, but string of length 66 found', u'occurred at index 2'

However, I thought by using the loop in the filter_func function, I was dealing with this by inputing a char into 'ord'. Therefore the moment it hits a non-ASCII character, it should be replaced by a space.

Could somebody help me out?

Thanks!

解决方案

You code fails as you are not applying it on each character, you are applying it per word and ord errors as it takes a single character, you would need:

  df['DB_user'] = df["DB_user"].apply(lambda x: ''.join([" " if ord(i) < 32 or ord(i) > 126 else i for i in x]))

You can also simplify the join using a chained comparison:

   ''.join([i if 32 < ord(i) < 126 else " " for i in x])

You could also use string.printable to filter the chars:

from string import printable
st = set(printable)
df["DB_user"] = df["DB_user"].apply(lambda x: ''.join([" " if  i not in  st else i for i in x]))

The fastest is to use translate:

from string import maketrans

del_chars =  " ".join(chr(i) for i in range(32) + range(127, 256))
trans = maketrans(t, " "*len(del_chars))

df['DB_user'] = df["DB_user"].apply(lambda s: s.translate(trans))

Interestingly that is faster than:

  df['DB_user'] = df["DB_user"].str.translate(trans)

这篇关于从 pandas 列中删除非 ASCII 字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆