使用不在 pandas 中的矢量化逻辑来过滤框架 [英] Using vectorized logical not in pandas to filter a frame

查看:84
本文介绍了使用不在 pandas 中的矢量化逻辑来过滤框架的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个要修剪的熊猫数据框.我想取出该部分为2且标识符不是以数字开头的行.首先,我想数一数.如果我运行这个

I have a pandas data frame I would like to prune. I want to take out the rows where the section is 2 and the identifier does not start with a digit. First I would like to count them. If I run this

len(analytic_events[analytic_events['section']==2].index)

我得到结果1247669

I get the result 1247669

当我缩小范围并运行它时

When I narrow things down and run this

len(analytic_events[(analytic_events['section']==2) & ~(analytic_events['identifier'][0].isdigit())].index)

我得到的答案完全相同:1247669

I get exactly the same answer: 1247669

例如,我知道十个行将其作为标识符

I know, for example, that ten of the rows have this as their identifier

.help.your_tools.subtopic2

.help.your_tools.subtopic2

不以数字开头,并且15,000行以其作为标识符

which does not start with a digit, and that 15,000 rows have this as their identifier

240.1007

240.1007

这样做以数字开头.

为什么我的过滤器传递所有行,而不是仅传递其标识符不是以数字开头的行?

Why is my filter passing all the rows rather than just those whose identifier does not start with a digit?

推荐答案

使用str处理文本函数,使用str[0]表示字符串的第一个值,使用最后一个sum表示计数True的值:

Use str for working with text functions and str[0] for first value of string, last sum for count Trues values:

mask= ((analytic_events['section']==2) & 
       ~(analytic_events['identifier'].str[0].str.isdigit()))

print (mask.sum())

如果性能很重要且没有缺失值,请使用列表理解:

If performance is important and no missing values use list comprehension:

arr = ~np.array([x[0].isdigit() for x in analytic_events['identifier']])
mask = ((analytic_events['section']==2) & arr)

为什么我的过滤器传递所有行,而不是仅传递其标识符不是以数字开头的行?

Why is my filter passing all the rows rather than just those whose identifier does not start with a digit?

如果测试解决方案的输出:

If test output of your solution:

analytic_events = pd.DataFrame(
                        {'section':[2,2,2,3,2],
                         'identifier':['4hj','8hj','gh','th','h6h']})

print (analytic_events)
   section identifier
0        2        4hj
1        2        8hj
2        2         gh
3        3         th
4        2        h6h

获取列的第一个值:

print ((analytic_events['identifier'][0]))
4hj

检查标量的位数是否为

print ((analytic_events['identifier'][0].isdigit()))
False

print (~(analytic_events['identifier'][0].isdigit()))
-1

带有第一个蒙版的链条将其转换为True:

With chain with first mask it is converted to True:

print ((analytic_events['section']==2) & ~(analytic_events['identifier'][0].isdigit()))
0     True
1     True
2     True
3    False
4     True
Name: section, dtype: bool

所以它的工作原理就像不存在第二个面具一样

So it working same like second mask not exist:

print (analytic_events['section']==2)
0     True
1     True
2     True
3    False
4     True
Name: section, dtype: bool

这篇关于使用不在 pandas 中的矢量化逻辑来过滤框架的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆