使用不在 pandas 中的矢量化逻辑来过滤框架 [英] Using vectorized logical not in pandas to filter a frame
问题描述
我有一个要修剪的熊猫数据框.我想取出该部分为2且标识符不是以数字开头的行.首先,我想数一数.如果我运行这个
I have a pandas data frame I would like to prune. I want to take out the rows where the section is 2 and the identifier does not start with a digit. First I would like to count them. If I run this
len(analytic_events[analytic_events['section']==2].index)
我得到结果1247669
I get the result 1247669
当我缩小范围并运行它时
When I narrow things down and run this
len(analytic_events[(analytic_events['section']==2) & ~(analytic_events['identifier'][0].isdigit())].index)
我得到的答案完全相同:1247669
I get exactly the same answer: 1247669
例如,我知道十个行将其作为标识符
I know, for example, that ten of the rows have this as their identifier
.help.your_tools.subtopic2
.help.your_tools.subtopic2
不以数字开头,并且15,000行以其作为标识符
which does not start with a digit, and that 15,000 rows have this as their identifier
240.1007
240.1007
这样做以数字开头.
为什么我的过滤器传递所有行,而不是仅传递其标识符不是以数字开头的行?
Why is my filter passing all the rows rather than just those whose identifier does not start with a digit?
推荐答案
使用str
处理文本函数,使用str[0]
表示字符串的第一个值,使用最后一个sum
表示计数True
的值:>
Use str
for working with text functions and str[0]
for first value of string, last sum
for count True
s values:
mask= ((analytic_events['section']==2) &
~(analytic_events['identifier'].str[0].str.isdigit()))
print (mask.sum())
如果性能很重要且没有缺失值,请使用列表理解:
If performance is important and no missing values use list comprehension:
arr = ~np.array([x[0].isdigit() for x in analytic_events['identifier']])
mask = ((analytic_events['section']==2) & arr)
为什么我的过滤器传递所有行,而不是仅传递其标识符不是以数字开头的行?
Why is my filter passing all the rows rather than just those whose identifier does not start with a digit?
如果测试解决方案的输出:
If test output of your solution:
analytic_events = pd.DataFrame(
{'section':[2,2,2,3,2],
'identifier':['4hj','8hj','gh','th','h6h']})
print (analytic_events)
section identifier
0 2 4hj
1 2 8hj
2 2 gh
3 3 th
4 2 h6h
获取列的第一个值:
print ((analytic_events['identifier'][0]))
4hj
检查标量的位数是否为
print ((analytic_events['identifier'][0].isdigit()))
False
print (~(analytic_events['identifier'][0].isdigit()))
-1
带有第一个蒙版的链条将其转换为True
:
With chain with first mask it is converted to True
:
print ((analytic_events['section']==2) & ~(analytic_events['identifier'][0].isdigit()))
0 True
1 True
2 True
3 False
4 True
Name: section, dtype: bool
所以它的工作原理就像不存在第二个面具一样
So it working same like second mask not exist:
print (analytic_events['section']==2)
0 True
1 True
2 True
3 False
4 True
Name: section, dtype: bool
这篇关于使用不在 pandas 中的矢量化逻辑来过滤框架的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!