在 pandas 数据框上应用np.isin()导致的意外行为 [英] Unexpected behaviour from applying np.isin() on a pandas dataframe
问题描述
在回答另一个问题的过程中,我偶然发现了意外的行为:
While working on an answer to another question, I stumbled upon an unexpected behaviour:
请考虑以下DataFrame:
Consider the following DataFrame:
df = pd.DataFrame({
'A':list('AAcdef'),
'B':[4,5,4,5,5,4],
'E':[5,3,6,9,2,4],
'F':list('BaaBbA')
})
print(df)
A B E F
0 A 4 5 B #<— row contains 'A' and 5
1 A 5 3 a #<— row contains 'A' and 5
2 c 4 6 a
3 d 5 9 B
4 e 5 2 b
5 f 4 4 A
如果我们尝试查找包含['A', 5]
的所有列,则可以使用 jezrael的答案:
If we try to find all columns that contain ['A', 5]
, we can use jezrael's answer:
cond = [['A'],[5]]
print( np.logical_and.reduce([df.isin(x).any(1) for x in cond]) )
(正确)产生:[ True True False False False False]
但是,如果我们使用:
cond = [['A'],[5]]
print( df.apply(lambda x: np.isin([cond],[x]).all(),axis=1) )
这将产生:
0 False
1 False
2 False
3 False
4 False
5 False
dtype: bool
仔细检查第二次尝试发现:
Closer inspection of the second attempt reveals that:
-
np.isin(['A',5],df.loc[0])
错误地" 产生array([ True, False])
的原因,可能是由于numpy
推断了dtype<U1
,因此导致了5!='5'
-
np.isin(['A',5],['A',4,5,'B'])
正确" 产生array([ True, True])
,这意味着我们可以(并且应该)在上述.apply()
方法中使用df.loc[0].values.tolist()
np.isin(['A',5],df.loc[0])
"wrongly" yieldsarray([ True, False])
, likely due tonumpy
infering a dtype<U1
, and consequently5!='5'
np.isin(['A',5],['A',4,5,'B'])
"correctly" yieldsarray([ True, True])
, which means we can (and should) usedf.loc[0].values.tolist()
in the.apply()
method above
问题已简化:
为什么我需要在一种情况下指定x.values.tolist()
,而在另一种情况下可以直接使用x
?
Why do I need to specify x.values.tolist()
in one case, and can directly use x
in the other?
print( np.logical_and.reduce([df.isin(x).any(1) for x in cond]) )
print( df.apply(lambda x: np.isin([cond],x.values.tolist()).all(),axis=1 ) )
更糟糕的是,如果我们搜索[4,5]
:
Even worse is what happens if we search for [4,5]
:
cond = [[4],[5]]
## this returns False for row 0
print( df.apply(lambda x: np.isin([cond],x.values.tolist() ).all() ,axis=1) )
## this returns True for row 0
print( df.apply(lambda x: np.isin([cond],x.values ).all() ,axis=1) )
推荐答案
我认为在DataFrame中,数字是混合整数整数的,所以如果按行循环获取具有混合类型的Series
,则numpy将其强制为strings
.
I think in DataFrame are mixed numeric with integer solumns, so if loop by rows get Series
with mixing types, so numpy coerce the to strings
.
可能的解决方案将转换为数组,然后转换为cond
中的string
值:
Possible solution is convert to array and then to string
values in cond
:
cond = [[4],[5]]
print(df.apply(lambda x: np.isin(np.array(cond).astype(str), x.values.tolist()).all(),axis=1))
0 True
1 False
2 False
3 False
4 False
5 False
dtype: bool
不幸的是,对于一般解决方案(如果可能,仅数字列)需要同时转换-cond
和Series
:
Unfortunately for general solution (if possible only numeric columns) need convert both - cond
and Series
:
f = lambda x: np.isin(np.array(cond).astype(str), x.astype(str).tolist()).all()
print (df.apply(f, axis=1))
或所有数据:
f = lambda x: np.isin(np.array(cond).astype(str), x.tolist()).all()
print (df.astype(str).apply(f, axis=1))
如果在纯python中使用设置,则效果很好:
If use sets in pure python, it working nice:
print(df.apply(lambda x: set([4,5]).issubset(x),axis=1) )
0 True
1 False
2 False
3 False
4 False
5 False
dtype: bool
print(df.apply(lambda x: set(['A',5]).issubset(x),axis=1) )
0 True
1 True
2 False
3 False
4 False
5 False
dtype: bool
这篇关于在 pandas 数据框上应用np.isin()导致的意外行为的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!