是否在完整数组或过滤后的数组上计算了 pandas 数据帧上numpy.where方法的结果? [英] Are the outcomes of the numpy.where method on a pandas dataframe calculated on the full array or the filtered array?
问题描述
我想在熊猫数据框上使用 numpyp.where
来检查列中是否存在某个字符串。如果存在字符串,则应用拆分功能并采用第二个列表元素,否则不采用第一个字符。但是下面的代码不起作用,它会引发 IndexError:列表索引超出范围
,因为第一项不包含下划线:
I want to use a numpyp.where
on a pandas dataframe to check for existence of a certain string in a column. If the string is present apply a split-function and take the second list element, if not just take the first character. However the following code doesn't work, it throws a IndexError: list index out of range
because the first entry contains no underscore:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':['a','a_1','b_','b_2_3']})
df["B"] = np.where(df.A.str.contains('_'),df.A.apply(lambda x: x.split('_')[1]),df.A.str[0])
仅通话 np.where
返回条件成立的一组索引,所以我觉得分开
-command将仅用于该数据子集:
Only calling np.where
returns an array of indices for which the condition holds true, so I was under the impression that the split
-command would only be used on that subset of the data:
np.where(df.A.str.contains('_'))
Out[14]: (array([1, 2, 3], dtype=int64),)
但是显然 split
-命令用于整个未过滤的数组,这对我来说似乎很奇怪,因为这似乎是潜在的大量不必要的数组操作会减慢计算速度。
But apparently the split
-command is used on the entire unfiltered array which seems odd to me as that seems like a potentially big number of unnecessary operations that would slow down the calculation.
我不知道
我只是想知道这是预期的结果还是大熊猫或numpy的问题? 。
I'm merely wondering if this is an expected outcome or an issue with either pandas or numpy.
推荐答案
Python不是惰性语言,因此可以立即评估代码。生成器/迭代器的确引入了一些惰性,但这在这里并不适用
Python isn't a "lazy" language so code is evaluated immediately. generators/iterators do introduce some lazyness, but that doesn't apply here
如果我们拆分您的代码行,则会得到以下语句:
if we split your line of code, we get the following statements:
-
df.A.str.contains('_')
-
df.A.apply(lambda x:x.split('_')[1])
-
df.A.str [0]
df.A.str.contains('_')
df.A.apply(lambda x: x.split('_')[1])
df.A.str[0]
Python必须先评估这些语句,然后才能执行将它们作为参数传递给 np。其中
Python has to evaluate these statements before it can pass them as arguments to np.where
要查看所有发生的情况,我们可以将上面的内容重写为小函数显示一些输出:
to see all this happening, we can rewrite the above as little functions that displays some output:
def fn_contains(x):
print('contains', x)
return '_' in x
def fn_split(x):
s = x.split('_')
print('split', x, s)
# check for errors here
if len(s) > 1:
return s[1]
def fn_first(x):
print('first', x)
return x[0]
然后您可以使用以下命令在数据上运行它们:
and then you can run them on your data with:
s = pd.Series(['a','a_1','b_','b_2_3'])
np.where(
s.apply(fn_contains),
s.apply(fn_split),
s.apply(fn_first)
)
,您将看到依次执行的所有操作。这基本上就是您执行事物时内部 numpy / pandas中发生的事情
and you'll see everything being executed in order. this is basically what's happening "inside" numpy/pandas when you execute things
这篇关于是否在完整数组或过滤后的数组上计算了 pandas 数据帧上numpy.where方法的结果?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!