是否在完整数组或过滤后的数组上计算了 pandas 数据帧上numpy.where方法的结果? [英] Are the outcomes of the numpy.where method on a pandas dataframe calculated on the full array or the filtered array?

查看:64
本文介绍了是否在完整数组或过滤后的数组上计算了 pandas 数据帧上numpy.where方法的结果?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在熊猫数据框上使用 numpyp.where 来检查列中是否存在某个字符串。如果存在字符串,则应用拆分功能并采用第二个列表元素,否则不采用第一个字符。但是下面的代码不起作用,它会引发 IndexError:列表索引超出范围,因为第一项不包含下划线:

I want to use a numpyp.where on a pandas dataframe to check for existence of a certain string in a column. If the string is present apply a split-function and take the second list element, if not just take the first character. However the following code doesn't work, it throws a IndexError: list index out of range because the first entry contains no underscore:

import pandas as pd
import numpy as np
df = pd.DataFrame({'A':['a','a_1','b_','b_2_3']})
df["B"] = np.where(df.A.str.contains('_'),df.A.apply(lambda x: x.split('_')[1]),df.A.str[0])

仅通话 np.where 返回条件成立的一组索引,所以我觉得分开 -command将仅用于该数据子集:

Only calling np.where returns an array of indices for which the condition holds true, so I was under the impression that the split-command would only be used on that subset of the data:

np.where(df.A.str.contains('_'))
Out[14]: (array([1, 2, 3], dtype=int64),)

但是显然 split -命令用于整个未过滤的数组,这对我来说似乎很奇怪,因为这似乎是潜在的大量不必要的数组操作会减慢计算速度。

But apparently the split-command is used on the entire unfiltered array which seems odd to me as that seems like a potentially big number of unnecessary operations that would slow down the calculation.

我不知道

我只是想知道这是预期的结果还是大熊猫或numpy的问题? 。

I'm merely wondering if this is an expected outcome or an issue with either pandas or numpy.

推荐答案

Python不是惰性语言,因此可以立即评估代码。生成器/迭代器的确引入了一些惰性,但这在这里并不适用

Python isn't a "lazy" language so code is evaluated immediately. generators/iterators do introduce some lazyness, but that doesn't apply here

如果我们拆分您的代码行,则会得到以下语句:

if we split your line of code, we get the following statements:


  1. df.A.str.contains('_')

  2. df.A.apply(lambda x:x.split('_')[1])

  3. df.A.str [0]

  1. df.A.str.contains('_')
  2. df.A.apply(lambda x: x.split('_')[1])
  3. df.A.str[0]

Python必须先评估这些语句,然后才能执行将它们作为参数传递给 np。其中

Python has to evaluate these statements before it can pass them as arguments to np.where

要查看所有发生的情况,我们可以将上面的内容重写为小函数显示一些输出:

to see all this happening, we can rewrite the above as little functions that displays some output:

def fn_contains(x):
    print('contains', x)
    return '_' in x

def fn_split(x):
    s = x.split('_')
    print('split', x, s)
    # check for errors here
    if len(s) > 1:
        return s[1]

def fn_first(x):
    print('first', x)
    return x[0]

然后您可以使用以下命令在数据上运行它们:

and then you can run them on your data with:

s = pd.Series(['a','a_1','b_','b_2_3'])
np.where(
  s.apply(fn_contains),
  s.apply(fn_split),
  s.apply(fn_first)
)

,您将看到依次执行的所有操作。这基本上就是您执行事物时内部 numpy / pandas中发生的事情

and you'll see everything being executed in order. this is basically what's happening "inside" numpy/pandas when you execute things

这篇关于是否在完整数组或过滤后的数组上计算了 pandas 数据帧上numpy.where方法的结果?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆