根据列值从数据框内插值 [英] interpolating values from a dataframe based on a column value

查看:84
本文介绍了根据列值从数据框内插值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有以下问题:

import pandas as pd
import numpy as np

xp = [0.0, 0.5, 1.0]

np.random.seed(100)
df = pd.DataFrame(np.random.rand(10, 4), columns=['x0', 'y1', 'y2', 'y3'])

df
      x0     y1     y2     y3
0 0.5434 0.2784 0.4245 0.8448
1 0.0047 0.1216 0.6707 0.8259
2 0.1367 0.5751 0.8913 0.2092
3 0.1853 0.1084 0.2197 0.9786
4 0.8117 0.1719 0.8162 0.2741
5 0.4317 0.9400 0.8176 0.3361
6 0.1754 0.3728 0.0057 0.2524
7 0.7957 0.0153 0.5988 0.6038
8 0.1051 0.3819 0.0365 0.8904
9 0.9809 0.0599 0.8905 0.5769

我想插入名为interp的列.要插值的x坐标值包含在列x0中,数据点的x坐标将为xp,数据点的y坐标将包含在y1y2中和y3.

I would like to interpolate a column named interp. The value x-coordinate to be interpolated is contained in column x0, the x-coordinate of the data points would be xp, and the y-coordinates of the data points would be contained in y1, y2 and y3.

到目前为止,我提出了以下建议:

So far, I came up with the following:

df['interp'] = df.apply(lambda x: np.interp(x.x0, xp, [x.y1, x.y2, x.y3]), axis=1)

df
      x0     y1     y2     y3  interp
0 0.5434 0.2784 0.4245 0.8448  0.4610
1 0.0047 0.1216 0.6707 0.8259  0.1268
2 0.1367 0.5751 0.8913 0.2092  0.6616
3 0.1853 0.1084 0.2197 0.9786  0.1496
4 0.8117 0.1719 0.8162 0.2741  0.4783
5 0.4317 0.9400 0.8176 0.3361  0.8344
6 0.1754 0.3728 0.0057 0.2524  0.2440
7 0.7957 0.0153 0.5988 0.6038  0.6018
8 0.1051 0.3819 0.0365 0.8904  0.3093
9 0.9809 0.0599 0.8905 0.5769  0.5889

但是,要在其上执行此计算的数据帧包含一百万行以上,因此我想优先使用一种比apply更快的方法.有什么想法吗?

However, the dataframe on which this calculation will be performed contains more than a million lines, so I'd like to privilege a faster method than apply. Any ideas?

np.interp似乎只采用一维数组,这就是我选择apply的原因.

np.interp only seems to take 1-D arrays, and this is the reason I went with apply.

推荐答案

一种使此速度更快的好方法是

One good solution for making this faster is pandas.DataFrame.eval():

TL; DR

Seconds per number of rows
Rows:     100   1000  10000    1E5    1E6    1E7
apply:  0.076  0.734  7.812
eval:   0.056  0.053  0.058  0.087  0.338  2.887

从这些时间可以看出,eval()具有大量的设置开销,并且多达10,000行基本上需要花费相同的时间.但这比应用程序快两个数量级,因此对于大型数据集而言,这当然值得开销.

As can be seen from these timings, eval() has a lot of setup overhead, and up to 10,000 rows basically takes the same time. But it is two orders of magnitude faster than the apply, and thus it certainly worth the overhead for large data sets.

这是什么?

从( DOCS )

pandas.eval(expr, parser='pandas', engine=None, truediv=True, 
            local_dict=None, global_dict=None, resolvers=(),
            level=0, target=None, inplace=None)

使用各种后端将Python表达式评估为字符串.

Evaluate a Python expression as a string using various backends.

支持以下算术运算:+,-,*,/,**,%,//(仅适用于python引擎)以及以下布尔运算: (或),& (和),以及〜(不是).另外,"pandas"解析器允许使用and或or和与相应的按位运算符不相同的语义.支持Series和DataFrame对象,并且它们的行为与普通的Python评估一样.

The following arithmetic operations are supported: +, -, *, /, ** , %, // (python engine only) along with the following boolean operations: | (or), & (and), and ~ (not). Additionally, the 'pandas' parser allows the use of and, or, and not with the same semantics as the corresponding bitwise operators. Series and DataFrame objects are supported and behave as they would with plain ol’ Python evaluation.

为此问题执行的技巧:

下面的代码利用了以下事实:插值始终仅在两个段中进行.它实际上计算两个段的插值,然后通过乘以布尔测试(即0、1)来丢弃未使用的段

The code below exploits the fact that the interpolation is always only in two segments. It actually calculates the interpolant for both segments, and then discards the unused segment via a multiply by a bool test (ie, 0, 1)

传递给eval的实际表达式是:

The actual expression passed to eval is:

((y2-y1) / 0.5 * (x0-0.0) + y1) * (x0 < 0.5)+((y3-y2) / 0.5 * (x0-0.5) + y2) * (x0 >= 0.5)

代码:

import pandas as pd
import numpy as np

xp = [0.0, 0.5, 1.0]

np.random.seed(100)

def method1():
    df['interp'] = df.apply(
        lambda x: np.interp(x.x0, xp, [x.y1, x.y2, x.y3]), axis=1)

def method2():
    exp = '((y%d-y%d) / %s * (x0-%s) + y%d) * (x0 %s 0.5)'
    exp_1 = exp % (2, 1, xp[1] - xp[0], xp[0], 1, '<')
    exp_2 = exp % (3, 2, xp[2] - xp[1], xp[1], 2, '>=')

    df['interp2'] = df.eval(exp_1 + '+' + exp_2)

from timeit import timeit

def runit(stmt):
    print("%s: %.3f" % (
        stmt, timeit(stmt + '()', number=10,
                     setup='from __main__ import ' + stmt)))

def runit_size(size):
    global df
    df = pd.DataFrame(
        np.random.rand(size, 4), columns=['x0', 'y1', 'y2', 'y3'])

    print('Rows: %d' % size)
    if size <= 10000:
        runit('method1')
    runit('method2')

for i in (100, 1000, 10000, 100000, 1000000, 10000000):
    runit_size(i)

print(df.head())

结果:

         x0        y1        y2        y3    interp   interp2
0  0.060670  0.949837  0.608659  0.672003  0.908439  0.908439
1  0.462774  0.704273  0.181067  0.647582  0.220021  0.220021
2  0.568109  0.954138  0.796690  0.585310  0.767897  0.767897
3  0.455355  0.738452  0.812236  0.927291  0.805648  0.805648
4  0.826376  0.029957  0.772803  0.521777  0.608946  0.608946

这篇关于根据列值从数据框内插值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆