根据列值从数据框内插值 [英] interpolating values from a dataframe based on a column value
问题描述
假设我有以下问题:
import pandas as pd
import numpy as np
xp = [0.0, 0.5, 1.0]
np.random.seed(100)
df = pd.DataFrame(np.random.rand(10, 4), columns=['x0', 'y1', 'y2', 'y3'])
df
x0 y1 y2 y3
0 0.5434 0.2784 0.4245 0.8448
1 0.0047 0.1216 0.6707 0.8259
2 0.1367 0.5751 0.8913 0.2092
3 0.1853 0.1084 0.2197 0.9786
4 0.8117 0.1719 0.8162 0.2741
5 0.4317 0.9400 0.8176 0.3361
6 0.1754 0.3728 0.0057 0.2524
7 0.7957 0.0153 0.5988 0.6038
8 0.1051 0.3819 0.0365 0.8904
9 0.9809 0.0599 0.8905 0.5769
我想插入名为interp
的列.要插值的x坐标值包含在列x0
中,数据点的x坐标将为xp
,数据点的y坐标将包含在y1
,y2
中和y3
.
I would like to interpolate a column named interp
. The value x-coordinate to be interpolated is contained in column x0
, the x-coordinate of the data points would be xp
, and the y-coordinates of the data points would be contained in y1
, y2
and y3
.
到目前为止,我提出了以下建议:
So far, I came up with the following:
df['interp'] = df.apply(lambda x: np.interp(x.x0, xp, [x.y1, x.y2, x.y3]), axis=1)
df
x0 y1 y2 y3 interp
0 0.5434 0.2784 0.4245 0.8448 0.4610
1 0.0047 0.1216 0.6707 0.8259 0.1268
2 0.1367 0.5751 0.8913 0.2092 0.6616
3 0.1853 0.1084 0.2197 0.9786 0.1496
4 0.8117 0.1719 0.8162 0.2741 0.4783
5 0.4317 0.9400 0.8176 0.3361 0.8344
6 0.1754 0.3728 0.0057 0.2524 0.2440
7 0.7957 0.0153 0.5988 0.6038 0.6018
8 0.1051 0.3819 0.0365 0.8904 0.3093
9 0.9809 0.0599 0.8905 0.5769 0.5889
但是,要在其上执行此计算的数据帧包含一百万行以上,因此我想优先使用一种比apply
更快的方法.有什么想法吗?
However, the dataframe on which this calculation will be performed contains more than a million lines, so I'd like to privilege a faster method than apply
. Any ideas?
np.interp
似乎只采用一维数组,这就是我选择apply
的原因.
np.interp
only seems to take 1-D arrays, and this is the reason I went with apply
.
推荐答案
One good solution for making this faster is pandas.DataFrame.eval()
:
TL; DR
Seconds per number of rows
Rows: 100 1000 10000 1E5 1E6 1E7
apply: 0.076 0.734 7.812
eval: 0.056 0.053 0.058 0.087 0.338 2.887
从这些时间可以看出,eval()
具有大量的设置开销,并且多达10,000行基本上需要花费相同的时间.但这比应用程序快两个数量级,因此对于大型数据集而言,这当然值得开销.
As can be seen from these timings, eval()
has a lot of setup overhead, and up to 10,000 rows basically takes the same time. But it is two orders of magnitude faster than the apply, and thus it certainly worth the overhead for large data sets.
这是什么?
从( DOCS )
pandas.eval(expr, parser='pandas', engine=None, truediv=True,
local_dict=None, global_dict=None, resolvers=(),
level=0, target=None, inplace=None)
使用各种后端将Python表达式评估为字符串.
Evaluate a Python expression as a string using various backends.
支持以下算术运算:+,-,*,/,**,%,//(仅适用于python引擎)以及以下布尔运算: (或),& (和),以及〜(不是).另外,"pandas"解析器允许使用and或or和与相应的按位运算符不相同的语义.支持Series和DataFrame对象,并且它们的行为与普通的Python评估一样.
The following arithmetic operations are supported: +, -, *, /, ** , %, // (python engine only) along with the following boolean operations: | (or), & (and), and ~ (not). Additionally, the 'pandas' parser allows the use of and, or, and not with the same semantics as the corresponding bitwise operators. Series and DataFrame objects are supported and behave as they would with plain ol’ Python evaluation.
为此问题执行的技巧:
下面的代码利用了以下事实:插值始终仅在两个段中进行.它实际上计算两个段的插值,然后通过乘以布尔测试(即0、1)来丢弃未使用的段
The code below exploits the fact that the interpolation is always only in two segments. It actually calculates the interpolant for both segments, and then discards the unused segment via a multiply by a bool test (ie, 0, 1)
传递给eval的实际表达式是:
The actual expression passed to eval is:
((y2-y1) / 0.5 * (x0-0.0) + y1) * (x0 < 0.5)+((y3-y2) / 0.5 * (x0-0.5) + y2) * (x0 >= 0.5)
代码:
import pandas as pd
import numpy as np
xp = [0.0, 0.5, 1.0]
np.random.seed(100)
def method1():
df['interp'] = df.apply(
lambda x: np.interp(x.x0, xp, [x.y1, x.y2, x.y3]), axis=1)
def method2():
exp = '((y%d-y%d) / %s * (x0-%s) + y%d) * (x0 %s 0.5)'
exp_1 = exp % (2, 1, xp[1] - xp[0], xp[0], 1, '<')
exp_2 = exp % (3, 2, xp[2] - xp[1], xp[1], 2, '>=')
df['interp2'] = df.eval(exp_1 + '+' + exp_2)
from timeit import timeit
def runit(stmt):
print("%s: %.3f" % (
stmt, timeit(stmt + '()', number=10,
setup='from __main__ import ' + stmt)))
def runit_size(size):
global df
df = pd.DataFrame(
np.random.rand(size, 4), columns=['x0', 'y1', 'y2', 'y3'])
print('Rows: %d' % size)
if size <= 10000:
runit('method1')
runit('method2')
for i in (100, 1000, 10000, 100000, 1000000, 10000000):
runit_size(i)
print(df.head())
结果:
x0 y1 y2 y3 interp interp2
0 0.060670 0.949837 0.608659 0.672003 0.908439 0.908439
1 0.462774 0.704273 0.181067 0.647582 0.220021 0.220021
2 0.568109 0.954138 0.796690 0.585310 0.767897 0.767897
3 0.455355 0.738452 0.812236 0.927291 0.805648 0.805648
4 0.826376 0.029957 0.772803 0.521777 0.608946 0.608946
这篇关于根据列值从数据框内插值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!