一种避免在 pandas 数据框中循环的替代方法 [英] Alternate method to avoid loop in pandas dataframe

查看：58 发布时间：2020/5/4 5:51:04 python performance python-2.7 loops pandas

本文介绍了一种避免在 pandas 数据框中循环的替代方法的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有以下数据框:

table2 = pd.DataFrame({
        'Product Type': ['A', 'B', 'C', 'D'],
        'State_1_Value': [10, 11, 12, 13],
    'State_2_Value': [20, 21, 22, 23],
    'State_3_Value': [30, 31, 32, 33],
    'State_4_Value': [40, 41, 42, 43],
    'State_5_Value': [50, 51, 52, 53],
    'State_6_Value': [60, 61, 62, 63],
    'Lower_Bound': [-1, 1, .5, 5],
    'Upper_Bound': [1, 2, .625, 15],
    'sim_1': [0, 0, .61, 7],
    'sim_2': [1, 1.5, .7, 9],
    })

>>> table2
   Lower_Bound Product Type  State_1_Value  State_2_Value  State_3_Value  \
0         -1.0            A             10             20             30   
1          1.0            B             11             21             31   
2          0.5            C             12             22             32   
3          5.0            D             13             23             33   

   State_4_Value  State_5_Value  State_6_Value  Upper_Bound  sim_1  sim_2  
0             40             50             60        1.000    0.0    1.0  
1             41             51             61        2.000    0.0    1.5  
2             42             52             62        0.625    0.61    0.7  
3             43             53             63       15.000    7.0    9.0

然后，我编写了以下代码来生成一个新的DataFrame，并为每个"sim"输出修改后的结果

And I wrote the following code to generate a new DataFrame with a modified output for each 'sim'

for i in range(1,3):
    table2['Bucket%s'%i] = 5 * (table2['sim_%s'%i] - table2['Lower_Bound']) / (table2['Upper_Bound'] - table2['Lower_Bound']) + 1
    table2['lv'] = table2['Bucket%s'%i].map(int)
    table2['hv'] = table2['Bucket%s'%i].map(int) + 1
    table2.ix[table2['lv'] < 1 , 'lv'] = 1
    table2.ix[table2['lv'] > 5 , 'lv'] = 5
    table2.ix[table2['hv'] > 6 , 'hv'] = 6
    table2.ix[table2['hv'] < 2 , 'hv'] = 2
    table2['nLower'] = table2.apply(lambda row: row['State_%s_Value'%row['lv']],axis=1)
    table2['nHigher'] = table2.apply(lambda row: row['State_%s_Value'%row['hv']],axis=1)
    table2['Final_value_%s'%i] = (table2['nHigher'] - table2['nLower'])*(table2['Bucket%s'%i]-table2['lv']) + table2['nLower']
df = table2.filter(regex="sim|Type")

输出:

>>> df
  Product Type  sim_1  sim_2
0            A   35.0   60.0
1            B  -39.0   36.0
2            C   56.0   92.0
3            D   23.0   33.0

我想在10,000个sims上运行，目前每个循环大约需要0.25秒.有什么方法可以修改此代码以避免循环并提高时间效率?

I want to run this on 10,000 sims, and currently each loop takes about .25 seconds. Is there any way to modify this code to avoid the loop and be more time efficient?

如果您对这段代码试图完成的工作感到好奇，则可以在这里看到我的回答有些杂乱无章的问题:

If you're curious what this code is trying to accomplish you can see my self-answered somewhat disorganized question here: Pandas DataFrame: Complex linear interpolation

推荐答案

使用以下代码，我可以无循环地完成此任务:

I was able to accomplish this with no loops using the following code:

在我的10k x 200的桌子上，它运行了3分钟，而不是之前的2个小时.

As a result on my 10k x 200 table it ran in 3 minutes instead of the previous 2 hours.

不幸的是，现在我需要在10k x 4k的表上运行它，而我在那个表上遇到了MemoryError，但这可能超出了这个问题的范围.

Unfortunately now I need to run it on a 10k x 4k table, and I hit MemoryError on that one, but it may be out of the scope of this question.

df= pd.DataFrame({
            'Product Type': ['A', 'B', 'C', 'D'],
            'State_1_Value': [10, 11, 12, 13],
        'State_2_Value': [20, 21, 22, 23],
        'State_3_Value': [30, 31, 32, 33],
        'State_4_Value': [40, 41, 42, 43],
        'State_5_Value': [50, 51, 52, 53],
        'State_6_Value': [60, 61, 62, 63],
        'Lower_Bound': [-1, 1, .5, 5],
        'Upper_Bound': [1, 2, .625, 15],
        'sim_1': [0, 0, .61, 7],
        'sim_2': [1, 1.5, .7, 9],
        })


buckets = df.ix[:,-2:].sub(df['Lower_Bound'],axis=0).div(df['Upper_Bound'].sub(df['Lower_Bound'],axis=0),axis=0) * 5 + 1
low = buckets.applymap(int)
high = buckets.applymap(int) + 1
low = low.applymap(lambda x: 1 if x < 1 else x)
low = low.applymap(lambda x: 5 if x > 5 else x)
high = high.applymap(lambda x: 6 if x > 6 else x)
high = high.applymap(lambda x: 2 if x < 2 else x)
low_value = pd.DataFrame(df.filter(regex="State|Type").values[np.arange(low.shape[0])[:,None], low])
high_value = pd.DataFrame(df.filter(regex="State|Type").values[np.arange(high.shape[0])[:,None], high])
df1 = (high_value - low_value).mul((buckets - low).values) + low_value
df1['Product Type'] = df['Product Type']

这篇关于一种避免在 pandas 数据框中循环的替代方法的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

一种避免在 pandas 数据框中循环的替代方法 [英] Alternate method to avoid loop in pandas dataframe

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

一种避免在 pandas 数据框中循环的替代方法 [英] Alternate method to avoid loop in pandas dataframe

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭