Pandas,基于列值的唯一子集追加列 [英] Pandas, append column based on unique subset of column values

查看:48
本文介绍了Pandas,基于列值的唯一子集追加列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含多行的数据框.我正在使用自定义函数生成的数据附加一列,如下所示:

I have a dataframe with many rows. I am appending a column using data produced from a custom function, like this:

import numpy

df['new_column'] = numpy.vectorize(fx)(df['col_a'], df['col_b'])
# takes 180964.377 ms

它工作正常,我想做的是加快速度.实际上只有一小组 col_acol_b 的独特组合.许多迭代是多余的.我在想也许 pandas 会自己解决这个问题,但我认为事实并非如此.考虑一下:

It is working fine, what I am trying to do is speed it up. There is really only a small group of unique combinations of col_a and col_b. Many of the iterations are redundant. I was thinking maybe pandas would just figure that out on its own but I don't think that is the case. Consider this:

print len(df.index) #prints 127255
df_unique = df.copy().drop_duplicates(['col_a', 'col_b'])
print len(df_unique.index) #prints 9834

我也通过运行这个让自己相信可能的加速:

I also convinced myself of the possible speedup by running this:

df_unique['new_column'] = numpy.vectorize(fx)(df_unique['col_a'], df_unique['col_b'])
# takes 14611.357 ms

由于有很多冗余数据,我想做的是更新大数据框(df 127255 rows)但只需要运行fx函数最少次数(9834 次).这是因为 col_acol_b 的所有重复行.当然,这意味着 df 中会有多行 col_acol_b 具有相同的值,但是没关系,另一个df 的列是不同的,使每一行都是唯一的.

Since there is a lot of redundant data, what I am trying to do is update the large dataframe ( df 127255 rows ) but only need to run the fx function the minimum amount of times ( 9834 times ). This is because of all the duplicate rows for col_a and col_b. Of course this means that there will be multiple rows in df that have the same values for col_a and col_b, but that is OK, the other columns of df are different and make each row unique.

在我创建一个普通的迭代 for 循环来遍历 df_unique 数据帧并对 df 进行条件更新之前,我想问一下是否有更多的pythonic"" 进行这种更新的巧妙方式.非常感谢.

Before I create a normal iterative for loop to loop through the df_unique dataframe and do a conditional update on df, I wanted to ask if there was a more "pythonic" neat way of doing this kind of update. Thanks a lot.

** 更新 **

我创建了上面提到的简单 for 循环,如下所示:

I created the simple for loop mentioned above, like this:

df = ...
df_unique = df.copy().drop_duplicates(['col_a', 'col_b'])
df_unique['new_column'] = np.vectorize(fx)(df_unique['col_a'], df_unique['col_b'])
for index, row in df_unique.iterrows():         
    df.loc[(df['col_a'] == row['col_a']) & (df['col_b'] == row['col_b']),'new_column'] = row['new_column']
# takes 165971.890

因此,使用这个 for 循环可能会有轻微的性能提升,但与我预期的相差甚远.

So with this for loop there may be a slight performance increase but not nearly what I would have expected.

仅供参考

这是fx 函数.它查询一个 mysql 数据库.

This is the fx function. It queries a mysql database.

def fx(d):
    exp_date = datetime.strptime(d.col_a, '%m/%d/%Y')
    if exp_date.weekday() == 5:
        exp_date -= timedelta(days=1)

    p = pandas.read_sql("select stat from table where a = '%s' and b_date = '%s';" % (d.col_a,exp_date.strftime('%Y-%m-%d')),engine)
    if len(p.index) == 0:
        return None
    else:
        return p.iloc[0].close

推荐答案

更新:

如果您可以设法将属于 table 表的三列 ['stat','a','b_date'] 读入 tab DF 然后你可以像这样合并它:

if you can manage to read up your three columns ['stat','a','b_date'] belonging to table table into tab DF then you could merge it like this:

tab = pd.read_sql('select stat,a,b_date from table', engine)
df.merge(tab, left_on=[...], right_on=[...], how='left')

旧答案:

您可以将预先计算的 df_unique DF 与原始 df DF 合并/加入:

you can merge/join your precalculated df_unique DF with the original df DF:

df['new_column'] = df.merge(df_unique, on=['col_a','col_b'], how='left')['new_column']

这篇关于Pandas,基于列值的唯一子集追加列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆