pandas :分配列的值,最大为字典值设置的限制 [英] Pandas: Assign values of column up to a limit set by dictionary values

查看:89
本文介绍了 pandas :分配列的值,最大为字典值设置的限制的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何删除iterrows()?可以用numpy或pandas更快地完成此操作吗?

How can I remove the iterrows()? Can this be done faster with numpy or pandas?

import pandas as pd
import numpy as np
df = pd.DataFrame({'A': 'foo bar foo bar foo bar foo foo'.split(),
                   'B': 'one one two three two two one three'.split(),
                   'C': np.arange(8)*0  })
print(df)
#      A      B  C
# 0  foo    one  0
# 1  bar    one  0
# 2  foo    two  0
# 3  bar  three  0
# 4  foo    two  0
# 5  bar    two  0
# 6  foo    one  0
# 7  foo  three  0

selDict = {"foo":2, "bar":3}

这有效:

for i, r in df.iterrows():
    if selDict[r["A"]] > 0:
        selDict[r["A"]] -=1         
        df.set_value(i, 'C', 1)

   print df
#      A      B  C
# 0  foo    one  1
# 1  bar    one  1
# 2  foo    two  1
# 3  bar  three  1
# 4  foo    two  0
# 5  bar    two  1
# 6  foo    one  0
# 7  foo  three  0

推荐答案

这是一种方法-

1)辅助功能:

def argsort_unique(idx):
    # Original idea : http://stackoverflow.com/a/41242285/3293881 by @Andras
    n = idx.size
    sidx = np.empty(n,dtype=int)
    sidx[idx] = np.arange(n)
    return sidx

def get_bin_arr(grplens, stop1_idx):
    count_stops_corr = np.minimum(stop1_idx, grplens)

    limsc = np.maximum(grplens, count_stops_corr)
    L = limsc.sum()

    starts = np.r_[0,limsc[:-1].cumsum()]

    shift_arr = np.zeros(L,dtype=int)
    stops = starts + count_stops_corr
    stops = stops[stops<L]

    shift_arr[starts] += 1
    shift_arr[stops] -= 1
    bin_arr = shift_arr.cumsum()
    return bin_arr 

基于循环切片的辅助函数可能更快:

Possibly faster alternative with a loopy slicing based helper function :

def get_bin_arr(grplens, stop1_idx):
    stop1_idx_corr = np.minimum(stop1_idx, grplens)    
    clens = grplens.cumsum()
    out = np.zeros(clens[-1],dtype=int)    
    out[:stop1_idx_corr[0]] = 1
    for i,j in zip(clens[:-1], clens[:-1] + stop1_idx_corr[1:]):
        out[i:j] = 1
    return out

2)主要功能:

def out_C(A, selDict):
    k = np.array(selDict.keys())
    v = np.array(selDict.values())
    unq, C  = np.unique(A, return_counts=1)
    sidx3 = np.searchsorted(unq, k)
    lims = np.zeros(len(unq),dtype=int)
    lims[sidx3] = v
    bin_arr = get_bin_arr(C, lims)
    sidx2 = A.argsort()
    out = bin_arr[argsort_unique(sidx2)]    
    return out

样品运行-

原始方法:

def org_app(df, selDict):
    df['C'] = 0
    d = selDict.copy()    
    for i, r in df.iterrows():
        if d[r["A"]] > 0:
            d[r["A"]] -=1         
            df.set_value(i, 'C', 1)
    return df

案例1:

>>> df = pd.DataFrame({'A': 'foo bar foo bar res foo bar res foo foo res'.split()})
>>> selDict = {"foo":2, "bar":3, "res":1}
>>> org_app(df, selDict)
      A  C
0   foo  1
1   bar  1
2   foo  1
3   bar  1
4   res  1
5   foo  0
6   bar  1
7   res  0
8   foo  0
9   foo  0
10  res  0
>>> out_C(df.A.values, selDict)
array([1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0])

案例2:

>>> selDict = {"foo":20, "bar":30, "res":10}
>>> org_app(df, selDict)
      A  C
0   foo  1
1   bar  1
2   foo  1
3   bar  1
4   res  1
5   foo  1
6   bar  1
7   res  1
8   foo  1
9   foo  1
10  res  1
>>> out_C(df.A.values, selDict)
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

这篇关于 pandas :分配列的值,最大为字典值设置的限制的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆