范围 pandas 之间的字典中的查找值 [英] Lookup value in dictionary between range pandas

查看:61
本文介绍了范围 pandas 之间的字典中的查找值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个格式如下的查找"表:

I have a "lookup" table formatted as such:

Min | Max | Val
  1 |  99 | "Principal"
100 | 199 | "Partner"
... | ... | ...

我的数据框中有一个CURRENT_POINTS系列,介于最小或最大(含)之间.

There is a CURRENT_POINTS series in my dataframe that is between Min or Max (inclusive).

问题:如何基于上述查找表创建VAL列?我最初的想法是使用df.lookup,但是df中有80万行,因此两个表的大小不相等.

Question: how do I create a VAL column that is based on the above lookup table? My initial thought was to use df.lookup, but there are 800K rows in df and so the two tables are not equally sized.

提前感谢您的帮助!

有什么想法吗?

推荐答案

我会使用 cut()方法.

假设您具有以下DF:

In [187]: lkp
Out[187]:
   Min  Max  Val
0    1   99  AAA
1  100  199  BBB
2  200  299  CCC
3  300  399  DDD

In [188]: df
Out[188]:
   CURRENT_POINTS
0              55
1              10
2              20
3             144
4             194
5             143
6             397
7             233
8             128
9             215

使用cut()方法,我们可以产生一个category dtype的新列,这可能会节省大量内存:

Using cut() method we can produce a new column of a category dtype, which might save a lot of memory:

In [189]: df['Val'] = pd.cut(df.CURRENT_POINTS,
     ...:                    bins=[0] + lkp[['Min','Max']].stack()[1::2].tolist(),
     ...:                    labels=lkp.Val.tolist())
     ...:

In [190]: df
Out[190]:
   CURRENT_POINTS  Val
0              55  AAA
1              10  AAA
2              20  AAA
3             144  BBB
4             194  BBB
5             143  BBB
6             397  DDD
7             233  CCC
8             128  BBB
9             215  CCC

In [191]: df.dtypes
Out[191]:
CURRENT_POINTS       int32
Val               category
dtype: object

类别dtype可以节省大量内存:

Category dtype can save a lot of memory:

In [192]: big = pd.concat([df] * 10**5, ignore_index=True)

In [193]: big.shape
Out[193]: (1000000, 2)

In [194]: big['str_col'] = 'AAA'

In [198]: big.dtypes
Out[198]:
CURRENT_POINTS       int32
Val               category
str_col             object
dtype: object

In [195]: big.memory_usage()
Out[195]:
Index                  80
CURRENT_POINTS    4000000
Val               1000032     # <--- `category` column takes 1 byte per row (plus 32 bytes overhead)
str_col           8000000

In [197]: big.head()
Out[197]:
   CURRENT_POINTS  Val str_col
0              55  AAA     AAA
1              10  AAA     AAA
2              20  AAA     AAA
3             144  BBB     AAA
4             194  BBB     AAA

注意:请注意categoryValstr_col列(dtype:object)的内存使用情况

NOTE: pay attention at memory usage for the category column Val and for the str_col column (dtype: object)

说明:

垃圾箱:

In [199]: lkp[['Min','Max']]
Out[199]:
   Min  Max
0    1   99
1  100  199
2  200  299
3  300  399

In [200]: lkp[['Min','Max']].stack()
Out[200]:
0  Min      1
   Max     99
1  Min    100
   Max    199
2  Min    200
   Max    299
3  Min    300
   Max    399
dtype: int64

In [201]: lkp[['Min','Max']].stack()[1::2].tolist()
Out[201]: [99, 199, 299, 399]

In [202]: [0] + lkp[['Min','Max']].stack()[1::2].tolist()
Out[202]: [0, 99, 199, 299, 399]

标签:

In [203]: lkp.Val.tolist()
Out[203]: ['AAA', 'BBB', 'CCC', 'DDD']

注意:lkp在用于binslabels之前,必须按['Min', 'Max']进行排序.

NOTE: lkp must be sorted by ['Min', 'Max'] before using it for bins and labels.

这是一个用于排序的小演示:

Here is a small demo for sorting:

In [2]: lkp
Out[2]:
   Min  Max  Val
0  300  399  DDD
1  100  199  BBB
2    1   99  AAA
3  200  299  CCC

In [4]: lkp = lkp.sort_values(['Min','Max'])

In [5]: lkp
Out[5]:
   Min  Max  Val
2    1   99  AAA
1  100  199  BBB
3  200  299  CCC
0  300  399  DDD

这篇关于范围 pandas 之间的字典中的查找值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆