如何通过字符串匹配加速 pandas 行过滤? [英] How to speed up pandas row filtering by string matching?

查看:254
本文介绍了如何通过字符串匹配加速 pandas 行过滤?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我经常需要通过 df [df ['col_name'] =='string_value'] df 我想加快行选择操作,有没有一个快速的方法来做到这一点?



例如,

  In [1]:df = mul_df(3000,2000,3).reset_index()

In [2]:timeit df [ df ['STK_ID'] =='A0003']
1个循环,最好是3:每循环1.52秒


$


mul_df()是创建多级数据框的函数:

 > >> mul_df(4,2,3)
COL000 COL001 COL002
STK_ID RPT_Date
A0000 B000 0.6399 0.0062 1.0022
B001 -0.2881 -2.0604 1.2481
A0001 B000 0.7070 -0.9539 - 0.5268
B001 0.8860 -0.5367 -2.4492
A0002 B000 -2.4738 0.9529 -0.9789
B001 0.1392 -1.0931 -0.2077
A0003 B000 -1.1377 0.5455 -0.2290
B001 1.0083 0.2746 -0.3934

以下是mul_df()的代码:

  import itertools 
import numpy as np
import pandas as pd

def mul_df(level1_rownum,level2_rownum,col_num,data_ty ='float32'):
'''创建多级数据框,例如:mul_df(4,2,6)'''

index_name = ['STK_ID','RPT_Date']
col_name = ['COL'+ str(x).zfill(3)for x in range(col_num)]

first_level_dt = [['A'+ str(x).zfill (4)] * level2_rownum for x在范围内(level1_rownum)]
first_level_dt = list(itertools.chain(* first_level_dt))#flatten the list
second_level_dt = ['B'+ str(x).zfill(3) (level2_rownum)] * level1_rownum

dt = pd.DataFrame(np.random.randn(level1_rownum * level2_rownum,col_num),columns = col_name,dtype = data_ty)
dt [index_name [0 ]] = first_level_dt
dt [index_name [1]] = second_level_dt

rst = dt.set_index(index_name,drop = True,inplace = False)
return rst


解决方案

我一直希望将二进制搜索索引添加到DataFrame对象中。你可以采取自己的DIY方法排序,并自己做这个:

 在[11]:df = df.sort ('STK_ID')#如果你确定它已经排序了,就跳过这个

在[12]:df ['STK_ID']。searchsorted('A0003','left')
[13]:6000

在[13]:df ['STK_ID']。searchsorted('A0003','right')
Out [13]:8000

In [14]:timeit df [6000:8000]
10000循环,最好3:每循环134μs

这很快,因为它总是检索视图并且不复制任何数据。

I often need to filter pandas dataframe df by df[df['col_name']=='string_value'], and I want to speed up the row selction operation, is there a quick way to do that ?

For example,

In [1]: df = mul_df(3000,2000,3).reset_index()

In [2]: timeit df[df['STK_ID']=='A0003']
1 loops, best of 3: 1.52 s per loop

Can 1.52s be shorten ?

Note:

mul_df() is function to create multilevel dataframe:

>>> mul_df(4,2,3)
                 COL000  COL001  COL002
STK_ID RPT_Date                        
A0000  B000      0.6399  0.0062  1.0022
       B001     -0.2881 -2.0604  1.2481
A0001  B000      0.7070 -0.9539 -0.5268
       B001      0.8860 -0.5367 -2.4492
A0002  B000     -2.4738  0.9529 -0.9789
       B001      0.1392 -1.0931 -0.2077
A0003  B000     -1.1377  0.5455 -0.2290
       B001      1.0083  0.2746 -0.3934

Below is the code of mul_df():

import itertools
import numpy as np
import pandas as pd

def mul_df(level1_rownum, level2_rownum, col_num, data_ty='float32'):
    ''' create multilevel dataframe, for example: mul_df(4,2,6)'''

    index_name = ['STK_ID','RPT_Date']
    col_name = ['COL'+str(x).zfill(3) for x in range(col_num)]

    first_level_dt = [['A'+str(x).zfill(4)]*level2_rownum for x in range(level1_rownum)]
    first_level_dt = list(itertools.chain(*first_level_dt)) #flatten the list
    second_level_dt = ['B'+str(x).zfill(3) for x in range(level2_rownum)]*level1_rownum

    dt = pd.DataFrame(np.random.randn(level1_rownum*level2_rownum, col_num), columns=col_name, dtype = data_ty)
    dt[index_name[0]] = first_level_dt
    dt[index_name[1]] = second_level_dt

    rst = dt.set_index(index_name, drop=True, inplace=False)
    return rst

解决方案

I have long wanted to add binary search indexes to DataFrame objects. You can take the DIY approach of sorting by the column and doing this yourself:

In [11]: df = df.sort('STK_ID') # skip this if you're sure it's sorted

In [12]: df['STK_ID'].searchsorted('A0003', 'left')
Out[12]: 6000

In [13]: df['STK_ID'].searchsorted('A0003', 'right')
Out[13]: 8000

In [14]: timeit df[6000:8000]
10000 loops, best of 3: 134 µs per loop

This is fast because it always retrieves views and does not copy any data.

这篇关于如何通过字符串匹配加速 pandas 行过滤?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆