如何通过字符串匹配加速 pandas 行过滤？ [英] How to speed up pandas row filtering by string matching?

查看：254 发布时间：2017/11/8 20:02:06 python filter pandas

本文介绍了如何通过字符串匹配加速 pandas 行过滤？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我经常需要通过 df [df ['col_name'] =='string_value'] df 我想加快行选择操作，有没有一个快速的方法来做到这一点？

例如，

  In [1]：df = mul_df（3000,2000,3）.reset_index（）
 
 In [2]：timeit df [ df ['STK_ID'] =='A0003'] 
 1个循环，最好是3：每循环1.52秒

mul_df（）是创建多级数据框的函数：

 > >> mul_df（4,2,3）
 COL000 COL001 COL002 
 STK_ID RPT_Date 
 A0000 B000 0.6399 0.0062 1.0022 
 B001 -0.2881 -2.0604 1.2481 
 A0001 B000 0.7070 -0.9539  - 0.5268 
 B001 0.8860 -0.5367 -2.4492 
 A0002 B000 -2.4738 0.9529 -0.9789 
 B001 0.1392 -1.0931 -0.2077 
 A0003 B000 -1.1377 0.5455 -0.2290 
 B001 1.0083 0.2746 -0.3934

以下是mul_df（）的代码：

  import itertools 
 import numpy as np 
 import pandas as pd 
 
 def mul_df（level1_rownum，level2_rownum，col_num，data_ty ='float32'）：
'''创建多级数据框，例如：mul_df（4,2,6）'''
 
 index_name = ['STK_ID'，'RPT_Date'] 
 col_name = ['COL'+ str（x）.zfill（3）for x in range（col_num）] 
 
 first_level_dt = [['A'+ str（x）.zfill （4）] * level2_rownum for x在范围内（level1_rownum）] 
 first_level_dt = list（itertools.chain（* first_level_dt））#flatten the list 
 second_level_dt = ['B'+ str（x）.zfill（3） （level2_rownum）] * level1_rownum 
 
 dt = pd.DataFrame（np.random.randn（level1_rownum * level2_rownum，col_num），columns = col_name，dtype = data_ty）
 dt [index_name [0 ]] = first_level_dt 
 dt [index_name [1]] = second_level_dt 
 
 rst = dt.set_index（index_name，drop = True，inplace = False）
 return rst

解决方案

我一直希望将二进制搜索索引添加到DataFrame对象中。你可以采取自己的DIY方法排序，并自己做这个：

 在[11]：df = df.sort （'STK_ID'）＃如果你确定它已经排序了，就跳过这个
 
在[12]：df ['STK_ID']。searchsorted（'A0003'，'left'）
 [13]：6000 
 
在[13]：df ['STK_ID']。searchsorted（'A0003'，'right'）
 Out [13]：8000 
 
 In [14]：timeit df [6000：8000] 
 10000循环，最好3：每循环134μs

这很快，因为它总是检索视图并且不复制任何数据。

I often need to filter pandas dataframe df by df[df['col_name']=='string_value'], and I want to speed up the row selction operation, is there a quick way to do that ?

For example,
In [1]: df = mul_df(3000,2000,3).reset_index() In [2]: timeit df[df['STK_ID']=='A0003'] 1 loops, best of 3: 1.52 s per loop
Can 1.52s be shorten ?

Note:

mul_df() is function to create multilevel dataframe:
>>> mul_df(4,2,3) COL000 COL001 COL002 STK_ID RPT_Date A0000 B000 0.6399 0.0062 1.0022 B001 -0.2881 -2.0604 1.2481 A0001 B000 0.7070 -0.9539 -0.5268 B001 0.8860 -0.5367 -2.4492 A0002 B000 -2.4738 0.9529 -0.9789 B001 0.1392 -1.0931 -0.2077 A0003 B000 -1.1377 0.5455 -0.2290 B001 1.0083 0.2746 -0.3934
Below is the code of mul_df():
import itertools import numpy as np import pandas as pd def mul_df(level1_rownum, level2_rownum, col_num, data_ty='float32'): ''' create multilevel dataframe, for example: mul_df(4,2,6)''' index_name = ['STK_ID','RPT_Date'] col_name = ['COL'+str(x).zfill(3) for x in range(col_num)] first_level_dt = [['A'+str(x).zfill(4)]*level2_rownum for x in range(level1_rownum)] first_level_dt = list(itertools.chain(*first_level_dt)) #flatten the list second_level_dt = ['B'+str(x).zfill(3) for x in range(level2_rownum)]*level1_rownum dt = pd.DataFrame(np.random.randn(level1_rownum*level2_rownum, col_num), columns=col_name, dtype = data_ty) dt[index_name[0]] = first_level_dt dt[index_name[1]] = second_level_dt rst = dt.set_index(index_name, drop=True, inplace=False) return rst

解决方案
I have long wanted to add binary search indexes to DataFrame objects. You can take the DIY approach of sorting by the column and doing this yourself:
In [11]: df = df.sort('STK_ID') # skip this if you're sure it's sorted In [12]: df['STK_ID'].searchsorted('A0003', 'left') Out[12]: 6000 In [13]: df['STK_ID'].searchsorted('A0003', 'right') Out[13]: 8000 In [14]: timeit df[6000:8000] 10000 loops, best of 3: 134 µs per loop
This is fast because it always retrieves views and does not copy any data.

这篇关于如何通过字符串匹配加速 pandas 行过滤？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何通过字符串匹配加速 pandas 行过滤？ [英] How to speed up pandas row filtering by string matching?

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何通过字符串匹配加速 pandas 行过滤？ [英] How to speed up pandas row filtering by string matching?

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭