在pandas.dataframe中搜索优化的选择 [英] search for an optimized selection in a pandas.dataframe

查看:187
本文介绍了在pandas.dataframe中搜索优化的选择的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

根据此选择,在pandas.dataframe中选择包含N列(字符串,整数和浮点数)的某些行的最有效方法是:

What is the most efficient way for selecting some rows in a pandas.dataframe, containing N columns (strings, integers and floats), according to this selection:

  • 遍历2个列(整数)的所有组合.
  • 对于每种不同的组合,请仅保留一行(即所有列),将最小值保留在第三列(浮点数)中

例如,对于(titi,tutu)与第三列为tete的组合:

for instance, for combinations of (titi,tutu) with the third column being tete:

  toto  tata  titi  tutu  tete
0    a    18   600   700   4.5
1    b    18   600   800  10.1
2    c    18   600   700  12.6
3    d     3   300   400   3.4
4    a    16   900  1000   6.0
5    a    18   600   800  10.1
6    c     3   300   400   3.0
7    a    16   900  1000   6.0

必须给:

    toto  tata  titi  tutu  tete
0    a    18   600   700   4.5
1    b    18   600   800  10.1
4    a    16   900  1000   6.0
6    c     3   300   400   3.0

此刻,我从以下代码开始:

For the moment, I began with the following code:

import pandas
indicesToKeep = []
indicesToRemove = []
reader = pandas.read_csv('/Users/steph/work/perso/sof/test.csv')
columns = reader.columns
for i in reader['titi'].unique():
    #temp = reader[[:]].query('titi == i')#does not work !
    temp = reader.loc[(reader.titi == i),columns]
    for j in temp['tutu'].unique():
        temp2 = temp.loc[(temp.tutu == j),columns]
        minimum = min(temp2.tete)
        indicesToKeep.append(min(
                temp2[temp2.tete==minimum].index.tolist()))
################
# compute the complement of indicesToKeep
#but I don't remember the pythonic syntax
for i in range(len(reader)):
    if i not in indicesToKeep:
        indicesToRemove.append(i)
############################
reader = reader.drop(indicesToRemove)            

注意:

  • 我确定这没有优化.
  • 我使用旧的"loc"方法,因为我不知道如何使用"query"

推荐答案

IIUC sort_values + drop_duplicates,如果您起诉熊猫尝试不使用for循环,则大多数情况下它比矢量化方法慢

IIUC sort_values+drop_duplicates, if you are suing pandas try to not using for loop,most of time it is slow than the vectorized method

df.sort_values('tete').drop_duplicates(['titi','tutu']).sort_index()
Out[583]: 
  toto  tata  titi  tutu  tete
0    a    18   600   700   4.5
1    b    18   600   800  10.1
4    a    16   900  1000   6.0
6    c     3   300   400   3.0

这篇关于在pandas.dataframe中搜索优化的选择的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆