如何有效地对pandas DataFrame中的行组合进行采样 [英] How to efficiently sample combinations of rows in a pandas DataFrame

查看:223
本文介绍了如何有效地对pandas DataFrame中的行组合进行采样的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

比方说,我有一个带有一定数量的列和行的pandas DataFrame.我想做的是找到5行的组合,这些组合在给定阈值的情况下在特定列中的得分最高.下面是一个玩具示例,可以更好地说明这一点:

Let's say I have a pandas DataFrame with a certain number of columns and rows. What I want to do is to find the combination of 5 rows that combined yield the highest score in a particular column given some threshold. Below is a little toy example to illustrate it better:

下面是我的代码的简化示例,我想知道这种蛮力"方法是否是解决此问题的明智方法.有没有机会更有效地做到这一点?使用其他Python库还是有一些技巧可以使其运行得更快(我想到过Cython,但我认为itertools已经在C中实现,因此不会有太大的好处?).另外,由于itertools是生成器,因此我在这里不知道如何使用多重处理.我欢迎任何讨论和想法!

Below is a simplified example of my code, and I am wondering if this "brute force" approach is a smart way to tackle this problem. Is there any chance to do it more efficiently? Using other Python libraries, or are there tricks to run it faster (I thought about Cython, but I think itertools is already implemented in C so that there won't be much benefit?). Also, I wouldn't know how to use multiprocessing here, since itertools is a generator. I would welcome any discussions and ideas!

谢谢!

对不起,我忘了提到第二个约束.例如,行的组合必须符合某些类别标准.例如,.

Sorry, I forgot to mention that there is a second constraint. E.g., the combinations of rows have to fit certain category criteria. E.g,.

  • 1个类别a
  • 2个类别b
  • 2个类别c

因此,总结一下问题:我想找到 k 行的组合,这些行优化了 s 的得分,因为 k 行属于特定类别,并且不超过约束列中的特定分数阈值.

So, to summarize the problem: I want to find a combination of k rows that optimize score s given that the k rows belong to certain categories and don't exceed a certain score threshold in a constraint column.

from itertools import combinations
from itertools import product

# based on the suggested answer:
# sort by best score per constraint ratio:
r = df['constraint_column']/df['constraint']
r.sort(ascending=False, inplace=True)
df = df.ix[r.index]


df_a = df[df['col1'] == some_criterion] # rows from category a
df_b = df[df['col2'] == some_criterion] # rows from category b
df_c = df[df['col3'] == some_criterion] # rows from category c

score = 0.0

for i in product(
            combinations(df_a.index, r=1), 
            combinations(df_b.index, r=2), 
            combinations(df_c.index, r=2)):

    indexes = set(chain.from_iterable(i))

    df_cur = df.ix[indexes]

    if df_cur['constraint_column'].values.sum() > some_threshold:
        continue


    new_score = df_cur['score_column'].values.sum()
    if new_score > score:
        score = new_score


    # based on the suggested answer:
    # break here, since it can't get any better if the threshold is exactly
    # matched since we sorted by the best score/constraint ratio previously.

    if df_cur['constraint_column'].values.sum() == some_threshold:
        break 

推荐答案

我认为您可以通过基于每个约束的得分"指标来获取最佳解决方案来解决此问题:

i think you can solve this by just taking the best based on the "score per constraint" metric:

constraint = 6 #whatever value you want here
df['s_per_c'] = df.score / df.constraint
df.sort('s_per_c', inplace=True, ascending=False)

total = 0
for i, r in df.iterrows():
    if r.constraint > constraint:
        continue
    constraint -= r.constraint
    total += r.score
    if constraint == 0:
        break

我的逻辑是,每次得分时,我都想确保自己负担得起(约束"),并且要获得最高的回报("s_per_c")

my logic here is that every time i take a score i want to make sure that i can afford it ("constraint") and that i'm getting the best bang for my buck ("s_per_c")

这篇关于如何有效地对pandas DataFrame中的行组合进行采样的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆