Python Pandas自连接,用于合并笛卡尔积,以产生所有组合和总和 [英] Python Pandas self join for merge cartesian product to produce all combinations and sum

查看:184
本文介绍了Python Pandas自连接,用于合并笛卡尔积,以产生所有组合和总和的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是Python的新手,它似乎具有很大的灵活性,并且比传统的RDBMS系统快.

I am brand new to Python, seems like it has a lot of flexibility and is faster than traditional RDBMS systems.

通过一个非常简单的过程来创建随机的幻想团队.我来自RDBMS背景(Oracle SQL),对于这种数据处理来说似乎并不是最佳选择.

Working on a very simple process to create random fantasy teams. I come from an RDBMS background (Oracle SQL) and that does not seem to be optimal for this data processing.

我使用从csv文件读取的熊猫制作了一个数据框,现在有一个包含两列的简单数据框-播放器,薪水:

I made a dataframe using pandas read from csv file and now have a simple dataframe with two columns -- Player, Salary:

`                    Name  Salary
0              Jason Day   11700
1         Dustin Johnson   11600
2           Rory McIlroy   11400
3          Jordan Spieth   11100
4         Henrik Stenson   10500
5         Phil Mickelson   10200
6            Justin Rose    9800
7             Adam Scott    9600
8          Sergio Garcia    9400
9          Rickie Fowler    9200`

我想通过python(pandas)做的事情是产生6名球员的所有组合,这些球员的薪水在一定的45000-50000之间.

What I am trying to do via python (pandas) is produce all combinations of 6 players which salary is between a certain amount 45000 -- 50000.

在查找python选项时,我发现itertools组合很有趣,但是它会生成大量的组合列表,而不会过滤薪水总和.

In looking up python options, I found the itertools combination interesting, but it would result a massive list of combinations without filtering the sum of salary.

在传统的SQL中,我会使用SUM进行大规模的合并笛卡尔连接,但随后我会把播放器放在不同的位置.

In traditional SQL, I would do a massive merge cartesian join w/ SUM, but then I get the players in different spots..

然后是A,B,C,C,B,A.

Such as A, B, C then, C, B, A..

我无法很好运行的传统SQL是这样的:

My traditional SQL which doesn't work well enough is something like this:

` SELECT distinct
ONE.name AS "1", 
  TWO.name AS "2",
    THREE.name AS "3",
      FOUR.name AS "4", 
  FIVE.name AS "5", 
  SIX.name AS "6",
   sum(one.salary + two.salary + three.salary + four.salary + five.salary + six.salary) as salary
  FROM 
  nl.pgachamp2 ONE,nl.pgachamp2 TWO,nl.pgachamp2 THREE, nl.pgachamp2 FOUR,nl.pgachamp2 FIVE,nl.pgachamp2 SIX
 where ONE.name != TWO.name
 and ONE.name != THREE.name
 and one.name != four.name
 and one.name != five.name
 and TWO.name != THREE.name
 and TWO.name != four.name
 and two.name != five.name
 and TWO.name != six.name
 and THREE.name != four.name
 and THREE.name != five.name
 and three.name != six.name
 and five.name != six.name
 and four.name != six.name
 and four.name != five.name
 and one.name != six.name
 group by ONE.name, TWO.name, THREE.name, FOUR.name, FIVE.name, SIX.name`

在Pandas/Python中有没有办法做到这一点?

Is there a way to do this in Pandas/Python?

任何可以指向的文档都很棒!

Any documentation that can be pointed to would be great!

推荐答案

这是使用简单算法的非熊猫解决方案.它从按薪水排序的球员列表中递归地产生组合.这样一来,它就可以跳过生成超过薪金上限的组合.

Here's a non-Pandas solution using a simple algorithm. It generates combinations recursively from a list of players sorted by salary. This lets it skip generating combinations that exceed the salary cap.

正如piRSquared所提到的,在问题中所述的薪水限额之内,没有6人的球队,所以我选择了限额以产生少量球队.

As piRSquared mentions, there are no teams of 6 that fall within the salary limits stated in the question, so I chose limits to generate a small number of teams.

#!/usr/bin/env python3

''' Limited combinations

    Generate combinations of players whose combined salaries fall within given limits

    See http://stackoverflow.com/q/38636460/4014959

    Written by PM 2Ring 2016.07.28
'''

data = '''\
0              Jason Day   11700
1         Dustin Johnson   11600
2           Rory McIlroy   11400
3          Jordan Spieth   11100
4         Henrik Stenson   10500
5         Phil Mickelson   10200
6            Justin Rose    9800
7             Adam Scott    9600
8          Sergio Garcia    9400
9          Rickie Fowler    9200
'''

data = [s.split() for s in data.splitlines()]
all_players = [(' '.join(u[1:-1]), int(u[-1])) for u in data]
all_players.sort(key=lambda t: t[1])
for i, row in enumerate(all_players):
    print(i, row)
print('- '*40)

def choose_teams(free, num, team=(), value=0):
    num -= 1
    for i, p in enumerate(free):
        salary = all_players[p][1]
        newvalue = value + salary
        if newvalue <= hi:
            newteam = team + (p,)
            if num == 0:
                if newvalue >= lo:
                    yield newteam, newvalue
            else:
                yield from choose_teams(free[i+1:], num, newteam, newvalue)
        else:
            break

#Salary limits
lo, hi = 55000, 60500

#Indices of players that can be chosen for a team 
free = tuple(range(len(all_players)))

for i, (t, s) in enumerate(choose_teams(free, 6), 1):
    team = [all_players[p] for p in t]
    names, sals = zip(*team)
    assert sum(sals) == s
    print(i, t, names, s)

输出

0 ('Rickie Fowler', 9200)
1 ('Sergio Garcia', 9400)
2 ('Adam Scott', 9600)
3 ('Justin Rose', 9800)
4 ('Phil Mickelson', 10200)
5 ('Henrik Stenson', 10500)
6 ('Jordan Spieth', 11100)
7 ('Rory McIlroy', 11400)
8 ('Dustin Johnson', 11600)
9 ('Jason Day', 11700)
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
1 (0, 1, 2, 3, 4, 5) ('Rickie Fowler', 'Sergio Garcia', 'Adam Scott', 'Justin Rose', 'Phil Mickelson', 'Henrik Stenson') 58700
2 (0, 1, 2, 3, 4, 6) ('Rickie Fowler', 'Sergio Garcia', 'Adam Scott', 'Justin Rose', 'Phil Mickelson', 'Jordan Spieth') 59300
3 (0, 1, 2, 3, 4, 7) ('Rickie Fowler', 'Sergio Garcia', 'Adam Scott', 'Justin Rose', 'Phil Mickelson', 'Rory McIlroy') 59600
4 (0, 1, 2, 3, 4, 8) ('Rickie Fowler', 'Sergio Garcia', 'Adam Scott', 'Justin Rose', 'Phil Mickelson', 'Dustin Johnson') 59800
5 (0, 1, 2, 3, 4, 9) ('Rickie Fowler', 'Sergio Garcia', 'Adam Scott', 'Justin Rose', 'Phil Mickelson', 'Jason Day') 59900
6 (0, 1, 2, 3, 5, 6) ('Rickie Fowler', 'Sergio Garcia', 'Adam Scott', 'Justin Rose', 'Henrik Stenson', 'Jordan Spieth') 59600
7 (0, 1, 2, 3, 5, 7) ('Rickie Fowler', 'Sergio Garcia', 'Adam Scott', 'Justin Rose', 'Henrik Stenson', 'Rory McIlroy') 59900
8 (0, 1, 2, 3, 5, 8) ('Rickie Fowler', 'Sergio Garcia', 'Adam Scott', 'Justin Rose', 'Henrik Stenson', 'Dustin Johnson') 60100
9 (0, 1, 2, 3, 5, 9) ('Rickie Fowler', 'Sergio Garcia', 'Adam Scott', 'Justin Rose', 'Henrik Stenson', 'Jason Day') 60200
10 (0, 1, 2, 3, 6, 7) ('Rickie Fowler', 'Sergio Garcia', 'Adam Scott', 'Justin Rose', 'Jordan Spieth', 'Rory McIlroy') 60500
11 (0, 1, 2, 4, 5, 6) ('Rickie Fowler', 'Sergio Garcia', 'Adam Scott', 'Phil Mickelson', 'Henrik Stenson', 'Jordan Spieth') 60000
12 (0, 1, 2, 4, 5, 7) ('Rickie Fowler', 'Sergio Garcia', 'Adam Scott', 'Phil Mickelson', 'Henrik Stenson', 'Rory McIlroy') 60300
13 (0, 1, 2, 4, 5, 8) ('Rickie Fowler', 'Sergio Garcia', 'Adam Scott', 'Phil Mickelson', 'Henrik Stenson', 'Dustin Johnson') 60500
14 (0, 1, 3, 4, 5, 6) ('Rickie Fowler', 'Sergio Garcia', 'Justin Rose', 'Phil Mickelson', 'Henrik Stenson', 'Jordan Spieth') 60200
15 (0, 1, 3, 4, 5, 7) ('Rickie Fowler', 'Sergio Garcia', 'Justin Rose', 'Phil Mickelson', 'Henrik Stenson', 'Rory McIlroy') 60500
16 (0, 2, 3, 4, 5, 6) ('Rickie Fowler', 'Adam Scott', 'Justin Rose', 'Phil Mickelson', 'Henrik Stenson', 'Jordan Spieth') 60400


如果您使用的旧版本的Python不支持yield from语法,则可以替换


If you're using an older version of Python that doesn't support the yield from syntax you can replace

yield from choose_teams(free[i+1:], num, newteam, newvalue)

使用

for t, v in choose_teams(free[i+1:], num, newteam, newvalue):
    yield t, v

这篇关于Python Pandas自连接,用于合并笛卡尔积,以产生所有组合和总和的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆