Python Pandas自连接,用于合并笛卡尔积,以产生所有组合和总和 [英] Python Pandas self join for merge cartesian product to produce all combinations and sum
问题描述
我是Python的新手,它似乎具有很大的灵活性,并且比传统的RDBMS系统快.
I am brand new to Python, seems like it has a lot of flexibility and is faster than traditional RDBMS systems.
通过一个非常简单的过程来创建随机的幻想团队.我来自RDBMS背景(Oracle SQL),对于这种数据处理来说似乎并不是最佳选择.
Working on a very simple process to create random fantasy teams. I come from an RDBMS background (Oracle SQL) and that does not seem to be optimal for this data processing.
我使用从csv文件读取的熊猫制作了一个数据框,现在有一个包含两列的简单数据框-播放器,薪水:
I made a dataframe using pandas read from csv file and now have a simple dataframe with two columns -- Player, Salary:
` Name Salary
0 Jason Day 11700
1 Dustin Johnson 11600
2 Rory McIlroy 11400
3 Jordan Spieth 11100
4 Henrik Stenson 10500
5 Phil Mickelson 10200
6 Justin Rose 9800
7 Adam Scott 9600
8 Sergio Garcia 9400
9 Rickie Fowler 9200`
我想通过python(pandas)做的事情是产生6名球员的所有组合,这些球员的薪水在一定的45000-50000之间.
What I am trying to do via python (pandas) is produce all combinations of 6 players which salary is between a certain amount 45000 -- 50000.
在查找python选项时,我发现itertools组合很有趣,但是它会生成大量的组合列表,而不会过滤薪水总和.
In looking up python options, I found the itertools combination interesting, but it would result a massive list of combinations without filtering the sum of salary.
在传统的SQL中,我会使用SUM进行大规模的合并笛卡尔连接,但随后我会把播放器放在不同的位置.
In traditional SQL, I would do a massive merge cartesian join w/ SUM, but then I get the players in different spots..
然后是A,B,C,C,B,A.
Such as A, B, C then, C, B, A..
我无法很好运行的传统SQL是这样的:
My traditional SQL which doesn't work well enough is something like this:
` SELECT distinct
ONE.name AS "1",
TWO.name AS "2",
THREE.name AS "3",
FOUR.name AS "4",
FIVE.name AS "5",
SIX.name AS "6",
sum(one.salary + two.salary + three.salary + four.salary + five.salary + six.salary) as salary
FROM
nl.pgachamp2 ONE,nl.pgachamp2 TWO,nl.pgachamp2 THREE, nl.pgachamp2 FOUR,nl.pgachamp2 FIVE,nl.pgachamp2 SIX
where ONE.name != TWO.name
and ONE.name != THREE.name
and one.name != four.name
and one.name != five.name
and TWO.name != THREE.name
and TWO.name != four.name
and two.name != five.name
and TWO.name != six.name
and THREE.name != four.name
and THREE.name != five.name
and three.name != six.name
and five.name != six.name
and four.name != six.name
and four.name != five.name
and one.name != six.name
group by ONE.name, TWO.name, THREE.name, FOUR.name, FIVE.name, SIX.name`
在Pandas/Python中有没有办法做到这一点?
Is there a way to do this in Pandas/Python?
任何可以指向的文档都很棒!
Any documentation that can be pointed to would be great!
推荐答案
这是使用简单算法的非熊猫解决方案.它从按薪水排序的球员列表中递归地产生组合.这样一来,它就可以跳过生成超过薪金上限的组合.
Here's a non-Pandas solution using a simple algorithm. It generates combinations recursively from a list of players sorted by salary. This lets it skip generating combinations that exceed the salary cap.
正如piRSquared所提到的,在问题中所述的薪水限额之内,没有6人的球队,所以我选择了限额以产生少量球队.
As piRSquared mentions, there are no teams of 6 that fall within the salary limits stated in the question, so I chose limits to generate a small number of teams.
#!/usr/bin/env python3
''' Limited combinations
Generate combinations of players whose combined salaries fall within given limits
See http://stackoverflow.com/q/38636460/4014959
Written by PM 2Ring 2016.07.28
'''
data = '''\
0 Jason Day 11700
1 Dustin Johnson 11600
2 Rory McIlroy 11400
3 Jordan Spieth 11100
4 Henrik Stenson 10500
5 Phil Mickelson 10200
6 Justin Rose 9800
7 Adam Scott 9600
8 Sergio Garcia 9400
9 Rickie Fowler 9200
'''
data = [s.split() for s in data.splitlines()]
all_players = [(' '.join(u[1:-1]), int(u[-1])) for u in data]
all_players.sort(key=lambda t: t[1])
for i, row in enumerate(all_players):
print(i, row)
print('- '*40)
def choose_teams(free, num, team=(), value=0):
num -= 1
for i, p in enumerate(free):
salary = all_players[p][1]
newvalue = value + salary
if newvalue <= hi:
newteam = team + (p,)
if num == 0:
if newvalue >= lo:
yield newteam, newvalue
else:
yield from choose_teams(free[i+1:], num, newteam, newvalue)
else:
break
#Salary limits
lo, hi = 55000, 60500
#Indices of players that can be chosen for a team
free = tuple(range(len(all_players)))
for i, (t, s) in enumerate(choose_teams(free, 6), 1):
team = [all_players[p] for p in t]
names, sals = zip(*team)
assert sum(sals) == s
print(i, t, names, s)
输出
0 ('Rickie Fowler', 9200)
1 ('Sergio Garcia', 9400)
2 ('Adam Scott', 9600)
3 ('Justin Rose', 9800)
4 ('Phil Mickelson', 10200)
5 ('Henrik Stenson', 10500)
6 ('Jordan Spieth', 11100)
7 ('Rory McIlroy', 11400)
8 ('Dustin Johnson', 11600)
9 ('Jason Day', 11700)
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
1 (0, 1, 2, 3, 4, 5) ('Rickie Fowler', 'Sergio Garcia', 'Adam Scott', 'Justin Rose', 'Phil Mickelson', 'Henrik Stenson') 58700
2 (0, 1, 2, 3, 4, 6) ('Rickie Fowler', 'Sergio Garcia', 'Adam Scott', 'Justin Rose', 'Phil Mickelson', 'Jordan Spieth') 59300
3 (0, 1, 2, 3, 4, 7) ('Rickie Fowler', 'Sergio Garcia', 'Adam Scott', 'Justin Rose', 'Phil Mickelson', 'Rory McIlroy') 59600
4 (0, 1, 2, 3, 4, 8) ('Rickie Fowler', 'Sergio Garcia', 'Adam Scott', 'Justin Rose', 'Phil Mickelson', 'Dustin Johnson') 59800
5 (0, 1, 2, 3, 4, 9) ('Rickie Fowler', 'Sergio Garcia', 'Adam Scott', 'Justin Rose', 'Phil Mickelson', 'Jason Day') 59900
6 (0, 1, 2, 3, 5, 6) ('Rickie Fowler', 'Sergio Garcia', 'Adam Scott', 'Justin Rose', 'Henrik Stenson', 'Jordan Spieth') 59600
7 (0, 1, 2, 3, 5, 7) ('Rickie Fowler', 'Sergio Garcia', 'Adam Scott', 'Justin Rose', 'Henrik Stenson', 'Rory McIlroy') 59900
8 (0, 1, 2, 3, 5, 8) ('Rickie Fowler', 'Sergio Garcia', 'Adam Scott', 'Justin Rose', 'Henrik Stenson', 'Dustin Johnson') 60100
9 (0, 1, 2, 3, 5, 9) ('Rickie Fowler', 'Sergio Garcia', 'Adam Scott', 'Justin Rose', 'Henrik Stenson', 'Jason Day') 60200
10 (0, 1, 2, 3, 6, 7) ('Rickie Fowler', 'Sergio Garcia', 'Adam Scott', 'Justin Rose', 'Jordan Spieth', 'Rory McIlroy') 60500
11 (0, 1, 2, 4, 5, 6) ('Rickie Fowler', 'Sergio Garcia', 'Adam Scott', 'Phil Mickelson', 'Henrik Stenson', 'Jordan Spieth') 60000
12 (0, 1, 2, 4, 5, 7) ('Rickie Fowler', 'Sergio Garcia', 'Adam Scott', 'Phil Mickelson', 'Henrik Stenson', 'Rory McIlroy') 60300
13 (0, 1, 2, 4, 5, 8) ('Rickie Fowler', 'Sergio Garcia', 'Adam Scott', 'Phil Mickelson', 'Henrik Stenson', 'Dustin Johnson') 60500
14 (0, 1, 3, 4, 5, 6) ('Rickie Fowler', 'Sergio Garcia', 'Justin Rose', 'Phil Mickelson', 'Henrik Stenson', 'Jordan Spieth') 60200
15 (0, 1, 3, 4, 5, 7) ('Rickie Fowler', 'Sergio Garcia', 'Justin Rose', 'Phil Mickelson', 'Henrik Stenson', 'Rory McIlroy') 60500
16 (0, 2, 3, 4, 5, 6) ('Rickie Fowler', 'Adam Scott', 'Justin Rose', 'Phil Mickelson', 'Henrik Stenson', 'Jordan Spieth') 60400
如果您使用的旧版本的Python不支持yield from
语法,则可以替换
If you're using an older version of Python that doesn't support the yield from
syntax you can replace
yield from choose_teams(free[i+1:], num, newteam, newvalue)
使用
for t, v in choose_teams(free[i+1:], num, newteam, newvalue):
yield t, v
这篇关于Python Pandas自连接,用于合并笛卡尔积,以产生所有组合和总和的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!