确定占总数 90% 的记录 [英] identify records that make up 90% of total
问题描述
我有一份报告,其中确定了总体数字/趋势的关键驱动因素.我想自动化该功能,以便能够根据该数字的百分比列出/识别基础记录.例如,如果南部(地区)小部件销售的净变化为 -5,000.00,但有正面和负面影响 - 我想确定至少约 90% (-4,500.00) 的所有潜在驱动因素 -从最大到最小总共 5,000.00.
I have a report that identifies key drivers of an overall number/trend. I would like to automate the functionality to be able to list/identify the underlying records based on a percentage of that number. For example if the net change for sales of widgets in the south(region) is -5,000.00, but there are positives and negatives- I would like to identify at least ~90% (-4,500.00) of all underlying drivers that make up that -5,000.00 total from largest to smallest.
data
region OfficeLocation sales
South 1 -500
South 2 300
South 3 -1000
South 4 -2000
South 5 300
South 6 -700
South 7 -400
South 8 800
North 11 300
North 22 -400
North 33 1000
North 44 800
North 55 900
North 66 -800
对于 South,总销售额为 -3200.我想确定/列出构成此移动至少 90% 的驱动因素(按降序排列)-因此 -3200 的 90% 将是 2880.以及 South office 3 & 的定向移动/销售;4 = -3000 将是此请求的输出:
for South, the total sales is -3200. I would like to identify/list the drivers that make up at least 90% of this move(in descending order)- so 90% of -3200 would be 2880. And the directional moves/sales for South office 3 & 4 = -3000 would be the output for this request:
region OfficeLocation sales
South 3 -1000
South 4 -2000
对于 North,总销售额为 +1800.我想确定/列出构成此移动至少 90% 的驱动因素(按降序排列)-因此 1800 的至少 90% 将是 1620.以及 South office 3 & 的定向移动/销售;4 = -3000 将是此请求的输出:
for North, the total sales is +1800. I would like to identify/list the drivers that make up at least 90% of this move(in descending order)- so at least 90% of 1800 would be 1620. And the directional moves/sales for South office 3 & 4 = -3000 would be the output for this request:
区域办事处位置销售北 33 1000北 44 800
region OfficeLocation sales North 33 1000 North 44 800
上面的数据集有南/北的积极和消极趋势.您能提供的任何帮助将不胜感激!
Dataset above has both positive and negative trends for south/north. Any help you can provide would be greatly appreciated!
推荐答案
正如评论中提到的,不清楚在 'North'
情况下该怎么做,因为那里的总和为正,但忽略这一点,您可以执行以下操作:
As mentioned in the comment, it isn't clear what to do in the 'North'
case as the sum is positive there, but ignoring that, you could do something like the following:
In [200]: df[df.groupby('region').sales.apply(lambda g: g <= g.loc[(g.sort_values().cumsum() > 0.9*g.sum()).idxmin()])]
Out[200]:
region OfficeLocation sales
2 South 3 -1000
3 South 4 -2000
13 North 66 -800
如果在积极的情况下,您想找到尽可能少的元素,它们共同具有占销售额总和的 90% 的特性,则可以采用上述解决方案:
If, in the positive case, you want to find as few elements as possible that together have the property that they make up 90% of the sum of the sales, the above solution can be adopted as follows:
def is_driver(group):
s = group.sum()
if s > 0:
group *= -1
s *= -1
a = group.sort_values().cumsum() > 0.9*s
return group <= group.loc[a.idxmin()]
In [168]: df[df.groupby('region').sales.apply(is_driver)]
Out[168]:
region OfficeLocation sales
2 South 3 -1000
3 South 4 -2000
10 North 33 1000
12 North 55 900
注意,在平局的情况下,只挑出一个元素.
Note that in the case of a tie, only one element is picked out.
这篇关于确定占总数 90% 的记录的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!