在R数据框中按组应用计算 [英] Applying calculation per groups within R dataframe
问题描述
我有这样的数据:
object category country
495647 1 RUS
477462 2 GER
431567 3 USA
449136 1 RUS
367260 1 USA
495649 1 RUS
477461 2 GER
431562 3 USA
449133 2 RUS
367264 2 USA
...
其中一个对象出现在各种(类别,国家/地区)
对和国家/地区共享一个类别列表。
where one object appears in various (category, country)
pairs and countries share a single list of categories.
我想在其中添加另一列,这就是每个国家/地区的类别权重-某个类别的类别中出现的对象数量,在一个国家/地区内归一化为最多1个对象(仅针对唯一的(类别,国家/地区)求和 code>对)。
I'd like to add another column to that, which would be a category weight per country - the number of objects appearing in a category for a category, normalized to sum up to 1 within a country (summation only over unique (category, country)
pairs).
我可以做以下事情:
aggregate(df$object, list(df$category, df$country), length)
然后从那里计算重量,但是直接在原始数据上执行此操作的更有效,更优雅的方法是什么。
and then calculate the weight from there, but what's a more efficient and elegant way of doing that directly on the original data.
所需的示例输出:
object category country weight
495647 1 RUS .75
477462 2 GER .5
431567 3 USA .5
449136 1 RUS .75
367260 1 USA .25
495649 1 RUS .75
477461 3 GER .5
431562 3 USA .5
449133 2 RUS .25
367264 2 USA .25
...
上面的内容总计一个国家/地区唯一的(类别,国家/地区)
对。
The above would sum up to one within country for unique (category, country)
pairs.
推荐答案
专门与最后要记住的一句话是:直接在原始数据上执行此操作的更有效,更优雅的方法是什么。恰好是 data.table
为此提供了一项新功能。
Responding specifically with the final sentence in mind: "What's a more efficient and elegant way of doing that directly on the original data.", it just so happens that data.table
has a new feature for this.
install.packages("data.table", repos="http://R-Forge.R-project.org")
# Needs version 1.8.1 from R-Forge. Soon to be released to CRAN.
数据在 DT
中:
> DT[, countcat:=.N, by=list(country,category)] # add 'countcat' column
category country countcat
1: 1 RUS 3
2: 2 GER 1
3: 3 USA 2
4: 1 RUS 3
5: 1 USA 1
6: 1 RUS 3
7: 3 GER 1
8: 3 USA 2
9: 2 RUS 1
10: 2 USA 1
> DT[, weight:=countcat/.N, by=country] # add 'weight' column
category country countcat weight
1: 1 RUS 3 0.75
2: 2 GER 1 0.50
3: 3 USA 2 0.50
4: 1 RUS 3 0.75
5: 1 USA 1 0.25
6: 1 RUS 3 0.75
7: 3 GER 1 0.50
8: 3 USA 2 0.50
9: 2 RUS 1 0.25
10: 2 USA 1 0.25
:=
通过引用数据添加一列,并且是旧功能。新功能是它现在可以按组工作。 .N
是一个符号,用于保存每个组中的行数。
:=
adds a column by reference to the data and is an 'old' feature. The new feature is that it now works by group. .N
is a symbol that holds the number of rows in each group.
这些操作对内存有效,应扩展到大数据;例如, 1e8
, 1e9
行。
These operations are memory efficient and should scale to large data; e.g., 1e8
, 1e9
rows.
如果不想包含中间列 countcat
,只需将其删除即可。同样,这是一个高效的操作,无论表的大小如何(通过在内部移动指针)都可以立即运行。
If you don't wish to include the intermediate column countcat
, just remove it afterwards. Again, this is an efficient operation which works instantly regardless of the size of the table (by moving pointers internally).
> DT[,countcat:=NULL] # remove 'countcat' column
category country weight
1: 1 RUS 0.75
2: 2 GER 0.50
3: 3 USA 0.50
4: 1 RUS 0.75
5: 1 USA 0.25
6: 1 RUS 0.75
7: 3 GER 0.50
8: 3 USA 0.50
9: 2 RUS 0.25
10: 2 USA 0.25
>
这篇关于在R数据框中按组应用计算的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!