总结具有不可靠数据的data.table [英] Summarize a data.table with unreliable data

查看：84 发布时间：2017/3/12 12:22:48 r data.table

本文介绍了总结具有不可靠数据的data.table的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有 data.table 记录事件，例如用户ID，居住国家和事件。
Eg，

I have a data.table of events recording, say, user ID, country of residence, and event. E.g.,

dt <- data.table(user=c(rep(3, 5), rep(4, 5)),
                 country=c(rep(1,4),rep(2,6)),
                 event=1:10, key="user")

如您所见，数据有些损坏：事件5报告用户3在国家2旅行 - 这对我来说没关系）。
所以当我尝试总结数据：

As you can see, the data is somewhat corrupted: event 5 reports user 3 as being in country 2 (or maybe he traveled - it does not matter to me here). So when I try to summarize the data:

dt[, country[.N] , by=user]
   user V1
1:    3  2
2:    4  2

我得到错误的国家为用户3.
理想情况下，我想得到一个用户最常见的国家和
百分比的时间，他在那里：

I get the wrong country for user 3. Ideally, I would like to get the most common country for a user and the percentage of time he spent there:

   user country support
1:    3       1     0.8
2:    4       2     1.0

我如何做？

实际数据有〜10 ^ 7行，（这是为什么我使用 data.table 而不是毕竟data.frame ）。

The actual data has ~10^7 rows, so the solution has to scale (this is why I am using data.table and not data.frame after all).

推荐答案

另一种方式：

表（。）是罪魁祸首。更改为完成 data.table 语法。

Edited. table(.) was the culprit. Changed it to complete data.table syntax.

dt.out<- dt[, .N, by=list(user,country)][, list(country[which.max(N)], 
               max(N)/sum(N)), by=user]
setnames(dt.out, c("V1", "V2"), c("country", "support"))
#    user country support
# 1:    3       1     0.8
# 2:    4       2     1.0

这篇关于总结具有不可靠数据的data.table的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

总结具有不可靠数据的data.table [英] Summarize a data.table with unreliable data

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

总结具有不可靠数据的data.table [英] Summarize a data.table with unreliable data

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭