R将ID中重复的行与数据框中的不同列类型组合在一起 [英] R combining duplicate rows by ID with different column types in a dataframe

查看:192
本文介绍了R将ID中重复的行与数据框中的不同列类型组合在一起的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据框,其中有一个列ID作为标识符,还有一些其他不同类型的列(因数和数字)。看起来像这样

I have a dataframe with a column ID as an identifier and some other columns of different types (factors and numerics). It looks like this

df <- data.frame(id    = c(1, 1, 1, 1, 1, 2, 2, 3, 3, 4, 4, 4, 4),
                 abst  = c(0, NA, 2, NA, NA, NA, 0, 0, NA, 2, NA, 3, 4),
                 farbe = as.factor(c("keine", NA, "keine", NA, NA, NA, "keine", "keine", NA, NA, NA, "rot", "rot")),
                 gier  = c(0, NA, 5, NA, NA, NA, 0, 0, NA, 1, NA, 6, 2))

现在,我想合并重复的ID。数字列定义为所有相同ID的平均值(不包含NA!)。因子列合并为一个。可以省略NA。

Now I want to combine the duplicate IDs. The numeric columns are defined as the mean value of all identical IDs (without the NAs!). The factor columns are combined into one. The NAs can be omitted.

最后结果应该像这样

dfRes <- data.frame(id    = c(1, 2, 3, 4),
                    abst  = c(1, 0, 0, 3),
                    farbe = as.factor(c("keine", "keine", "keine", "rot")),
                    gier  = c(2.5, 0, 0, 3))

我希望有一种方法可以快速进行计算,因为我有大约一百万个观测值。
预先感谢!

I hope there is a way to do the calculations fast, because I have about a million observations. Thanks in advance!

编辑(添加): farbe可能不是唯一的。在这种情况下,我认为对我的数据最好的方法是有一个重复的行,但只有一个不同的 farbe,因此有2个相同的ID,所有 farbe的值都相同但不同。这应该是非常罕见的情况,但是是一个很大的补充。

Edit (additions): "farbe" may not be unique. In this case I think the best idea for my data is to have a duplicate row but only with a different "farbe", so there are 2 identical IDs and all the same but different values for "farbe". This should be just very rare case, but a great addition.

我的真实数据中有很多数字和因子列。还可以创建一个解决方案,所以我不必定义每个列?

I have a lot more numeric and factor columns in my real data. Is it also possible to create a solution, so I don't have to define every single column?

推荐答案

编辑

看到了您对非唯一因子列和按类型选择列的编辑。这将起作用,但是我会考虑一种更清洁的方法来进行报告并报告(我相信有一个简单的方法)。如果您想像原始示例一样手动指定列并且您具有非唯一因素,只需将 unlist() unique()以下面的相同方式。或者,您可以考虑使用 paste() collapse =; 之类的一行结合所有因子水平这样。如果要更改最终数据表的列顺序,请在数据表上使用 setcolorder()

Just saw your edit about non-unique factor columns and selecting columns by type. This will work but I will think about a cleaner way to do this and report back (I am sure there is a simple way). If you want to manually specify columns like the original example and you have non-unique factors, just use unlist() with unique() in the same fashion as below. Alternatively, you could consider combining all factor levels on one line using paste() with collapse = "; " or something to that effect. If you want to change the column order for the final data.table, use setcolorder() on the data.table

setDT(df)

# For selecting columns later
num_cols <- sapply(df, is.numeric)
num_cols[names(num_cols) == "id"] <- FALSE
fac_cols <- sapply(df, is.factor)

df[, lapply(.SD, mean, na.rm = T), by = id, .SDcols = num_cols][
  df[, lapply(.SD, function(i) unlist(unique(i[!is.na(i)]))), by = id, .SDcols = fac_cols], on = "id"]

   id abst gier farbe
1:  1    1  2.5 keine
2:  2    0  0.0 keine
3:  3    0  0.0 keine
4:  4    3  3.0  rot2
5:  4    3  3.0   rot

工作原理
它加入数字列摘要

How it works: It joins the numeric column summary

df[, lapply(.SD, mean, na.rm = TRUE), by = id, .SDcols = num_cols]

具有因子列摘要

df[, lapply(.SD, function(i) unlist(unique(i[!is.na(i)]))), by = id, .SDcols = fac_cols]

要编辑的数据

df <- data.frame(id    = c(1, 1, 1, 1, 1, 2, 2, 3, 3, 4, 4, 4, 4),
                 abst  = c(0, NA, 2, NA, NA, NA, 0, 0, NA, 2, NA, 3, 4),
                 farbe = as.factor(c("keine", NA, "keine", NA, NA, NA, "keine", "keine", NA, NA, "rot2", "rot", "rot")),
                 gier  = c(0, NA, 5, NA, NA, NA, 0, 0, NA, 1, NA, 6, 2))

原始答案

这是许多 data.table 解决方案之一。这按因子列对data.table进行排序,以便在汇总时可以获取最高值。我也将其转换回纯data.frame,但如果您不想这样做,则不必这样做。希望这对您有帮助!

Here is one of many data.table solutions. This orders the data.table by the factor column so it can grab the top value while summarizing. I also converted it back to a pure data.frame but you do not have to do that if you do not want to. Hope this helps!

此外,这还假定农场对于每个都是相同的id

Also, this assumes that farbe will be the same for each id

library(data.table)

setDT(df)

df <- df[order(farbe), .(abst = mean(abst, na.rm = TRUE),
                         farbe = farbe[1],
                         gier = mean(gier, na.rm = TRUE)), by = id]

setDF(df)
df
  id abst farbe gier
1  1    1 keine  2.5
2  2    0 keine  0.0
3  3    0 keine  0.0
4  4    3   rot  3.0

这篇关于R将ID中重复的行与数据框中的不同列类型组合在一起的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆