基于“附近"的聚合.行值 [英] Aggregating based on "near" row values

查看:96
本文介绍了基于“附近"的聚合.行值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个非常混乱的数据框(网络抓取),不幸的是其中有许多双甚至三重条目.大多数数据框如下所示:

I have a very messy dataframe (webscraped) that unfortunately has many double and even triple entries in it. Most of the dataframe looks like this:

> df1<-data.frame(var1=c("a","a","b","b","c","c","d","d"),var2=c("right.a",NA,"right.b",NA,"right.c",NA,"right.d",NA),var3=c("correct.a","correct.a","correct.b","correct.b","correct.c","correct.c","correct.d","correct.d"))
> df1
  var1    var2      var3
1    a right.a correct.a
2    a    <NA> correct.a
3    b right.b correct.b
4    b    <NA> correct.b
5    c right.c correct.c
6    c    <NA> correct.c
7    d right.d correct.d
8    d    <NA> correct.d

"var1"是我需要用于汇总的ID变量.我的目标是要有一个像这样的数据框:

"var1" is my ID variable that I need to use to aggregate. My goal is to have a dataframe that looks like this:

  var1    var2      var3
1    a right.a correct.a
2    b right.b correct.b
3    c right.c correct.c
4    d right.d correct.d

但是,主要问题是,并非整个数据帧看起来都像这样.实际上,我还有其他部分看起来像这样:

However, the main problem is, that not the whole dataframe looks like this. In fact, I have other parts that look like this:

> df2<-data.frame(var1=c("e","e","e","f","f","g","g","g"),var2=c(NA,NA,"right.e",NA,NA,NA,"right.g",NA),var3=c("correct.e","correct.e",NA,"correct.f",NA,"correct.g","correct.g",NA))
> df2
  var1    var2      var3
1    e    <NA> correct.e
2    e    <NA> correct.e
3    e right.e      <NA>
4    f    <NA> correct.f
5    f    <NA>      <NA>
6    g    <NA> correct.g
7    g right.g   wrong.g
8    g    <NA>      <NA>

和其他变体.最后,每个ID都应包含一行,其中正确和正确的是var2和var3.此时,我迷路了:我的var1 不唯一.但是,我知道一起"的重复ID在数据帧中分组(如我的示例所示);例如在第4102和4103行中可能还会有另一个"a".

And other variations. In the end, every ID should have one row with the correct and right var2 and var3 in it. At this point, I get lost: My var1 is not unique. However, I know that duplicate IDs that "belong" together are grouped in the dataframe (as seen in my examples); e.g. there might be another "a" in row 4102 and 4103.

我认为应该使用以var1作为ID的聚合,但是要告诉R聚合应该只检查var1的+ -2行.有任何想法如何编码吗?

What I think would be the way to go is using aggregate with var1 as ID,but in addition telling R that aggregate should just check +-2 rows of var1 when doing so. Any ideas how to code this?

谢谢!

推荐答案

这是使用data.table

library(data.table)

setDT(df1)[, .(var2[!is.na(var2)][1], var3[!is.na(var3)][1]), by=var1]
   var1      V1        V2
1:    a right.a correct.a
2:    b right.b correct.b
3:    c right.c correct.c
4:    d right.d correct.d

setDT(df2)[, .(var2[!is.na(var2)][1], var3[!is.na(var3)][1]), by=var1]
   var1      V1        V2
1:    e right.e correct.e
2:    f      NA correct.f
3:    g right.g correct.g

例如,在var2[!is.na(var2)][1]中的想法是从var2中获取第一个非缺失值.如果缺少所有值,则返回NA. var1对两个变量执行此操作.

The idea in var2[!is.na(var2)][1] for example, to take the first non-missing value from var2. If all values are missing, then this returns NA. This operation is performed for both variables by var1.

如果您有两个以上的变量,则可以切换到lapply.例如,以下.

If you have more than two variables, you might switch to lapply. For example, the following.

df1[, lapply(.SD, function(i) i[!is.na(i)][1]), by=var1]
   var1    var2      var3
1:    a right.a correct.a
2:    b right.b correct.b
3:    c right.c correct.c
4:    d right.d correct.d

在一个以上的var1具有有效值且由不丢失的var2表示的情况下,则可以通过联接来达到预期的结果.

In an instance where more than one var1 has a valid value, and this is indicated by a non-missing var2, then you can reach the intended result with a join.

评论中的数据,

df1<-data.frame(var1=c("a","a","b","b","c","c","d","d","a","a"),
                var2=c("right.a",NA,"right.b",NA,"right.c",NA,"right.d",NA,"right.a1",NA),
                var3=c("correct.a","correct.a","correct.b","correct.b","correct.c","correct.c","correct.d","correct.d","correct.a1","correct.a1"))

然后使用此数据

setDT(df1)[df1[, .(var2=var2[!is.na(var2)]), by=var1], on=.(var1, var2)]
   var1     var2       var3
1:    a  right.a  correct.a
2:    a right.a1 correct.a1
3:    b  right.b  correct.b
4:    c  right.c  correct.c
5:    d  right.d  correct.d

此处,var1的所有非缺失var2观测值都合并到原始数据集上.

Here, all non-missing var2 observations by var1 are merged onto the original data set.

这篇关于基于“附近"的聚合.行值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆