使用变量名的大写和非大写版本合并列,而不指定变量名 [英] Coalesce columns with capitalised and non capitalised versions of variables names without specifying the variable names

查看:42
本文介绍了使用变量名的大写和非大写版本合并列,而不指定变量名的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个(大)数据框,如下所示:

I have a (large) data frame as follows:

library(data.table)
DT <- fread(
"ID country year A b B a
4   NLD   2002  NA   1   NA   0
5   NLD   2002  NA   0   NA   1
6   NLD   2006  NA   1   NA   1
7   NLD   2006  NA   0   NA   0
8   NLD   2006  0   NA   0   NA
9   GBR   2002  0   NA   0   NA
10  GBR   2002  0   NA   0   NA
11  GBR   2002  0   NA   0   NA
12  GBR   2006  1   NA   1   NA
13  GBR   2006  1   NA   0   NA",
header = TRUE)

我只想合并变量 A a ,以及 B b .

I would simply like to merge variables A and a, and, B and b.

问题是我必须对1000个以上的变量执行此操作,因此我要避免指定不需要检查的列名或需要检查的列名.

The problem is that I have to do this for more than 1000 variables, so I would like to avoid specifying either the column names which need not to be checked or the ones that do.

我希望有一个解决方案,首先将列分为没有大写字母的组和没有大写字母的组.

I was hoping for a solution that first splits the columns into a group for which there is no non-capitalised alternative, and a group for which there is.

据我了解,这里的解决方案:

As far as I understand the solution here:

基于R中模式的Coalesce列

仍然需要提供需要检查大小写的变量名称.如果我误解了这种解决方案(很有可能),请告诉我.无论如何,如前所述,我需要一种无需专门指定变量的解决方案.

It still requires to provide the variables names for which the case needs to be checked. If I misunderstand this solution, which is very much possible, please let me know. In any case, as explained, I need a solution without specifically specifying the variables.

我在此处找到了一个良好的开端.

I found a good start here.

但是,该解决方案与我需要的解决方案略有不同.

That solution has however a slightly different approach than the one I need.

如何使这样的变量合并以 tolower(varname)== varname 之类的条件为条件?

How do I make such a variable merge conditional on something like tolower(varname) == varname ?

所需的输出:

DT <- fread(
"ID country year A B
4   NLD   2002  0  1
5   NLD   2002  1  0
6   NLD   2006  1  1
7   NLD   2006  0  0 
8   NLD   2006  0  0 
9   GBR   2002  0  0 
10  GBR   2002  0  0
11  GBR   2002  0  0
12  GBR   2006  1  1 
13  GBR   2006  1  0 ",
header = TRUE)

推荐答案

OP正在使用 data.table ,因此该问题值得一个 data.table 答案.

The OP is using data.table, so the question deserves a data.table answer.

下面的方法通常类似于 sindri_baldur的答案,但在重要细节上有所不同.特别是

The approach below is similar to sindri_baldur's answer in general but differs in important details. In particular,

  • 它还将合并多个列,例如"CC","cc","cC" ,涵盖了不同的变量名编写方式,例如大写,小写上下驼色的情况下.
  • 它将返回已合并的列的描述.
  • it will also coalesce multiple columns like "CC", "cc", "cC" covering the different ways of writing variable names, e.g., upper case, lower case, as well as lower and upper camel case.
  • it will return a description of the columns which have been coalesced.
library(data.table)
library(magrittr) # piping is used to improve readability
names(DT) %>% 
  data.table(orig = ., lc = tolower(.)) %>% 
  .[, {
    if (.N > 1L) {
      new <- toupper(.BY)
      old <- setdiff(orig, new)
      DT[, (new) := fcoalesce(.SD), .SDcols = orig]
      DT[, (old) := NULL]
      sprintf("Coalesced %s onto %s", toString(old), new)
    }
  }, by = lc]
DT[]

   lc                 V1
1:  a Coalesced a onto A
2:  b Coalesced b onto B

DT[]

    ID country year A B
 1:  4     NLD 2002 0 1
 2:  5     NLD 2002 1 0
 3:  6     NLD 2006 1 1
 4:  7     NLD 2006 0 0
 5:  8     NLD 2006 0 0
 6:  9     GBR 2002 0 0
 7: 10     GBR 2002 0 0
 8: 11     GBR 2002 0 0
 9: 12     GBR 2006 1 1
10: 13     GBR 2006 1 0

用于另一个用例

DT2 <- fread(
  "ID country year A b B a CC cc cC
4   NLD   2002  NA   1   NA   0   1  NA  NA    
5   NLD   2002  NA   0   NA   1  NA   2  NA
6   NLD   2006  NA   1   NA   1  NA  NA   3
7   NLD   2006  NA   0   NA   0  NA  NA  NA  
8   NLD   2006  0   NA   0   NA   1  NA  NA
9   GBR   2002  0   NA   0   NA  NA   2  NA
10  GBR   2002  0   NA   0   NA  NA  NA   3
11  GBR   2002  0   NA   0   NA   1  NA  NA
12  GBR   2006  1   NA   1   NA  NA   2  NA
13  GBR   2006  1   NA   0   NA  NA  NA   3",
  header = TRUE)
DT <- copy(DT2)

以上代码返回

   lc                       V1
1:  a       Coalesced a onto A
2:  b       Coalesced b onto B
3: cc Coalesced cc, cC onto CC

DT[]

    ID country year A B CC
 1:  4     NLD 2002 0 1  1
 2:  5     NLD 2002 1 0  2
 3:  6     NLD 2006 1 1  3
 4:  7     NLD 2006 0 0 NA
 5:  8     NLD 2006 0 0  1
 6:  9     GBR 2002 0 0  2
 7: 10     GBR 2002 0 0  3
 8: 11     GBR 2002 0 0  1
 9: 12     GBR 2006 1 1  2
10: 13     GBR 2006 1 0  3

说明

  1. 将列名转换为data.table,并在表中添加小写版本的列名 lc .
  2. 我们使用 by = data.table 的功能对 any for 循环>表达,即使有副作用.因此,在 data.table 范围内,对于 lc 的每个不同值,通过引用 更新 DT .是在第1步中即时"创建的,但前提是该组中有不止一列.
  1. The column names are turned into a data.table with a additional column lc of the lower case versions of the column names.
  2. Instead of a for loop we use grouping by = and data.table's feature to evaluate any expression, even with side effects. So, DT is updated by reference for each distinct value of lc within the scope of the data.table which was created on-the-fly in step 1 but only if there is more than one column in the group.

未来扩展

这种方法可以扩展到合并使用下划线,点或空格的列.列名中的"",例如"var_1","VAR.1","Var 1" .

这篇关于使用变量名的大写和非大写版本合并列,而不指定变量名的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆