使用变量名的大写和非大写版本合并列,而不指定变量名 [英] Coalesce columns with capitalised and non capitalised versions of variables names without specifying the variable names
问题描述
我有一个(大)数据框,如下所示:
I have a (large) data frame as follows:
library(data.table)
DT <- fread(
"ID country year A b B a
4 NLD 2002 NA 1 NA 0
5 NLD 2002 NA 0 NA 1
6 NLD 2006 NA 1 NA 1
7 NLD 2006 NA 0 NA 0
8 NLD 2006 0 NA 0 NA
9 GBR 2002 0 NA 0 NA
10 GBR 2002 0 NA 0 NA
11 GBR 2002 0 NA 0 NA
12 GBR 2006 1 NA 1 NA
13 GBR 2006 1 NA 0 NA",
header = TRUE)
我只想合并变量 A
和 a
,以及 B
和 b
.
I would simply like to merge variables A
and a
, and, B
and b
.
问题是我必须对1000个以上的变量执行此操作,因此我要避免指定不需要检查的列名或需要检查的列名.
The problem is that I have to do this for more than 1000 variables, so I would like to avoid specifying either the column names which need not to be checked or the ones that do.
我希望有一个解决方案,首先将列分为没有大写字母的组和没有大写字母的组.
I was hoping for a solution that first splits the columns into a group for which there is no non-capitalised alternative, and a group for which there is.
据我了解,这里的解决方案:
As far as I understand the solution here:
仍然需要提供需要检查大小写的变量名称.如果我误解了这种解决方案(很有可能),请告诉我.无论如何,如前所述,我需要一种无需专门指定变量的解决方案.
It still requires to provide the variables names for which the case needs to be checked. If I misunderstand this solution, which is very much possible, please let me know. In any case, as explained, I need a solution without specifically specifying the variables.
我在此处找到了一个良好的开端.
I found a good start here.
但是,该解决方案与我需要的解决方案略有不同.
That solution has however a slightly different approach than the one I need.
如何使这样的变量合并以 tolower(varname)== varname
之类的条件为条件?
How do I make such a variable merge conditional on something like tolower(varname) == varname
?
所需的输出:
DT <- fread(
"ID country year A B
4 NLD 2002 0 1
5 NLD 2002 1 0
6 NLD 2006 1 1
7 NLD 2006 0 0
8 NLD 2006 0 0
9 GBR 2002 0 0
10 GBR 2002 0 0
11 GBR 2002 0 0
12 GBR 2006 1 1
13 GBR 2006 1 0 ",
header = TRUE)
推荐答案
OP正在使用 data.table
,因此该问题值得一个 data.table
答案.
The OP is using data.table
, so the question deserves a data.table
answer.
下面的方法通常类似于 sindri_baldur的答案,但在重要细节上有所不同.特别是
The approach below is similar to sindri_baldur's answer in general but differs in important details. In particular,
- 它还将合并多个列,例如
"CC","cc","cC"
,涵盖了不同的变量名编写方式,例如大写,小写上下驼色的情况下. - 它将返回已合并的列的描述.
- it will also coalesce multiple columns like
"CC", "cc", "cC"
covering the different ways of writing variable names, e.g., upper case, lower case, as well as lower and upper camel case. - it will return a description of the columns which have been coalesced.
library(data.table)
library(magrittr) # piping is used to improve readability
names(DT) %>%
data.table(orig = ., lc = tolower(.)) %>%
.[, {
if (.N > 1L) {
new <- toupper(.BY)
old <- setdiff(orig, new)
DT[, (new) := fcoalesce(.SD), .SDcols = orig]
DT[, (old) := NULL]
sprintf("Coalesced %s onto %s", toString(old), new)
}
}, by = lc]
DT[]
lc V1
1: a Coalesced a onto A
2: b Coalesced b onto B
DT[]
ID country year A B
1: 4 NLD 2002 0 1
2: 5 NLD 2002 1 0
3: 6 NLD 2006 1 1
4: 7 NLD 2006 0 0
5: 8 NLD 2006 0 0
6: 9 GBR 2002 0 0
7: 10 GBR 2002 0 0
8: 11 GBR 2002 0 0
9: 12 GBR 2006 1 1
10: 13 GBR 2006 1 0
用于另一个用例
DT2 <- fread(
"ID country year A b B a CC cc cC
4 NLD 2002 NA 1 NA 0 1 NA NA
5 NLD 2002 NA 0 NA 1 NA 2 NA
6 NLD 2006 NA 1 NA 1 NA NA 3
7 NLD 2006 NA 0 NA 0 NA NA NA
8 NLD 2006 0 NA 0 NA 1 NA NA
9 GBR 2002 0 NA 0 NA NA 2 NA
10 GBR 2002 0 NA 0 NA NA NA 3
11 GBR 2002 0 NA 0 NA 1 NA NA
12 GBR 2006 1 NA 1 NA NA 2 NA
13 GBR 2006 1 NA 0 NA NA NA 3",
header = TRUE)
DT <- copy(DT2)
以上代码返回
lc V1
1: a Coalesced a onto A
2: b Coalesced b onto B
3: cc Coalesced cc, cC onto CC
DT[]
ID country year A B CC
1: 4 NLD 2002 0 1 1
2: 5 NLD 2002 1 0 2
3: 6 NLD 2006 1 1 3
4: 7 NLD 2006 0 0 NA
5: 8 NLD 2006 0 0 1
6: 9 GBR 2002 0 0 2
7: 10 GBR 2002 0 0 3
8: 11 GBR 2002 0 0 1
9: 12 GBR 2006 1 1 2
10: 13 GBR 2006 1 0 3
说明
- 将列名转换为data.table,并在表中添加小写版本的列名
lc
. - 我们使用
by =
和data.table
的功能对 any for 循环>表达,即使有副作用.因此,在data.table
范围内,对于lc
的每个不同值,通过引用 更新DT
.是在第1步中即时"创建的,但前提是该组中有不止一列.
- The column names are turned into a data.table with a additional column
lc
of the lower case versions of the column names. - Instead of a
for
loop we use groupingby =
anddata.table
's feature to evaluate any expression, even with side effects. So,DT
is updated by reference for each distinct value oflc
within the scope of thedata.table
which was created on-the-fly in step 1 but only if there is more than one column in the group.
未来扩展
这种方法可以扩展到合并使用下划线,点或空格的列.列名中的"",例如"var_1","VAR.1","Var 1"
.
这篇关于使用变量名的大写和非大写版本合并列,而不指定变量名的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!