R:在数据框中的所有因子列中创建重复的级别 [英] R: Make unique the duplicated levels in all factor columns in a data frame
问题描述
我有超过200个 SPSS
数据集,其中病例数量有所不同介于4,000到23,000之间,变量的数量在120到1,200之间变化(摘自 SPSS
数据集之一可以在这里找到)。这些文件包含数字和因子变量,许多因素都具有重复的级别。我从外包中使用了 read.spss
,将它们导入到数据框中,保留值标签,因为我需要它们进一步使用。在导入期间,R警告我关于因子列中重复的级别:
> adn< - read.spss(/ tmp / adn_110.sav,use.value.labels = TRUE,
use.missings = TRUE,to.data.frame = TRUE)
警告消息:
1:在read.spss(/ tmp / adn_110.sav中,use.value.labels = TRUE,use.missings = TRUE,:
/tmp/adn_110.sav:无法识别的记录类型7在系统文件
中遇到子类型18 $ b 2:在`levels< -`(`* tmp *`,value = if(nl == nL)as.character(labels)else paste0(labels,:
重复级别的因素已被弃用
3:在`level< -`(`* tmp *`,value = if(nl == nL)as.character(labels)else paste0(labels,:
重复级别的因素已被弃用
数据框导出为 .RData
,可以在这里找到当我使用表
(例如)获取任何因子列的每个级别的计数,显示所有重复的级别,但是所有重复级别的计数将添加到第一次出现的th e复制级别,所有其他级别0都将返回:
>表(adn [[adn01]],useNA =ifany)
不正确部分正确部分正确
8 0 4 0
正确< NA>
2 1
警告信息:
在`levels< -`(`* tmp *`,value = if(nl == nL)as.character(labels)else paste0(labels, :
重复级别的因素已被弃用
我知道我可以很容易地对待因子 as.numeric
当调用表
时,我需要输出中显示的级别名称,我可以使用 make.unique
使个别因子列的级别是唯一的,在重复级别的末尾附加一个数字:
>级别(adn [[adn01]])< - make.unique(levels(adn [[adn01]]),sep =)
像一个魅力一样工作,然后表
显示正确的数值: / p>
> table(adn [[adn01]],useNA =ifany)
不正确1部分正确
5 3 1
部分正确1 C直立< NA>
3 2 1
然而,为每个因素列做这个, 200个文件,其中变量的数量在120到1,200之间变化,这将是一生的使命。如果文件发生变化,我将不得不重做所有内容。我天真地认为循环通过ccolums将是容易的。但是, make.table
需要名称。我已经尝试了以下内容:
> lapply(adn [,1:length(adn)],make.unique(as.vector(attr(adn [,1:length(adn)],
levels))))
错误在make.unique(as.vector(attr(adn [,1:length(adn)],levels))):
'names'必须是一个字符向量
没有运气。我在过去几天尝试了许多其他的东西,包括。仍然是一样的:'names'必须是一个字符向量
。我想问题是索引属性级别
的列,这是一个列表组件,但我无法弄清楚什么。 并非所有列都是因素。有人可以帮忙吗?
编辑:
akrun 完美的工作。再次感谢你!
尝试
code> load('adn.RData')
indx< - sapply(adn,is.factor)
adn [indx]< - lapply(adn [indx],function ){
levels(x)< - make.unique(levels(x))
x})
表(adn [['adn01']] ,useNA ='ifany')
#不正确1部分正确部分正确1
#5 3 1 3
#正确< NA>
#2 1
表(adn [['adn03']],useNA ='ifany')
#不正确部分更正正确< ; NA>
#6 3 5 1
更新
如果您有多个文件,您可以将文件读入列表,然后在列表
上进行处理。例如,考虑到文件在工作目录中。
文件< - list.files(pattern ='^ adn \\d +')
lst1< - lapply(files,function(x)read.spss(x,use.value.labels = TRUE,
use.missings = TRUE,to.data 。框架= TRUE)#未测试
为了测试目的,我正在创建 lst1
与相同的数据集 adn
。
adn1< - adn
lst1< - list(adn,adn1)
,您将为列表
元素
应用
make.unique
lst2< - lapply(lst1,function(dat){
indx< - sapply(dat,is.factor)
dat [indx] ; - lapply(dat [indx],function(x){
levels(x)< - make.unique(levels(x))
x})
dat})
lapply(lst2,functi on(x)table(x [['adn01']],useNA ='ifany'))
#[[1]]
#不正确1部分正确部分正确。 1
#5 3 1 3
#正确< NA>
#2 1
#[[2]]
#不正确1部分正确部分正确1
#5 3 1 3
#正确< NA>
#2 1
For several days already I've been stuck with a problem in R, trying to make duplicate levels in multiple factor columns in data frame unique using a loop. This is part of a larger project.
I have more than 200 SPSS
data sets where the number of cases vary between 4,000 and 23,000 and the number of variables vary between 120 and 1,200 (an excerpt of one of the SPSS
data sets can be found here). The files contain both numeric and factor variables and many of the factor ones have duplicated levels. I have used read.spss
from the foreign package to import them in data frames, keeping the value labels because I need them for further use. During the import R warns me about the duplicated levels in the factor columns:
> adn <- read.spss("/tmp/adn_110.sav", use.value.labels = TRUE,
use.missings = TRUE, to.data.frame = TRUE)
Warning messages:
1: In read.spss("/tmp/adn_110.sav", use.value.labels = TRUE, use.missings = TRUE, :
/tmp/adn_110.sav: Unrecognized record type 7, subtype 18 encountered in system file
2: In `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels) else paste0(labels, :
duplicated levels in factors are deprecated
3: In `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels) else paste0(labels, :
duplicated levels in factors are deprecated
The data frame, exported as .RData
, can be found here. When I use table
(for example) to get the counts for each level of any factor column, all duplicated levels are displayed, but the counts for all duplicated levels are added to the first occurrence of the duplicate levels and for all others 0s are returned:
> table(adn[["adn01"]], useNA = "ifany")
Incorrect Incorrect Partially correct Partially correct
8 0 4 0
Correct <NA>
2 1
Warning message:
In `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels) else paste0(labels, :
duplicated levels in factors are deprecated
I know I can easily treat the factor as.numeric
when calling table
. However, I need the level names displayed in the output. I can use make.unique
to make the levels for individual factor columns unique, appending a number at the end of the duplicate levels:
> levels(adn[["adn01"]]) <- make.unique(levels(adn[["adn01"]]), sep = " ")
Works like a charm. Then table
shows me the correct counts:
> table(adn[["adn01"]], useNA = "ifany")
Incorrect Incorrect 1 Partially correct
5 3 1
Partially correct 1 Correct <NA>
3 2 1
However, doing this for each factor column in each of the more than 200 files, where the number of variables vary between 120 and 1,200, would be a mission of a lifetime. And if the files change I will have to redo everything. I naively thought looping through the ccolums would be easy. However, make.table
requires names. I have tried the following:
> lapply(adn[ , 1:length(adn)], make.unique(as.vector(attr(adn[ , 1:length(adn)],
"levels"))))
Error in make.unique(as.vector(attr(adn[, 1:length(adn)], "levels"))) :
'names' must be a character vector
No luck. I have tried many other things in the last days, including classical for
loops. Still the same: 'names' must be a character vector
. I guess the problem is in indexing the attribute levels
of the columns, which is a list component, but I can't figure out what. Additional issue may be that not all columns are factors. Can someone help?
EDIT:
The solution provided by akrun works perfectly. Thank you once again!
Try
load('adn.RData')
indx <- sapply(adn, is.factor)
adn[indx] <- lapply(adn[indx], function(x) {
levels(x) <- make.unique(levels(x))
x })
table(adn[['adn01']], useNA='ifany')
# Incorrect Incorrect.1 Partially correct Partially correct.1
# 5 3 1 3
# Correct <NA>
# 2 1
table(adn[['adn03']], useNA='ifany')
# Incorrect Partially correct Correct <NA>
# 6 3 5 1
Update
If you have multiple files, you can read the files into a list and then do the processing on the list
. For example, considering that the files are in the working directory.
files <- list.files(pattern='^adn\\d+')
lst1 <- lapply(files, function(x) read.spss(x, use.value.labels = TRUE,
use.missings = TRUE, to.data.frame = TRUE) #not tested
For testing purposes, I am creating lst1
with the same dataset adn
.
adn1 <- adn
lst1 <- list(adn, adn1)
Now, you are apply the make.unique
for each list
element
lst2 <- lapply(lst1, function(dat) {
indx <- sapply(dat, is.factor)
dat[indx] <- lapply(dat[indx], function(x){
levels(x) <- make.unique(levels(x))
x})
dat})
lapply(lst2, function(x) table(x[['adn01']], useNA='ifany'))
# [[1]]
# Incorrect Incorrect.1 Partially correct Partially correct.1
# 5 3 1 3
# Correct <NA>
# 2 1
# [[2]]
# Incorrect Incorrect.1 Partially correct Partially correct.1
# 5 3 1 3
# Correct <NA>
# 2 1
这篇关于R:在数据框中的所有因子列中创建重复的级别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!