R:在数据框中的所有因子列中创建重复的级别 [英] R: Make unique the duplicated levels in all factor columns in a data frame

查看:778
本文介绍了R:在数据框中的所有因子列中创建重复的级别的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

已经有几天我已经在R中遇到了一个问题,试图使用循环在数据框中使用多个因素列中的重复级别。这是一个更大的项目的一部分。



我有超过200个 SPSS 数据集,其中病例数量有所不同介于4,000到23,000之间,变量的数量在120到1,200之间变化(摘自 SPSS 数据集之一可以在这里找到)。这些文件包含数字和因子变量,许多因素都具有重复的级别。我从外包中使用了 read.spss ,将它们导入到数据框中,保留值标签,因为我需要它们进一步使用。在导入期间,R警告我关于因子列中重复的级别:

 > adn<  -  read.spss(/ tmp / adn_110.sav,use.value.labels = TRUE,
use.missings = TRUE,to.data.frame = TRUE)
警告消息:
1:在read.spss(/ tmp / adn_110.sav中,use.value.labels = TRUE,use.missings = TRUE,:
/tmp/adn_110.sav:无法识别的记录类型7在系统文件
中遇到子类型18 $ b 2:在`levels< -`(`* tmp *`,value = if(nl == nL)as.character(labels)else paste0(labels,:
重复级别的因素已被弃用
3:在`level< -`(`* tmp *`,value = if(nl == nL)as.character(labels)else paste0(labels,:
重复级别的因素已被弃用

数据框导出为 .RData 可以在这里找到当我使用(例如)获取任何因子列的每个级别的计数,显示所有重复的级别,但所有重复级别的计数将添加到第一次出现的副本级别和所有其他0返回:

 >表(adn [[adn01]],useNA =ifany)
不正确部分正确部分正确
8 0 4 0
正确< NA>
2 1
警告信息:
在`levels< -`(`* tmp *`,value = if(nl == nL)as.character(labels)else paste0(labels, :
重复级别的因素已被弃用

我知道我可以很容易地对待因子 as.numeric 当调用时,我需要输出中显示的级别名称,我可以使用 make.unique 使个别因子列的级别是唯一的,在重复级别的末尾附加一个数字:

 >级别(adn [[adn01]])<  -  make.unique(levels(adn [[adn01]]),sep =)

像一个魅力一样工作,然后显示正确的数值: / p>

 > table(adn [[adn01]],useNA =ifany)

不正确1部分正确
5 3 1
部分正确1 C直立< NA>
3 2 1

然而,为每个因素列做这个, 200个文件,其中变量的数量在120到1,200之间变化,这将是一生的使命。如果文件发生变化,我将不得不重做所有内容。我天真地认为循环通过ccolums将是容易的。但是, make.table 需要名称。我已经尝试了以下内容:

 > lapply(adn [,1:length(adn)],make.unique(as.vector(attr(adn [,1:length(adn)],
levels))))
错误在make.unique(as.vector(attr(adn [,1:length(adn)],levels))):
'names'必须是一个字符向量

没有运气。我在过去几天尝试了许多其他的东西,包括。仍然是一样的:'names'必须是一个字符向量。我想问题是索引属性级别的列,这是一个列表组件,但我无法弄清楚什么。 并非所有列都是因素。有人可以帮忙吗?



编辑:



akrun 完美的工作。再次感谢你!

解决方案

尝试

 code> load('adn.RData')
indx< - sapply(adn,is.factor)
adn [indx]< - lapply(adn [indx],function ){
levels(x)< - make.unique(levels(x))
x})


表(adn [['adn01']] ,useNA ='ifany')

#不正确1部分正确部分正确1
#5 3 1 3
#正确< NA>
#2 1


表(adn [['adn03']],useNA ='ifany')

#不正确部分更正正确< ; NA>
#6 3 5 1



更新



如果您有多个文件,您可以将文件读入列表,然后在列表上进行处理。例如,考虑到文件在工作目录中。

 文件<  -  list.files(pattern ='^ adn \\d +')
lst1< - lapply(files,function(x)read.spss(x,use.value.labels = TRUE,
use.missings = TRUE,to.data 。框架= TRUE)#未测试

为了测试目的,我正在创建 lst1 与相同的数据集 adn

  adn1<  -  adn 
lst1< - list(adn,adn1)

,您将为列表元素



应用 make.unique

  lst2<  -  lapply(lst1,function(dat){
indx< - sapply(dat,is.factor)
dat [indx] ; - lapply(dat [indx],function(x){
levels(x)< - make.unique(levels(x))
x})
dat})


lapply(lst2,functi on(x)table(x [['adn01']],useNA ='ifany'))
#[[1]]

#不正确1部分正确部分正确。 1
#5 3 1 3
#正确< NA>
#2 1

#[[2]]

#不正确1部分正确部分正确1
#5 3 1 3
#正确< NA>
#2 1


For several days already I've been stuck with a problem in R, trying to make duplicate levels in multiple factor columns in data frame unique using a loop. This is part of a larger project.

I have more than 200 SPSS data sets where the number of cases vary between 4,000 and 23,000 and the number of variables vary between 120 and 1,200 (an excerpt of one of the SPSS data sets can be found here). The files contain both numeric and factor variables and many of the factor ones have duplicated levels. I have used read.spss from the foreign package to import them in data frames, keeping the value labels because I need them for further use. During the import R warns me about the duplicated levels in the factor columns:

> adn <- read.spss("/tmp/adn_110.sav", use.value.labels = TRUE,
use.missings = TRUE, to.data.frame = TRUE)
Warning messages:
1: In read.spss("/tmp/adn_110.sav", use.value.labels = TRUE, use.missings = TRUE,  :
  /tmp/adn_110.sav: Unrecognized record type 7, subtype 18 encountered in system file
2: In `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels) else paste0(labels,  :
  duplicated levels in factors are deprecated
3: In `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels) else paste0(labels,  :
  duplicated levels in factors are deprecated

The data frame, exported as .RData, can be found here. When I use table (for example) to get the counts for each level of any factor column, all duplicated levels are displayed, but the counts for all duplicated levels are added to the first occurrence of the duplicate levels and for all others 0s are returned:

> table(adn[["adn01"]], useNA = "ifany")
  Incorrect         Incorrect Partially correct Partially correct 
          8                 0                 4                 0 
    Correct              <NA> 
          2                 1 
Warning message:
In `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels) else paste0(labels,  :
  duplicated levels in factors are deprecated

I know I can easily treat the factor as.numeric when calling table. However, I need the level names displayed in the output. I can use make.unique to make the levels for individual factor columns unique, appending a number at the end of the duplicate levels:

> levels(adn[["adn01"]]) <- make.unique(levels(adn[["adn01"]]), sep = " ")

Works like a charm. Then table shows me the correct counts:

> table(adn[["adn01"]], useNA = "ifany")

          Incorrect         Incorrect 1   Partially correct 
                  5                   3                   1 
Partially correct 1             Correct                <NA> 
                  3                   2                   1 

However, doing this for each factor column in each of the more than 200 files, where the number of variables vary between 120 and 1,200, would be a mission of a lifetime. And if the files change I will have to redo everything. I naively thought looping through the ccolums would be easy. However, make.table requires names. I have tried the following:

> lapply(adn[ , 1:length(adn)], make.unique(as.vector(attr(adn[ , 1:length(adn)],
"levels"))))
Error in make.unique(as.vector(attr(adn[, 1:length(adn)], "levels"))) : 
  'names' must be a character vector

No luck. I have tried many other things in the last days, including classical for loops. Still the same: 'names' must be a character vector. I guess the problem is in indexing the attribute levels of the columns, which is a list component, but I can't figure out what. Additional issue may be that not all columns are factors. Can someone help?

EDIT:

The solution provided by akrun works perfectly. Thank you once again!

解决方案

Try

 load('adn.RData')
 indx <- sapply(adn, is.factor)
 adn[indx] <- lapply(adn[indx], function(x) {
                   levels(x) <- make.unique(levels(x))
                   x })


 table(adn[['adn01']], useNA='ifany')

 #     Incorrect         Incorrect.1   Partially correct Partially correct.1 
 #             5                   3                   1                   3 
 #       Correct                <NA> 
 #             2                   1 


  table(adn[['adn03']], useNA='ifany')

  #  Incorrect Partially correct           Correct              <NA> 
  #          6                 3                 5                 1 

Update

If you have multiple files, you can read the files into a list and then do the processing on the list. For example, considering that the files are in the working directory.

files <- list.files(pattern='^adn\\d+')
lst1 <- lapply(files, function(x) read.spss(x, use.value.labels = TRUE,
          use.missings = TRUE, to.data.frame = TRUE) #not tested

For testing purposes, I am creating lst1 with the same dataset adn.

adn1 <- adn
lst1 <- list(adn, adn1)

Now, you are apply the make.unique for each list element

lst2 <- lapply(lst1, function(dat) {
                  indx <- sapply(dat, is.factor)
                  dat[indx] <- lapply(dat[indx], function(x){
                           levels(x) <- make.unique(levels(x))
                            x})
                          dat})


  lapply(lst2, function(x) table(x[['adn01']], useNA='ifany'))
  # [[1]]

  #    Incorrect         Incorrect.1   Partially correct Partially correct.1 
  #            5                   3                   1                   3 
  #      Correct                <NA> 
  #            2                   1 

  # [[2]]

  #    Incorrect         Incorrect.1   Partially correct Partially correct.1 
  #            5                   3                   1                   3 
  #      Correct                <NA> 
  #            2                   1 

这篇关于R:在数据框中的所有因子列中创建重复的级别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆