R data.table如何在多个二进制数据列中用列名替换正值 [英] R data.table how to replace positive values with column names across multiple binary data columns

查看:89
本文介绍了R data.table如何在多个二进制数据列中用列名替换正值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用R版本3.2.1和data.table版本1.9.6. 我有一个像下面的示例一样的data.table,它包含一些编码为二进制的列,其分类为字符,其值为"0"和"1",还包含一个字符串矢量,其中包含具有与二进制列名称相同的词的短语.我的最终目标是使用字符串向量中的单词以及二进制向量中的肯定响应来创建单词云.为此,我首先需要将二进制向量中的肯定响应转换为它们的列名,但是这会卡住我的地方.

I'm using R v. 3.2.1 and data.table v 1.9.6. I have a data.table like the example below, which contains some coded binary columns classed as character with the values "0" and "1" and also a string vector that contains phrases with some of the same words as the binary column names. My ultimate goal is to create a wordcloud using both the words in the string vector and also the positive responses in the binary vectors. To do this, I first need to convert the positive responses in the binary vectors to their column names, but there is where I'm getting stuck.

此处提出了类似的问题,但这与发布者以矩阵开头并不完全相同,建议的解决方案似乎不适用于更复杂的数据集.除了二进制列之外,我还具有其他列,因此解决方案需要首先准确识别我的二进制列.

A similar question has been asked here but it is not quite the same as the poster starts with a matrix and the suggested solution does not seem to work with a more complicated data set. I also have columns other than my binary columns which have ones in them, so the solution needs to first accurately identify my binary columns.

以下是一些示例数据:

id <- c(1,2,3,4,5)
age <- c("5", "1", "11", "20", "21")
apple <- c("0", "1", NA, "1", "0")
pear <- c("1", "1", "1", "0", "0")
banana <- c("0", "1", "1", NA, "1")
favfood <- c("i love pear juice", "i eat chinese pears and crab apples every sunday", "i also like apple tart", "i like crab apple juice", "i hate most fruit except bananas" )

df <- as.data.frame(cbind(id, age, apple, pear, banana, favfood), stringsAsFactors=FALSE)
dt <- data.table(df)
dt[, id := as.numeric(id)]

这是数据的样子:

    id age apple pear banana                                          favfood
1:  1   5     0    1      0                                i love pear juice
2:  2   1     1    1      1 i eat chinese pears and crab apples every sunday
3:  3  11    NA    1      1                           i also like apple tart
4:  4  20     1    0     NA                          i like crab apple juice
5:  5  21     0    0      1                 i hate most fruit except bananas

因此,如果apple == 1或favfood包含字符串"apple"或两者同时出现,则wordcloud的苹果频率应该为1.

Thus the wordcloud should should have a frequency of 1 for apples if apple==1 or favfood cointains the string "apple" or both, and so on.

这是我的尝试(它并没有完成我想要的,但是大约成功了一半):

Here is my attempt (which doesn't do what I want, but gets about half way):

# First define the logic columns.
# I've done this by name here but in my real data set this won't work because there are too many    
logicols <- c("apple", "pear", "banana")

# Next identify the location of the "1"s within the subset of logic columns:
ones <- which(dt==1 & colnames(dt) %in% logicols, arr.ind=T)

# Lastly, convert the "1"s in the subset to their column names:
dt[ones, ]<-colnames(dt)[ones[,2]]

这给出了:

> dt
   id age apple pear banana                                          favfood
1:  1   5     0 pear      0                                i love pear juice
2:  2   1     1 pear banana i eat chinese pears and crab apples every sunday
3:  3  11    NA    1 banana                           i also like apple tart
4:  4  20     1    0     NA                          i like crab apple juice
5:  5  21     0    0      1                 i hate most fruit except bananas

此方法存在两个问题:

(a)标识要按名称转换的列对于我的真实数据集不方便,因为其中有很多列.如何识别列的此子集,而不包含其他包含1但在其中也具有其他值的列(在此示例中,"age"包含1,但显然不是逻辑列)?在我的实际数据集中,我已在示例中故意将"age"编码为字符列,有些字符列包含不是逻辑列的1.使它们与众不同的功能是我的逻辑列是字符列,但仅包含值0、1或缺少值(NA).

(a) Identifying the columns to convert by name is not convenient for my real data set because there are many of them. How can I identify this subset of columns without including other columns that contain 1s but have other values in them as well (in this example "age" contains a 1 but it is clearly not a logic column)? I have deliberately coded "age" as a character column in the example as in my real data set, there are character columns that contain 1s that are not logic columns. The feature that sets them apart is that my logic columns are character but only contain the values 0, 1 or are missing (NA).

(b)索引尚未提取逻辑列中的所有1,有人知道这是为什么吗(例如,"apple"列的第二行中的1没有转换)?

(b) The index has not picked up all the 1s in the logic columns, does anyone know why this is (e.g. the 1 in the second row of the "apple" column is not converted)?

非常感谢您的帮助-我确定我缺少相对简单的内容,但在此方面还是很棘手.

Many thanks for your help - I'm sure I'm missing something relatively simple, but quite stuck on this.

推荐答案

感谢@Frank指出逻辑/二进制列应该已经使用as.logical()转换为正确的类.

Thanks to @Frank for pointing out that the logic/binary columns should have been converted to the correct class with as.logical().

这大大简化了要更改的值的标识,现在索引也似乎可以正常工作:

This greatly simplifies identification of the values to change and the indexing now seems to work as well:

# Starting with the data in its original format:
id <- c(1,2,3,4,5)
age <- c("5", "1", "11", "20", "21")
apple <- c("0", "1", NA, "1", "0")
pear <- c("1", "1", "1", "0", "0")
banana <- c("0", "1", "1", NA, "1")
favfood <- c("i love pear juice", "i eat chinese pears and crab apples every sunday", "i also like apple tart", "i like crab apple juice", "i hate most fruit except bananas" )

df <- as.data.frame(cbind(id, age, apple, pear, banana, favfood), stringsAsFactors=FALSE)

# Convert the "0" / "1" character columns to logical with a function:

    > recode.multi
    function(data, recode.cols, old.var, new.var, format = as.numeric){
      # function to recode multiple columns 
      #
      # Args:        data: a data.frame 
      #       recode.cols: a character vector containing the names of those 
      #                    columns to recode
      #           old.var: a character vector containing values to be recorded
      #           new.var:  a character vector containing desired recoded values
      #            format: a function descrbing the desired format e.g.
      #                    as.character, as.numeric, as.factor, etc.. 

      # check from and to are of equal length
      if(length(old.var) == length(new.var)){
        NULL
      } else {
        stop("'from' and 'to' are of differing lengths")
      }

      # convert format of selected columns to character
      if(length(recode.cols) == 1){
        data[, recode.cols] = as.character(data[, recode.cols])
      } else {
        data[, recode.cols] = data.frame(lapply(data[, recode.cols], as.character), stringsAsFactors=FALSE)
      }


      # recode old variables to new variables for selected columns
      for(i in 1:length(old.var)){
        data[, recode.cols][data[, recode.cols] == old.var[i]] = new.var[i]
      }


  # convert recoded columns to desired format 
  data[, recode.cols] = sapply(data[, recode.cols], format)

  data
}

df = recode.multi(data = df, recode.cols = c(unlist(strsplit("apple pear banana", split=" "))), old.var = c("0", "1", NA), new.var = c(FALSE, TRUE, NA), format = as.logical)

dt <- data.table(df)
dt[, id := as.numeric(id)]

# Identify the values to swap with column names:
convtoname <- which(dt==TRUE, arr.ind=T)

# Make the swap:
dt[convtoname, ]<-colnames(dt)[convtoname[,2]]

这给出了预期的结果:

> dt
   id age apple  pear banana                                          favfood
1: id   5 FALSE  pear  FALSE                                i love pear juice
2:  2   1 apple  pear banana i eat chinese pears and crab apples every sunday
3:  3  11    NA  pear banana                           i also like apple tart
4:  4  20 apple FALSE     NA                          i like crab apple juice
5:  5  21 FALSE FALSE banana                 i hate most fruit except bananas

这篇关于R data.table如何在多个二进制数据列中用列名替换正值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆