拆分一列串联的逗号分隔数据并将输出重新编码为因子 [英] Split a column of concatenated comma-delimited data and recode output as factors

查看:41
本文介绍了拆分一列串联的逗号分隔数据并将输出重新编码为因子的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试清理一些输入错误的数据.变量的问题允许从五个选项中选择多个答案,编号为 1 到 5. 数据已按以下方式输入(这只是一个示例——实际数据中有更多的变量和更多的观察值框架):

I am trying to clean up some data that has been incorrectly entered. The question for the variable allows for multiple responses out of five choices, numbered as 1 to 5. The data has been entered in the following manner (this is just an example--there are many more variables and many more observations in the actual data frame):

data
          V1
1    1, 2, 3
2    1, 2, 4
3 2, 3, 4, 5
4    1, 3, 4
5    1, 3, 5
6 2, 3, 4, 5

以下是重新创建示例数据的一些代码:

Here's some code to recreate that example data:

data = data.frame(V1 = c("1, 2, 3", "1, 2, 4", "2, 3, 4, 5", 
                         "1, 3, 4", "1, 3, 5", "2, 3, 4, 5"))

我真正需要的是要处理更多的数据......二进制——就像一组是/否"问题——输入到一个看起来更像的数据框中:

What I actually need is the data to be treated more... binary--like a set of "yes/no" questions--entered in a data frame that looks more like:

data
    V1.1  V1.2  V1.3  V1.4  V1.5
1      1     1     1    NA    NA
2      1     1    NA     1    NA
3     NA     1     1     1     1
4      1    NA     1     1    NA
5      1    NA     1    NA     1
6     NA     1     1     1     1

目前实际的变量名称并不重要——我可以轻松解决这个问题.此外,缺失的元素是O"、NA"还是空白也没有太大关系——同样,这是我以后可以修复的.

The actual variable names don't matter at the moment--I can easily fix that. Also, it doesn't matter too much whether the missing elements are "O", "NA", or blank--again, that's something I can fix later.

我已经尝试使用 reshape 包中的 transform 函数以及 strsplit 的不同内容,但我不能要么去做我正在寻找的事情.我还查看了有关 Stackoverflow 的许多其他相关问题,但它们似乎不是完全相同的问题.

I've tried using the transform function from the reshape package as well as a fed different things with strsplit, but I can't get either to do what I am looking for. I've also looked at many other related questions on Stackoverflow, but they don't seem to be quite the same problem.

推荐答案

您只需要编写一个函数并使用apply.首先是一些虚拟数据:

You just need to write a function and use apply. First some dummy data:

##Make sure you're not using factors
dd = data.frame(V1 = c("1, 2, 3", "1, 2, 4", "2, 3, 4, 5", 
                         "1, 3, 4", "1, 3, 5", "2, 3, 4, 5"), 
                     stringsAsFactors=FALSE)

接下来,创建一个接受一行并根据需要进行转换的函数

Next, create a function that takes in a row and transforms as necessary

make_row = function(i, ncol=5) {
  ##Could make the default NA if needed
  m = numeric(ncol)
  v = as.numeric(strsplit(i, ",")[[1]])
  m[v] = 1
  return(m)
}

然后使用apply转置结果

t(apply(dd, 1, make_row))

这篇关于拆分一列串联的逗号分隔数据并将输出重新编码为因子的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆