R:用逗号将变量列拆分为多个(不平衡)列 [英] R: Split Variable Column into multiple (unbalanced) columns by comma
问题描述
我有一个包含 25 个变量和超过 200 万个观测值的数据集.我的一个变量是几个不同类别"的组合,我想将它们拆分到每列显示 1 个类别的位置(类似于拆分在 stata 中的作用).例如:
I have a dataset of 25 variables and over 2 million observations. One of my variables is a combination of a few different "categories" that I want to split to where it shows 1 category per column (similar to what split would do in stata). For example:
# Name Age Number Events First
# Karen 24 8 Triathlon/IM,Marathon,10k,5k 0
# Kurt 39 2 Half-Marathon,10k 0
# Leah 18 0 1
我希望它看起来像:
# Name Age Number Events_1 Event_2 Events_3 Events_4 First
# Karen 24 8 Triathlon/IM Marathon 10k 5k 0
# Kurt 39 2 Half-Marathon 10k NA NA 0
# Leah 18 0 NA NA NA NA 1
我已经浏览过stackoverflow,但没有找到任何有效的东西(一切都给了我某种错误).任何建议将不胜感激.
I have looked through stackoverflow but have not found anything that works (everything gives me an error of some sort). Any suggestions would be greatly appreciated.
注意:可能并不重要,但 1 个人拥有的最大类别数是 19,因此我需要创建 Event_1:Event_19
Note: May not be important but the largest number of categories 1 person has is 19 therefore I would need to create Event_1:Event_19
评论:以前的堆栈溢出建议使用单独的函数,但是该函数似乎不适用于我的数据集.当我输入函数时,程序运行但当它完成时什么都没有改变,没有输出,也没有错误代码.当我尝试使用其他线程中提出的其他建议时,我收到了错误消息.但是,我终于通过使用 cSplit 函数得到了它.感谢帮助!!!
Comment: Previous stack overflows have suggested the separate function, however this function does not seem to work with my dataset. When I input the function the program runs but when it is finished nothing is changed, there is no output, and no error code. When I tried to use other suggestions made in other threads I received error messages. However, I finally got it is work by using the cSplit function. Thank for the help!!!
推荐答案
来自 Ananda 的 splitstackshape
包:
From Ananda's splitstackshape
package:
cSplit(df, "Events", sep=",")
# Name Age Number First Events_1 Events_2 Events_3 Events_4
#1: Karen 24 8 0 Triathlon/IM Marathon 10k 5k
#2: Kurt 39 2 0 Half-Marathon 10k NA NA
#3: Leah 18 0 1 NA NA NA NA
或者使用tidyr
:
separate(df, 'Events', paste("Events", 1:4, sep="_"), sep=",", extra="drop")
# Name Age Number Events_1 Events_2 Events_3 Events_4 First
#1 Karen 24 8 Triathlon/IM Marathon 10k 5k 0
#2 Kurt 39 2 Half-Marathon 10k <NA> <NA> 0
#3 Leah 18 0 NA <NA> <NA> <NA> 1
使用 data.table
包:
setDT(df)[,paste0("Events_", 1:4) := tstrsplit(Events, ",")][,-"Events", with=F]
# Name Age Number First Events_1 Events_2 Events_3 Events_4
#1: Karen 24 8 0 Triathlon/IM Marathon 10k 5k
#2: Kurt 39 2 0 Half-Marathon 10k NA NA
#3: Leah 18 0 1 NA NA NA NA
数据
df <- structure(list(Name = structure(1:3, .Label = c("Karen", "Kurt",
"Leah "), class = "factor"), Age = c(24L, 39L, 18L), Number = c(8L,
2L, 0L), Events = structure(c(3L, 2L, 1L), .Label = c(" NA",
" Half-Marathon,10k", " Triathlon/IM,Marathon,10k,5k"
), class = "factor"), First = c(0L, 0L, 1L)), .Names = c("Name",
"Age", "Number", "Events", "First"), class = "data.frame", row.names = c(NA,
-3L))
这篇关于R:用逗号将变量列拆分为多个(不平衡)列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!