基于一组值将单个列分割成多个列 [英] Split a single column into multiple columns based on a set of values

查看:111
本文介绍了基于一组值将单个列分割成多个列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

编辑:
好​​的,所以我已经解决了我最初的问题,使用G.GGthendieck的建议,再次感谢,完全是干净的方式,我后来。初始职位如下。
现在的现实是,我的文件比这个模板更微妙一点。



其实看起来像这样:

  A1 
100
200
txt
A2
STRING
300
400
txt txt
txt
txt txt txt
A3
STRING
STRING
150
250
A2



一个知名的STRING,有时它不会发生,有时只是一次或几次发生。首先我没有注意到几次发生,所以在想到这只是一次发生的时候,我做了一个循环来处理这个问题:

  for(i in 1:nrow(raw_data)){
if(is.na(raw_data [i,2])){
raw_data< - raw_data [-c i $]
} else if(raw_data [i,2] ==STRING){
raw_data [i,2] = raw_data [i,3]
raw_data [i,3 ] = raw_data [i,4]
raw_data [i,4] = raw_data [i,5]
raw_data [i,5] = raw_data [i,6]
raw_data [ 6] = raw_data [i,7]
raw_data [i,7] = raw_data [i,8]
raw_data [i,8] = raw_data [i,9]
raw_data [i ,9] = raw_data [i,10]
raw_data [i,10] = raw_data [i,12]
raw_data [i,11] =是
if(is.na (raw_data [i,13])){
raw_data [i,12] = NA
} else raw_data [i,12] = raw_data [i,13]

基本上我在第11列中指定yes来表示找到了字符串。我清楚地应该说出这里的发生而不是Yes / No(所以默认为0,1或2或...)。所有其他列值正在向左移动,以便它们返回到预期的列。



如果可能,我可以如何调整事实上,在现实中,我可能有几次发生了STRING。我可能必须完全改变我的方法?



现在对于那些喜欢挑战的人,我真的开始评估我的处理对于这个文件是否真的有效...如何处理原始文件的每一行,因为我们知道像A1 A2等的任何东西应该在col1等等?



无论如何,感谢那些谁会研究这个,并尝试:)






初始帖子:
我有一个数据集R由包含多个列中理想情况下的变量的单列组成。结构如下:

  A1 
100
200
txt
A2
300
400
txt txt
txt
txt txt txt
A3
150
250
A2



理想情况下,这是我追逐的结果:

  A1 | 100 | 200 | txt 
A2 | 300 | 400 | txt txt | txt | txt txt
A3 | 150 | 250
A2 | 。 | 。 | 。

集合{A1; A2; A3}是已知的。我现在遇到的主要困难是列数未知。



我已经通过转置我的数据开始,并且正在考虑在单行,每次我看到我的集合{A1; A2; A3}中的值之一,我在列1中启动一个新行,这样列1只包含{A1; A2; A3}值。 p>

我相信有一个更干净的方式来做这样的工作。



提前感谢你的帮助

解决方案

创建一个分组变量 g 并使用它 c $ c>将数据从长格式转换为列表, v 。最后,将 v 的每个组件转换为ts对象和 cbind ts对象在一起(因为ts对象可以绑定在一起,并自动填充NAs)将结果转换为矩阵 m 。将 m 转换为data.frame,并将 type.convert 应用于每列以修复列类型。如果矩阵 m 足以作为答案,则可以省略标记为 ## 的两行。



没有使用包裹。

  g<  -  cumsum(DF [ 1]]%在%c(A1,A2,A3))
v< - tapply(DF [[1]],g,c,simplified = FALSE)
m< ; - t(do.call(cbind,lapply(v,ts)))
DFout < - as.data.frame(m,stringsAsFactors = FALSE)##
DFout []< - lapply (DFout,type.convert,as.is = TRUE)##

give:

 > DFout 
V1 V2 V3 V4 V5 V6
1 A1 100 200 txt< NA> < NA>
2 A2 300 400 txt txt txt txt txt txt
3 A3 150 250 < NA> < NA>
4 A2 NA NA< NA> < NA> < NA>

注意:可重复输入的输入是:

  DF<  - 结构(列表(V1 = c(A1,100,200,txt ,300,
400,txt txt,txt,txt txt txt,A3,150,250,A2
) .Names =V1,row.names = c(NA,-14L),class =data.frame)


Edit : Ok, so I've solved my initial problem using a suggestion from G. Grothendieck, thanks again, exactly the clean way of doing that I was after. Initial post is below. Now reality is that my file is just a little more subtle than this template.

It actually looks like this:

A1
100
200
txt 
A2
STRING
300
400
txt txt
txt
txt txt txt
A3
STRING
STRING
150
250
A2
.
.
.

a STRING that is well known right after A something, sometimes it does not occur and sometimes just one time or several occurences. I didn't notice the several occurences at first, so while thinking it was just one time when it happened, I did a loop to handle the problem :

for (i in 1:nrow(raw_data)){
  if (is.na(raw_data[i,2])) {
    raw_data <- raw_data[-c(i)]
  } else if (raw_data[i,2] == "STRING") {
    raw_data[i,2] = raw_data[i,3]
    raw_data[i,3] = raw_data[i,4]
    raw_data[i,4] = raw_data[i,5]
    raw_data[i,5] = raw_data[i,6]
    raw_data[i,6] = raw_data[i,7]
    raw_data[i,7] = raw_data[i,8]
    raw_data[i,8] = raw_data[i,9]
    raw_data[i,9] = raw_data[i,10]
    raw_data[i,10] = raw_data[i,12]
    raw_data[i,11] = "Yes"
    if (is.na(raw_data[i,13])){
      raw_data[i,12] = NA
    } else raw_data[i,12] = raw_data[i,13]

Basically I'm assigning "yes" in column 11 to say that the string was found. I clearly should state the occurence here instead of Yes/No (so 0 by default, 1 or 2 or ...). All the other column values are being shifted to the left so that they are going back to the columns where they are expected to be.

How can I adapt this, if possible, to the fact that, in reality, I may have several occurences of STRING. I might have to change entirely my approach ?

now for those of you who like the challenge, I'm really starting to assess if my processing is really efficient for this file... What about processing each line of the original file, and since we know that anything like A1 A2 etc should go in col1 etc etc... ?

Anyhow, Thanks for those who will look into this and try :)


Initial post : I have a dataset in R that is comprised of a single column containing variables that I ideally would like in multiple columns. The structure is as follow :

A1
100
200
txt 
A2
300
400
txt txt
txt
txt txt txt
A3
150
250
A2
.
.
.

Ideally this is the result I'm chasing :

A1 | 100 | 200 | txt  
A2 | 300 | 400 | txt txt | txt | txt txt
A3 | 150 | 250
A2 |  .  |  .  |  .

The set {A1;A2;A3} is known. The main difficulty I'm hitting right now is that the number of columns is unknown.

I've started by transpose my data, and was thinking doing a loop on the single row, and each time I see one of the value in my set {A1;A2;A3} I start a new row with this value in column 1 so that column 1 only contains {A1;A2;A3} values.

I'm convinced that there is a cleaner way of doing such task.

Thanks ahead of time for your assistance with this!

解决方案

Create a grouping variable g and with it use tapply to convert the data from long form to a list, v. Finally, convert each component of v to a "ts" object and cbind the "ts" objects together (since "ts" objects can be bound together and automatically padded with NAs) transposing the result as matrix m. Convert m to a data.frame and apply type.convert to each column to fix the column types. The two lines marked ## can be omitted if a matrix, m, is sufficient as the answer.

No packages are used.

g <- cumsum(DF[[1]] %in% c("A1", "A2", "A3"))
v <- tapply(DF[[1]], g, c, simplify = FALSE)
m <- t(do.call(cbind, lapply(v, ts)))
DFout<- as.data.frame(m, stringsAsFactors = FALSE)    ##
DFout[] <- lapply(DFout, type.convert, as.is = TRUE)  ##

giving:

> DFout
  V1  V2  V3      V4   V5          V6
1 A1 100 200    txt  <NA>        <NA>
2 A2 300 400 txt txt  txt txt txt txt
3 A3 150 250    <NA> <NA>        <NA>
4 A2  NA  NA    <NA> <NA>        <NA>

Note: The input in reproducible form is:

DF <- structure(list(V1 = c("A1", "100", "200", "txt ", "A2", "300", 
"400", "txt txt", "txt", "txt txt txt", "A3", "150", "250", "A2"
)), .Names = "V1", row.names = c(NA, -14L), class = "data.frame")

这篇关于基于一组值将单个列分割成多个列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆