多个列上的R substr函数 [英] R substr function on multiple columns

查看:105
本文介绍了多个列上的R substr函数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有3栏.第一列具有唯一的ID,第二列和第三列具有字符串数据和一些NA数据.我需要从第2列中提取信息,并将其放在单独的列中,然后对第3列做同样的事情.我正在使用for循环构建如下的函数.我需要在第三个字母后拆分列.[例如,在下面的V1列中,我需要将AAAbbb分解为AAA和bbb,并将它们放在单独的列中.我知道我可以使用substr来做到这一点.我是R的新手,请帮助.

I have 3 columns. First column has unique ID, second and third columns have string data and some NA data. I need to extract info from column 2 and put it in separate columns and do the same thing for column 3. I am building a function as follows, using for loops. I need to split the columns after the third letter. [For example in the V1 column below, I need to break AAAbbb as AAA and bbb and put them in separate columns. I know I can use substr to do this. I am new to R, please help.

UID * V1 * V2 *

UID * V1 * V2 *

Z001NL * AAAbbb * IADSFO *

Z001NL * AAAbbb * IADSFO *

Z001NP * IADSFO * NA *

Z001NP * IADSFO * NA *

Z0024G * SFOHNL * NLSFO0 *

Z0024G * SFOHNL * NLSFO0 *

这是我的代码.

test=read.csv("c:/some/path/in/windows/test.csv", header=TRUE)

substring_it = function(test)
{
for(i in 1:3){
for(j in 2:3){
answer = transform(test, code 1 = substr((test[[j,i]]), 1, 3), code2 = substr((test[j,i]), 4, 6))

}
}
return(answer)

}

hello = substring_it(test)

测试将是我要读取的数据框.

test will be my data frame that I will read in.

我需要这个作为输出

UID * V1.1 * V1.2 * V2.1 * V2.2

UID * V1.1 * V1.2 * V2.1 * V2.2

Z001NL * AAA * bbb * IAD * SFO

Z001NL * AAA * bbb * IAD * SFO

Z001NP * IAD * SFO * NA * NA

Z001NP * IAD * SFO * NA * NA

Z0024G * SFO * HNL * NLS * SFO

Z0024G * SFO * HNL * NLS * SFO

推荐答案

您可以使用 sapply 将函数应用于向量的每个元素-这在此处可能很有用,因为您可以使用sapply在原始数据框(测试)的列上创建新数据框的列.

You can use sapply to apply a function to each element of a vector - this could be useful here, since you could use sapply on the columns of your original data frame (test) to create the columns for your new data frame.

以下是解决此问题的方法:

Here's a solution that does this:

test = data.frame(UID = c('Z001NL', 'Z001NP', 'Z0024G'), 
  V1 = c('AAAbbb', 'IADSFO', 'SFOHNL'),
  V2 = c('IADSFO', NA, 'NLSFO0'))

substring_it = function(x){
  # x is a data frame
  c1 = sapply(x[,2], function(x) substr(x, 1, 3))
  c2 = sapply(x[,2], function(x) substr(x, 4, 6))
  c3 = sapply(x[,3], function(x) substr(x, 1, 3))
  c4 = sapply(x[,3], function(x) substr(x, 4, 6))
  return(data.frame(UID=x[,1], c1, c2, c3, c4))
}

substring_it(test)
# returns:
#     UID  c1  c2   c3   c4
#1 Z001NL AAA bbb  IAD  SFO
#2 Z001NP IAD SFO <NA> <NA>
#3 Z0024G SFO HNL  NLS  FO0

如果必须多次这样做,这是一种循环遍历列的方法.我不确定原始数据框的列按什么顺序,新数据框的列按什么顺序结束,因此您可能需要使用"pos"计数器.我还假定要拆分的列是第2列至第201列("colindex"),因此您可能必须更改该列.

here's a way to loop over columns if you have to do this a bunch of times. I'm not sure what order your original data frame's columns are in and what order you want the new data frame's columns to end up in, so you may need to play around with the "pos" counter. I also assumed the columns to be split were columns 2 thru 201 ("colindex"), so you'll probably have to change that.

newcolumns = list()
pos = 1 #counter for column index of new data frame
for(colindex in 2:201){
    newcolumns[[pos]] = sapply(test[,colindex], function(x) substr(x, 1, 3))
    newcolumns[[pos+1]] = sapply(test[,colindex], function(x) substr(x, 4, 6))
    pos = pos+2
}
newdataframe = data.frame(UID = test[,1], newcolumns)
# update "names(newdataframe)" as needed

这篇关于多个列上的R substr函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆