多列上的 R substr 函数 [英] R substr function on multiple columns
问题描述
我有 3 列.第一列具有唯一 ID,第二列和第三列具有字符串数据和一些 NA 数据.我需要从第 2 列中提取信息并将其放在单独的列中,并对第 3 列执行相同的操作.我正在构建一个函数,如下所示,使用 for 循环.我需要在第三个字母后拆分列.[例如在下面的V1列中,我需要将AAAbbb拆分为AAA和bbb并将它们放在单独的列中.我知道我可以使用 substr 来做到这一点.我是 R 新手,请帮忙.
I have 3 columns. First column has unique ID, second and third columns have string data and some NA data. I need to extract info from column 2 and put it in separate columns and do the same thing for column 3. I am building a function as follows, using for loops. I need to split the columns after the third letter. [For example in the V1 column below, I need to break AAAbbb as AAA and bbb and put them in separate columns. I know I can use substr to do this. I am new to R, please help.
UID * V1 * V2 *
UID * V1 * V2 *
Z001NL * AAAbbb * IADSFO *
Z001NL * AAAbbb * IADSFO *
Z001NP * IADSFO * NA *
Z001NP * IADSFO * NA *
Z0024G * SFOHNL * NLSFO0 *
Z0024G * SFOHNL * NLSFO0 *
这是我的代码.
test=read.csv("c:/some/path/in/windows/test.csv", header=TRUE)
substring_it = function(test)
{
for(i in 1:3){
for(j in 2:3){
answer = transform(test, code 1 = substr((test[[j,i]]), 1, 3), code2 = substr((test[j,i]), 4, 6))
}
}
return(answer)
}
hello = substring_it(test)
test 将是我将读入的数据框.
test will be my data frame that I will read in.
我需要这个作为我的输出
I need this as my output
UID * V1.1 * V1.2 * V2.1 * V2.2
UID * V1.1 * V1.2 * V2.1 * V2.2
Z001NL * AAA * bbb * IAD * SFO
Z001NL * AAA * bbb * IAD * SFO
Z001NP * IAD * SFO * NA * NA
Z001NP * IAD * SFO * NA * NA
Z0024G * SFO * HNL * NLS * SFO
Z0024G * SFO * HNL * NLS * SFO
推荐答案
您可以使用 sapply
将函数应用于向量的每个元素 - 这在这里可能很有用,因为您可以使用 sapply在原始数据框(测试)的列上为新数据框创建列.
You can use sapply
to apply a function to each element of a vector - this could be useful here, since you could use sapply on the columns of your original data frame (test) to create the columns for your new data frame.
这是一个解决方案:
test = data.frame(UID = c('Z001NL', 'Z001NP', 'Z0024G'),
V1 = c('AAAbbb', 'IADSFO', 'SFOHNL'),
V2 = c('IADSFO', NA, 'NLSFO0'))
substring_it = function(x){
# x is a data frame
c1 = sapply(x[,2], function(x) substr(x, 1, 3))
c2 = sapply(x[,2], function(x) substr(x, 4, 6))
c3 = sapply(x[,3], function(x) substr(x, 1, 3))
c4 = sapply(x[,3], function(x) substr(x, 4, 6))
return(data.frame(UID=x[,1], c1, c2, c3, c4))
}
substring_it(test)
# returns:
# UID c1 c2 c3 c4
#1 Z001NL AAA bbb IAD SFO
#2 Z001NP IAD SFO <NA> <NA>
#3 Z0024G SFO HNL NLS FO0
如果您必须多次执行此操作,这是一种循环列的方法.我不确定原始数据框的列的顺序以及您希望新数据框的列以什么顺序结束,因此您可能需要使用pos"计数器.我还假设要拆分的列是第 2 到 201 列(colindex"),因此您可能需要更改它.
here's a way to loop over columns if you have to do this a bunch of times. I'm not sure what order your original data frame's columns are in and what order you want the new data frame's columns to end up in, so you may need to play around with the "pos" counter. I also assumed the columns to be split were columns 2 thru 201 ("colindex"), so you'll probably have to change that.
newcolumns = list()
pos = 1 #counter for column index of new data frame
for(colindex in 2:201){
newcolumns[[pos]] = sapply(test[,colindex], function(x) substr(x, 1, 3))
newcolumns[[pos+1]] = sapply(test[,colindex], function(x) substr(x, 4, 6))
pos = pos+2
}
newdataframe = data.frame(UID = test[,1], newcolumns)
# update "names(newdataframe)" as needed
这篇关于多列上的 R substr 函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!