在 R 中,按特定字符分割字符向量;将第三块保存在新向量中 [英] In R, split a character vector by a specific character; save 3rd piece in new vector
问题描述
我有一个‘aaa_9999_1’形式的数据向量,其中第一部分是字母位置代码,第二部分是四位数年份,最后一部分是唯一的点标识符.例如,有多个 sil_2007_X 点,每个点都有不同的最后一位.我需要使用_"字符拆分此字段,并仅将唯一 ID 号保存到新向量中.我试过了:
I have a vector of data in the form ‘aaa_9999_1’ where the first part is an alpha-location code, the second is the four digit year, and the final is a unique point identifier. E.g., there are multiple sil_2007_X points, each with a different last digit. I need to split this field, using the "_" character and save only the unique ID number into a new vector. I tried:
oss$point <- unlist(strsplit(oss$id, split='_', fixed=TRUE))[3]
基于此处的回复:R 删除部分字符串.我得到一个单一的1"响应.如果我只是运行
based on a response here: R remove part of string. I get a single response of "1". If I just run
strsplit(oss$id, split= ‘_’, fixed=TRUE)
我可以生成拆分列表:
> head(oss$point)
[[1]]
[1] "sil" "2007" "1"
[[2]]
[1] "sil" "2007" "2"
[[3]]
[1] "sil" "2007" "3"
[[4]]
[1] "sil" "2007" "4"
[[5]]
[1] "sil" "2007" "5"
[[6]]
[1] "sil" "2007" "6"
在最后添加 [3] 只会给我 [[3]] 结果:sil"2007"3".我想要的是所有记录的第三部分(唯一编号)的向量.我觉得我已经接近理解这一点了,但是在截止日期的项目上花费了太多时间(就像一天中的大部分时间).感谢您的任何反馈.
Adding the [3] at the end just gives me the [[3]] result: "sil" "2007" "3". What I want is a vector of the 3rd part (the unique number) of all records. I feel like I’m close to understanding this, but it is taking too much time (like most of a day) on a deadline project. Thanks for any feedback.
推荐答案
strsplit
创建一个列表,所以我会尝试以下操作:
strsplit
creates a list, so I would try the following:
lapply(strsplit(oss$id, split='_', fixed=TRUE), `[`, 3) ## Output a list
sapply(strsplit(oss$id, split='_', fixed=TRUE), `[`, 3) ## Output a vector (even though a list is also a vector)
[
表示提取第三个元素.如果您更喜欢矢量,请将 lapply
替换为 sapply
.
The [
means to extract the third element. If you prefer a vector, substitute lapply
with sapply
.
这是一个例子:
mystring <- c("A_B_C", "D_E_F")
lapply(strsplit(mystring, "_"), `[`, 3)
# [[1]]
# [1] "C"
#
# [[2]]
# [1] "F"
sapply(strsplit(mystring, "_"), `[`, 3)
# [1] "C" "F"
<小时>
如果有一个容易定义的模式,gsub
也可能是一个不错的选择,并且避免了分裂.请参阅 DWin 和 Josh O'Brien 对改进(更健壮)版本的评论.
If there is an easily definable pattern, gsub
might be a good option too, and avoids splitting. See the comments for improved (more robust) versions along the same lines from DWin and Josh O'Brien.
gsub(".*_.*_(.*)", "\\1", mystring)
# [1] "C" "F"
<小时>
最后,为了好玩,您可以扩展 unlist
方法,通过回收 TRUE
和 FALSE的向量使其工作code>s 提取每三个项目(因为我们事先知道所有拆分将导致相同的结构).
And, finally, just for fun, you can expand on the unlist
approach to make it work by recycling a vector of TRUE
s and FALSE
s to extract every third item (since we know in advance that all the splits will result in an identical structure).
unlist(strsplit(mystring, "_"), use.names = FALSE)[c(FALSE, FALSE, TRUE)]
# [1] "C" "F"
<小时>
如果您不是按数字位置提取,而只是想提取分隔符后的最后一个值,您有几种不同的选择.
If you're extracting not by numeric position, but just looking to extract the last value after a delimiter, you have a few different alternatives.
使用贪婪的正则表达式:
Use a greedy regex:
gsub(".*_(.*)", "\\1", mystring)
# [1] "C" "F"
使用像stringi"包中的 stri_extract*
这样的便利函数:
Use a convenience function like stri_extract*
from the "stringi" package:
library(stringi)
stri_extract_last_regex(mystring, "[A-Z]+")
# [1] "C" "F"
这篇关于在 R 中,按特定字符分割字符向量;将第三块保存在新向量中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!