在 R 中,按特定字符分割字符向量;将第三块保存在新向量中 [英] In R, split a character vector by a specific character; save 3rd piece in new vector

查看:29
本文介绍了在 R 中,按特定字符分割字符向量;将第三块保存在新向量中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个‘aaa_9999_1’形式的数据向量,其中第一部分是字母位置代码,第二部分是四位数年份,最后一部分是唯一的点标识符.例如,有多个 sil_2007_X 点,每个点都有不同的最后一位.我需要使用_"字符拆分此字段,并仅将唯一 ID 号保存到新向量中.我试过了:

I have a vector of data in the form ‘aaa_9999_1’ where the first part is an alpha-location code, the second is the four digit year, and the final is a unique point identifier. E.g., there are multiple sil_2007_X points, each with a different last digit. I need to split this field, using the "_" character and save only the unique ID number into a new vector. I tried:

oss$point <- unlist(strsplit(oss$id, split='_', fixed=TRUE))[3]

基于此处的回复:R 删除部分字符串.我得到一个单一的1"响应.如果我只是运行

based on a response here: R remove part of string. I get a single response of "1". If I just run

strsplit(oss$id, split= ‘_’, fixed=TRUE)

我可以生成拆分列表:

> head(oss$point)
[[1]]
[1] "sil"  "2007" "1"   

[[2]]
[1] "sil"  "2007" "2"   

[[3]]
[1] "sil"  "2007" "3"   

[[4]]
[1] "sil"  "2007" "4"   

[[5]]
[1] "sil"  "2007" "5"   

[[6]]
[1] "sil"  "2007" "6"  

在最后添加 [3] 只会给我 [[3]] 结果:sil"2007"3".我想要的是所有记录的第三部分(唯一编号)的向量.我觉得我已经接近理解这一点了,但是在截止日期的项目上花费了太多时间(就像一天中的大部分时间).感谢您的任何反馈.

Adding the [3] at the end just gives me the [[3]] result: "sil" "2007" "3". What I want is a vector of the 3rd part (the unique number) of all records. I feel like I’m close to understanding this, but it is taking too much time (like most of a day) on a deadline project. Thanks for any feedback.

推荐答案

strsplit 创建一个列表,所以我会尝试以下操作:

strsplit creates a list, so I would try the following:

lapply(strsplit(oss$id, split='_', fixed=TRUE), `[`, 3) ## Output a list
sapply(strsplit(oss$id, split='_', fixed=TRUE), `[`, 3) ## Output a vector (even though a list is also a vector)

[ 表示提取第三个元素.如果您更喜欢矢量,请将 lapply 替换为 sapply.

The [ means to extract the third element. If you prefer a vector, substitute lapply with sapply.

这是一个例子:

mystring <- c("A_B_C", "D_E_F")

lapply(strsplit(mystring, "_"), `[`, 3)
# [[1]]
# [1] "C"
# 
# [[2]]
# [1] "F"
sapply(strsplit(mystring, "_"), `[`, 3)
# [1] "C" "F"

<小时>

如果有一个容易定义的模式,gsub 也可能是一个不错的选择,并且避免了分裂.请参阅 DWin 和 Josh O'Brien 对改进(更健壮)版本的评论.


If there is an easily definable pattern, gsub might be a good option too, and avoids splitting. See the comments for improved (more robust) versions along the same lines from DWin and Josh O'Brien.

gsub(".*_.*_(.*)", "\\1", mystring)
# [1] "C" "F"

<小时>

最后,为了好玩,您可以扩展 unlist 方法,通过回收 TRUEFALSEs 提取每三个项目(因为我们事先知道所有拆分将导致相同的结构).


And, finally, just for fun, you can expand on the unlist approach to make it work by recycling a vector of TRUEs and FALSEs to extract every third item (since we know in advance that all the splits will result in an identical structure).

unlist(strsplit(mystring, "_"), use.names = FALSE)[c(FALSE, FALSE, TRUE)]
# [1] "C" "F"

<小时>

如果您不是按数字位置提取,而只是想提取分隔符后的最后一个值,您有几种不同的选择.


If you're extracting not by numeric position, but just looking to extract the last value after a delimiter, you have a few different alternatives.

使用贪婪的正则表达式:

Use a greedy regex:

gsub(".*_(.*)", "\\1", mystring)
# [1] "C" "F"

使用像stringi"包中的 stri_extract* 这样的便利函数:

Use a convenience function like stri_extract* from the "stringi" package:

library(stringi)
stri_extract_last_regex(mystring, "[A-Z]+")
# [1] "C" "F"

这篇关于在 R 中,按特定字符分割字符向量;将第三块保存在新向量中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆