使用dplyr mutate和现有列的子字符串创建新列 [英] Create new column with dplyr mutate and substring of existing column
问题描述
我有一个带有一列字符串的数据框,并且想要将这些字符串的子字符串提取到一个新列中。
I have a dataframe with a column of strings and want to extract substrings of those into a new column.
下面是一些示例代码和数据,它们表明我想在 id
列中最后一个下划线字符之后输入字符串,以创建 new_id
列。
id
列条目始终包含2个下划线字符,并且始终是我想要的最后一个子字符串。
Here is some sample code and data showing I want to take the string after the final underscore character in the id
column in order to create a new_id
column.
The id
column entry always has 2 underscore characters and it's always the final substring I would like.
df = data.frame( id = I(c("abcd_123_ABC","abc_5234_NHYK")), x = c(1.0,2.0) )
require(dplyr)
df = df %>% dplyr::mutate(new_id = strsplit(id, split="_")[[1]][3])
我期望strsplit依次对每一行起作用。
I was expecting strsplit to act on each row in turn.
但是, new_id
列每行仅包含 ABC
,而我想 ABC
在第1行, NHYK
在第2行,您知道为什么这样做失败以及如何实现我想要的吗?
However, the new_id
column only contains ABC
in each row, whereas I would like ABC
in row 1 and NHYK
in row 2. Do you know why this fails and how to achieve what I want?
推荐答案
您可以使用 stringr :: str_extract
:
library(stringr)
df %>%
dplyr::mutate(new_id = str_extract(id, "[^_]+$"))
#> id x new_id
#> 1 abcd_123_ABC 1 ABC
#> 2 abc_5234_NHYK 2 NHYK
正则表达式表示匹配一个或多个( +
)不是 _
(否定的 [^]
),然后是字符串结尾( $
)。
The regex says, match one or more (+
) of the characters that aren't _
(the negating [^ ]
), followed by end of string ($
).
这篇关于使用dplyr mutate和现有列的子字符串创建新列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!