使用dplyr mutate和现有列的子字符串创建新列 [英] Create new column with dplyr mutate and substring of existing column

查看:269
本文介绍了使用dplyr mutate和现有列的子字符串创建新列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个带有一列字符串的数据框,并且想要将这些字符串的子字符串提取到一个新列中。

I have a dataframe with a column of strings and want to extract substrings of those into a new column.

下面是一些示例代码和数据,它们表明我想在 id 列中最后一个下划线字符之后输入字符串,以创建 new_id 列。
id 列条目始终包含2个下划线字符,并且始终是我想要的最后一个子字符串。

Here is some sample code and data showing I want to take the string after the final underscore character in the id column in order to create a new_id column. The id column entry always has 2 underscore characters and it's always the final substring I would like.

df = data.frame( id = I(c("abcd_123_ABC","abc_5234_NHYK")), x = c(1.0,2.0) )

require(dplyr)

df = df %>% dplyr::mutate(new_id = strsplit(id, split="_")[[1]][3])

我期望strsplit依次对每一行起作用。

I was expecting strsplit to act on each row in turn.

但是, new_id 列每行仅包含 ABC ,而我想 ABC 在第1行, NHYK 在第2行,您知道为什么这样做失败以及如何实现我想要的吗?

However, the new_id column only contains ABC in each row, whereas I would like ABC in row 1 and NHYK in row 2. Do you know why this fails and how to achieve what I want?

推荐答案

您可以使用 stringr :: str_extract

library(stringr)

 df %>%
   dplyr::mutate(new_id = str_extract(id, "[^_]+$"))


#>              id x new_id
#> 1  abcd_123_ABC 1    ABC
#> 2 abc_5234_NHYK 2   NHYK

正则表达式表示匹配一个或多个( + 不是 _ (否定的 [^] ),然后是字符串结尾( $ )。

The regex says, match one or more (+) of the characters that aren't _ (the negating [^ ]), followed by end of string ($).

这篇关于使用dplyr mutate和现有列的子字符串创建新列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆