为什么`str_extract` 只捕获其中的一些值? [英] Why is `str_extract` only catching some of these values?

查看:38
本文介绍了为什么`str_extract` 只捕获其中的一些值?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个表格,其中有一个会员类型"列,其中包含我们多年来使用的无数不同的会员级别.

I have a table that has a "membership type" column that includes a zillion different membership levels that we've used over the years.

example <-data.frame(membership = c( "Legacy Payment ID #3564, Payment Record #0, Period Paid: 1 Flag: N", 
                              "Legacy Payment ID #3611, Payment Record #0, Period Paid: 2 Flag: N", 
                              "Legacy Payment ID #4105, Payment Record #0, Period Paid: 1 Flag: G",
                              "Legacy Payment ID #4136, Payment Record #0, Period Paid: 1 Flag: N", 
                              "Legacy Payment ID #5191, Payment Record #0, Period Paid: 1 Flag: N ", 
                              "Individual (2 yr)",
                              "Individual Producer (Yearly)",
                              "Student Membership (Yearly)"  ))

我希望我可以添加第二列,其中至少包含一组粗略的成员资格条款的可能值,str_extract:

I would expect that I could add a second column, with at least a rough set of possible values for the membership term with str_extract:

library(stringr)
example$term <-  example$membership %>% 
  str_extract(c("Period Paid: 1","Period Paid: 2","Yearly", "2 yr"))

但这只能捕获一半的值,我无法找到它跳过的模式.

But that's only catching half the values and I can't find a pattern in what it is skipping.

1   Legacy Payment ID #3564, Payment Record #0, Period Paid: 1 Flag: N  Period Paid: 1
2   Legacy Payment ID #3611, Payment Record #0, Period Paid: 2 Flag: N  Period Paid: 2
3   Legacy Payment ID #4105, Payment Record #0, Period Paid: 1 Flag: G  NA
4   Legacy Payment ID #4136, Payment Record #0, Period Paid: 1 Flag: N  NA
5   Legacy Payment ID #5191, Payment Record #0, Period Paid: 1 Flag: N  Period Paid: 1
6   Legacy Payment ID #5238, Payment Record #0, Period Paid: 1 Flag: N  NA
7   Legacy Payment ID #5287, Payment Record #0, Period Paid: 1 Flag: N  NA
8   Legacy Payment ID #5306, Payment Record #0, Period Paid: 1 Flag: N  NA
9   Legacy Payment ID #5739, Payment Record #0, Period Paid: 2 Flag: G  NA
10  Individual (2 yr)                                                   NA
11  Individual Producer (Yearly)                                        Yearly
12  Student Membership (Yearly)                                         NA

第 4 行和第 5 行之间的唯一区别是付款 ID.为什么只在第 5 行找到搜索值?

The only difference between row 4 and row 5 is the Payment ID. Why is it only finding the search value in Row 5?

我该如何解决.但主要是为什么?

And how do I fix it. But mostly why?

推荐答案

我们可以使用 |

library(stringr)
library(dplyr)
pattern_vec <- c("Period Paid: 1","Period Paid: 2","Yearly", "2 yr")
example%>% 
      mutate(term = str_extract(membership,
      str_c(pattern_vec, collapse="|")))
#                                                       membership           term
#1  Legacy Payment ID #3564, Payment Record #0, Period Paid: 1 Flag: N Period Paid: 1
#2  Legacy Payment ID #3611, Payment Record #0, Period Paid: 2 Flag: N Period Paid: 2
#3  Legacy Payment ID #4105, Payment Record #0, Period Paid: 1 Flag: G Period Paid: 1
#4  Legacy Payment ID #4136, Payment Record #0, Period Paid: 1 Flag: N Period Paid: 1
#5 Legacy Payment ID #5191, Payment Record #0, Period Paid: 1 Flag: N  Period Paid: 1
#6                                                   Individual (2 yr)           2 yr
#7                                        Individual Producer (Yearly)         Yearly
#8                                         Student Membership (Yearly)         Yearly

<小时>

str_extract 对 'string' 和 'pattern' 都进行了向量化,除非在 'pattern' 中有一个长度 > 1 的向量,那么它将进行元素匹配,即第一个值'membership' 到模式的第一个值,第二个到第二个等等.在这里,在 OP 的情况下,长度是不同的,即列长度与模式长度不同.因此,模式向量通过从第 4 行之后的开始重复自身来进行循环.


str_extract is vectorized for both the 'string' and 'pattern' except that if there is a vector of length > 1 in 'pattern', then it would be doing an elementwise match i.e. 1st value of 'membership' to 1st value of pattern, 2nd to 2nd and so on. Here, in the OP's case, the lengths are different i.e. column length is different than the pattern length. So, the pattern vector does a recycling by repeating itself from the start after row 4.

为了检查回收,您可以使用rep复制pattern_vec并检查输出:

In order to check the recycling, you can use rep to replicate the pattern_vec and check the output:

out1 <- example %>% 
      mutate(term = str_extract(membership, rep(pattern_vec, length.out = n())))

out2 <- example %>% 
            mutate(term = str_extract(membership,  pattern_vec))
identical(out1, out2)
#[1] TRUE



out1
#                                                           membership           term
#1  Legacy Payment ID #3564, Payment Record #0, Period Paid: 1 Flag: N Period Paid: 1
#2  Legacy Payment ID #3611, Payment Record #0, Period Paid: 2 Flag: N Period Paid: 2
#3  Legacy Payment ID #4105, Payment Record #0, Period Paid: 1 Flag: G           <NA>
#4  Legacy Payment ID #4136, Payment Record #0, Period Paid: 1 Flag: N           <NA>
#5 Legacy Payment ID #5191, Payment Record #0, Period Paid: 1 Flag: N  Period Paid: 1
#6                                                   Individual (2 yr)           <NA>
#7                                        Individual Producer (Yearly)         Yearly
#8                                         Student Membership (Yearly)           <NA>

来自 OP 的说明:

关于 RStudio 社区 帮助我 (OP) 理解上述解释:

A post on RStudio Community that helped me (OP) understand the explanation above:

当输入单个模式时,str_replace_all 会将该模式与每个元素进行比较.但是,如果您传递给它一个向量,它会尝试遵守顺序,因此将第一个模式与第一个对象进行比较,然后将第二个模式与第二个对象进行比较.

When fed with a single pattern, str_replace_all will compare that pattern for against every element. However, if you pass it a vector, it will try to respect the order, so compare the first pattern with the first object, then the second pattern with the second object.

这篇关于为什么`str_extract` 只捕获其中的一些值?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆