在可以出现零次或多次的字符之后截断 R 中字符串的结尾 [英] Truncating the end of a string in R after a character that can be present zero or more times

查看:31
本文介绍了在可以出现零次或多次的字符之后截断 R 中字符串的结尾的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下数据:

temp<-c("AIR BAGS:FRONTAL" ,"SERVICE BRAKES HYDRAULIC:ANTILOCK",
    "PARKING BRAKE:CONVENTIONAL",
    "SEATS:FRONT ASSEMBLY:POWER ADJUST",
    "POWER TRAIN:AUTOMATIC TRANSMISSION",
    "SUSPENSION",
    "ENGINE AND ENGINE COOLING:ENGINE",
    "SERVICE BRAKES HYDRAULIC:ANTILOCK",
    "SUSPENSION:FRONT",
    "ENGINE AND ENGINE COOLING:ENGINE",
    "VISIBILITY:WINDSHIELD WIPER/WASHER:LINKAGES")

我想创建一个新向量,在出现:"的情况下仅保留第一个:"之前的文本,不出现:"时保留整个单词.

I would like to create a new vector that retains only the text before the first ":" in the cases where a ":" is present, and the whole word when ":" is not present.

我曾尝试使用:

temp=data.frame(matrix(unlist(str_split(temp,pattern=":",n=2)), 
+                        ncol=2, byrow=TRUE))

但它在没有:"的情况下不起作用

but it does not work in the cases where there is no ":"

我知道这个问题非常类似于:截断 R 中某个字符的字符串,它使用:

I know this question is very similar to: truncate string from a certain character in R, which used:

sub("^[^.]*", "", x)

但我对正则表达式不是很熟悉,并且一直在努力反转该示例以仅保留字符串的开头.

But I am not very familiar with regular expressions and have struggled to reverse that example to retain only the beginning of the string.

推荐答案

你可以用一个简单的正则表达式来解决这个问题:

You can solve this with a simple regex:

sub("(.*?):.*", "\\1", x)
 [1] "AIR BAGS"                  "SERVICE BRAKES HYDRAULIC"  "PARKING BRAKE"             "SEATS"                    
 [5] "POWER TRAIN"               "SUSPENSION"                "ENGINE AND ENGINE COOLING" "SERVICE BRAKES HYDRAULIC" 
 [9] "SUSPENSION"                "ENGINE AND ENGINE COOLING" "VISIBILITY"     

<小时>

正则表达式的工作原理:


How the regex works:

  • "(.*?):.*" 查找重复的任何字符 .* 的集合,但使用 ? 将其修改为不要贪婪.这后面应该跟一个冒号,然后是任何字符(重复)
  • 用括号内的位替换整个字符串 - "\\1"
  • "(.*?):.*" Look for a repeated set of any characters .* but modify it with ? to not be greedy. This should be followed by a colon and then any character (repeated)
  • Substitute the entire string with the bit found inside the parentheses - "\\1"

要理解的一点是,任何正则表达式匹配默认都是贪婪的.通过将其修改为非贪婪,第一个模式匹配不能包含冒号,因为括号后的第一个字符是冒号.冒号后的正则表达式恢复为默认值,即贪婪.

The bit to understand is that any regex match is greedy by default. By modifying it to be non-greedy, the first pattern match can not include the colon, since the first character after the parentheses is a colon. The regex after the colon is back to the default, i.e. greedy.

这篇关于在可以出现零次或多次的字符之后截断 R 中字符串的结尾的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆