在可以出现零次或多次的字符之后截断 R 中字符串的结尾 [英] Truncating the end of a string in R after a character that can be present zero or more times
问题描述
我有以下数据:
temp<-c("AIR BAGS:FRONTAL" ,"SERVICE BRAKES HYDRAULIC:ANTILOCK",
"PARKING BRAKE:CONVENTIONAL",
"SEATS:FRONT ASSEMBLY:POWER ADJUST",
"POWER TRAIN:AUTOMATIC TRANSMISSION",
"SUSPENSION",
"ENGINE AND ENGINE COOLING:ENGINE",
"SERVICE BRAKES HYDRAULIC:ANTILOCK",
"SUSPENSION:FRONT",
"ENGINE AND ENGINE COOLING:ENGINE",
"VISIBILITY:WINDSHIELD WIPER/WASHER:LINKAGES")
我想创建一个新向量,在出现:"的情况下仅保留第一个:"之前的文本,不出现:"时保留整个单词.
I would like to create a new vector that retains only the text before the first ":" in the cases where a ":" is present, and the whole word when ":" is not present.
我曾尝试使用:
temp=data.frame(matrix(unlist(str_split(temp,pattern=":",n=2)),
+ ncol=2, byrow=TRUE))
但它在没有:"的情况下不起作用
but it does not work in the cases where there is no ":"
我知道这个问题非常类似于:截断 R 中某个字符的字符串,它使用:
I know this question is very similar to: truncate string from a certain character in R, which used:
sub("^[^.]*", "", x)
但我对正则表达式不是很熟悉,并且一直在努力反转该示例以仅保留字符串的开头.
But I am not very familiar with regular expressions and have struggled to reverse that example to retain only the beginning of the string.
推荐答案
你可以用一个简单的正则表达式来解决这个问题:
You can solve this with a simple regex:
sub("(.*?):.*", "\\1", x)
[1] "AIR BAGS" "SERVICE BRAKES HYDRAULIC" "PARKING BRAKE" "SEATS"
[5] "POWER TRAIN" "SUSPENSION" "ENGINE AND ENGINE COOLING" "SERVICE BRAKES HYDRAULIC"
[9] "SUSPENSION" "ENGINE AND ENGINE COOLING" "VISIBILITY"
<小时>
正则表达式的工作原理:
How the regex works:
"(.*?):.*"
查找重复的任何字符.*
的集合,但使用?
将其修改为不要贪婪.这后面应该跟一个冒号,然后是任何字符(重复)- 用括号内的位替换整个字符串 -
"\\1"
"(.*?):.*"
Look for a repeated set of any characters.*
but modify it with?
to not be greedy. This should be followed by a colon and then any character (repeated)- Substitute the entire string with the bit found inside the parentheses -
"\\1"
要理解的一点是,任何正则表达式匹配默认都是贪婪的.通过将其修改为非贪婪,第一个模式匹配不能包含冒号,因为括号后的第一个字符是冒号.冒号后的正则表达式恢复为默认值,即贪婪.
The bit to understand is that any regex match is greedy by default. By modifying it to be non-greedy, the first pattern match can not include the colon, since the first character after the parentheses is a colon. The regex after the colon is back to the default, i.e. greedy.
这篇关于在可以出现零次或多次的字符之后截断 R 中字符串的结尾的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!