如何在R中两个单词之间的文本上加字幕? [英] How to gsub on the text between two words in R?

查看:70
本文介绍了如何在R中两个单词之间的文本上加字幕?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在文本中的特定未知单词前放置\n.我知道,我的文字中首次出现未知单词会在树"和湖"之间

I would like to place a \n before a specific unknown word in my text. I know that the first time the unknown word appears in my text will be between "Tree" and "Lake"

例如文字:

text
[1]  "TreeRULakeSunWater" 
[2]  "A B C D"

树"和湖"将永远不会改变,但是它们之间的词总是会改变,因此我不会在我的regex

"Tree" and "Lake" will never change, but the word in between them is always changing so I do not look for "RU" in my regex

我目前正在做什么:

if (grepl(".*Tree\\s*|Lake.*",  text)) { text <- gsub(".*Tree\\s*|Lake.*", "\n\\1", text)}

我在上面所做的问题是gsub将把text的所有子项并保留\nRU.

The problem with what I am doing above is that the gsub will sub all of text and leave just \nRU.

text
[1] "\nRU"

我也尝试过:

if (grepl(".*Tree *(.*?) *Lake.*",  text)) { text <- gsub(".*Tree *(.*?) *Lake.*", "\n\\1", text)}

我希望textgsub之后看起来像什么:

What I would like text to look like after gsub:

text
[1] "Tree \nRU LakeSunWater"
[2] "A B C D"

根据Wiktor Stribizew的评论,我能够成功完成gsub

From Wiktor Stribizew's comment I am able to do a successful gsub

gsub("Tree(\\w+)Lake", "Tree \n\\1 Lake", text)

但是,这只会对出现在树"和湖"之间的"RU"(这是未知单词的第一个出现)的出现做一个gsub.未知单词和本例中的"RU"将出现很多次在文本中,并且当"RU"是一个完整单词时,我想在每次出现的"RU"之前放置\n.

But this will only do a gsub on occurrences where "RU" is between "Tree and "Lake", which is the first occurrence of the unknown word. The unknown word and in this case "RU" will show up many times in the text, and I would like to place \n in front of every occurrence of "RU" when "RU" is a whole word.

新例文字.

text
[1] "TreeRULakeSunWater"
[2] "A B C RU D"

新例我想要的:

text
[1] "Tree \nRU LakeSunWater"
[2] "A B C \nRU D"

任何帮助将不胜感激.请让我知道是否需要更多信息.

Any help will be appreciated. Please let me know if further information is needed.

推荐答案

您需要先找到"Tree"和"Lake"之间的未知单词.您可以使用

You need to find the unknown word between "Tree" and "Lake" first. You can use

unknown_word <- gsub(".*Tree(\\w+)Lake.*", "\\1", text)

模式将匹配字符串中最后一个Tree之前的所有字符,然后捕获直至Lake的未知单词(\w+ =一个或多个单词字符),然后匹配字符串的其余部分.它替换向量中的所有字符串.您可以通过[[1]]索引访问第一个.

The pattern matches any characters up to the last Tree in a string, then captures the unknown word (\w+ = one or more word characters) up to the Lake and then matches the rest of the string. It replaces all the strings in the vector. You can access the first one by [[1]] index.

然后,当您知道该词时,将其替换为

Then, when you know the word, replace it with

gsub(paste0("[[:space:]]*(", unknown_word[[1]], ")[[:space:]]*"), " \n\\1 ", text)

请参见 IDEONE演示.

在这里,您使用[[:space:]]*( + unknown_word [ 1 ] + )[[:space:]]*模式.它在未知单词的两端和未知单词本身(捕获到第1组)中匹配零个或多个空格.在替换中,空格缩小为1(如果没有空格则添加),然后\\1恢复未知单词.您可以将[[:space:]]替换为\\s.

Here, you have [[:space:]]*( + unknown_word[1] + )[[:space:]]* pattern. It matches zero or more whitespaces on both ends of the unknown word, and the unknown word itself (captured into Group 1). In the replacement, the spaces are shrunk into 1 (or added if there were none) and then \\1 restores the unknown word. You may replace [[:space:]] with \\s.

更新

如果您只需要在RU之前添加完整单词的换行符,请使用\b单词边界:

If you need to only add a newline symbols before RU that are whole words, use the \b word boundary:

> gsub(paste0("[[:space:]]*\\b(", unknown_word[[1]], ")\\b[[:space:]]*"), " \n\\1 ", text)
[1] "TreeRULakeSunWater" "A B C \nRU D"   

这篇关于如何在R中两个单词之间的文本上加字幕?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆