在 R 中使用 stringr 提取特定单词周围的单词样本 [英] Extract a sample of words around a particular word using stringr in R

查看:45
本文介绍了在 R 中使用 stringr 提取特定单词周围的单词样本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 SO 上看到了一些关于此主题的类似问题,但它们似乎措辞不当(示例)或其他语言(示例).

I've seen a couple of similar questions posted on SO regarding this topic, but they seem to be worded improperly (example) or in a different language (example).

在我的场景中,我认为被空白包围的所有东西都是一个词.表情符号、数字、不是真正单词的字母串,我不在乎.我只想在找到的字符串周围获得一些上下文,而不必读取整个文件来确定它是否有效匹配.

In my scenario, I consider everything that is surrounded by white space to be a word. Emoticons, numbers, strings of letters that aren't really words, I don't care. I just want to get some context around the string that was found without having to read the entire file to figure out if it's a valid match.

我尝试使用以下内容,但如果您的文本文件很长,则需要一段时间才能运行:

I tried using the following, but it takes awhile to run if you've got a long text file:

text <- "He served both as Attorney General and Lord Chancellor of England. After his death, he remained extremely influential through his works, especially as philosophical advocate and practitioner of the scientific method during the scientific revolution. Bacon has been called the father of empiricism.[6] His works argued for the possibility of scientific knowledge based only upon inductive and careful observation of events in nature. Most importantly, he argued this could be achieved by use of a skeptical and methodical approach whereby scientists aim to avoid misleading themselves. While his own practical ideas about such a method, the Baconian method, did not have a long lasting influence, the general idea of the importance and possibility of a skeptical methodology makes Bacon the father of scientific method. This marked a new turn in the rhetorical and theoretical framework for science, the practical details of which are still central in debates about science and methodology today. Bacon was knighted in 1603 and created Baron Verulam in 1618[4] and Viscount St. Alban in 1621;[3][b] as he died without heirs, both titles became extinct upon his death. Bacon died of pneumonia in 1626, with one account by John Aubrey stating he contracted the condition while studying the effects of freezing on the preservation of meat."

stringr::str_extract(text, "(.*?\\s){1,10}Verulam(\\s.*?){1,10}")

我假设有一种更快/更有效的方法来做到这一点,是吗?

I'm assuming there is a much, much faster/more efficient way in which to do this, yes?

推荐答案

试试这个:

stringr::str_extract(text, "([^\\s]+\\s){3}Verulam(\\s[^\\s]+){3}")
# alternately, if you like " " more than \\s:
# stringr::str_extract(text, "(?:[^ ]+ ){3}Verulam(?: [^ ]+){3}")

#[1] "and created Baron Verulam in 1618[4] and"

更改 {} 中的数字以满足您的需要.

Change the number inside the {} to suit your needs.

您也可以使用非捕获 (?:) 组,但我不确定这是否会提高速度.

You can use non-capture (?:) groups, too, though I'm not sure yet whether that will improve speed.

stringr::str_extract(text, "(?:[^\\s]+\\s){3}Verulam(?:\\s[^\\s]+){3}")

这篇关于在 R 中使用 stringr 提取特定单词周围的单词样本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆