正则表达式提取匹配词周围的一组字数 [英] regex to extract a set number of words around a matched word

查看:268
本文介绍了正则表达式提取匹配词周围的一组字数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找一种方式来抓住发现的比赛的话,但对我来说这太复杂了。所有我需要的是一个正则表达式来抓取,让我们说10个匹配词之前和之后的单词。有没有人可以帮我设定一个模式呢?

I was looking around for a way to grab words around a found match, but they were much too complicated for my case. All I need is a regex statement to grab, lets say 10, words before and after a matched word. Would anybody be able to help me set up a pattern to do that?

例如,让我们来看句子(没有意义):

For example, let's take the sentence (won't make sense):

    sentence = "The hairy yellow, stinkin' dog, sat round' the c4mpfir3 and ate the brown/yellow smore's that the kids(*adults) were makin."

,假设我们要匹配3个字,在smore之前和之后(已经清理匹配)。输出将是:

and let's say we want to match 3 words before and after smore's (already cleaned to match). The output would be:

   "ate the brown/yellow smore's that the were"

现在让我们举个例子想要在stinkin之前和之后拿一个字:

now lets take the example of wanting to take one word before and after stinkin' :

   "yellow, stinkin' dog"

另一个例子。 sat:

Another example. "sat":

   "yellow, stinkin' dog, round' the and

现在我们来一个新句子:

Let's make a new sentence now:

   sentence = "If the problem is still there after 30 minutes. Give up"

如果我正在尝试匹配那个词,并在之前和之后采取2个字输出将是:

If I was trying to match the word there, and take 2 words before and after the output would be:

   "is still there after minutes"

我知道这不是10,但是我觉得你得到了例子,如果没有,让我知道,我会提供更多,当我这样做,我意识到了更多我想要的不是我原来的想法,而是更新到正则表达式,但是我要给出一个模式。

I know it's not 10, but I think you get the example? If not, let me know and I will provide more. As I made this, I realized how much more I want than I originally thought. I'm rather new to regex, but I'm going to give the pattern a shot.

    ('[a-zA-Z\'.,/]{3}(word_to_match)[a-zA-Z\'.,/]{3}')

谢谢

推荐答案

word:一串非空格字符,另一个是一个字母和数字的字符串,但没有标点符号。Python具有方便的快捷方式。

Here's a likely definition of "word": A string of non-space characters. Here's another: A string of letters and digits, but no punctuation. Python has convenient shortcuts for both.

\w 是具有第二个含义(字母和数字)的任何单词字符,而 \W 是任何

\w is any "word" character with the second meaning (letters and digits), and \W is any other character. Use it like this:

m = re.search(r'((\w+\W+){0,4}grab(\W+\w+){0,4})', sentence)
print m.groups()[0]

如果您喜欢第一个定义,只需使用 \S (任何字符这不是空格)和 \s (任何空格字符):

If you prefer the first definition, just use \S (any character that's not a space) and \s (any space character):

re.search(r'((\S+\s+){0,4}grab(\s+\S+){0,4})', sentence)

你会注意到我在前后匹配零到四个字。这样,如果你的话在句子中是第三,你仍然会得到一个匹配。 (搜索是贪心的,所以如果可能的话,总是会得到四个)。

You'll notice I'm matching zero to four words before and after. That way if your word is third in the sentence, you'll still get a match. (Searches are "greedy" so you'll always get four if it's possible).

这篇关于正则表达式提取匹配词周围的一组字数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆