如何为句子中的每个单词分组? [英] How to make a group for each word in a sentence?

查看:39
本文介绍了如何为句子中的每个单词分组?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这可能是一个愚蠢的问题,但是...

假设你有一个句子:

<块引用>

敏捷的棕色狐狸

或者你可能会得到这样的句子:

<块引用>

敏捷的棕色狐狸跳过懒狗

简单的正则表达式 (\w*) 找到第一个单词The"并将其放入一个组中.

对于第一句,你可以写 (\w*)\s*(\w*)\s*(\w*)\s*(\w*)\s* 把每个词放在自己的组,但前提是您知道句子中的单词数.

是否可以编写一个正则表达式,将任意句子中的每个单词放入自己的组中?如果您可以执行类似 (?:(\w*)\s*)* 之类的操作,让它将 (\w*) 的每个实例分组,那就太好了,但这是行不通的.

我在 Python 中执行此操作,我的用例显然比The quick brown fox"要复杂一些,因此如果 Regex 可以在一行中完成此操作会很好,但如果那不可能,那么我假设下一个最佳解决方案是使用 re.findall() 或类似方法遍历所有匹配项.

感谢您提供的任何见解.

为了完整起见,这是我的实际用例以及我如何使用您的帮助解决它.再次感谢.

<预><代码>>>>s = '1 0 5 test1 5 test2 5 test3 5 test4 5 test5'>>>s = re.match(r'^\d+\s\d+\s?(.*)', s).group(1)>>>印刷5 测试1 5 测试2 5 测试3 5 测试4 5 测试5>>>list = re.findall(r'\d+\s(\w+)', s)>>>打印清单['test1', 'test2', 'test3', 'test4', 'test5']

解决方案

我不相信这是可能的.正则表达式将捕获与给定正则表达式中的括号配对...如果您只列出一个组,例如 '((\w+)\s+){0,99}',那么它只会重复捕获到相同的第一个和第二组...不要为找到的每个匹配项创建新组.

您可以使用 split,但它只能拆分一个字符值,而不是像空格这样的一类字符.

相反,您可以使用 re.split,它可以对正则表达式进行拆分,并为其指定 '\s' 以匹配任何空格.您可能希望它匹配 '\s+' 以贪婪地收集空格.

<预><代码>>>>进口重新>>>帮助(重新拆分)关于模块中功能拆分的帮助:拆分(模式,字符串,maxsplit=0)通过模式的出现拆分源字符串,返回包含结果子字符串的列表.>>>re.split('\s+', 'The quick brown\t fox')['The', 'quick', 'brown', 'fox']>>>

This may be a silly question but...

Say you have a sentence like:

The quick brown fox

Or you might get a sentence like:

The quick brown fox jumped over the lazy dog

The simple regexp (\w*) finds the first word "The" and puts it in a group.

For the first sentence, you could write (\w*)\s*(\w*)\s*(\w*)\s*(\w*)\s* to put each word in its own group, but that assumes you know the number of words in the sentence.

Is it possible to write a regular expression that puts each word in any arbitrary sentence into its own group? It would be nice if you could do something like (?:(\w*)\s*)* to have it group each instance of (\w*), but that doesn't work.

I am doing this in Python, and my use case is obviously a little more complex than "The quick brown fox", so it would be nifty if Regex could do this in one line, but if that's not possible then I assume the next best solution is to loop over all the matches using re.findall() or something similar.

Thanks for any insight you may have.

Edit: For completeness's sake here's my actual use case and how I solved it using your help. Thanks again.

>>> s = '1 0 5 test1 5 test2 5 test3 5 test4 5 test5'
>>> s = re.match(r'^\d+\s\d+\s?(.*)', s).group(1)
>>> print s
5 test1 5 test2 5 test3 5 test4 5 test5
>>> list = re.findall(r'\d+\s(\w+)', s)
>>> print list
['test1', 'test2', 'test3', 'test4', 'test5']

解决方案

I don't believe that it is possible. Regexes pair the captures with the parentheses in the given regular expression... if you only listed one group, like '((\w+)\s+){0,99}', then it would just repeatedly capture to the same first and second group... not create new groups for each match found.

You could use split, but that only splits on one character value, not a class of characters like whitespace.

Instead, you can use re.split, which can split on a regular expression, and give it '\s' to match any whitespace. You probably want it to match '\s+' to gather the whitespace greedily.

>>> import re
>>> help(re.split)
Help on function split in module re:

split(pattern, string, maxsplit=0)
    Split the source string by the occurrences of the pattern,
    returning a list containing the resulting substrings.

>>> re.split('\s+', 'The   quick brown\t fox')
['The', 'quick', 'brown', 'fox']
>>>

这篇关于如何为句子中的每个单词分组?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆