转推的python正则表达式 [英] python regular expression for retweets
问题描述
我正在研究一个正则表达式,它将从推文中提取转推关键字和用户名.这是一个例子,用一个相当糟糕的正则表达式来完成这项工作:
i'm working on a regex that will extract retweet keywords and user names from tweets. here's an example, with a rather terrible regex to do the job:
tweet='foobar RT@one, @two: @three barfoo'
m=re.search(r'(RT|retweet|from|via)\b\W*@(\w+)\b\W*@(\w+)\b\W*@(\w+)\b\W*',tweet)
m.groups()
('RT', 'one', 'two', 'three')
我想要的是浓缩重复的 \b\W*@(\w+)\b\W*
模式并使它们成为可变数字,这样如果 @four 是在@three 之后添加,它也会被提取.我已经尝试了很多排列来重复使用 +
失败.
what i'd like is to condense the repeated \b\W*@(\w+)\b\W*
patterns and make them of a variable number, so that if @four were added after @three, it would also be extracted. i've tried many permutations to repeat this with a +
unsuccessfully.
我也希望它适用于类似的事情
i'd also like this to work for something like
tweet='foobar RT@one, RT @two: RT @three barfoo';
这可以通过 re.finditer 实现 如果 模式不重叠.(我有一个模式重叠的版本,所以只有第一个 RT 被选中.)
which can be achieved with a re.finditer if the patterns don't overlap. (i have a version where the patterns do overlap, and so only the first RT gets picked up.)
非常感谢任何帮助.谢谢.
any help is greatly appreciated. thanks.
推荐答案
尝试
(RT|retweet|from|via)(?:\b\W*@(\w+))+'
将 \b\W*@(\w+)
括在 '(?:...)` 中,您可以对重复的术语进行分组,而无需捕获聚合.
Enclosing the \b\W*@(\w+)
in '(?:...)` allows you to group the terms for repetition without capturing the aggregate.
我不确定我是否在关注您问题的第二部分,但我认为您可能正在寻找涉及以下结构的内容:
I'm not sure I'm following the second part of your question, but I think you may be looking for something involving a construct like:
(?:(?!RT|@).)
它将匹配任何不是@"或RT"开头的字符,同样不捕获它.
which will match any character that isn't an "@" or the start of "RT", again without capturing it.
在那种情况下,如何:
(RT|retweet|from|via)((?:\b\W*@\w+)+)
然后后期处理
re.split(r'@(\w+)' ,m.groups()[1])
获取单个句柄?
这篇关于转推的python正则表达式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!