转推的python正则表达式 [英] python regular expression for retweets

查看:18
本文介绍了转推的python正则表达式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在研究一个正则表达式,它将从推文中提取转推关键字和用户名.这是一个例子,用一个相当糟糕的正则表达式来完成这项工作:

i'm working on a regex that will extract retweet keywords and user names from tweets. here's an example, with a rather terrible regex to do the job:

tweet='foobar RT@one, @two: @three barfoo'
m=re.search(r'(RT|retweet|from|via)\b\W*@(\w+)\b\W*@(\w+)\b\W*@(\w+)\b\W*',tweet)
m.groups()
('RT', 'one', 'two', 'three')

我想要的是浓缩重复的 \b\W*@(\w+)\b\W* 模式并使它们成为可变数字,这样如果 @four 是在@three 之后添加,它也会被提取.我已经尝试了很多排列来重复使用 + 失败.

what i'd like is to condense the repeated \b\W*@(\w+)\b\W* patterns and make them of a variable number, so that if @four were added after @three, it would also be extracted. i've tried many permutations to repeat this with a + unsuccessfully.

我也希望它适用于类似的事情

i'd also like this to work for something like

tweet='foobar RT@one, RT @two: RT @three barfoo';

这可以通过 re.finditer 实现 如果 模式不重叠.(我有一个模式重叠的版本,所以只有第一个 RT 被选中.)

which can be achieved with a re.finditer if the patterns don't overlap. (i have a version where the patterns do overlap, and so only the first RT gets picked up.)

非常感谢任何帮助.谢谢.

any help is greatly appreciated. thanks.

推荐答案

尝试

(RT|retweet|from|via)(?:\b\W*@(\w+))+'

\b\W*@(\w+) 括在 '(?:...)` 中,您可以对重复的术语进行分组,而无需捕获聚合.

Enclosing the \b\W*@(\w+) in '(?:...)` allows you to group the terms for repetition without capturing the aggregate.

我不确定我是否在关注您问题的第二部分,但我认为您可能正在寻找涉及以下结构的内容:

I'm not sure I'm following the second part of your question, but I think you may be looking for something involving a construct like:

(?:(?!RT|@).)

它将匹配任何不是@"或RT"开头的字符,同样不捕获它.

which will match any character that isn't an "@" or the start of "RT", again without capturing it.

在那种情况下,如何:

(RT|retweet|from|via)((?:\b\W*@\w+)+)

然后后期处理

re.split(r'@(\w+)' ,m.groups()[1])

获取单个句柄?

这篇关于转推的python正则表达式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆