解析推文以将主题标签提取到数组中 [英] Parsing a tweet to extract hashtags into an array
问题描述
我花了很长时间在一条包含主题标签的推文中获取信息,然后使用Python将每个主题标签提取到一个数组中.我什至不敢透露我到目前为止一直在尝试的事情.
I am having a heck of a time taking the information in a tweet including hashtags, and pulling each hashtag into an array using Python. I am embarrassed to even put what I have been trying thus far.
例如,我喜欢#stackoverflow,因为#people非常#helpful!"
For example, "I love #stackoverflow because #people are very #helpful!"
这应该将3个主题标签拉入数组.
This should pull the 3 hashtags into an array.
推荐答案
一个简单的正则表达式就可以完成这项工作:
A simple regex should do the job:
>>> import re
>>> s = "I love #stackoverflow because #people are very #helpful!"
>>> re.findall(r"#(\w+)", s)
['stackoverflow', 'people', 'helpful']
不过请注意,正如其他答案中所建议的那样,这也可能会找到非标签,例如URL中的哈希位置:
Note though, that as suggested in other answers, this may also find non-hashtags, such as a hash location in a URL:
>>> re.findall(r"#(\w+)", "http://example.org/#comments")
['comments']
因此,另一个简单的解决方案是以下操作(将重复项作为奖励删除):
So another simple solution would be the following (removes duplicates as a bonus):
>>> def extract_hash_tags(s):
... return set(part[1:] for part in s.split() if part.startswith('#'))
...
>>> extract_hash_tags("#test http://example.org/#comments #test")
set(['test'])
这篇关于解析推文以将主题标签提取到数组中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!