字词按多个单词的主题标签拆分 [英] Term split by hashtag of multiple words
问题描述
我正在尝试拆分一个包含多个单词的标签的术语,例如#I-am-great"或#awesome-dayofmylife"
那么我正在寻找的输出是:
I am trying to split a term which contains a hashtag of multiple words such as "#I-am-great" or "#awesome-dayofmylife'
then the output that I am looking for is:
I am great
awesome day of my life
我所能实现的是:
>>> import re
>>> name = "big #awesome-dayofmylife because #iamgreat"
>>> name = re.sub(r'#([^\s]+)', r'\1', name)
>>> print name
big awesome-dayofmylife because iamgreat
如果系统询问我是否有可能的单词列表,则答案为否",因此,如果我能从中获得指导,那就太好了.有NLP专家吗?
If I am asked whether I have a list of possible words then the answer is 'No' so if I can get guidance in that then that would be great. Any NLP experts?
推荐答案
上面的所有注释器当然都是正确的:单词之间没有空格或其他清晰分隔符(尤其是英语)的#号标签通常是模棱两可的,无法正确解析在所有情况下.
All the commentators above are correct of course: A hashtag without spaces or other clear separators between the words (especially in English) is often ambiguous and cannot be parsed correctly in all cases.
但是,单词列表的概念很容易实现,并且可能会产生有用的(尽管有时是错误的)结果,所以我实现了一个快速的版本:
However, the idea of the word list is rather simple to implement and might yield useful (albeit sometimes wrong) results nevertheless, so I implemented a quick version of that:
wordList = '''awesome day of my life because i am great something some
thing things unclear sun clear'''.split()
wordOr = '|'.join(wordList)
def splitHashTag(hashTag):
for wordSequence in re.findall('(?:' + wordOr + ')+', hashTag):
print ':', wordSequence
for word in re.findall(wordOr, wordSequence):
print word,
print
for hashTag in '''awesome-dayofmylife iamgreat something
somethingsunclear'''.split():
print '###', hashTag
splitHashTag(hashTag)
此打印:
### awesome-dayofmylife
: awesome
awesome
: dayofmylife
day of my life
### iamgreat
: iamgreat
i am great
### something
: something
something
### somethingsunclear
: somethingsunclear
something sun clear
正如您所见,它进入了qstebom为它设置的陷阱;-)
And as you see it falls into the trap qstebom has set for it ;-)
上面代码的一些解释:
变量wordOr
包含所有单词的字符串,并用竖线符号(|
)分隔.在正则表达式中表示这些单词之一".
The variable wordOr
contains a string of all words, separated by a pipe symbol (|
). In regular expressions that means "one of these words".
第一个findall
得到的模式表示一个或多个这些单词的序列",因此它与"dayofmylife"之类的内容匹配. findall
找到所有这些序列,所以我遍历它们(for wordSequence in …
).对于每个单词序列,然后我搜索序列中的每个单词(也使用findall
)并打印该单词.
The first findall
gets a pattern which means "a sequence of one or more of these words", so it matches things like "dayofmylife". The findall
finds all these sequences, so I iterate over them (for wordSequence in …
). For each word sequence then I search each single word (also using findall
) in the sequence and print that word.
这篇关于字词按多个单词的主题标签拆分的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!