字词按多个单词的主题标签拆分 [英] Term split by hashtag of multiple words

查看:96
本文介绍了字词按多个单词的主题标签拆分的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试拆分一个包含多个单词的标签的术语,例如#I-am-great"或#awesome-dayofmylife"
那么我正在寻找的输出是:

I am trying to split a term which contains a hashtag of multiple words such as "#I-am-great" or "#awesome-dayofmylife'
then the output that I am looking for is:

 I am great
 awesome day of my life

我所能实现的是:

 >>> import re
 >>> name = "big #awesome-dayofmylife because #iamgreat"
 >>> name =  re.sub(r'#([^\s]+)', r'\1', name)
 >>> print name
 big awesome-dayofmylife because iamgreat

如果系统询问我是否有可能的单词列表,则答案为否",因此,如果我能从中获得指导,那就太好了.有NLP专家吗?

If I am asked whether I have a list of possible words then the answer is 'No' so if I can get guidance in that then that would be great. Any NLP experts?

推荐答案

上面的所有注释器当然都是正确的:单词之间没有空格或其他清晰分隔符(尤其是英语)的#号标签通常是模棱两可的,无法正确解析在所有情况下.

All the commentators above are correct of course: A hashtag without spaces or other clear separators between the words (especially in English) is often ambiguous and cannot be parsed correctly in all cases.

但是,单词列表的概念很容易实现,并且可能会产生有用的(尽管有时是错误的)结果,所以我实现了一个快速的版本:

However, the idea of the word list is rather simple to implement and might yield useful (albeit sometimes wrong) results nevertheless, so I implemented a quick version of that:

wordList = '''awesome day of my life because i am great something some
thing things unclear sun clear'''.split()

wordOr = '|'.join(wordList)

def splitHashTag(hashTag):
  for wordSequence in re.findall('(?:' + wordOr + ')+', hashTag):
    print ':', wordSequence   
    for word in re.findall(wordOr, wordSequence):
      print word,
    print

for hashTag in '''awesome-dayofmylife iamgreat something
somethingsunclear'''.split():
  print '###', hashTag
  splitHashTag(hashTag)

此打印:

### awesome-dayofmylife
: awesome
awesome
: dayofmylife
day of my life
### iamgreat
: iamgreat
i am great
### something
: something
something
### somethingsunclear
: somethingsunclear
something sun clear

正如您所见,它进入了qstebom为它设置的陷阱;-)

And as you see it falls into the trap qstebom has set for it ;-)

上面代码的一些解释:

变量wordOr包含所有单词的字符串,并用竖线符号(|)分隔.在正则表达式中表示这些单词之一".

The variable wordOr contains a string of all words, separated by a pipe symbol (|). In regular expressions that means "one of these words".

第一个findall得到的模式表示一个或多个这些单词的序列",因此它与"dayofmylife"之类的内容匹配. findall找到所有这些序列,所以我遍历它们(for wordSequence in …).对于每个单词序列,然后我搜索序列中的每个单词(也使用findall)并打印该单词.

The first findall gets a pattern which means "a sequence of one or more of these words", so it matches things like "dayofmylife". The findall finds all these sequences, so I iterate over them (for wordSequence in …). For each word sequence then I search each single word (also using findall) in the sequence and print that word.

这篇关于字词按多个单词的主题标签拆分的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆