使用NLTK查找不完整句子中的主题 [英] Find subject in incomplete sentence with NLTK

查看:91
本文介绍了使用NLTK查找不完整句子中的主题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个要分类的产品清单.它们将用不完整的句子来描述,例如:

I have a list of products that I am trying to classify into categories. They will be described with incomplete sentences like:

固态驱动器外壳"

硬盘驱动器电缆"

"1TB硬盘"

制造商翻新的500GB硬盘"

"500GB Hard Drive, Refurbished from Manufacturer"

如何使用python和NLP获得房屋,电缆,驱动器,驱动器"之类的输出,或描述哪个单词正在修改哪个词的树? 预先谢谢你

How can I use python and NLP to get an output like "Housing, Cable, Drive, Drive", or a tree that describes which word is modifying which? Thank you in advance

推荐答案

NLP技术处理这种文本的能力相对较差.

NLP techniques are relatively ill equipped to deal with this kind of text.

换句话说:很可能会构建一个包含NLP流程以实现所需分类器的解决方案,但是增加的复杂性并不一定会在开发速度或提高分类器精度方面得到回报.
如果真的坚持使用NLP技术,则POS标记及其识别名词的能力是最明显的想法,但是Chunlking和对WordNet或其他词汇来源的访问是NLTK的其他合理用法.

Phrased differently: it is quite possible to build a solution which includes NLP processes to implement the desired classifier but the added complexity doesn't necessarily pays off in term of speed of development nor classifier precision improvements.
If one really insists on using NLP techniques, POS-tagging and its ability to identify nouns is the most obvious idea, but Chunking and access to WordNet or other lexical sources are other plausible uses of NLTK.

相反,基于简单正则表达式和一些启发式(例如 NoBugs 所建议的启发式)的临时解决方案可能是解决此问题的合适方法.当然,这样的解决方案有两个主要风险:

Instead, an ad-hoc solution based on simple regular expressions and a few heuristics such as these suggested by NoBugs is probably an appropriate approach to the problem. Certainly, such solutions bear two main risks:

  • 过度适合于在制定规则时所考虑/考虑的文字部分
  • 如果引入过多的规则和子规则,则可能导致解决方案的混乱/复杂.
  • over-fitting to the portion of the text reviewed/considered in building the rules
  • possible messiness/complexity of the solution if too many rules and sub-rules are introduced.

对要考虑的完整(或非常大的示例)文本进行一些简单的静态分析,应有助于指导一些启发式方法的选择,并避免过度拟合的问题.我非常确定,与自定义词典相关的相对较少的规则应该足以产生具有适当精度以及速度/资源性能的分类器.

Running some plain statical analysis on the complete (or very big sample) of the texts to be considered should help guide the selection of a few heuristics and also avoid the over-fitting concerns. I'm quite sure that a relatively small number of rules, associated with a custom dictionary should be sufficient to produce a classifier with appropriate precision as well as speed/resources performance.

一些想法:

  • 数一数的手在整个语料库中计算所有单词(可能还有所有的二元语法和三元语法).通过允许将最费力和最严格的规则分配给最常见的模式,此信息可以驱动分类器的设计.
  • 手动引入一个简短的词典,该词典将最受欢迎的单词与以下内容相关联:
    • 他们的POS函数(此处主要是二进制问题:即名词vs.修饰语和其他非名词.
    • 其同义词根[如果适用]
    • 他们的班级[如果适用]
    • count all the words (and possibly all the bi-grams and tri-grams) in a sizable portion of the corpus a hand. This info can drive the design of the classifier by allowing to allocate the most effort and the most rigid rules to the most common patterns.
    • manually introduce a short dictionary which associates the most popular words with:
      • their POS function (mostly a binary matter here: i.e. nouns vs. modifiers and other non-nouns.
      • their synonym root [if applicable]
      • their class [if applicable]

      恐怕这个答案不足以提供Python/NLTK片段作为解决方案的入门,但是坦率地说,这种基于NLTK的简单方法充其量只能令人失望.另外,我们应该在输入文本中使用更大的样本集来指导可行方法的选择,包括基于NLTK或NLP技术的方法.

      I'm afraid this answer falls short of providing Python/NLTK snippets as a primer towards a solution, but frankly such simple NLTK-based approaches are likely to be disappointing at best. Also, we should have a much bigger sample set of the input text to guide the selection of plausible approaches, include ones that are based on NLTK or NLP techniques at large.

      这篇关于使用NLTK查找不完整句子中的主题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆