一些NLP东西与Python中的语法,标记,词干和单词义消歧有关 [英] Some NLP stuff to do with grammar, tagging, stemming, and word sense disambiguation in Python

查看:90
本文介绍了一些NLP东西与Python中的语法,标记,词干和单词义消歧有关的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

寻求针对奇数需求的最佳解决方案的建议. 我是大学四年级的一名(文学)学生,仅在编程方面有自己的指导.我对Python有足够的能力,以致(大部分时间)实现我发现的解决方案并在其上进行开发不会遇到麻烦,但是由于我的新颖性,我正在寻求有关 best 的建议解决这个特殊问题的方法.

Seeking advice on an optimal solution to an odd requirement. I'm a (literature) student in my fourth year of college with only my own guidance in programming. I'm competent enough with Python that I won't have trouble implementing solutions I find (most of the time) and developing upon them, but because of my newbness, I'm seeking advice on the best ways I might tackle this peculiar problem.

已经使用了NLTK,但与NLTK书中的示例有所不同.我已经在使用NLTK的很多东西,尤其是WordNet,因此对我来说材料并不陌生.我已经阅读了大部分NLTK书.如果我试图分析现有文本或目标文本是否像散文,我会更好地知道该如何进行,但是我的应用程序专注于诗歌,尤其是基于不可预见的输入即时构建诗歌文本.来自用户.

Already using NLTK, but differently from the examples in the NLTK book. I'm already utilizing a lot of stuff from NLTK, particularly WordNet, so that material is not foreign to me. I've read most of the NLTK book. I'd know better how to proceed if I were trying to analyze existing texts or if the target texts were prose-like -- but my application is focused on poetry, particularly on building poetic texts on-the-fly, based on unforeseeable inputs from users.

我正在使用零碎的原子语言.我的应用程序逐字逐句地移动:每轮,几个用户输入单词(每个用户一个单词).我的程序试图统一或组合这些输入字以产生单个输出字.我已经开发了单词选择算法-它利用WordNet的各种功能来得出其单字结果.结果以WordNet同义词集的形式出现-不变形的单词(复数形式和时态形式).结果被附加到诗"的文本上(在某些空格之后).生成的单词的增加会影响用户对接下来要扔进锅中的单词的选择,这就是该游戏/程序的运行方式,一次向诗中添加了一个机器变形的单词.

I'm working with fragmentary, atomic language. My application moves word-by-word: each round, several users put in words (one word per user). My program seeks to unify or combine these input words to produce a single output word. I've developed the word-selection algorithm already -- it utilizes various features of WordNet to come up with its single-word result. The result is in the form of a WordNet synset -- a uninflected word (stripped of plurality and tense). The result gets appended to the text of the "poem" (after some whitespace). The addition of the resulting word influences the users' choices of what word to throw into the pot next, and that's how this game/program moves along, adding one machine-morphed word to the poem at a time.

问题:如何以语法上合理的方式对结果进行变形?未经任何语法处理,结果只是字典可搜索单词的列表,单词之间没有一致.第一步是我的应用程序根据上下文对词根进行词干/复数/共轭/变位. (我说的根词"是WordNet和/或人类可读的等价词的同义词集.)想象一下,这首诗中已经有一些语法上有意义的文本开始了,我的应用程序需要改变一个新的结果-word以与现有序列一致.如果这只能像3字窗口之类的东西一样工作,那很好,但是我正在寻找有关最佳操作顺序的建议.我希望有人能给我一些指导(我希望这很难实现,但是我要确保我从正确的想法开始).

The problem: How to inflect the result in a grammatically sensible way? Without any kind of grammatical processing, the results are just a list of dictionary-searchable words, without agreement between words. First step is for my application to stem/pluralize/conjugate/inflect root-words according to context. (The "root words" I'm speaking of are synsets from WordNet and/or their human-readable equivalents.) Imagining that there were already some grammatically sensible text in the poem to start off with, my application needs to inflect a new result-word to agree with the existing sequence. It's fine if this is only working on like a 3-word window or something, but I'm looking for advice on an optimal order of operations. I'm hoping that somebody can give me some pointers (I expect it to be difficult to implement, but I want to make sure I'm starting off with the right ideas).

假设我们已经有一首诗,用户正在向其中添加新的输入.新的结果需要以语法上合理的方式加以体现.

Let's assume we already have a chunk of a poem, to which users are adding new inputs to. The new results need to be inflected in a grammatically sensible way.

The river bears no empty bottles, sandwich papers,   
Silk handkerchiefs, cardboard boxes, cigarette ends  
Or other testimony of summer nights. The nymphs

比方说,我的算法从用户那里获取了一批输入,现在需要打印4个可能的下一个单词/同义词(非正式表示)中的1个:['departure', 'to have', 'blue', 'quick'].在我看来'blue'应该被丢弃; 'The nymphs blue'在语法上似乎很奇怪/不太可能.从那里可以使用这些动词中的任何一个.

Let's say my algorithm has taken a batch of inputs from users, and now needs to print 1 of the 4 possible next words/synsets (informally represented): ['departure', 'to have', 'blue', 'quick']. It seems to me that 'blue' should be discarded; 'The nymphs blue' seems grammatically odd/unlikely. From there it could use either of these verbs.

如果选择'to have',则结果可能会合理地转换为'had''have''having''will have''would have'等(但不是'has'). (结果行将类似于'The nymphs have',而合理的结果将为将来的结果提供更好的上下文...)

If it picks 'to have' the result could be sensibly inflected as 'had', 'have', 'having', 'will have', 'would have', etc. (but not 'has'). (The resulting line would be something like 'The nymphs have' and the sensibly-inflected result will provide better context for future results ...)

在这种情况下,我希望'depature'是有效的可能性;虽然'The nymphs departure'没有任何意义(不是"nymphs'"),但'The nymphs departed'(或其他动词变位形式)却可以.

I'd like for 'depature' to be a valid possibility in this case; while 'The nymphs departure' doesn't make sense (it's not "nymphs'"), 'The nymphs departed' (or other verb conjugations) would.

貌似'The nymphs quick'没有任何意义,但'The nymphs quickly [...]''The nymphs quicken'之类的东西就可以,因此'quick'也是发生明显拐点的可能性.

Seemingly 'The nymphs quick' wouldn't make sense, but something like 'The nymphs quickly [...]' or 'The nymphs quicken' could, so 'quick' is also a possibility for sensible inflection.

  1. 标记语音,原始输入的复数形式,时态等.注意这一点可能有助于从几种可能性中进行选择(即,如果用户输入了'having'而不是其他时态,则在have/have/have之间进行的选择可能比随机选择更具针对性).我听说Stanford POS标记器很好,它在NLTK中有一个实现.我不确定如何在这里进行时态检测.
  2. 考虑上下文,以便排除语法上特殊的可能性.考虑最后几个单词及其词性标记(和时态?)以及句子边界(如果有),并从那,丢掉那些没有意义的东西.在'The nymphs'之后,我们不需要其他文章(据我所知,不是确定词),也不需要形容词,但是副词或动词可能会起作用.将当前内容与带标记的语料库(和/或马尔可夫链?)中的序列进行比较(或咨询语法检查功能)可以为此提供解决方案.
  3. 从剩余的可能性中选择一个单词(那些可能会被明智地改变的词).这不是我需要的答案-我已经有了解决方法.假设它是随机选择的.
  4. 根据需要转换所选单词.如果可以折叠来自#1的信息(例如,也许复数"标志设置为True),请这样做.如果存在多种可能性(例如,所选择的单词是一个动词,但可能有一些时态),请随机选择.无论如何,在将单词插入诗歌"之前,都需要对单词进行变体.
  1. Tag part of speech, plurality, tense, etc. -- of original inputs. Taking note of this could help to select from the several possibilities (i.e. choosing between had/have/having could be more directed than random if a user had inputted 'having' rather than some other tense). I've heard the Stanford POS tagger is good, which has an implementation in NLTK. I am not sure how to handle tense detection here.
  2. Consider context in order to rule out grammatically peculiar possibilities. Consider the last couple words and their part-of-speech tags (and tense?), as well as sentence boundaries if any, and from that, drop things that wouldn't make sense. After 'The nymphs' we don't want another article (or determiner, as far as I can tell), nor an adjective, but an adverb or verb could work. Comparison of the current stuff with sequences in tagged corpora (and/or Markov chains?) -- or consultation of grammar-checking functions -- could provide a solution for this.
  3. Select a word from the remaining possibilities (those that could be inflected sensibly). This isn't something I need an answer for -- I've got my methods for this. Let's say it's randomly selected.
  4. Transform the selected word as needed. If the information from #1 can be folded in (for example, perhaps the "pluralize" flag was set to True), do so. If there are several possibilities (e.g. picked word is a verb, but a few tenses are possible) select, randomly. Regardless I'm going to need to morph the word before inserting it into the "poem".

我正在寻找有关此例程健全性的建议,以及有关添加步骤的建议.进一步分解这些步骤的方法也将有所帮助.最后,我正在寻找有关哪种工具最能完成每个任务的建议.

I'm looking for advice on the soundness of this routine, as well as suggestions for steps to add. Ways to break down these steps further would also be helpful. Finally I'm looking for suggestions on what tool might best accomplish each task.

在提供足够的信息的同时,我尝试尽可能简洁.请不要犹豫,要求我澄清!我将不胜感激,并且会接受最清晰/最有启发性的答案:)谢谢!

I've tried to be as concise as possible, while providing enough information. Please don't hesitate to ask me for clarification! I'll appreciate any information I get, and I'll accept the clearest / most illuminating answer :) Thanks!

推荐答案

我认为上述有关n-gram语言模型的注释比解析和标记更适合您的要求.解析器和标记器(除非进行了修改)将遭受目标词缺少正确上下文的困扰(即,查询时您没有其余的句子可用).另一方面,语言模型有效地考虑了过去(左上下文),尤其是对于不超过5个单词的窗口. n-gram的问题在于它们不对长距离依赖项建模(超过 n 个单词).

I think that the comment above on n-gram language model fits your requirements better than parsing and tagging. Parsers and taggers (unless modified) will suffer from the lack of right context of the target word (i.e., you don't have the rest of the sentence available at time of query). On the other hand, language models consider the past (left context) efficiently, especially for windows up to 5 words. The problem with n-grams is that they don't model long distance dependencies (more than n words).

NLTK具有一种语言模型: http: //nltk.googlecode.com/svn/trunk/doc/api/nltk.model.ngram-pysrc.html .标记词典可以帮助您进一步平滑模型.

NLTK has a language model: http://nltk.googlecode.com/svn/trunk/doc/api/nltk.model.ngram-pysrc.html . A tag lexicon may help you smooth the model more.

我所看到的步骤:1.从用户那里获得一组词汇. 2.为单词的所有可能的变体创建一个更大的集合. 3.询问模型最有可能出现变形的单词.

The steps as I see them: 1. Get a set of words from the users. 2. Create a larger set of all possible inflections of the words. 3. Ask the model which inflected word is most probable.

这篇关于一些NLP东西与Python中的语法,标记,词干和单词义消歧有关的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆