如何使用Python-NLTK解析基于词法内容(短语)的句子 [英] How to parse sentences based on lexical content (phrases) with Python-NLTK

查看:139
本文介绍了如何使用Python-NLTK解析基于词法内容(短语)的句子的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Python-NLTK能否识别输入字符串并不仅基于空白而且还基于内容对其进行解析?说,在这种情况下,计算机系统"成为一个短语.谁能提供示例代码?

Can Python-NLTK recognize input string and parse it not only based on white space but also on the content? Say, "computer system" became a phrases in this situation. Can anyone provide a sample code?

输入字符串:用户对计算机系统响应时间的看法调查"

input String: "A survey of user opinion of computer system response time"

预期输出:["A",调查",的",用户",意见",的",计算机系统",响应",时间" ]

Expected output: ["A", "survey", "of", "user", "opinion", "of", "computer system", "response", "time"]

推荐答案

您正在寻找的技术被称为语言学和计算的多个子领域或子领域的多个名称.

The technology you're looking for is called multiple names from multiple sub-fields or sub-sub-fields of linguistics and computing.

  • 关键词提取
    • Keyphrase Extraction
      • From Information Retrieval, mainly use for improving indexing/querying for sear
      • Read this recent survey paper: http://www.hlt.utdallas.edu/~saidul/acl14.pdf
      • (I personally) strongly recommend: https://code.google.com/p/jatetoolkit/ and of course the famous https://code.google.com/p/kea-algorithm/ (from the people who brought you WEKA, http://www.cs.waikato.ac.nz/ml/weka/)
      • For python, possibly https://github.com/aneesha/RAKE
      • 分组
        • Chunking
          • From Natural Language Processing, it's also call shallow parsing,
          • Read Steve Abney's work on how it came about: http://www.vinartus.net/spa/90e.pdf
          • Major NLP framework and toolkits should have them (e.g. OpenNLP, GATE, NLTK* (do note that NLTK's default chunker only works for name entities))
          • Stanford NLP has one too: http://nlp.stanford.edu/projects/shallow-parsing.shtml

          我将举例说明NLTK中的NE块:

          I'll give an example of the NE chunker in NLTK:

          >>> from nltk import word_tokenize, ne_chunk, pos_tag
          >>> sent = "A survey of user opinion of computer system response time"
          >>> chunked = ne_chunk(pos_tag(word_tokenize(sent)))
          >>> for i in chunked:
          ...     print i
          ... 
          ('A', 'DT')
          ('survey', 'NN')
          ('of', 'IN')
          ('user', 'NN')
          ('opinion', 'NN')
          ('of', 'IN')
          ('computer', 'NN')
          ('system', 'NN')
          ('response', 'NN')
          ('time', 'NN')
          

          具有命名实体:

          >>> sent2 = "Barack Obama meets Michael Jackson in Nihonbashi"
          >>> chunked = ne_chunk(pos_tag(word_tokenize(sent2)))
          >>> for i in chunked:
          ...     print i
          ... 
          (PERSON Barack/NNP)
          (ORGANIZATION Obama/NNP)
          ('meets', 'NNS')
          (PERSON Michael/NNP Jackson/NNP)
          ('in', 'IN')
          (GPE Nihonbashi/NNP)
          

          我想你会发现它有很多缺陷,总比没有好.

          You can see it's pretty much flawed, better something than nothing, i guess.

          • 多词表达提取
            • NLP中的热门话题,每个人都出于某种原因提取它们
            • 伊凡·萨格(Ivan Sag)最著名的著作: http://lingo.stanford.edu /pubs/WP-2001-03.pdf 和ACL论文中各种提取算法和提取用法的usage处
            • 尽管这个MWE非常神秘,而且我们不知道如何自动对它们进行分类或正确地提取它们,但是没有合适的工具(奇怪的是,MWE的输出研究人员希望经常可以通过Keyphrase Extraction或chunking获得它们. ...)
            • Multi-Word Expression extraction
              • Hot topic in NLP, everyone wants to extract them for one reason or another
              • Most notable work by Ivan Sag: http://lingo.stanford.edu/pubs/WP-2001-03.pdf and a miasma of all sorts of extraction algorithms and extracted usage from ACL papers
              • As much as this MWE is very mysterious and we don't know how to classify them automatically or extract them properly, there's no proper tools for this (strangely the output researchers of MWE wants often can be obtained with Keyphrase Extraction or chunking...)
              • 术语提取

              • 这是来自翻译研究的,他们希望翻译人员在翻译文档时使用正确的技术用语.
              • 请注意,术语伴随着ISO标准的聚宝盆,由于繁复的翻译行业产生了数十亿美元的收入,因此应遵循ISO的标准.
              • 从语言上讲,我不知道是什么使它们与术语提取器,相同的算法,不同的接口有所不同……我想某些术语提取器的唯一功能就是能够双语操作并自动生成字典.
              • li>

              以下是一些工具

              • https://github.com/srijiths/jtopia and
              • http://fivefilters.org/term-extraction/
              • https://github.com/turian/topia.termextract
              • https://www.airpair.com/nlp/keyword-extraction-tutorial
              • http://termcoord.wordpress.com/about/testing-of-term-extraction-tools/free-term-extractors/
              • Note on tools: there's still no one tool that stands out for term extraction though. And because of then big money involved, it's always some API calls and most code are "semi-open".. mostly closed. Then again, SEO is also big money, possibly it's just a culture thing in translation industry to be super secretive.

              现在回到OP的问题.

              Now back to OP's question.

              问: NLTK可以提取计算机系统"作为短语吗?

              A:不是真的

              如上所示,NLTK具有预先训练的分块器,但它可用于名称实体,即使如此,并不是所有的命名实体都能被很好地识别.

              As shown above, NLTK has pre-trained chunker but it works on name entities and even so, not all named entities are well recognized.

              OP可能会尝试更激进的想法,让我们假设一系列名词总是一起构成一个短语:

              Possibly OP could try out more radical idea, let's assume that a sequence of nouns together always form a phrase:

              >>> from nltk import word_tokenize, pos_tag
              >>> sent = "A survey of user opinion of computer system response time"
              >>> tagged = pos_tag(word_tokenize(sent))
              >>> chunks = []
              >>> current_chunk = []
              >>> for word, pos in tagged:
              ...     if pos.startswith('N'):
              ...             current_chunk.append((word,pos))
              ...     else:
              ...             if current_chunk:
              ...                     chunks.append(current_chunk)
              ...             current_chunk = []
              ... 
              >>> chunks
              [[('computer', 'NN'), ('system', 'NN'), ('response', 'NN'), ('time', 'NN')], [('survey', 'NN')], [('user', 'NN'), ('opinion', 'NN')]]
              >>> for i in chunks:
              ...     print i
              ... 
              [('computer', 'NN'), ('system', 'NN'), ('response', 'NN'), ('time', 'NN')]
              [('survey', 'NN')]
              [('user', 'NN'), ('opinion', 'NN')]
              

              因此,即使采用了这种解决方案,似乎也很难尝试单独获得计算机系统". 但是,如果您想一想,获得计算机系统响应时间"似乎比计算机系统"更有效.

              So even with that solution, seems like trying to get 'computer system' alone is hard. But if you think for a bit seems like getting 'computer system response time' is a more valid phrase than 'computer system'.

              不要对计算机系统响应时间的所有解释都成立:

              Do not that all interpretations of computer system response time seem valid:

              • [计算机系统响应时间]
              • [计算机[系统[响应[时间]]]]
              • [计算机系统] [响应时间]
              • [计算机[系统响应时间]]

              还有许多其他可能的解释.因此,您必须要问,您将提取的短语用作什么,然后查看如何继续剪切诸如计算机系统响应时间"之类的长短语.

              And many many more possible interpretations. So you've got to ask, what are you using the extracted phrase for and then see how to proceed with cutting long phrases like 'computer system response time'.

              这篇关于如何使用Python-NLTK解析基于词法内容(短语)的句子的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆