有人可以对自然语言处理的要素进行简单的解释吗? [英] Someone can give a simple explanation about the elements of Natural Language Processing?

查看:90
本文介绍了有人可以对自然语言处理的要素进行简单的解释吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是自然语言处理的新手,对使用的术语感到困惑.

I'm new to Natural Language Processing and I'm a confused about the terms used.

什么是令牌化? POS标记?实体识别吗?

What is tokenization? POS tagging? Entity Identify?

标记化仅将文本拆分为可能具有含义或为这些部分赋予含义的部分?意思是当我确定某物是名词,动词或形容词时的名字是什么.如果我想分为日期,名称,货币?

Tokenization is only split the text in parts that can have a meaning or give a meaning for these parts? And the meaning, what is the name when I determine that something is a noun, verb or adjetive. And if I want to divide into dates, names, currency?

我需要对NLP中使用的领域/术语进行简单的解释.

I need a simple explanation about the areas/terms used in NLP.

推荐答案

要添加到dmn的解释中:

To add to dmn's explanation:

通常,您应该在NLP中关注两个主题:

In general, there are two themes you should care about in NLP:

  1. 统计分析与基于规则的分析

  1. Statistical vs Rule-Based Analysis

轻量级vs重量级分析

统计分析使用统计机器学习技术对文本进行分类,通常具有良好的准确性和良好的回忆性. 基于规则的分析技术基本上使用手工构建的规则,具有很高的精度,但召回率很差(基本上,它们可以识别规则中的情况,但没有别的).

Statistical Analysis uses statistics machine learning techniques to classify text and in general have good precision and good recall. Rule-Based Analysis techniques basically use hand-built rules and have very good precision but terrible recall (basically they identify the cases in your rules, but nothing else).

轻量级与重量级分析是您将在现场看到的两种方法.通常,学术工作是重量级的,具有解析器,奇特分类器和许多非常高科技的NLP内容.在工业上,总体上重点放在数据上,并且许多学术知识的伸缩性很差,并且超出标准的统计或机器学习技术并不会给您带来多少好处.例如,解析在很大程度上是没有用的(并且很慢),因此关键字和ngram分析实际上非常有用,尤其是当您有大量数据时.例如,Google Translate显然不是在幕后花哨的-它们拥有太多的数据,无论翻译软件的精练程度如何,它们都可能使其他人崩溃.

Lightweight vs Heavyweight Analysis are the two approaches you'll see in the field. In general, academic work is heavyweight, featuring parsers, fancy classifiers and lots of very high tech NLP stuff. In industry, by and large the focus is on data, and a lot of the academic stuff scales poorly and going beyond standard statistical or machine learning techniques doesn't bring you much. For example, parsing is largely useless (and slow) and as such keyword and ngram analysis is actually pretty useful, especially when you have a lot of data. For example, Google Translate isn't apparently that fancy behind the scenes- they just have so much data they can crush everybody else no matter how refined their translation software is.

由此产生的结果是,在行业中有很多机器学习和数学知识,但是使用NLP的东西并不是很复杂,因为复杂的东西的确不能很好地工作.更为可取的是使用用户数据,例如在相关主题上的点击和机械土耳其语……这非常有效,因为人们比计算机更能理解自然语言.

The upshot of this is in industry there's a lot of machine learning and math, but the NLP stuff is used is not very sophisticated, because the sophisticated stuff really doesn't work well. Far preferred is using user data like clicks on related subjects and mechanical turk... and this works very well as people are far better at understanding natural language than computers.

解析是将一个句子分解为短语,例如动词短语,名词短语,介词短语等,并获得语法树.您可以使用 Stanford Parser的在线版本来玩一些示例并了解一下解析器的作用.例如,假设我们有一个句子

Parsing is break a sentence down into phrases, say verb phrase, noun phrase, prepositional phrase, etc and get a grammatical tree. You can use the online version of the Stanford Parser to play with examples and get a feel for what a parser does. For example, Let's say we have the sentence

My cat's name is Pat.

然后我们进行POS标记:

Then we do POS tagging:

My/PRP$ cat/NN 's/POS name/NN is/VBZ Pat/NNP ./.

使用POS标签和训练有素的统计解析器,我们得到了一个解析树:

Using the POS tags and a trained statistical parser, we get a parse tree:

(ROOT
  (S
    (NP
      (NP (PRP$ My) (NN cat) (POS 's))
      (NN name))
    (VP (VBZ is)
      (NP (NNP Pat)))
    (. .)))

我们还可以执行一种稍有不同的解析类型,称为依赖解析:

We can also do a slightly different type of parse called a dependency parse:

poss(cat-2, My-1)
poss(name-4, cat-2)
possessive(cat-2, 's-3)
nsubj(Pat-6, name-4)
cop(Pat-6, is-5)

N克 基本上是相邻的集合长度为n的单词.您可以在此处中查看n-gram.您还可以执行字符n-gram,这些n-gram经常用于拼写纠正.

N-Grams are basically sets of adjacent words of length n. You can look at n-grams in Google's data here. You can also do character n-grams which are used heavily for spelling correction.

情感分析正在分析文本,以提取人们对某事物的感觉或提及事物(例如品牌)的方式.这涉及到看很多表示情感的单词.

Sentiment Analysis is analyzing text to extract how people feel about something or in what light things (such as brands) are mentioned. This involves a lot of looking at words that denote emotion.

语义分析正在分析文本的含义.通常这采用分类法和本体论的形式,将概念归为一类(狗,猫属于动物和宠物),但这是一个非常不发达的领域. WordNet和Framenet等资源在这里很有用.

Semantic Analysis is analyzing the meaning of text. Often this takes the form of taxonomies and ontologies where you group concepts together (dog,cat belong to animal and pet) but it is a very undeveloped field. Resources like WordNet and Framenet are useful here.

这篇关于有人可以对自然语言处理的要素进行简单的解释吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆