什么是共现矩阵以及它们如何在 NLP 中使用? [英] What are co-occurence matrixes and how are they used in NLP?

查看:298
本文介绍了什么是共现矩阵以及它们如何在 NLP 中使用?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Google ngram 下载器的 pypi 文档 说有时您需要在数据集上汇总数据.例如构建一个共现矩阵."

The pypi docs for a google ngram downloader say that "sometimes you need an aggregate data over the dataset. For example to build a co-occurrence matrix."

共现矩阵的维基百科与图像处理有关,谷歌搜索这个词似乎带来了某种搜索引擎优化技巧.

The wikipedia for co-occurence matrix has to do with image processing and googling the term seems to bring up some sort of SEO trick.

那么什么是共现矩阵(在计算语言学/NLP 中)?它们在 NLP 中是如何使用的?

So what are co-occurrence matrixes (in computational linguistics/NLP)? How are they used in NLP?

推荐答案

什么是共现矩阵?

一般来说,共现矩阵将在行 (ER) 和列 (EC) 中具有特定实体.该矩阵的目的是显示每个 ER 在与每个 EC 相同的上下文中出现的次数.因此,为了使用共现矩阵,您必须定义实体以及它们共现的上下文.

What is a co-occurrence matrix ?

Generally speaking, a co-occurrence matrix will have specific entities in rows (ER) and columns (EC). The purpose of this matrix is to present the number of times each ER appears in the same context as each EC. As a consequence, in order to use a co-occurrence matrix, you have to define your entites and the context in which they co-occur.

在 NLP 中,最经典的方法是将每个实体(即行和列)定义为文本中出现的单词,将上下文定义为句子.

In NLP, the most classic approach is to define each entity (ie, lines and columns) as a word present in a text, and the context as a sentence.

考虑以下文本:

玫瑰是红色的.天空是蓝色的.

Roses are red. Sky is blue.

使用之前描述的经典方法,我们将得到以下矩阵:

With the classic approach described before, we'll have the following matrix :

      |  Roses | are | red | Sky | is | blue
Roses |    1   |  1  |  1  |  0  |  0 |   0
are   |    1   |  1  |  1  |  0  |  0 |   0
red   |    1   |  1  |  1  |  0  |  0 |   0
Sky   |    0   |  0  |  0  |  1  |  1 |   1
is    |    0   |  0  |  0  |  1  |  1 |   1
Blue  |    0   |  0  |  0  |  1  |  1 |   1

这里,每个单元格表示两个项目是否同时出现.您可以使用它出现的次数或更复杂的方法来替换它.您也可以更改实体本身,方法是将名词放在列中,将形容词放在行中而不是每个单词.

Here, each cell indicates wether the two items co-occur or not. You may replace it with the number of times it appears, or with a more sophisticated approach. You may also change the entities themselves, by putting nouns in columns and adjective in lines instead of every word.

这些矩阵最明显的用途是它们能够提供概念之间的联系.假设您正在处理产品评论.为简单起见,我们还假设每条评论仅由短句组成.你会有类似的东西:

The most evident use of these matrix is their ability to provide links between notions. Let's suppose you're working on products reviews. Let's also suppose for simplicity that each review is only composed of short sentences. You'll have something like that :

ProductX 很棒.

ProductX is amazing.

我讨厌 productY.

I hate productY.

将这些评论表示为一个共现矩阵将使您能够将产品与欣赏联系起来.

Representing these reviews as one co-occurrence matrix will enable you associate products with appreciations.

这篇关于什么是共现矩阵以及它们如何在 NLP 中使用?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆