R构造文档术语矩阵如何匹配其值由白色空格分隔的短语组成的字典 [英] R construct document term matrix how to match dictionaries whose values consist of white-space separated phrases

查看:298
本文介绍了R构造文档术语矩阵如何匹配其值由白色空格分隔的短语组成的字典的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用R进行文本挖掘时,在处理文本数据之后,我们需要创建一个文档术语矩阵来进一步探索。但是与中文一样,英语也有一些特定的阶段,如语义距离,机器学习,如果把它们划分成单词,它的含义完全不一样,我想知道如何匹配预先定义的字典值由空格分隔的术语组成,如包含语义距离,机器学习等。如果一个文档是我们可以使用机器学习方法来计算语义距离,当将该文档应用于字典[语义距离,机器学习]时,将返回1×2矩阵:[语义距离1 ;机器学习,1]

When do text mining using R, after reprocessing text data, we need create a document-term matrix for further exploring. But in similar with Chinese, English also have some certain phases, such as "semantic distance", "machine learning", if you segment them into word, it have totally different meanings, I want to know how to match pre-defined dictionaries whose values consist of white-space separated terms, such as contains "semantic distance", "machine learning". if a document is "we could use machine learning method to calculate the words semantic distance", when applying this document on the dictionary["semantic distance", "machine learning"], it will return a 1x2 matrix:[semantic distance, 1;machine learning,1]

推荐答案

可以用quanteda来做到这一点,尽管它需要为每个短语构建一个字典,然后预处理文本以将短语转换为令牌。要成为令牌,短语需要加入除空格之外的其他东西 - 这里是 _ 字符。

It's possible to do this with quanteda, although it requires the construction of a dictionary for each phrase, and then pre-processing the text to convert the phrases into tokens. To become a "token", the phrases need to be joined by something other than whitespace -- here, the "_" character.

require(quanteda)
packageVersion("quanteda")
## [1] '0.9.5.19'

以下是一些示例文本,包括OP中的短语。我添加了两个附加的文本,下面是文档特征矩阵的第一行产生所要求的答案。

Here are some example texts, including the phrase in the OP. I added two additional texts for the illustration -- below, the first row of the document-feature matrix produces the requested answer.

txt <- c("We could use machine learning method to calculate the words semantic distance.",
         "Machine learning is the best sort of learning.",
         "The distance between semantic distance and machine learning is machine driven.")

短语到令牌的当前签名需要短语参数为字典或并置对象。在这里,我们将使它成为一个字典:

The current signature for phrase to token requires the phrases argument to be a dictionary or a collocations object. Here we will make it a dictionary:

mydict <- dictionary(list(machine_learning = "machine learning", 
                          semantic_distance = "semantic distance"))

然后我们预处理文本来转换字典短语他们的钥匙:

Then we pre-process the text to convert the dictionary phrases to their keys:

txtPhrases <- phrasetotoken(txt, mydict)
txtPhrases
## [1] "We could use machine_learning method to calculate the words semantic_distance."
## [2] "Machine_learning is the best sort of learning."                                
## [3] "The distance between semantic_distance and machine_learning is machine driven."

最后,我们可以构建文档特征矩阵,使用默认的glob模式保留所有短语匹配任何包含下划线字符的功能:

Finally, we can construct the document-feature matrix, keeping all phrases using the default "glob" pattern match for any feature that includes the underscore character:

mydfm <- dfm(txtPhrases, keptFeatures = "*_*")
## Creating a dfm from a character vector ...
##   ... lowercasing
##   ... tokenizing
##   ... indexing documents: 3 documents
##   ... indexing features: 20 feature types
##   ... kept 2 features, from 1 supplied (glob) feature types
##   ... created a 3 x 2 sparse dfm
##   ... complete. 
## Elapsed time: 0.012 seconds.

mydfm
## Document-feature matrix of: 3 documents, 2 features.
## 3 x 2 sparse Matrix of class "dfmSparse"
##        features
## docs    machine_learning semantic_distance
##   text1                1                 1
##   text2                1                 0
##   text3                1                 1

这是笨重的,但是从 quanteda 0.9.5.19这是最简单的方法。一旦我添加了多个令牌短语条目到字典匹配(很快!),这将变得更容易。

This is clunky but as of quanteda 0.9.5.19 that's the simplest way. Once I have added multiple token phrase entries to dictionary matching (soon!) this will become much easier.

这篇关于R构造文档术语矩阵如何匹配其值由白色空格分隔的短语组成的字典的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆