R 我如何使用 TermDocumentMatrix() 保留标点符号 [英] R How do i keep punctuation with TermDocumentMatrix()

查看:40
本文介绍了R 我如何使用 TermDocumentMatrix() 保留标点符号的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个大型数据框,我在其中识别字符串中的模式,然后提取它们.我提供了一个小的子集来说明我的任务.我通过创建一个包含多个单词的 TermDocumentMatrix 来生成我的模式.我将这些模式与 stringi 和 stringr 包中的 stri_extract 和 str_replace 一起使用,以在punct_prob"数据框中进行搜索.

I have a large dataframe where I am identifying patterns in strings and then extracting them. I have provided a small subset to illustrate my task. I am generating my patterns by creating a TermDocumentMatrix with multiple words. I use these patterns with stri_extract and str_replace from stringi and stringr packages to search within the 'punct_prob' dataframe.

我的问题是我需要在 'punct_prob$description' 中保持标点符号的完整,以维护每个字符串中的字面含义.例如,我不能让 2.35 毫米变成 235 毫米.然而,我使用的 TermDocumentMatrix 过程正在删除标点符号(或至少是句点),因此我的模式搜索功能无法匹配它们.

My problem is that I need to keep punctuation in tact within the 'punct_prob$description' to maintain the literal meanings within each string. For example, I can't have 2.35 mm becoming 235mm. The TermDocumentMatrix procedure I am using however is removing punctuation (or at least the periods) and thus my pattern seeking functions can't match them.

简而言之……生成 TDM 时如何保留标点符号?我曾尝试在 TermDocumentMatrix 控件参数中包含 removePunctuation=FALSE,但没有成功.

In short... how do i keep the punctuation when generating the TDM? I have tried including removePunctuation=FALSE within the TermDocumentMatrix control argument but with no success.

library(tm)
punct_prob = data.frame(description = tolower(c("CONTRA ANGLE HEAD 2:1 FOR 2.35mm BUR",
                                    "TITANIUM LINE MINI P.B F.O. TRIP SPRAY",
                                    "TITANIUM LINE POWER P. B F.O. TRIP SPR",
                                    "MEDESY SPECIAL ITEM")))

punct_prob$description = as.character(punct_prob$description)

# a control for the number of words in phrases
max_ngram = max(sapply(strsplit(punct_prob$description, " "), length))

#set up ngrams and tdm
BigramTokenizer <- function(x) {RWeka::NGramTokenizer(x, RWeka::Weka_control(min = max_ngram, max = max_ngram))}
punct_prob_corpus = Corpus(VectorSource(punct_prob$description))
punct_prob_tdm <- TermDocumentMatrix(punct_prob_corpus, control = list(tokenize = BigramTokenizer, removePunctuation=FALSE))
inspect(punct_prob_tdm)

检查结果 - 没有标点符号......

inspect results - with no punctuation....

                                   Docs
Terms                              1 2 3 4
  angle head 2 1 for 2 35mm bur    1 0 0 0
  contra angle head 2 1 for 2 35mm 1 0 0 0
  line mini p b f o trip spray     0 1 0 0
  line power p b f o trip spr      0 0 1 0
  titanium line mini p b f o trip  0 1 0 0
  titanium line power p b f o trip 0 0 1 0

提前感谢您的任何帮助:)

Thanks for any help in advance :)

推荐答案

问题不在于 termdocumentmatrix,而在于基于 RWEKA 的 ngram tokenizer.Rweka 在进行标记化时删除标点符号.

The issue is not so much the termdocumentmatrix, but the ngram tokenizer based on RWEKA. Rweka removes punctuations when doing the tokenizing.

如果您使用 nlp 标记器,它会保留标点符号.请参阅下面的代码.

If you use the nlp tokenizer it keeps the punctuation. See code below.

附言我在你的第三个文本行中删除了一个空格,所以 P. B. 是 P.B.就像在第 2 行一样.

P.S. I removed one space in your 3rd textline so P. B. is P.B. like it is on line 2.

library(tm)
punct_prob = data.frame(description = tolower(c("CONTRA ANGLE HEAD 2:1 FOR 2.35mm BUR",
                                                "TITANIUM LINE MINI P.B F.O. TRIP SPRAY",
                                                "TITANIUM LINE POWER P.B F.O. TRIP SPR",
                                                "MEDESY SPECIAL ITEM")))
punct_prob$description = as.character(punct_prob$description)

max_ngram = max(sapply(strsplit(punct_prob$description, " "), length))

punct_prob_corpus = Corpus(VectorSource(punct_prob$description))




NLPBigramTokenizer <- function(x) {
  unlist(lapply(ngrams(words(x), max_ngram), paste, collapse = " "), use.names = FALSE)
}


punct_prob_tdm <- TermDocumentMatrix(punct_prob_corpus, control = list(tokenize = NLPBigramTokenizer))
inspect(punct_prob_tdm)

<<TermDocumentMatrix (terms: 3, documents: 4)>>
Non-/sparse entries: 3/9
Sparsity           : 75%
Maximal term length: 38
Weighting          : term frequency (tf)

                                        Docs
Terms                                    1 2 3 4
  contra angle head 2:1 for 2.35mm bur   1 0 0 0
  titanium line mini p.b f.o. trip spray 0 1 0 0
  titanium line power p.b f.o. trip spr  0 0 1 0

这篇关于R 我如何使用 TermDocumentMatrix() 保留标点符号的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆