如何为R主题模型正确编码UTF-8 txt文件 [英] How to properly encode UTF-8 txt files for R topic model

查看:132
本文介绍了如何为R主题模型正确编码UTF-8 txt文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在此论坛上已讨论过类似的问题(例如此处),但我没有找到能解决我问题的方法,因此我为一个看似相似的问题表示歉意.

Similar issues have been discussed on this forum (e.g. here and here), but I have not found the one that solves my problem, so I apologize for a seemingly similar question.

我有一组使用UTF-8编码的.txt文件(请参见屏幕截图).我正在尝试使用tm包在R中运行主题模型.但是,尽管在创建语料库时使用了encoding ="UTF-8",但是我在编码方面遇到了明显的问题.例如,我得到< U + FB01> scal 代替 fiscal in< U + FB02> uenc 而不是 influence ,并不是所有标点符号都被删除,并且某些字母无法识别(例如,在某些情况下,如 view" 仍带有引号或 plan'ændring或孤立的引号,例如"and"或 zit years-因此应该已经删除了).这些术语还会显示在各个术语的主题分布中.之前我在编码方面遇到了一些问题,但是使用"encoding = "UTF-8"创建用于解决问题的语料库.这次似乎无济于事.

I have a set of .txt files with UTF-8 encoding (see the screenshot). I am trying to run a topic model in R using tm package. However, despite using encoding = "UTF-8" when creating the corpus, I get obvious problems with encoding. For instance, I get < U+FB01 >scal instead of fiscal, in< U+FB02>uenc instead of influence, not all punctuation is removed and some letters are unrecognizable (e.g. quotations marks are still there in some cases like view" or plan’ or ændring or orphaned quotations marks like " and " or zit or years—thus with a dash which should have been removed). These terms also show up in topic distribution over terms. I had some problems with encoding before, but using "encoding = "UTF-8" to create the corpus used to solve the problem. It seem like it does not help this time.

我使用的是Windows 10 x64 R版本3.6.0(2019-04-26),0.7-7版本的tm软件包(均为最新版本).对于如何解决该问题的任何建议,我将不胜感激.

I am on Windows 10 x64, R version 3.6.0 (2019-04-26) , 0.7-7 version of tm package (all up to date). I would greatly appreciate any advice on how to address the problem.

library(tm)
library(beepr)
library(ggplot2)
library(topicmodels)
library(wordcloud)
library(reshape2)
library(dplyr)
library(tidytext)
library(scales)
library(ggthemes)
library(ggrepel)
library(tidyr)


inputdir<-"c:/txtfiles/"
docs<- VCorpus(DirSource(directory = inputdir, encoding ="UTF-8"))

#Preprocessing
docs <-tm_map(docs,content_transformer(tolower))

removeURL <- function(x) gsub("http[^[:space:]]*", "", x)
docs <- tm_map(docs, content_transformer(removeURL))

toSpace <- content_transformer(function(x, pattern) (gsub(pattern, " ", x)))
docs <- tm_map(docs, toSpace, "/")
docs <- tm_map(docs, toSpace, "-")
docs <- tm_map(docs, toSpace, "\\.")
docs <- tm_map(docs, toSpace, "\\-")


docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, removeWords, stopwords("english"))
docs <- tm_map(docs, stripWhitespace)
docs <- tm_map(docs,stemDocument)

dtm <- DocumentTermMatrix(docs)
freq <- colSums(as.matrix(dtm))
ord <- order(freq, decreasing=TRUE)
write.csv(freq[ord],file=paste("word_freq.csv"))

#Topic model
  ldaOut <-LDA(dtm,k, method="Gibbs", 
               control=list(nstart=nstart, seed = seed, best=best, 
                            burnin = burnin, iter = iter, thin=thin))

我应该在cse中添加事实证明,该txt文件是使用以下R代码从PDF创建的:

I should add in cse it turns out to be relevant that the txt files were created from PDFs using the following R code:

inputdir <-"c:/pdf/"
myfiles <- list.files(path = inputdir, pattern = "pdf",  full.names = TRUE)
lapply(myfiles, function(i) system(paste('"C:/Users/Delt/AppData/Local/Programs/MiKTeX 2.9/miktex/bin/x64/pdftotext.exe"',
                                         paste0('"', i, '"')), wait = FALSE) )

可以在此处下载两个示例txt文件.

Two sample txt files can be downloaded here.

推荐答案

我找到了一种变通方法,该变通方法似乎在您提供的2个示例文件上正确运行.您首先需要做的是 NFKD(兼容性分解).这将"fi"正字连字分割为f和i.幸运的是,stringi包可以处理此问题.因此,在执行所有特殊文本清除之前,您需要应用功能stringi::stri_trans_nfkd.您可以在较低步骤之后(或之前)的预处理步骤中执行此操作.

I found a workaround that seems to work correctly on the 2 example files that you supplied. What you need to do first is NFKD (Compatibility Decomposition). This splits the "fi" orthographic ligature into f and i. Luckily the stringi package can handle this. So before doing all the special text cleaning, you need to apply the function stringi::stri_trans_nfkd. You can do this in the preprocessing step just after (or before) the tolower step.

请阅读此功能的文档和参考.

Do read the documentation for this function and the references.

library(tm)
docs<- VCorpus(DirSource(directory = inputdir, encoding ="UTF-8"))

#Preprocessing
docs <-tm_map(docs,content_transformer(tolower))

# use stringi to fix all the orthographic ligature issues 
docs <- tm_map(docs, content_transformer(stringi::stri_trans_nfkd))

toSpace <- content_transformer(function(x, pattern) (gsub(pattern, " ", x)))

# add following line as well to remove special quotes. 
# this uses a replace from textclean to replace the weird quotes 
# which later get removed with removePunctuation
docs <- tm_map(docs, content_transformer(textclean::replace_curly_quote))

....
rest of process
.....

这篇关于如何为R主题模型正确编码UTF-8 txt文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆