设置编码以将文本文件读入tm Corpora [英] set encoding for reading text files into tm Corpora

查看：79 发布时间：2020/10/29 6:52:39 text encoding text-mining tm corpus

本文介绍了设置编码以将文本文件读入tm Corpora的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

使用tm Corpus加载一堆文档，我需要指定编码。

loading a bunch of documents using tm Corpus i need to specify encoding.

所有文档都是UTF-8编码的。如果通过文本编辑器的openend内容还可以，但是语料库内容中充满了奇怪的符号（indicio。，sœs....）
源文本为西班牙语。 ES_es

All documents are UTF-8 encoded. If openend via text editor content is ok but corpus contents is full of strange symbols (indicioâ., ‘sœs....) Source text is in spanish. ES_es

library(tm)
cname <- file.path("C:", "Users", "john", "Documents", "texts")
docs <- Corpus(DirSource(cname), encoding ="UTF-8")

> Error in Corpus(DirSource(cname), encoding = "UTF-8") : 
  unused argument (encoding = "UTF-8")

编辑：

从语料库中获取str（documents [1]）：

Getting str(documents[1]) from corpus I've noticed:

.... $语言：chr en

.. ..$ language : chr "en"

如何指定例如 UTF-8， Latin1 或其他避免奇怪符号的编码？

How can I specify, for instance "UTF-8", "Latin1" or any other encoding to avoid strange symbols?

问候

推荐答案

从 C：中可以清楚地看到您正在使用Windows，它假定Windows-1252编码（在大多数系统上）而不是UTF-8。您可以尝试以字符形式读取文件，然后设置 Encoding（myCharVector）<- UTF-8 。如果输入编码为UTF-8，这将使您的系统正确识别并显示UTF-8字符。

From the "C:" it's clear you are using Windows, which assumes a Windows-1252 encoding (on most systems) rather than UTF-8. You could try reading the files in as character and then setting Encoding(myCharVector) <- "UTF-8". If the input encoding was UTF-8 this should cause your system to recognise and display the UTF-8 characters properly.

或者，尽管它也会使 tm 不必要：

Alternatively this will work, although it also makes tm unnecessary:

require(quanteda)
docs <- corpus(textfile("C:/Users/john/Documents/texts/*.txt", encoding = "UTF-8"))

然后您可以使用以下示例查看文本：

Then you can see the texts using for example:

cat(texts(docs)[1:2])

它们应该设置编码位并正确显示。然后，如果您愿意，可以使用以下命令将它们放入 tm ：

They should have the encoding bit set and display properly. Then if you prefer, you can get these into tm using:

docsTM <- Corpus(VectorSource(texts(docs)))

这篇关于设置编码以将文本文件读入tm Corpora的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

设置编码以将文本文件读入tm Corpora [英] set encoding for reading text files into tm Corpora

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

设置编码以将文本文件读入tm Corpora [英] set encoding for reading text files into tm Corpora

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭