将 MSWord 文件读入 R [英] read an MSWord file into R

查看:14
本文介绍了将 MSWord 文件读入 R的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否可以将 MSWord 2010 文件读入 R?我有 Windows 7 和戴尔 PC.

Is it possible to read an MSWord 2010 file into R? I have Windows 7 and a Dell PC.

我正在使用这条线:

my.data <- readLines('c:/users/mark w miller/simple R programs/test_for_r.docx')

尝试读取包含以下文本的 MSWord 文件:

to try to read an MSWord file containing the following text:

A   20  1000    AA
B   30  1001    BB
C   10  1500    CC

我收到一条警告消息,内容为:

I get a warning message that says:

警告信息:在 readLines("c:/users/mark w miller/simple Rprograms/test_for_r.docx") 中:在c:/users/mark w miller/simple Rprograms/test_for_r.docx"上找到不完整的最后一行

Warning message: In readLines("c:/users/mark w miller/simple R programs/test_for_r.docx") : incomplete final line found on 'c:/users/mark w miller/simple R programs/test_for_r.docx'

my.data 似乎是胡言乱语:

and my.data appears to be gibberish:

# [1] "PK030424" "¤l"             "ÈFÃË‹Átí"

我知道通过这个简单的示例,我可以轻松地将 MSWord 文件转换为不同的格式.然而,我的实际数据文件由几十年前输入的复杂表格组成,然后扫描成 pdf 文档.原始纸质文件的年代以及原始纸张、打字和/或扫描过程中可能存在的缺陷导致某些字母和数字不是很清楚.到目前为止,将 pdf 文件转换为 MSWord 似乎是正确翻译表格最成功的方法.将 MSWord 文件转换为 Excel 或富文本等,并不是很成功.即使在转换为 MSWord 之后,生成的文件也非常复杂并且包含许多错误.我想如果我可以将 MSWord 文件读入 R,那可能是编辑和更正它们的最有效方法.

I know with this simple example I could easily convert the MSWord file to a different format. However, my actual data files consist of complex tables that were typed decades ago and then scanned into pdf documents later. Age of the original paper document and perhaps imperfections in the original paper, typing and/or scanning process has resulted in some letters and numbers not being very clear. So far converting the pdf files to MSWord seems to be the most successful at correctly translating the tables. Converting the MSWord files to Excel or rich text, etc, has not been very successful. Even after conversion to MSWord the resulting files are very complex and contain numerous errors. I thought if I could read the MSWord files into R that might be the most efficient way to edit and correct them.

我知道package tm"可以将 MSWord 文件读入 R,但我有点担心使用它,因为它似乎需要安装第三方软件.

I am aware of 'package tm' that I guess can read MSWord files into R, but I am a little concerned about using it because it seems to require installing third-party software.

感谢您的任何建议.

推荐答案

首先,readLines() 不是正确的解决方案,因为 Word 文件不是文本(即纯 ASCII 文本) 文件.

First, readLines() is not the correct solution, since a Word file is not a text (that is plain, ASCII text) file.

tm 包中与 Word 相关的函数称为 readDOC() 但它和所需的第三方工具 (Antiword) 都是针对较旧的 Word 文件 (最高 Word 2003)并且无法使用较新的 .docx 文件.

The Word-related function in the tm package is called readDOC() but both it and the required third-party tool (Antiword) are for older Word files (up to Word 2003) and will not work using newer .docx files.

我能建议的最好方法是尝试 readPDF(),也可以在 tm 包中找到.注意:它要求您的系统上安装了 pdftotext 工具.对于 Linux 来说很容易,对 Windows 一无所知.或者,找到一个 Windows 工具,将 PDF 转换为纯 ASCII 文本文件(不是 Word 文件)——它们应该可以在 Windows 上使用记事本打开并正确显示——然后尝试 readLines() 再说一次.但是,鉴于您的 PDF 文件很旧并且来自扫描仪,因此转换为文本可能会很困难.

The best I can suggest is that you try readPDF(), also found in the tm package. Note: it requires that the tool pdftotext is installed on your system. Easy for Linux, no idea about Windows. Alternatively, find a Windows tool which converts PDF to plain, ASCII text files (not Word files) - they should open and display correctly using Notepad on Windows - then try readLines() again. However, given that your PDF files are old and come from a scanner, conversion to text might be difficult.

最后:我意识到在这种情况下您没有做出最初的决定,但对于其他任何人 - Word 和 PDF 都不是存储您要解析的数据的合适格式.

Finally: I realise that you did not make the original decision in this instance, but for anybody else - Word and PDF are not appropriate formats for storing data that you want to parse.

这篇关于将 MSWord 文件读入 R的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆