将数据从PDF文件读入R [英] Reading data from PDF files into R

查看:168
本文介绍了将数据从PDF文件读入R的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

那甚至有可能!?!

我有一堆需要导入数据库的旧式报告.但是,它们都是pdf格式.是否有任何可以读取pdf的R软件包?还是应该将其留给命令行工具?

I have a bunch of legacy reports that I need to import into a database. However, they're all in pdf format. Are there any R packages that can read pdf? Or should I leave that to a command line tool?

报告是用excel制作的,然后以pdf格式打印,因此它们具有规则的结构,但是有许多空白的单元格".

The reports were made in excel and then pdfed, so they have regular structure, but many blank "cells".

推荐答案

对可能希望提取数据的其他人只是一个警告:PDF是一种容器,而不是一种格式.如果原始文档不包含实际文本,而不是文本的位图图像,或者可能是比我想象的还要难看的东西,那么OCR就是无济于事的.

Just a warning to others who may be hoping to extract data: PDF is a container, not a format. If the original document does not contain actual text, as opposed to bitmapped images of text or possibly even uglier things than I can imagine, nothing other than OCR can help you.

最重要的是,根据我的不幸经历,我们无法保证创建PDF文档的应用程序的行为均相同,因此,表中的数据可能会或可能不会以所需的顺序读出(由于这种方式的结果)该文档已建立).要小心.

On top of that, in my sad experience there's no guarantee that apps which create PDF docs all behave the same, so the data in your table may or may not be read out in the desired order (as a result of the way the doc was built). Be cautious.

让一对应届毕业生为您转录数据可能更好.它们很便宜:-)

Probably better to make a couple grad students transcribe the data for you. They're cheap :-)

这篇关于将数据从PDF文件读入R的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆