R-遍历PDF页面 [英] R - iterate over pages in PDF

查看:48
本文介绍了R-遍历PDF页面的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一系列包含各种数据表的PDF文件.我只是在每个文件中查找一个特定的表,我的目标是找到每个文件在哪个页面上.

I have a series of PDF files that contain various tables of data. I am only looking for a specific table in each and my goal is to find what page it is on for each file.

我计划的方法是以某种方式遍历每个页面,阅读文本并确定它是否是我要查找的页面,如果是,则返回该页面编号,否则继续下一页.我一直在研究PDFTools,但似乎没有一种循环浏览页面的方法.

My planned approach is to somehow iterate over each page, read the text and determine if it is the page I'm looking for, if yes then return that page number, else continue to the next page. I've been looking into PDFTools, but it doesn't look like there is a way to loop through the pages.

有人知道有任何R软件包可以帮助我实现这一目标吗?还是有更好的方法可以使用PDFTools做到这一点?

Does anyone know of any R package that will help me achieve this, or is there a better way I can do this with PDFTools?

任何帮助将不胜感激!

推荐答案

我认为在PDFtools中,有一些方法可以提取文本数据,从而逐页创建字符串".因此代码可能如下所示:

I think in PDFtools there are ways to extracting text data that creates 'strings' page by page. So code may look like this:

library(pdftools)
txt <- pdf_text("something.pdf")

现在:

# first page text
txt[1]
txt[2] etc.

为了从每个 string 中提取单词,您必须使用 strsplit(),然后创建每个页面的单词向量,并逐页查找页面内部那个单词一个单词.与您的 word 匹配后,收集最外面的循环索引号作为页数.

In order to extract words from each string you have to use strsplit() and then create a vector of words of each page and look for page by page and inside that word by word. Once that matches with your word collect the outermost loop index number as number of page.

让我知道这是否对您有帮助.

let me know if this helps your purpose.

这篇关于R-遍历PDF页面的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆