如何从R中读取PDF元数据 [英] How to read PDF metadata from R
问题描述
我们出于好奇,是否有一种方法可以从R中读取PDF元数据(例如下面显示的信息)?
Our of curiosity, is there a way to read PDF metadata -- such as the information shown below -- from R?
对于当前问题库中的[r] pdf metadata
,我一无所获.任何指针都非常欢迎!
I could not anything about that by searching from [r] pdf metadata
in the current question base. Any pointers very welcome!
推荐答案
我想不出一种纯R的方法,但是您可能可以安装自己喜欢的PDF命令行工具(例如, PDF工具包PDFtk ,并使用它至少获取您的某些数据正在寻找.
I can't think of a pure R way to do this, but you can probably install your favorite PDF command-line tool (for example, the PDF toolkit, PDFtk and use that to get at least some of the data you are looking for.
以下是使用PDFtk的基本示例.假定pdftk
在您的路径中可访问.
The following is a basic example using PDFtk. It assumes that pdftk
is accessible in your path.
x <- getwd() ## I'll run this example in a tempdir to keep things clean
setwd(tempdir())
list.files(pattern="*.txt$|*.pdf$")
# character(0)
pdf(file = "SomeOutputFile.pdf")
plot(rnorm(100))
dev.off()
system("pdftk SomeOutputFile.pdf data_dump output SomeOutputFile.txt")
list.files(pattern="*.txt$|*.pdf$")
# [1] "SomeOutputFile.pdf" "SomeOutputFile.txt"
readLines("SomeOutputFile.txt")
# [1] "InfoBegin" "InfoKey: Creator"
# [3] "InfoValue: R" "InfoBegin"
# [5] "InfoKey: Title" "InfoValue: R Graphics Output"
# [7] "InfoBegin" "InfoKey: Producer"
# [9] "InfoValue: R 3.0.1" "InfoBegin"
# [11] "InfoKey: ModDate" "InfoValue: D:20131102170720"
# [13] "InfoBegin" "InfoKey: CreationDate"
# [15] "InfoValue: D:20131102170720" "NumberOfPages: 1"
# [17] "PageMediaBegin" "PageMediaNumber: 1"
# [19] "PageMediaRotation: 0" "PageMediaRect: 0 0 504 504"
# [21] "PageMediaDimensions: 504 504"
setwd(x)
我将研究还有哪些其他选项来指定要提取哪些元数据,并查看是否存在一种方便的方法将这些信息解析为对您更有用的形式.
I'd look into what other options there are to specify what metadata gets extracted, and see if there's a convenient way to parse this information into a form that is more useful for you.
这篇关于如何从R中读取PDF元数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!