Ruby/Rails:遍历文件夹并将元数据解析为种子数据库 [英] Ruby/Rails: Traverse folders and parse metadata to seed DB

查看:83
本文介绍了Ruby/Rails:遍历文件夹并将元数据解析为种子数据库的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一堆我想在Rails应用程序中建立索引的文档.我想使用某种rake任务来梳理目录层次结构,以查找文件并捕获这些文件中的元数据以在Rails中建立索引.

I have a bunch of documents that I'd like to index in a Rails application. I'd like to use a rake task of sorts to comb a directory hierarchy looking for files and capturing the metadata from those files to index in Rails.

我不太确定如何在Ruby中执行此操作.我发现了一个名为 pdftk 的实用程序,该实用程序可以从PDF文件(我索引的大部分都是PDF),但是我不确定如何捕获该数据的各个部分?

I'm not really sure how to do this in Ruby. I have found a utility called pdftk which can extract the metadata from the PDF files (much of what I'm indexing is PDFs) but I'm not sure how to capture the individual pieces of that data?

例如,获取下面的ModDate或每个BookmarkTitleBookmarkPageNumber.

For example, to grab the ModDate or each BookmarkTitle and BookmarkPageNumber below.

具体来说,我想遍历文件层次结构,对找到的每个.pdf执行pdftk $filename dump_data命令,然后将输出的重要部分捕获到rails模型中.

Specifically I want to traverse a file hierarchy, execute the pdftk $filename dump_data command for each .pdf I find and then capture the important parts of that output into a rails model(s).

pdftk的输出:

$ pdftk BoringDocument883c2.pdf dump_data
InfoKey: Creator
InfoValue: Adobe Acrobat 9.3.4
InfoKey: Producer
InfoValue: Adobe Acrobat 9.34 Paper Capture Plug-in
InfoKey: ModDate
InfoValue: D:20110312194536-04'00'
InfoKey: CreationDate
InfoValue: D:20110214174733-05'00'
PdfID0: 2f28dcb8474c6849ae8628bc4157df43
PdfID1: 3e13c82c73a9f44bad90eeed137e7a1a
NumberOfPages: 126
BookmarkTitle: Alternative Maintenance Techniques
BookmarkLevel: 1
BookmarkPageNumber: 3
BookmarkTitle: CONTENTS
BookmarkLevel: 1
BookmarkPageNumber: 4
BookmarkTitle: EXHIBITS
BookmarkLevel: 1
BookmarkPageNumber: 6
BookmarkTitle: I - INTRODUCTION
BookmarkLevel: 1
BookmarkPageNumber: 8
BookmarkTitle: II - EXECUTIVE SUMMARY
BookmarkLevel: 1
BookmarkPageNumber: 13
BookmarkTitle: III - REMOTE DIAGNOSTICS - A STATUS REPORT
BookmarkLevel: 1
BookmarkPageNumber: 30
BookmarkTitle: IV - ALTERNATIVE TECHNIQUES
BookmarkLevel: 1
BookmarkPageNumber: 55
BookmarkTitle: V - COMPANYA - A SERVICE PHILOSOPHY
BookmarkLevel: 1
BookmarkPageNumber: 66
BookmarkTitle: VI - COMPANYB - REDUNDANT HARDWARE ARCHITECTURE
BookmarkLevel: 1
BookmarkPageNumber: 77
...shortened for brevity...
PageLabelNewIndex: 1
PageLabelStart: 1
PageLabelPrefix: F-E12_0001.jpg
PageLabelNumStyle: NoNumber
PageLabelNewIndex: 2
PageLabelStart: 1
PageLabelPrefix: F-E12_0002.jpg
PageLabelNumStyle: NoNumber
PageLabelNewIndex: 3
PageLabelStart: 1
PageLabelPrefix: F-E12_0003.jpg
PageLabelNumStyle: NoNumber
...

我最近发现了 pdf-reader 宝石,它看起来很有前途并且可能会淘汰是否需要在外壳中触发pdftk?!?

I've recently found the pdf-reader gem which looks promising and may obviate the need for triggering pdftk, somehow, in the shell?!?

推荐答案

首先,我要说的是我对Rake的了解不是很好,所以可能会有一些错误.让我知道是否有任何问题,我很乐意尝试解决问题.

First off, let me say that my knowledge of Rake isn't that good, so there might be some mistakes. Let me know if something doesn't work and I would be happy to try and fix the problem.

要解决此问题,我将使用2个rake任务.瑞克任务之一将是递归目录遍历任务,另一个将是启动递归的任务.

To solve this, I am going to use 2 rake tasks. One of the rake tasks will be a recursive directory traversal task, and the other will be a task which kicks off the recursion.

desc "Populate the database with PDF metadata from the default PDF path"
task :populate_all_pdf_metadata do
  pdf_path = "/path/to/pdfs"

  Rake::Task[:populate_pdf_metadata].invoke(pdf_path)
end

desc "Recursively traverse a path looking for PDF metadata"
task :populate_pdf_metadata, :pdf_path do |t, args|
  excluded_dir_names = [".", ".."] # Do not look in dirs with these names.

  pdf_path = args[:pdf_path]

  Dir.entries(pdf_path).each do |file|
    if Dir.directory?(file) && !excluded_dir_names.include?(file)
      Rake::Task[:populate_pdf_metadata].invoke(pdf_path + "/" + file)
    elsif File.extname(file) == ".pdf"
      reader = PDF::Reader.new(file)

      # Populate the database here
    end
  end
end 

我相信上面的代码与您想要执行的操作相似.为了访问数据库,您将需要在任务中添加:environment依赖项.您可以在Google上搜索如何从rake任务访问ActiveRecord模型.我希望这有帮助.

I believe the code above is similar to what you want to do. In order to access the database you will need to add the :environment dependency to your tasks. You can search Google for how to access ActiveRecord models from a rake tasks. I hope this helps.

这篇关于Ruby/Rails:遍历文件夹并将元数据解析为种子数据库的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆