如何将PDF转换为Excel或CSV在Rails 4 [英] How to convert PDF to Excel or CSV in Rails 4

查看:122
本文介绍了如何将PDF转换为Excel或CSV在Rails 4的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我搜索了很多。我没有选择,除非这里提出。你们知道一个在线转换器有API或Gem / s,可以将PDF转换为Excel或CSV文件?



我不知道这里是否是最好的地方问这个。



我的应用程序在Rails 4.2中。
PDF文件包含一个包含大约10列的标题和大表。



更多信息:
用户通过表单上传PDF需要抓取PDF解析成CSV并读取内容。我尝试用PDF Reader Gem阅读内容,但结果并不真正有希望。



我使用过:。它很便宜。



然后我将HTML表格转换为CSV。



(这不太理想,但可行)



这是代码:

  require'httmultiparty'
class PageTextReceiver
include HTTMultiParty
base_uri'http:// localhost:3000'

def run
response = PageTextReceiver.post('https://pdftables.com/api?key=myapikey',:query => {f:File.new(/ path /到/ pdf / uploaded_pdf.pdf,r)})

File.open('/ path / to / save / as / html / response.html','w' f |
f.puts response
end
end

def convert
f = File.open(/ path / to / saved / html / response.html )
doc = Nokogiri :: HTML(f)
csv = CSV.open(path / to / csv / t.csv,'w',{:col_sep => ,:quote_char =>'\'',:force_quotes => true})
doc.xpath('// table / tr')。
tarray = []
row.xpath('td')。each do | cell |
tarray<< cell.text
end
csv<< tarray
end
csv.close
end
end

现在运行它像这样:

 #> page = PageTextReceiver.new 
#> page.run
#> page.convert

它不重构。只是证明的概念。你需要考虑性能。



我可以使用 Sidkiq 在后台运行它,并将结果移动到主线程。

I have searched a lot. I have no choice unless asking this here. Do you guys know an online convertor which has API or Gem/s that can convert PDF to Excel or CSV file?

I am not sure if here is the best place to ask this either.

My application is in Rails 4.2. PDF file has contains a header and a big table with about 10 columns.

More info: User upload the PDF via a form then I need to grab the PDF parse it to CSV and read the content. I tried to read the content with PDF Reader Gem however the result wasn't really promising.

I have used: freepdfconvert.com/pdf-excel Unfortunately then don't supply API. (I have contacted them)

Sample PDF

This piece of code convert the PDF into the text which is handy. Gem: pdf-reader

 def self.parse
    reader = PDF::Reader.new("pdf_uploaded_by_user.pdf")
    reader.pages.each do |page|
      puts page.text
    end
  end

Now if you check the sample attached PDF you will see some fields might be empty which it means I simply can't split the text line with space and put it in an array as I won't be able to map the array to the correct fields.

Thank you.

解决方案

Ok, After lots of research I couldn't find an API or even a proper software that does it. Here how I did it.

I first extract the Table out of the PDF into the Table with this API pdftables. It is cheap.

Then I convert the HTML table to CSV.

(This is not ideal but it works)

Here is the code:

require 'httmultiparty'
class PageTextReceiver
  include HTTMultiParty
  base_uri 'http://localhost:3000'

  def run
    response = PageTextReceiver.post('https://pdftables.com/api?key=myapikey', :query => { f: File.new("/path/to/pdf/uploaded_pdf.pdf", "r") })

    File.open('/path/to/save/as/html/response.html', 'w') do |f|
      f.puts response
    end
  end

  def convert
    f = File.open("/path/to/saved/html/response.html")
    doc = Nokogiri::HTML(f)
    csv = CSV.open("path/to/csv/t.csv", 'w',{:col_sep => ",", :quote_char => '\'', :force_quotes => true})
    doc.xpath('//table/tr').each do |row|
      tarray = []
      row.xpath('td').each do |cell|
        tarray << cell.text
      end
      csv << tarray
    end
    csv.close
  end
end

Now Run it like this:

#> page = PageTextReceiver.new
#> page.run
#> page.convert

It is not refactored. Just proof of concept. You need to consider performance.

I might use Sidkiq to run it in background and move the result to the main thread.

这篇关于如何将PDF转换为Excel或CSV在Rails 4的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆