如何将PDF文档解析为excel或XML。哪种解决方案最适合大量文档? [英] How could I parse a PDF document to either excel or XML. Which solution would be best for a large amount of documents?

查看:90
本文介绍了如何将PDF文档解析为excel或XML。哪种解决方案最适合大量文档?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在研究将PDF文件解析(或提取)为Excel或XML的最佳方法。我看过iText和ByteScout,它们可能是我需要做的最好的,但我也在考虑用VB .Net或VBScript进行编码,但需要指出正确的方向才能开始。任何帮助将不胜感激。



KM



我尝试了什么:



我尝试过ByteScout和Aspose.PDF。他们可能会工作,但我不完全理解他们。我也看了iText。

I've been researching the best way to parse (or extract) a PDF file into Excel or XML. I've looked at iText and ByteScout and they may be the best for what I need to do, but I'm also considering coding in VB .Net or VBScript, but need to be pointed in the right direction to get started. Any help would be greatly appreciated.

KM

What I have tried:

I have tried both ByteScout and Aspose.PDF. They may work, but I don't fully understand them. I've looked at iText also.

推荐答案

对于'快速答案'部分,这可能是一个大/宽的问题,我怀疑有很多工具这会为你做的工作,但你所提到的将决定整体方法和可能的'工具'和语言等



可能会提出一些问题(即形成要求): -



- 大量文件(多少?)

- 文本多少在每个文件?

- 文件的来源是什么 - 文件系统/网络服务器/电子邮件/数据库(等)?

- 为什么要输出Excel或XML - 你需要什么?如何处理提取的文本,例如搜索它,重新格式化它?

- 您是设想批处理过程还是实时/点播过程

- 你做有预算吗?也就是说,你能为工具买单吗?

- 交付你的'项目'的时间因素我会称之为什么?

- 你打算怎么跟踪/跟踪文件提取等

(可能还有更多)



你说(ByteScout和Aspose.PDF)但我不知道完全理解他们 - 我们不知道你的背景和你有多少经验 - 如果你必须写和支持某些东西,你可能会更好地使用
That could be a bit of a big/broad question for the 'Quick Answers' section, I suspect there are many tools that would do the job for you, but its what you havnt mentioned that would determine overall approach and possibly 'tools' & language etc

Some questions that might be asked (ie, forming 'requirements') :-

- "large amount of documents" (how many ?)
- how much text is in each document ?
- what is the source of the documents - file system/web server/email/database (etc) ?
- why Excel or XML for output - what do you need to do with the extracted text, eg, search it, reformat it ?
- are you envisaging a batch process or a real-time/on demand process
- do you have a budget ? ie, can you pay for tools ?
- what are the time factors for delivering your 'project' I'll call it ?
- how are you going to track/trace documents extracted etc
(and probably lots more)

You say (of ByteScout & Aspose.PDF) "but I don't fully understand them" - we dont know your background and how much experience you have - if you're going to have to write and support something, you may be better off with a


产品您可以使用产品供应商寻求帮助和支持 - 任何体面的SDK都应该附带一些示例/示例和支持 - 这是一个'购买与构建'的问题



对上述某些问题的答案/想法也可能建议VB.Net例如VBScript - 即稳健性,自动化水平,......



所以,对不起,你所展示的信息没有最佳方式 - 可能会有很多好方法更糟糕的方法 - 提取'工具'只是解决方案的一小部分





你也可以'将提取物外包给局/服务当然 - 您向他们发送PDF文件,他们会以您需要的格式向您发送数据 - 您无需编码!

[/ edit] < br $>






好​​吧,我会'开始'一个沿着这条线的解决方案以下内容,认识到您以后可能会演变一些部分。基本上,它发挥了你在(例如)VB.Net和VBScript的优势以及我认为他们的优势,并开发了一套'模块' - 每个'模块'作为一个简单的目的



输入模块

a)写一组'输入'模块 - 每种输入类型一个,例如

从电子邮件中提取 - >磁盘文件夹。可能是VB.Net

从网站文件夹复制 - >磁盘文件夹。可能是VBScript模块

(手动)从邮件?扫描



每个输入模块都需要能够接受其获取输入的各种参数(命令行) - 例如SMTP /电子邮件参数,以及目录哪个放置PDF的



处理模块

b)写一个'核心'PDF提取器 - 我建议VB.Net这个而不是VBScript - 我认为你会发现适合任务的强大/灵活性/表现力 - 一个控制台程序,从磁盘读取并提取文本并将xml存储在磁盘上



处理模块需要能够接受参数(命令行)从哪里读取PDF,从哪里放(例如)从提取的XML输出



c)编写一个数据库加载器模块(或使用SSIS或...),从(b)从磁盘读取XML文件并上传到数据库。



数据库模块/加载器需要能够接受(命令行)参数来指示XML文件的位置,以及如何连接到数据库



VBScript的使用方式类似于'DOS Batch'语言 - 将所有东西绑定在一起的粘合剂..它: -

- 运行每个输入模块

- 对于磁盘上的每个PDF文件,运行PDF提取器

- 为每个XML文件运行上传到数据库模块

- 运行任何审计步骤

- 可以安排或手动运行



将事物保存为单独的模块意味着例如用VBScript编写的东西可以升级/替换为用VB.Net或C#甚至c ++编写的东西。显然,模块的一些输入可以是命令行,有些你可能希望从配置类型的文件中读取



[/ edit 2]
product so you can use the product supplier for help & support - any decent SDK should also come with a number of examples/samples and support - this is a 'buy vs build' question

Answers/thoughts to/on some of those questions above might also suggest VB.Net for example over VBScript - ie, robustness, level of automation, ...

So, Im sorry, there's no 'best way' on the information you have shown - there could be lots of good ways and more bad ways - the extract 'tool' is only a small part of the solution

[edit : Added]
You could also 'outsource' the extraction to a bureau/service of course - you send them the PDF's and they send you back the data in the format you require - no coding required on your part !
[/edit]

[edit 2]

ok, I would 'start' with a solution that goes along the lines of the following, recognising that you may evolve some parts later on. Basically, it plays upon your strengths in (for example) VB.Net and VBScript and what I believe are their strengths, and developing a set of 'modules' - each 'module' as a simple purpose

Input Modules
a) write a set of 'input' modules - one for each type of input you have, for example
extract from email -> disk folder. May be VB.Net
copy from website folder -> disk folder. May Be VBScript Module
(manual) from mail ? scan

Each input module needs to be able to accept various parameters (command line) unique to how its getting its input - eg SMTP/email paramters, and the directory into which to place the PDF's

Processing Modules
b) write a 'core' PDF Extractor - Im suggesting VB.Net for this rather than VBScript - I think you'll find the power/flexibility/expressiveness suits the task - a console program, that reads from disk and extracts the text and stores the xml as a file on disk

The processing Module needs to be able to accept parameters (command line) where to read the PDF's from, where to put (for example) the XML output from the extraction

c) write a database loader module (or use SSIS or ...) that reads an XML file from (b) from disk and uploads into the database.

The database module/loader will need to be able to accept (command line) parameters to indicate where the XML files are, and how to connect to the DB

VBScript is used like 'DOS Batch' language - a 'glue' to bind everything together .. it :-
- runs each of the input modules
- for each PDF File on disk, runs the PDF extractor
- for each XML file runs the upload to DB module
- runs any audit steps
- can be scheduled or run manually

Keeping things as separate modules means for example something written in VBScript can be upgraded/replaced with something written in VB.Net or C# or even c++ later on. Obviously, some inputs to the modules can be command-line, some you may wish to read from config-type files

[/edit 2]


你看起来很复杂。

Email =>转换为PDF =>从PDF =>中提取数据Feed to Excel

我会尝试更简单。

从电子邮件中提取=> Feed to Excel

由于电子邮件是文本,因此提取数据应该更简单。
you process look complicated.
Email => convert to PDF => Extract data from PDF => Feed to Excel
I would try simpler.
Extract from Email => Feed to Excel
Since Email is text, it should be simpler to extract data.


这篇关于如何将PDF文档解析为excel或XML。哪种解决方案最适合大量文档?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆