从PDF杂志中提取文章内容 [英] Extracting article contents from PDF magazines

查看:115
本文介绍了从PDF杂志中提取文章内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

首先,我的目标不是特定的开发答案,而是一种开发方法.

First of all, I am not aiming for a specific development answer, but rather a development approach.

我遇到的问题是,我有一个客户,其中有大量 个PDF文章,在过去20年中,每年约有150篇文章以50 pdf的形式发表.所有这些PDF均由Quark Express和Mac用户(如果该信息很重要)编译而成.每次创建新的pdf杂志时,网络开发团队都会将每篇文章复制并粘贴(!)到Internet上的表单(!)中,包括.标题,内容,关键字,引用,作者姓名等.一个人通常需要整整3天才能完成工作.

The problem that I am having, is I have a client with a enormous amount of articles in PDFs, about 150 articles in fifty pdfs per year for the last 20 years. All of these PDFs are compiled from Quark express, from people with macs (if that info matters). Every time a new pdf magazine is created, the web-development team copy and pastes (!) each article into a form on the internet (!), incl. title, content, keywords, references, authorname, etc. It usually takes about 3 full days for one guy to finish the job.

当我在那儿工作时(已经不再是七年前了),我使用剪贴板监视应用程序以及一些与服务器交互的简单的基于XML的PHP​​脚本将处理过程加快了三倍.然后,您所需要做的就是选择文本CTRL + C,再选择一些文本CTRL + C,转到应用程序(ALT + TAB),按下一篇文章",然后重复此操作.但是我们,或者大部分是我,仍然每年花费大约50天来处理PDF杂志.

When I was working there (I am not anymore, this was nearly seven years ago), I speeded the process up three fold using a clipboard monitoring app, and some simple XML-based PHP scripts that interact with the server. All you needed to do then, was select text, CTRL+C, select some more text, CTRL+C, go to the app (ALT+TAB), press 'next article', and repeat this. But we, or mostly I, still spend about fifty days per year processing PDF magazines.

现在我已经下线了七年,出于友好的拜访原因,我将再次与我的前老板讲话.我知道他们仍在使用我的应用程序(!).但是,也许再回头看看他们的问题,看看我是否可以提出一个可以帮助他们的编码项目,是一个好主意?

Now I'm seven years down the line, and I am about to speak to my old boss again, for friendly visiting reasons. I know they are still using my apps (!). But perhaps it is a nice idea to look into their problem back again, and see if I can suggest a coding project that could help them?

我从没使用Quark Express,我只是知道,它与MS Word类似,就我对软件的了解而已.我对未加密的提取的PDF代码/语法不是很熟悉.

I have never used Quark Express, I only know that it is something similar as to MS Word, that's as far as my knowledge about the software goes. I am not extremely familiar with unencrypted, extracted PDF code/syntax.

简而言之:Quark Express是否具有某些特定的编译模式,可用于PDF脚本中以提取文章?那里有什么智能"工具,可以从文章内容所在的结构相似的pdf页面中学习"?是否存在诸如Quark Xpress之类的工具,可以将文章与不可见的参考标记一起封装"或标记"在一起,从而使脚本的提取更加简单?

In short: Does Quark Express have some specific compilation patterns, that can be used in the PDF scripts to extract articles? What 'intelligent' tools are there, that can 'learn' from similarly structured pdf pages, where the article contents are? Are there tools out there, like Quark Xpress modules of some sort, that can 'encapsulate' or 'mark' an article together, with an invisible reference tag, to make extraction a lot simpler for scripts?

创建这些PDF的人在过去的20年中一直在工作,除软件更新外,他们不愿改变工作流程.对于他们而言,任何其他工具都不得干扰他们的工作流程,否则他们只会拒绝它.

The people creating these PDFs have been doing their job for the past 20 years, and unwilling to change their working flow, except for software updates. Any additional tool for them must not interfere with their workflow, or they will just refuse it.

我不需要代码;但仅是您或其他人关于其他PDF提取问题所做的一些描述.最好的答案可能是对几种方法的描述,或者是对带有案例描述的外部链接的引用.

I don't want code; but merely some descriptions of what you or other people perhaps have done with regards to other PDF extraction problems. The best answer would be a description of maybe several methods, or some references to a external links with case descriptions.

推荐答案

广泛的问题,但是乍一看,我的回答是-如果让他们放到PDF的最深处,则已经使事情变得非常困难.如果他们仍在使用Quark XPress,则有更好的方法来执行此类操作,并且实际上有相当多的发布者使用了类似的方法.

Broad question, but at first sight my answer would be that - if you let them go as far as the PDF - you're making things very difficult already. If they are still using Quark XPress, there are far better ways to do this kind of thing and similar approaches are actually be used by quite a few publishers out there.

1)考虑使用Quark XPress生成PDF和XML.他们不想改变自己的方式很好,但是无论如何他们都必须在Quark之外创建PDF.生成XML并不是一个很大的额外步骤.实际上,(警告-从属关系!)有一些工具可以将所有这些整合为一个步骤.例如,您可以编写AppleScript来指导该过程,但是在人们单击导出"后,诸如axaio MadeToPrint之类的东西将自动生成(正确的)PDF和XML文件.

1) Look into generating both PDF and XML out of Quark XPress. It's fine that they don't want to change their ways but they have to create PDF out of Quark anyway; also generating XML is not a really big additional step. In fact (warning - affiliation!) there are tools who can make all of this into one step. You could write AppleScript for example to steer the process, but something like axaio MadeToPrint will automatically generate both the (correct) PDF and an XML file after people clicking "export".

2)一旦拥有相同内容的PDF和XML,就可以使用PDF进行打印(众所周知),然后编写一些代码以将XML转换为网站上所需的内容.如果编码是在网站本身上完成的,则您甚至可能不需要调整Quark产生的XML.只需使站点足够聪明,就可以拾取必要的点点滴滴.

2) Once you have the PDF and the XML of the same content, use the PDF for print (just as know) and then write some code to convert the XML into whatever you need on the web site. If the coding is done on the web site itself, you might not even need to tweak the XML coming out of Quark; simply make the site smart enough to pick up whatever bits and pieces are necessary.

关于一个广泛问题的广泛答案;希望那是您想要的...

Broad answer on a broad question; hope that was what you are looking for...

这篇关于从PDF杂志中提取文章内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆