从研究论文的PDF中提取信息 [英] Extracting information from PDFs of research papers

查看:88
本文介绍了从研究论文的PDF中提取信息的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要一种从PDF文档中提取书目元数据的机制,以免人们手工输入或剪切粘贴.

I need a mechanism for extracting bibliographic metadata from PDF documents, to save people entering it by hand or cut-and-pasting it.

至少是标题和摘要.作者的名单及其从属关系将是不错的.提取参考文献将是惊人的.

At the very least, the title and abstract. The list of authors and their affiliations would be good. Extracting out the references would be amazing.

理想情况下,这将是一个开源解决方案.

Ideally this would be an open source solution.

问题在于,并非所有PDF都对文本进行编码,并且许多PDF确实不能保留文本的逻辑顺序,因此仅执行pdf2text即可为您提供第1列的第1行,第2列的第1行,第2列的行1等.

The problem is that not all PDF's encode the text, and many which do fail to preserve the logical order of the text, so just doing pdf2text gives you line 1 of column 1, line 1 of column 2, line 2 of column 1 etc.

我知道这里有很多图书馆.它标识我需要解决的文档上的摘要,标题作者等.永远不可能做到这一点,但是80%会节省很多人力.

I know there's a lot of libraries. It's identifying the abstract, title authors etc. on the document that I need to solve. This is never going to be possible every time, but 80% would save a lot of human effort.

推荐答案

我们在2010年2月于伦敦的Dev8D上举办了一项竞赛来解决此问题,结果我们创建了一个不错的GPL工具.我们尚未将其集成到我们的系统中,但是已经存在于世界上.

We ran a contest to solve this problem at Dev8D in London, Feb 2010 and we got a nice little GPL tool created as a result. We've not yet integrated it into our systems but it's there in the world.

https://code.google.com/p/pdfssa4met/

这篇关于从研究论文的PDF中提取信息的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆