从研究论文的PDF中提取信息 [英] Extracting information from PDFs of research papers

查看：88 发布时间：2020/5/9 1:48:05 pdf metadata extraction

本文介绍了从研究论文的PDF中提取信息的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我需要一种从PDF文档中提取书目元数据的机制，以免人们手工输入或剪切粘贴.

I need a mechanism for extracting bibliographic metadata from PDF documents, to save people entering it by hand or cut-and-pasting it.

至少是标题和摘要.作者的名单及其从属关系将是不错的.提取参考文献将是惊人的.

At the very least, the title and abstract. The list of authors and their affiliations would be good. Extracting out the references would be amazing.

理想情况下，这将是一个开源解决方案.

Ideally this would be an open source solution.

问题在于，并非所有PDF都对文本进行编码，并且许多PDF确实不能保留文本的逻辑顺序，因此仅执行pdf2text即可为您提供第1列的第1行，第2列的第1行，第2列的行1等.

The problem is that not all PDF's encode the text, and many which do fail to preserve the logical order of the text, so just doing pdf2text gives you line 1 of column 1, line 1 of column 2, line 2 of column 1 etc.

我知道这里有很多图书馆.它标识我需要解决的文档上的摘要，标题作者等.永远不可能做到这一点，但是80％会节省很多人力.

I know there's a lot of libraries. It's identifying the abstract, title authors etc. on the document that I need to solve. This is never going to be possible every time, but 80% would save a lot of human effort.

从研究论文的PDF中提取信息 [英] Extracting information from PDFs of research papers

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

从研究论文的PDF中提取信息 [英] Extracting information from PDFs of research papers

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭