如何在 Swift 中从 PDF 中获取所有文本? [英] How can I get all text from a PDF in Swift?

查看:115
本文介绍了如何在 Swift 中从 PDF 中获取所有文本?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 PDF 文档,想提取其中的所有文本.我尝试了以下方法:

I have a PDF document and would like to extract all its text. I tried the following:

import Quartz

let url = NSBundle.mainBundle().URLForResource("test", withExtension: "pdf")
let pdf = PDFDocument(URL: url)
print(pdf.string())

它确实得到了文本,但是与在 Adob​​e 中打开 PDF、编辑全选、复制、粘贴相比,提取的行的顺序完全混淆了!

It does get the text, however the order of the lines extracted is completely mixed up as compared to opening the PDF in Adobe, Edit Select All, Copy, Paste!

如何在 Swift 中获得与打开 PDF、全选、复制/粘贴相同的结果?

How can I get the same outcome in Swift, as opening the PDF, Select All, Copy/Paste!?

推荐答案

不幸的是,这是不可能的.
至少不是没有你的一些主要工作.对于所有 pdf 文件来说,这当然不可能在一般情况下进行.

That is unfortunately not possible.
At least not without some major work on your part. And it certainly is not possible in a general matter for all pdfs.

PDF(通常)是一条单行道.
它们被创建以在每个系统上以相同的方式显示文本而没有任何区别,并且打印机无需知道所有字体和内容即可打印文档.

PDFs are (generally) a one-way street.
They were created to display text in the same way on every system without any difference and for printers to print a document without the printer having to know all fonts and stuff.

提取文本并非易事,并且仅适用于一些基本图像-pdf 带有文本(它不必)的 PDF.PDF 中存在的所有文本信息都与位置信息相结合,以确定它的显示位置.

Extracting text is non-trivial and only possible for some PDFs where the basic image-pdf is accompanied by text (which it does not have to). All text information present in the PDF is coupled with location information to determine where it is to be shown.

如果您在 PDF 中显示了一个表格,其中左列包含条目的名称,右行包含其内容,那么这两列都可以表示为完全不同的文本块,它们只出现 由于它们彼此相邻放置,因此彼此之间有一些联系.

If you have a table shown in the PDF where the left column contains the names of the entries and the right row contains its contents, both of those columns can be represented as completely different blocks of text which only appear to have some link between each other due to the their placement next to each other.

框架/您的代码必须做的是确定视觉上链接的文本的哪些部分在逻辑上也链接在一起并属于一起.这(还)不可能.你我能阅读理解和分组 PDF 的原因是,在某些领域,我们的大脑仍然远胜于计算机.

What the framework / your code would have to do is determine what parts of text that are visually linked are also logically linked and belong together. That is not (yet) possible. The reason you and I can read and understand and group the PDF is that in some fields our brain is still far better than computers.

最后一点,因为它可能会引起混淆:当然,Adobe 和 Apple 也可能已经进行了一些这种分组并取得了不错的结果,但仍然不完美.我刚刚测试的 PDF 在通过 Mac 预览提取文本后非常混乱.

Final note because it might cause confusion: It is certainly possible that Adobe and Apple as well do some of this grouping already and achieves a good result, but it is still not perfect. The PDF I just tested was pretty mangled up after extracting the text via the Mac Preview.

这篇关于如何在 Swift 中从 PDF 中获取所有文本?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆