使用SWIFT进行PDF解析 [英] PDF Parsing with SWIFT

查看:154
本文介绍了使用SWIFT进行PDF解析的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想解析一个没有图像,只有文本的PDF.我正在尝试查找文本片段.例如,搜索字符串名称:",并能够读取:"之后的字符.

I want to parse a PDF that has no images, only text. I'm trying to find pieces of text. For example to search the string "Name:" and be able to read the characters after ":".

我已经能够打开PDF,获取页数并在其上循环播放.问题是当我想使用像CGPDFDictionaryGetStreamCGPDFStreamCopyData这样的函数时,因为它们使用指针.我没有在互联网上找到许多迅速的程序员的信息.

I'm already able to open a PDF, get the number of pages, and to loop on them. The problem is when I want to use functions like CGPDFDictionaryGetStream or CGPDFStreamCopyData, because they use pointers. I have not found much info on the internet for swift programmers.

也许最简单的方法是将所有内容解析为NSString.然后我可以做剩下的事.

Maybe the easiest way would be to parse all the content to an NSString. Then I could do the rest.

这是我的代码:

// Get existing Pdf reference
let pdf = CGPDFDocumentCreateWithURL(NSURL(fileURLWithPath: path))
let pageCount = CGPDFDocumentGetNumberOfPages(pdf);
for index in 1...pageCount {
    let myPage = CGPDFDocumentGetPage(pdf, index)
    //Search somehow the string "Name:" to get whats written next
}

推荐答案

您可以使用 PDFKit 为此.它是Quartz框架的一部分,可在iOS和MacOS上使用.它的速度也相当快,我仅用0.07秒就可以搜索包含15000多个字符的PDF.

You can use PDFKit to do this. It is part of the Quartz framework and is available on both iOS and MacOS. It is also pretty fast, I was able to search through a PDF with over 15000 characters in just 0.07s.

这里是一个例子:

import Quartz

let pdf = PDFDocument(url: URL(fileURLWithPath: "/Users/...some path.../test.pdf"))

guard let contents = pdf?.string else {
    print("could not get string from pdf: \(String(describing: pdf))")
    exit(1)
}

let footNote = contents.components(separatedBy: "FOOT NOTE: ")[1] // get all the text after the first foot note

print(footNote.components(separatedBy: "\n")[0]) // print the first line of that text

// Output: "The operating system being written in C resulted in a more portable software."

您仍然可以访问以前拥有的大多数(如果不是全部)属性.例如pdf.pageCount表示页面数,而pdf.page(at: <Int>)表示获取特定页面.

You can also still access most of (if not all of) the properties you had before. Such as pdf.pageCount for the number of pages, and pdf.page(at: <Int>) to get a specific page.

这篇关于使用SWIFT进行PDF解析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆