在iPhone上将整个pdf页面解析为NSString [英] Parse whole pdf-page to NSString on an iPhone

查看:187
本文介绍了在iPhone上将整个pdf页面解析为NSString的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在尝试将pdf页面的文本解析为NSString一段时间了,我唯一能找到的是搜索特定字符串值的方法。

I've been trying to parse a pdf-page of text to NSString for a while now and the only thing I can find are methods to search for specific stringvalues.

我想做的是解析单页PDF而不使用任何外部库,如PDFKitten,PDFKit等。

What I'd like to do is parse a single page of PDF without using any external libraries such as PDFKitten, PDFKit etc.

我想要如果可能的话,NSArray,NSString或NSDictionary中的数据。

I'd like to have the data in an NSArray, NSString or NSDictionary if possible.

谢谢:D!

我的一部分到目前为止已经尝试过了。

A piece of what I've tried so far.

CGPDFDocumentRef MyGetPDFDocumentRef (const char *filename) {
    CFStringRef path;
    CFURLRef url;
    CGPDFDocumentRef document;
    path = CFStringCreateWithCString (NULL, filename,kCFStringEncodingUTF8);
    url = CFURLCreateWithFileSystemPath (NULL, path, kCFURLPOSIXPathStyle, 0);
    CFRelease (path);
    document = CGPDFDocumentCreateWithURL (url);// 2
    CFRelease(url);
    int count = CGPDFDocumentGetNumberOfPages (document);// 3
    if (count == 0) {
        printf("`%s' needs at least one page!", filename);
        return NULL;
    }
    return document;
}

// table methods to parse pdf
static void op_MP (CGPDFScannerRef s, void *info) {
    const char *name;
    if (!CGPDFScannerPopName(s, &name))
        return;
    printf("MP /%s\n", name);
}

static void op_DP (CGPDFScannerRef s, void *info) {
    const char *name;
    if (!CGPDFScannerPopName(s, &name))
        return;
    printf("DP /%s\n", name);
}

static void op_BMC (CGPDFScannerRef s, void *info) {
    const char *name;
    if (!CGPDFScannerPopName(s, &name))
        return;
    printf("BMC /%s\n", name);
}

static void op_BDC (CGPDFScannerRef s, void *info) {
    const char *name;
    if (!CGPDFScannerPopName(s, &name))
        return;
    printf("BDC /%s\n", name);
}

static void op_EMC (CGPDFScannerRef s, void *info) {
    const char *name;
    if (!CGPDFScannerPopName(s, &name))
        return;
    printf("EMC /%s\n", name);
}

void MyDisplayPDFPage (CGContextRef myContext,size_t pageNumber,const char *filename) {
    CGPDFDocumentRef document;
    CGPDFPageRef page;
    document = MyGetPDFDocumentRef (filename);// 1
    totalPages=CGPDFDocumentGetNumberOfPages(document);
    page = CGPDFDocumentGetPage (document, 1);// 2

    CGPDFDictionaryRef d;

    d = CGPDFPageGetDictionary(page);

    CGPDFScannerRef myScanner;
    CGPDFOperatorTableRef myTable;
    myTable = CGPDFOperatorTableCreate();
    CGPDFOperatorTableSetCallback (myTable, "MP", &op_MP);
    CGPDFOperatorTableSetCallback (myTable, "DP", &op_DP);
    CGPDFOperatorTableSetCallback (myTable, "BMC", &op_BMC);
    CGPDFOperatorTableSetCallback (myTable, "BDC", &op_BDC);
    CGPDFOperatorTableSetCallback (myTable, "EMC", &op_EMC);

    CGPDFContentStreamRef myContentStream = CGPDFContentStreamCreateWithPage (page);// 3
    myScanner = CGPDFScannerCreate (myContentStream, myTable, NULL);// 4

    CGPDFScannerScan (myScanner);// 5

    CGPDFStringRef str;

    d = CGPDFPageGetDictionary(page);

    if (CGPDFDictionaryGetString(d, "Lorem", &str)){
        CFStringRef s;
        s = CGPDFStringCopyTextString(str);
        if (s != NULL) {
            NSLog(@"%@ testing it", s);
        }
        CFRelease(s);
    }
}

- (void)viewDidLoad {
    [super viewDidLoad];


    MyDisplayPDFPage(UIGraphicsGetCurrentContext(), 1, [[[NSBundle mainBundle] pathForResource:@"TestPage" ofType:@"pdf"] UTF8String]);

}


推荐答案

Quartz 提供的功能可让您检查PDF文档结构和内容流。通过检查文档结构,您可以读取文档目录中的条目以及与每个条目关联的内容。通过递归遍历目录,您可以检查整个文档。

Quartz provides functions that let you inspect the PDF document structure and the content stream. Inspecting the document structure lets you read the entries in the document catalog and the contents associated with each entry. By recursively traversing the catalog, you can inspect the entire document.

PDF内容流正如其名称所暗示的那样 - 连续的数据流,例如'BT 12 / F71 Tf(绘制本文)Tj。 。 。 'PDF操作符及其描述符与实际PDF内容混合在一起。检查内容流需要您按顺序访问它。

A PDF content stream is just what its name suggests—a sequential stream of data such as 'BT 12 /F71 Tf (draw this text) Tj . . . ' where PDF operators and their descriptors are mixed with the actual PDF content. Inspecting the content stream requires that you access it sequentially.

developer.apple文档 显示了如何检查PDF文档的结构并解析PDF文档的内容。

This developer.apple documentation shows how to examine the structure of a PDF document and parse the contents of a PDF document.

这篇关于在iPhone上将整个pdf页面解析为NSString的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆