itext如何检查pdf页面上是否存在巨型字符串 [英] itext how to check if giant string is present on the pdf page

查看:157
本文介绍了itext如何检查pdf页面上是否存在巨型字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

- 我正在使用IText插件在我的java项目上创建/读取pdf。
-我正在阅读来自任何扩展程序(pdf,doc,word等)的多个文本文件,并将其内容写在新的pdf上(所有文件的所有内容连接在一起)
- 分隔每个内容对于巨型pdf上的每个文件,我总是开始一个新页面,在新页面的开头用红色写出文件的确切路径,然后写入文件的内容

-I am using the IText plugin to create/read pdfs on my java project. -I am reading multiple text files from any extension(pdf,doc,word etc) and writing their content on a new pdf(all the content of all the files joint together) -To separate each content of each file on the giant pdf, i am always starting a new page, writing the exact path to the file in red at the start of the new page and then writing the content of the file

问题:


  • 我想写这个文件在这个pdf上有多少个页面

  • 如何检查pdf页面上是否存在字符串?我有所有文件路径,所以我想检查是否有任何路径写在页面上

  • 我按照本教程提取我的任何页面的字符串:< a href =http://www.quicklyjava.com/read-pdf-file-in-java-using-itext/ =nofollow> http://www.quicklyjava.com/read-pdf-file -in-java-using-itext /

  • 但是,当我提取所有页面并检查是否有一个文件路径出现在页面上时(执行string.contains(...)),系统在pdf页面上找不到我的文件路径!我已经检查了为什么会发生这种情况,当我输出一个页面的字符串时,就像这样:

  • I want to write how many pages did the file have on this pdf
  • How do i check if a string is present on the pdf page? I have all the files paths, so i would like to check if any of the paths is written on the page
  • I was following this tutorial to extract the string of any of my pages: http://www.quicklyjava.com/read-pdf-file-in-java-using-itext/
  • But when i extract all the page and check if one if my file paths is present at the page(doing a string.contains(...)), the system doesn't find my file path on the pdf page! I have checked why this happens and when i outputted one page's string, it was like this:

1。
PdfGeneratorForSoftwareRegistration / PdfGeneratorForSoftwareRegistration /
src / br / ufrn / pairg / pdfgenerator / LeitorArquivoTexto.java
package br.ufrn.pairg.pdfgenerator;

1. PdfGeneratorForSoftwareRegistration/PdfGeneratorForSoftwareRegistration/ src/br/ufrn/pairg/pdfgenerator/LeitorArquivoTexto.java package br.ufrn.pairg.pdfgenerator;

import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.util.Scanner;

import java.io.BufferedReader; import java.io.File; import java.io.FileReader; import java.io.IOException; import java.util.Scanner;

public ...

public...

当我检查文件路径PdfGeneratorForSoftwareRegistration / PdfGeneratorForSoftwareRegistration /
src / br / ufrn / pairg / pdfgenerator / LeitorArquivoTexto.java是否出现在这个巨大的字符串时,系统没有'找到它。你能看到问题吗?我的路径很大,占据了2条线!这就是问题!

When i checked to see if the file path "PdfGeneratorForSoftwareRegistration/PdfGeneratorForSoftwareRegistration/ src/br/ufrn/pairg/pdfgenerator/LeitorArquivoTexto.java" was present at this giant string, the system didn't find it. Can you see the problem? My path is so big that occupies 2 lines! That's the problem!

所以,我的问题是:有没有办法检查pdf文本中是否存在使用itext插件的巨型字符串?

So, my question is: is there a way to check if a giant string is present on a pdf text using itext plugin?

推荐答案

PDF文件中的页面使用页面树进行组织。页面树的每个叶子是具有键和值的页面字典。你可以像这样在页面字典中添加一个自定义条目:

Pages in a PDF file are organized using a page tree. Each leaf of the page tree is a page dictionary with keys and values. You could add a custom entry to the page dictionary like this:

public void createPdf(String dest) throws IOException, DocumentException {
    Document document = new Document();
    PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream(dest));
    document.open();
    document.add(new Paragraph("Page 1"));
    document.newPage();
    document.add(new Paragraph("Page 2"));
    document.newPage();
    document.add(new Paragraph("Page 3"));
    document.newPage();
    document.add(new Paragraph("Page 4"));
    writer.addPageDictEntry(new PdfName("ITXT_PageMarker"), new PdfString("Marker for page 4"));
    document.newPage();
    document.add(new Paragraph("Page 5"));
    document.newPage();
    document.add(new Paragraph("Page 6"));
    writer.addPageDictEntry(new PdfName("ITXT_PageMarker"), new PdfName("PageMarker"));
    document.newPage();
    document.add(new Paragraph("Page 7"));
    writer.addPageDictEntry(new PdfName("ITXT_PageMarker"), new PdfNumber(7));
    document.newPage();
    document.add(new Paragraph("Page 8"));
    document.close();
}

如果你查看PDF,这看起来像这样:

If you look inside the PDF, this looks like this:

为了这个例子,我添加了一个PDF字符串对于第4页,第6页的PDF名称和第7页的PDF编号。

For the sake of this example, I added a PDF string for page 4, a PDF name for page 6 and a PDF number for page 7.

您可以检查是否存在此自定义键:

You can check for the presence of this custom key like this:

public void check(String filename) throws IOException {
    PdfReader reader = new PdfReader(filename);
    PdfDictionary pagedict;
    for (int i = 1; i < reader.getNumberOfPages(); i++) {
        pagedict = reader.getPageN(i);
        System.out.println(pagedict.get(new PdfName("ITXT_PageMarker")));
    }
    reader.close();
}

check()是这样的:

null
null
null
Marker for page 4
null
/PageMarker
7

重要:除了ISO 32000中定义的那些,您不能只为创建 PDF语法的新密钥。但是,如果您使用ISO注册4位数代码,则可以创建自己的自定义密钥。 。例如:Adobe注册ADBE,iText注册ITXT,...如果您引入新的自定义键,则应使用ISO注册的代码作为前缀。例如:在iText,我们可以使用 ITXT_PageMarker ,或 ITXT_custom ,或 ITXT_Whatever ,...这个规则避免了两个不同的公司引入了具有不同含义的相同代码。

Important: You can't just invent new keys for the PDF syntax apart from those defined in ISO 32000. However, you can create your own custom keys if you register a 4 digit code with ISO. For instance: Adobe registered ADBE, iText registered ITXT,... If you introduce new custom keys, you should use the code registered with ISO as a prefix. For instance: at iText, we can use ITXT_PageMarker, or ITXT_custom, or ITXT_Whatever,... This rule avoids that two different company introduce the same code with a different meaning.

这篇关于itext如何检查pdf页面上是否存在巨型字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆