itext如何检查pdf页面上是否存在巨型字符串 [英] itext how to check if giant string is present on the pdf page
问题描述
- 我正在使用IText插件在我的java项目上创建/读取pdf。
-我正在阅读来自任何扩展程序(pdf,doc,word等)的多个文本文件,并将其内容写在新的pdf上(所有文件的所有内容连接在一起)
- 分隔每个内容对于巨型pdf上的每个文件,我总是开始一个新页面,在新页面的开头用红色写出文件的确切路径,然后写入文件的内容
-I am using the IText plugin to create/read pdfs on my java project. -I am reading multiple text files from any extension(pdf,doc,word etc) and writing their content on a new pdf(all the content of all the files joint together) -To separate each content of each file on the giant pdf, i am always starting a new page, writing the exact path to the file in red at the start of the new page and then writing the content of the file
问题:
- 我想写这个文件在这个pdf上有多少个页面
- 如何检查pdf页面上是否存在字符串?我有所有文件路径,所以我想检查是否有任何路径写在页面上
- 我按照本教程提取我的任何页面的字符串:< a href =http://www.quicklyjava.com/read-pdf-file-in-java-using-itext/ =nofollow> http://www.quicklyjava.com/read-pdf-file -in-java-using-itext /
-
但是,当我提取所有页面并检查是否有一个文件路径出现在页面上时(执行string.contains(...)),系统在pdf页面上找不到我的文件路径!我已经检查了为什么会发生这种情况,当我输出一个页面的字符串时,就像这样:
- I want to write how many pages did the file have on this pdf
- How do i check if a string is present on the pdf page? I have all the files paths, so i would like to check if any of the paths is written on the page
- I was following this tutorial to extract the string of any of my pages: http://www.quicklyjava.com/read-pdf-file-in-java-using-itext/
But when i extract all the page and check if one if my file paths is present at the page(doing a string.contains(...)), the system doesn't find my file path on the pdf page! I have checked why this happens and when i outputted one page's string, it was like this:
1。
PdfGeneratorForSoftwareRegistration / PdfGeneratorForSoftwareRegistration /
src / br / ufrn / pairg / pdfgenerator / LeitorArquivoTexto.java
package br.ufrn.pairg.pdfgenerator;
1. PdfGeneratorForSoftwareRegistration/PdfGeneratorForSoftwareRegistration/ src/br/ufrn/pairg/pdfgenerator/LeitorArquivoTexto.java package br.ufrn.pairg.pdfgenerator;
import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.util.Scanner;
import java.io.BufferedReader; import java.io.File; import java.io.FileReader; import java.io.IOException; import java.util.Scanner;
public ...
public...
当我检查文件路径PdfGeneratorForSoftwareRegistration / PdfGeneratorForSoftwareRegistration /
src / br / ufrn / pairg / pdfgenerator / LeitorArquivoTexto.java是否出现在这个巨大的字符串时,系统没有'找到它。你能看到问题吗?我的路径很大,占据了2条线!这就是问题!
When i checked to see if the file path "PdfGeneratorForSoftwareRegistration/PdfGeneratorForSoftwareRegistration/ src/br/ufrn/pairg/pdfgenerator/LeitorArquivoTexto.java" was present at this giant string, the system didn't find it. Can you see the problem? My path is so big that occupies 2 lines! That's the problem!
所以,我的问题是:有没有办法检查pdf文本中是否存在使用itext插件的巨型字符串?
So, my question is: is there a way to check if a giant string is present on a pdf text using itext plugin?
推荐答案
PDF文件中的页面使用页面树进行组织。页面树的每个叶子是具有键和值的页面字典。你可以像这样在页面字典中添加一个自定义条目:
Pages in a PDF file are organized using a page tree. Each leaf of the page tree is a page dictionary with keys and values. You could add a custom entry to the page dictionary like this:
public void createPdf(String dest) throws IOException, DocumentException {
Document document = new Document();
PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream(dest));
document.open();
document.add(new Paragraph("Page 1"));
document.newPage();
document.add(new Paragraph("Page 2"));
document.newPage();
document.add(new Paragraph("Page 3"));
document.newPage();
document.add(new Paragraph("Page 4"));
writer.addPageDictEntry(new PdfName("ITXT_PageMarker"), new PdfString("Marker for page 4"));
document.newPage();
document.add(new Paragraph("Page 5"));
document.newPage();
document.add(new Paragraph("Page 6"));
writer.addPageDictEntry(new PdfName("ITXT_PageMarker"), new PdfName("PageMarker"));
document.newPage();
document.add(new Paragraph("Page 7"));
writer.addPageDictEntry(new PdfName("ITXT_PageMarker"), new PdfNumber(7));
document.newPage();
document.add(new Paragraph("Page 8"));
document.close();
}
如果你查看PDF,这看起来像这样:
If you look inside the PDF, this looks like this:
为了这个例子,我添加了一个PDF字符串对于第4页,第6页的PDF名称和第7页的PDF编号。
For the sake of this example, I added a PDF string for page 4, a PDF name for page 6 and a PDF number for page 7.
您可以检查是否存在此自定义键:
You can check for the presence of this custom key like this:
public void check(String filename) throws IOException {
PdfReader reader = new PdfReader(filename);
PdfDictionary pagedict;
for (int i = 1; i < reader.getNumberOfPages(); i++) {
pagedict = reader.getPageN(i);
System.out.println(pagedict.get(new PdfName("ITXT_PageMarker")));
}
reader.close();
}
此 check()$的输出c $ c>是这样的:
null
null
null
Marker for page 4
null
/PageMarker
7
重要:除了ISO 32000中定义的那些,您不能只为创建 PDF语法的新密钥。但是,如果您使用ISO注册4位数代码,则可以创建自己的自定义密钥。 。例如:Adobe注册ADBE,iText注册ITXT,...如果您引入新的自定义键,则应使用ISO注册的代码作为前缀。例如:在iText,我们可以使用 ITXT_PageMarker
,或 ITXT_custom
,或 ITXT_Whatever
,...这个规则避免了两个不同的公司引入了具有不同含义的相同代码。
Important: You can't just invent new keys for the PDF syntax apart from those defined in ISO 32000. However, you can create your own custom keys if you register a 4 digit code with ISO. For instance: Adobe registered ADBE, iText registered ITXT,... If you introduce new custom keys, you should use the code registered with ISO as a prefix. For instance: at iText, we can use ITXT_PageMarker
, or ITXT_custom
, or ITXT_Whatever
,... This rule avoids that two different company introduce the same code with a different meaning.
这篇关于itext如何检查pdf页面上是否存在巨型字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!