如何在java pdfbox中按结果拆分pdf文件 [英] How to split pdf file by result in java pdfbox

查看:509
本文介绍了如何在java pdfbox中按结果拆分pdf文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个pdf文件,其中包含60页。在每个页面中,我都使用Apache PDFBOX独特且重复发票编号。

import java.io.*;
import org.apache.pdfbox.pdmodel.*;
import org.apache.pdfbox.util.*;
import java.util.regex.*;

public class PDFtest1 {
public static void main(String[] args){
PDDocument pd;
try {

     File input = new File("G:\\Sales.pdf");

     // StringBuilder to store the extracted text
     StringBuilder sb = new StringBuilder();
     pd = PDDocument.load(input);
     PDFTextStripper stripper = new PDFTextStripper();

     // Add text to the StringBuilder from the PDF
     sb.append(stripper.getText(pd));


     Pattern p = Pattern.compile("Invoice No.\\s\\w\\d\\d\\d\\d\\d\\d\\d\\d\\d\\d");

     // Matcher refers to the actual text where the pattern will be found
     Matcher m = p.matcher(sb);

     while (m.find()){
         // group() method refers to the next number that follows the pattern we have specified.
         System.out.println(m.group());
     }

     if (pd != null) {
         pd.close();
     }
   } catch (Exception e){
     e.printStackTrace();
    }
 }
 }  

我是能够使用java regex读取所有发票编号。
最后结果如下

run:
Invoice No. D0000003010
Invoice No. D0000003011
Invoice No. D0000003011
Invoice No. D0000003011
Invoice No. D0000003011
Invoice No. D0000003012
Invoice No. D0000003012
Invoice No. D0000003012
Invoice No. D0000003013
Invoice No. D0000003013
Invoice No. D0000003014
Invoice No. D0000003014
Invoice No. D0000003015
Invoice No. D0000003016

我需要根据发票编号拆分pdf。例如,发票号D0000003011,所有pdf页面应合并为单个pdf,依此类推。
Hw我能否实现这一目标。 ..

I need to split the pdf according to tht Invoice No.s. For example Invoice No. D0000003011, all pdf pages should be merge as a single pdf and so on. Hw can i achive dis. ..

推荐答案

public static void main(String[] args) throws IOException, COSVisitorException
{
    File input = new File("G:\\Sales.pdf");

    PDDocument outputDocument = null;
    PDDocument inputDocument = PDDocument.loadNonSeq(input, null);
    PDFTextStripper stripper = new PDFTextStripper();
    String currentNo = null;
    for (int page = 1; page <= inputDocument.getNumberOfPages(); ++page)
    {
        stripper.setStartPage(page);
        stripper.setEndPage(page);
        String text = stripper.getText(inputDocument);
        Pattern p = Pattern.compile("Invoice No.(\\s\\w\\d\\d\\d\\d\\d\\d\\d\\d\\d\\d)");

        // Matcher refers to the actual text where the pattern will be found
        Matcher m = p.matcher(text);
        String no = null;
        if (m.find())
        {
            no = m.group(1);
        }
        System.out.println("page: " + page + ", value: " + no);

        PDPage pdPage = (PDPage) inputDocument.getDocumentCatalog().getAllPages().get(page - 1);

        if (no != null && !no.equals(currentNo))
        {
            saveCloseCurrent(currentNo, outputDocument);
            // create new document
            outputDocument = new PDDocument();
            currentNo = no;
        }
        if (no == null && currentNo == null)
        {
            System.out.println ("header page ??? " + page + " skipped");
            continue;
        }
        // append page to current document
        outputDocument.importPage(pdPage);
    }
    saveCloseCurrent(currentNo, outputDocument);
    inputDocument.close();
}

private static void saveCloseCurrent(String currentNo, PDDocument outputDocument)
        throws IOException, COSVisitorException
{
    // save to new output file
    if (currentNo != null)
    {
        // save document into file
        File f = new File(currentNo + ".pdf");
        if (f.exists())
        {
            System.err.println("File " + f + " exists?!");
            System.exit(-1);
        }
        outputDocument.save(f);
        outputDocument.close();
    }
}

小心:


  • 这个尚未经过您的文件测试(因为我没有);

  • 代码假设相同的发票数字总是在一起;

  • 你的正则表达式已经略有改变;

  • 确保第一个和最后一个PDF文件是正确的,并检查一些随机的,并且有不同的观众(如果有的话);

  • 验证文件的总数是否符合预期;

  • 总计的大小所有文件都将大于源文件,这是因为字体资源;

  • 使用1.8.10版本。不要同时使用PDFBox 0.7.3.jar!

  • 错误处理非常基本,你需要改变它;

  • this has not been tested with your file (because I don't have it);
  • the code makes the assumption that identical invoice numbers are always together;
  • your regular expression has been changed slightly;
  • make sure that the first and the last PDF files are correct, and check a few at random, and with different viewers if available;
  • verify that the total count of files is as expected;
  • the summed up size of all files will be bigger than the source file, this is because of the font resources;
  • use the 1.8.10 version. Don't use PDFBox 0.7.3.jar at the same time!
  • error handling is very basic, you need to change it;

更新19.8.2015:

update 19.8.2015:


  • 它现在支持没有发票号的页面,这些将是附加。

这篇关于如何在java pdfbox中按结果拆分pdf文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆