使用 PDFBOX 读取 pdf 文本偶尔会返回 \r\n [英] Reading text of a pdf using PDFBOX occasionally returns \r\n

查看:30
本文介绍了使用 PDFBOX 读取 pdf 文本偶尔会返回 \r\n的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在使用 PDFBox 阅读我继承的一组 pdf 的文本.

I’m currently using PDFBox to read the text of a set of pdfs that I’ve inherited.

我只对阅读文本感兴趣,而不对文件进行任何更改.

I’m only interested in reading the text, not making any changes to the file.

适用于大多数文件的代码是:

The code that works for most of the files is:

   File pdfFile = myPath.toFile();
   PDDocument document = PDDocument.load(pdfFile );
   Writer sw = new StringWriter();
   PDFTextStripper stripper = new PDFTextStripper();
   stripper.setStartPage( 1 );
   stripper.writeText( document,  sw );
   String documentText = sw.toString()

对于大多数文件,我会在 documentText 字段中输入文本.

For most files, I wind up with the text in the documentText field.

但是,对于 24 个文件中的 3 个,第一个文件的 documentText 内容是\r\n",第二个是\r\n\r\n",第三个是\r\n\r\n\r\n:, 但是这三个文件不连续.每个文件之间都有多个好文件.

But, for 3 of 24 files, the documentText content for the first file is "\r\n", for the second "\r\n\r\n", and for the third "\r\n\r\n\r\n:, But the three files are not consecutive. Multiple good files are between each of these files.

文件派生自 java.nio.Path.作为 Path 一部分的 WindowsFileAttribute 的大小为 279K,因此该文件在磁盘上不是空的.

The File is derived from a java.nio.Path. The WindowsFileAttribute that is part of the Path has a size of 279K, so the file is not empty on disk.

我可以打开文件并查看数据,它看起来像我的代码读取的其他文件.

I can open the file and view the data, and it looks like the other files that my code reads.

我使用的是 Java 8.0.121 和 PDFBox 2.0.4.(我相信这是最新版本.)

I’m using Java 8.0.121, and PDFBox 2.0.4. (this is the latest version, I believe.)

有什么建议吗?有没有更好的方法来阅读文本?(我对格式或使用的字体不感兴趣,只对文本感兴趣.)

Any suggestions? Is there a better way to read the text? (I’m not interested in the formatting, or fonts used, just the text.)

谢谢.

推荐答案

在java中使用pdfbox阅读多个PDF文档

package readwordfile;

import java.io.BufferedReader;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.pdfbox.text.TextPosition;

import java.io.ByteArrayOutputStream;
import java.io.DataInputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileWriter;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.OutputStreamWriter;
import java.io.Writer;
import java.util.ArrayList;
import java.util.List;

/**
 * This is an example on how to extract words from PDF document
 *
 * @author saravanan
 */
public class GetWordsFromPDF extends PDFTextStripper {

    static List<String> words = new ArrayList<String>();

    public GetWordsFromPDF() throws IOException {
    }

    /**
     * @param args
     * @throws IOException If there is an error parsing the document.
     */
    public static void main(String[] args) throws IOException {
        String files;
//        FileWriter fs = new FileWriter("C:\\Users\\saravanan\\Desktop\\New Text Document (2).txt");  
//        FileInputStream fstream1 = new FileInputStream("C:\\Users\\saravanan\\Desktop\\New Text Document (2).txt");
//        DataInputStream in1 = new DataInputStream(fstream1);
//        BufferedReader br1 = new BufferedReader(new InputStreamReader(in1));
        String path = "C:\\Users\\saravanan\\Desktop\\New folder\\";  //local folder path name
        File folder = new File(path);

        File[] listOfFiles = folder.listFiles();

        for (int i = 0; i < listOfFiles.length; i++) {
            if (listOfFiles[i].isFile()) {
                files = listOfFiles[i].getName();
                if (files.endsWith(".pdf") || files.endsWith(".PDF")) {

                    String nfiles = "C:\\Users\\saravanan\\Desktop\\New folder\\";
                    String fileName1 = nfiles + files;
                    System.out.print("\n\n" + files+"\n");
                    PDDocument document = null;
                    try {
                        document = PDDocument.load(new File(fileName1));
                        PDFTextStripper stripper = new GetWordsFromPDF();
                        stripper.setSortByPosition(true);
                        stripper.setStartPage(0);
                        stripper.setEndPage(document.getNumberOfPages());

                        Writer dummy = new OutputStreamWriter(new ByteArrayOutputStream());
                        stripper.writeText(document, dummy);
                        int x = 0;

                        System.out.println("");
                        for (String word : words) {
                            if (word.startsWith("xxxxxx")) { //here you can give your pdf doc starting word 
                                x = 1;
                            }
                            if (x == 1) {
                                if (!(word.endsWith("YYYYYY"))) { //here you can give your pdf doc ending word 
                                    System.out.print(word + " ");
                                    // fs.write(word);                                   
                                } else {
                                    x = 0;
                                    break;
                                }
                            }
                        }
                    } finally {
                        if (document != null) {
                            document.close();
                            words.clear();
                        }
                    }
                }
            }
        }
    }

    /**
     * Override the default functionality of PDFTextStripper.writeString()
     *
     * @param str
     * @param textPositions
     * @throws java.io.IOException
     */
    @Override
    protected void writeString(String str, List<TextPosition> textPositions) throws IOException {
        String[] wordsInStream = str.split(getWordSeparator());
        if (wordsInStream != null) {
            for (String word : wordsInStream) {
                words.add(word);    //store the pdf content into the List
            }
        }
    }
}

这篇关于使用 PDFBOX 读取 pdf 文本偶尔会返回 \r\n的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆