使用 PDFBOX 读取 pdf 文本偶尔会返回 \r\n [英] Reading text of a pdf using PDFBOX occasionally returns \r\n
问题描述
我目前正在使用 PDFBox 阅读我继承的一组 pdf 的文本.
I’m currently using PDFBox to read the text of a set of pdfs that I’ve inherited.
我只对阅读文本感兴趣,而不对文件进行任何更改.
I’m only interested in reading the text, not making any changes to the file.
适用于大多数文件的代码是:
The code that works for most of the files is:
File pdfFile = myPath.toFile();
PDDocument document = PDDocument.load(pdfFile );
Writer sw = new StringWriter();
PDFTextStripper stripper = new PDFTextStripper();
stripper.setStartPage( 1 );
stripper.writeText( document, sw );
String documentText = sw.toString()
对于大多数文件,我会在 documentText 字段中输入文本.
For most files, I wind up with the text in the documentText field.
但是,对于 24 个文件中的 3 个,第一个文件的 documentText 内容是\r\n",第二个是\r\n\r\n",第三个是\r\n\r\n\r\n:, 但是这三个文件不连续.每个文件之间都有多个好文件.
But, for 3 of 24 files, the documentText content for the first file is "\r\n", for the second "\r\n\r\n", and for the third "\r\n\r\n\r\n:, But the three files are not consecutive. Multiple good files are between each of these files.
文件派生自 java.nio.Path.作为 Path 一部分的 WindowsFileAttribute 的大小为 279K,因此该文件在磁盘上不是空的.
The File is derived from a java.nio.Path. The WindowsFileAttribute that is part of the Path has a size of 279K, so the file is not empty on disk.
我可以打开文件并查看数据,它看起来像我的代码读取的其他文件.
I can open the file and view the data, and it looks like the other files that my code reads.
我使用的是 Java 8.0.121 和 PDFBox 2.0.4.(我相信这是最新版本.)
I’m using Java 8.0.121, and PDFBox 2.0.4. (this is the latest version, I believe.)
有什么建议吗?有没有更好的方法来阅读文本?(我对格式或使用的字体不感兴趣,只对文本感兴趣.)
Any suggestions? Is there a better way to read the text? (I’m not interested in the formatting, or fonts used, just the text.)
谢谢.
推荐答案
在java中使用pdfbox阅读多个PDF文档
package readwordfile;
import java.io.BufferedReader;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.pdfbox.text.TextPosition;
import java.io.ByteArrayOutputStream;
import java.io.DataInputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileWriter;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.OutputStreamWriter;
import java.io.Writer;
import java.util.ArrayList;
import java.util.List;
/**
* This is an example on how to extract words from PDF document
*
* @author saravanan
*/
public class GetWordsFromPDF extends PDFTextStripper {
static List<String> words = new ArrayList<String>();
public GetWordsFromPDF() throws IOException {
}
/**
* @param args
* @throws IOException If there is an error parsing the document.
*/
public static void main(String[] args) throws IOException {
String files;
// FileWriter fs = new FileWriter("C:\\Users\\saravanan\\Desktop\\New Text Document (2).txt");
// FileInputStream fstream1 = new FileInputStream("C:\\Users\\saravanan\\Desktop\\New Text Document (2).txt");
// DataInputStream in1 = new DataInputStream(fstream1);
// BufferedReader br1 = new BufferedReader(new InputStreamReader(in1));
String path = "C:\\Users\\saravanan\\Desktop\\New folder\\"; //local folder path name
File folder = new File(path);
File[] listOfFiles = folder.listFiles();
for (int i = 0; i < listOfFiles.length; i++) {
if (listOfFiles[i].isFile()) {
files = listOfFiles[i].getName();
if (files.endsWith(".pdf") || files.endsWith(".PDF")) {
String nfiles = "C:\\Users\\saravanan\\Desktop\\New folder\\";
String fileName1 = nfiles + files;
System.out.print("\n\n" + files+"\n");
PDDocument document = null;
try {
document = PDDocument.load(new File(fileName1));
PDFTextStripper stripper = new GetWordsFromPDF();
stripper.setSortByPosition(true);
stripper.setStartPage(0);
stripper.setEndPage(document.getNumberOfPages());
Writer dummy = new OutputStreamWriter(new ByteArrayOutputStream());
stripper.writeText(document, dummy);
int x = 0;
System.out.println("");
for (String word : words) {
if (word.startsWith("xxxxxx")) { //here you can give your pdf doc starting word
x = 1;
}
if (x == 1) {
if (!(word.endsWith("YYYYYY"))) { //here you can give your pdf doc ending word
System.out.print(word + " ");
// fs.write(word);
} else {
x = 0;
break;
}
}
}
} finally {
if (document != null) {
document.close();
words.clear();
}
}
}
}
}
}
/**
* Override the default functionality of PDFTextStripper.writeString()
*
* @param str
* @param textPositions
* @throws java.io.IOException
*/
@Override
protected void writeString(String str, List<TextPosition> textPositions) throws IOException {
String[] wordsInStream = str.split(getWordSeparator());
if (wordsInStream != null) {
for (String word : wordsInStream) {
words.add(word); //store the pdf content into the List
}
}
}
}
这篇关于使用 PDFBOX 读取 pdf 文本偶尔会返回 \r\n的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!