使用itextpdf提取的英文文本是不可理解的 [英] English text extracted using itextpdf is not understandable

查看:139
本文介绍了使用itextpdf提取的英文文本是不可理解的的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从控制台上的pdf中提取和打印英文文本。使用PdfTextExtractor类通过itextpdf API完成提取。我得到的文字是不可理解的。可能是我面临的一些语言问题。我的目的是在PDF中查找特定文本并将其替换为其他字符串。我开始解析文件以找到字符串。以下代码片段代表我的字符串提取器:

I'm trying to extract and print english text out of a pdf on console. Extraction is done through itextpdf API using PdfTextExtractor class. Text i'm getting is not understandble. May be some language issues I'm facing. My intent is to find a particular text within a PDF and replace it with some other string. I started with parsing the file to find the string. Following code snippet represents my string extractor:

Document document = new Document();

PdfWriter writer = PdfWriter.getInstance(document,
    new FileOutputStream(OUTPUTFILE));
document.open();
PdfReader reader = new PdfReader(input);
int n = reader.getNumberOfPages();
PdfImportedPage page;
// Go through all pages
for (int i = 1; i <= n; i++) {

    String str=PdfTextExtractor.getTextFromPage(reader, i); 
    System.out.println(str);  

}
document.close();

但是我在控制台上输出的输出是不可理解的,即使PDF中的文本是英语。

but the output I'm getting on console is not understandable even though the text in the PDF is in english.

输出:

t cotenn dna o mntoafinir yales r ni et h layhcsip Amgteu end y Retila m eysts w tih eth erss p wlli $ b $ e erefcern emsyst et et se。 ru I n tioi,dnda etseh orpvedi eddda e ulav o t taw h s i oelbssip hwti
se vdcie ollaw na s tiouquibu cacess o t latoutenxc e rpap dna t ilagid ottennc olae n ewnh ey th krwo
tofoi。 nmirna ni soitaoli n mor f chea e。 roth s iTh s i a cel ra csea
ewerheth lweoh is ermo nath eth ms u u sti sti

t cotenn dna o mntoafinir yales r ni et h layhcsip Amgteu end y Retila m eysts w tih eth erss p wlli e erefcern emsyst o f et h se. ru I n tioi, dnda etseh orpvedi eddda e ulav o t taw h s i oelbssip hwti se vdcie ollaw na s tiouquibu cacess o t latoutenxc e rpap dna t ilagid ottennc olae n ewnh ey th krwo tofoi. nmirna ni soitaoli n mor f chea e. roth s iTh s i a cel ra csea ewerh " eth lweoh is ermo nath eth ms u fo sti

rtasp。

任何人都可以帮我找一些可能的解决方案,用英语提供文本,就像在源PDF中一样。任何形式的帮助都将受到高度赞赏。

Can anybody please help me out what could be the possible solution for bringing text in english language as it is like in source PDF. Any sort of help will be highly appreciated.

推荐答案

如果您希望根据文本在页面上的位置对文本进行排序,你需要引入一个特定的策略,例如 LocationTextExtractionStrategy

If you want the text to be ordered based on its position on the page, you need to introduce a specific strategy, such as the LocationTextExtractionStrategy:

for (int i = 1; i <= reader.getNumberOfPages(); i++) {
    String str=PdfTextExtractor.getTextFromPage(reader, i, new LocationTextExtractionStrategy());
}

LocationTextExtractionStrategy 有时候导致奇数句子,更具体地说,如果页面上的字母跳舞(字形的基线在同一行上的文字不同)。在这种情况下,您可以尝试 SimpleTextExtractionStrategy ,它将按照PDF语法内容流中显示的顺序返回文本。

The LocationTextExtractionStrategy sometimes results in odd sentences, more specifically if the letters 'dance' on the page (the baseline of the glyphs differs for text on the same line). In that case, you can try the SimpleTextExtractionStrategy which will return the text in the order in which it appears in the PDF syntax content stream.

这篇关于使用itextpdf提取的英文文本是不可理解的的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆