使用itextpdf提取的英文文本是不可理解的 [英] English text extracted using itextpdf is not understandable
问题描述
我正在尝试从控制台上的pdf中提取和打印英文文本。使用PdfTextExtractor类通过itextpdf API完成提取。我得到的文字是不可理解的。可能是我面临的一些语言问题。我的目的是在PDF中查找特定文本并将其替换为其他字符串。我开始解析文件以找到字符串。以下代码片段代表我的字符串提取器:
I'm trying to extract and print english text out of a pdf on console. Extraction is done through itextpdf API using PdfTextExtractor class. Text i'm getting is not understandble. May be some language issues I'm facing. My intent is to find a particular text within a PDF and replace it with some other string. I started with parsing the file to find the string. Following code snippet represents my string extractor:
Document document = new Document();
PdfWriter writer = PdfWriter.getInstance(document,
new FileOutputStream(OUTPUTFILE));
document.open();
PdfReader reader = new PdfReader(input);
int n = reader.getNumberOfPages();
PdfImportedPage page;
// Go through all pages
for (int i = 1; i <= n; i++) {
String str=PdfTextExtractor.getTextFromPage(reader, i);
System.out.println(str);
}
document.close();
但是我在控制台上输出的输出是不可理解的,即使PDF中的文本是英语。
but the output I'm getting on console is not understandable even though the text in the PDF is in english.
输出:
t cotenn dna o mntoafinir yales r ni et h layhcsip Amgteu end y Retila m eysts w tih eth erss p wlli $ b $ e erefcern emsyst et et se。 ru I n tioi,dnda etseh orpvedi eddda e ulav o t taw h s i oelbssip hwti
se vdcie ollaw na s tiouquibu cacess o t latoutenxc e rpap dna t ilagid ottennc olae n ewnh ey th krwo
tofoi。 nmirna ni soitaoli n mor f chea e。 roth s iTh s i a cel ra csea
ewerheth lweoh is ermo nath eth ms u u sti sti
t cotenn dna o mntoafinir yales r ni et h layhcsip Amgteu end y Retila m eysts w tih eth erss p wlli e erefcern emsyst o f et h se. ru I n tioi, dnda etseh orpvedi eddda e ulav o t taw h s i oelbssip hwti se vdcie ollaw na s tiouquibu cacess o t latoutenxc e rpap dna t ilagid ottennc olae n ewnh ey th krwo tofoi. nmirna ni soitaoli n mor f chea e. roth s iTh s i a cel ra csea ewerh " eth lweoh is ermo nath eth ms u fo sti
rtasp。
任何人都可以帮我找一些可能的解决方案,用英语提供文本,就像在源PDF中一样。任何形式的帮助都将受到高度赞赏。
Can anybody please help me out what could be the possible solution for bringing text in english language as it is like in source PDF. Any sort of help will be highly appreciated.
推荐答案
如果您希望根据文本在页面上的位置对文本进行排序,你需要引入一个特定的策略,例如 LocationTextExtractionStrategy
:
If you want the text to be ordered based on its position on the page, you need to introduce a specific strategy, such as the LocationTextExtractionStrategy
:
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
String str=PdfTextExtractor.getTextFromPage(reader, i, new LocationTextExtractionStrategy());
}
LocationTextExtractionStrategy
有时候导致奇数句子,更具体地说,如果页面上的字母跳舞(字形的基线在同一行上的文字不同)。在这种情况下,您可以尝试 SimpleTextExtractionStrategy
,它将按照PDF语法内容流中显示的顺序返回文本。
The LocationTextExtractionStrategy
sometimes results in odd sentences, more specifically if the letters 'dance' on the page (the baseline of the glyphs differs for text on the same line). In that case, you can try the SimpleTextExtractionStrategy
which will return the text in the order in which it appears in the PDF syntax content stream.
这篇关于使用itextpdf提取的英文文本是不可理解的的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!