阅读使用iTextSharp的本地化PDF文件 [英] Read localized PDF file using Itextsharp
问题描述
我想读使用iTextSharp的PDF文件。这个问题是想读英语(印地文或阿拉伯文为例)以外的PDF文件时,它没有得到正确的话。
我在想,我应该安装在系统上的印地文和阿拉伯字体或做我需要做编码的东西吗?
ITextExtractionStrategy策略=新SimpleTextExtractionStrategy();
字符串currentText = PdfTextExtractor.GetTextFromPage(pdfReader,页面策略);
currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default,Encoding.UTF8,Encoding.Default.GetBytes(currentText)));
text.Append(currentText);
编辑:
样品PDF图片排列方式:
提取的文本:
uxj ikfydk IFJKN fuokZpd ukekoyh&安培; 2011
i`B LA [; k%
1 1 1 1和; &安培; &安培; &安培; ftys DK受身ftys DK受身ftys DK受身ftys DK乌凯%%%%
0701-ò¶âã£ûæ-
2 2 2 2及&安培; &安培; &安培; fudk fudk fudk fudk; ; ; ; DK受身DK受身DK受身DK乌凯%%%%
1¢AI™
3 3 3 3及&安培; &安培; &安培; okMZ LA LA okMZ拉okMZ拉okMZ [[[[; ; ; ; K-Ø受身ķØ受身ķØ受身ķØ乌凯%%%%
1,一个个™®ã£û¶âû§âîºâã®â£û¶âûÕô¯âû®â£û¶âû
4 4 4 4和&安培; &安培; &安培; Hkkx LA LA Hkkx拉Hkkx拉Hkkx [[[[; ; ; ; K-ķķķ%%%%
不要使用任何类型的编码的,因为你不知道是什么
编码是pdf文件了。
块引用>。
我认为它会工作。ITextExtractionStrategy策略=新SimpleTextExtractionStrategy();
字符串currentText = PdfTextExtractor.GetTextFromPage(pdfReader,页面策略);
文本=文本+ currentText;///做你想要的文字是什么
MessageBox.Show(文本);如果它仍然不是那么的工作,你必须安装特定的字体。
I am trying to read PDF file using iTextSharp. The issue is when trying to read a PDF file other than English (Hindi or Arabic for example) it's not getting the correct words.
I am wondering, should I install the Hindi or Arabic font on my system or do I need to do something with encoding?
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy(); string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy); currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText))); text.Append(currentText);
Edit:
Sample PDF as Image:
Extracted Text:
uxj ikfydk ifj"kn fuokZpd ukekoyh& 2011 i`"B la[;k % 1 1 1 1& & & & ftys dk uke ftys dk uke ftys dk uke ftys dk uke % % % % 0701-ò¶âã£ûæ– 2 2 2 2& & & & fudk fudk fudk fudk; ; ; ; dk uke dk uke dk uke dk uke % % % % 1-¢âî™ 3 3 3 3& & & & okMZ la okMZ la okMZ la okMZ la[ [ [ [; ; ; ;k o uke k o uke k o uke k o uke % % % % 1-¯â"¯â™®â£û¶âû §âîºâã®â£û¶âû Õô¯âû®â£û¶âû 4 4 4 4& & & & Hkkx la Hkkx la Hkkx la Hkkx la[ [ [ [; ; ; ;k k k k % % % %
解决方案Do not use any kind of Encoding, because you do not know what encoding is the pdf file has.
. I think it will work.
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy(); string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy); text=text+currentText; ///do what you want with text MessageBox.Show(text);
If still it not working then you have to install specific font.
这篇关于阅读使用iTextSharp的本地化PDF文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!