如何从嵌入字体的pdf中读取文本? [英] How do I read text from pdf with embedded fonts?

查看:186
本文介绍了如何从嵌入字体的pdf中读取文本?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



我需要使用自定义字体从pdf扩展文本,但自定义不允许通过iText lib以明文/可读方式复制/粘贴文本或搜索文本或提取文本。 ..结果文本是空格或非uman可读字符



pdf格式为:作者:User Creator:Compart Docponent API制作人:Compart MFFPDF I / O Filter 2013 -03-09 00:51:11 CreationDate:04/21/16 11:26:59 ModDate:06/09/16 10:02:16标签:无表格:无页数:6加密:否页面大小:595.2 x 841.92分(A4)(旋转0度)文件大小:312703字节优化:是PDF版本:1.4



pdf字体信息是(每个运行pdffonts命令行) fonts):name:[none];类型:[类型3]; emb:[是]; sub:[no]; uni:[yes];



所以pdf似乎有一个ToUnicode地图,但这还不够以及以下代码



如何以清晰的方式阅读文字?



提前感谢



GG



我尝试过:



 dftext.Text =  null ; 
StringBuilder text = new StringBuilder();
PdfReader pdfReader = new PdfReader(filename);
for int page = 1 ; page < = pdfReader.NumberOfPages; page ++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader,page,strategy);
text.Append(System.Environment.NewLine);
text.Append( \ n页码: +页面);
text.Append(System.Environment.NewLine);
currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default,Encoding.UTF8,Encoding.Default.GetBytes(currentText)));
text.Append(currentText);
}
pdftext.Text + = text.ToString();
pdfReader.Close();

解决方案

一个明显的问题是:

 currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default,Encoding.UTF8,Encoding.Default.GetBytes(currentText))); 



完全没有意义。



试着扔掉它。



- SA

Hi,
I need to extect text from pdf with custom fonts but custom don't let to copy/paste text or search text or extract text in a clear/readble way by iText lib... the resultant text is space or non uman readable chars

The pdf format are: Author: User Creator: Compart Docponent API Producer: Compart MFFPDF I/O Filter 2013-03-09 00:51:11 CreationDate: 04/21/16 11:26:59 ModDate: 06/09/16 10:02:16 Tagged: no Form: none Pages: 6 Encrypted: no Page size: 595.2 x 841.92 pts (A4) (rotated 0 degrees) File size: 312703 bytes Optimized: yes PDF version: 1.4

the pdf fonts info are (running pdffonts command line for each fonts): name:[none] ; type:[Type 3] ; emb: [yes]; sub: [no]; uni : [yes];

so the pdf seems to have a ToUnicode map but that is not enough also with the follow code

How I can read text in a clear way?

thanks in advance

G.G.

What I have tried:

dftext.Text = null;
StringBuilder text = new StringBuilder();
PdfReader pdfReader = new PdfReader(filename);
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
    ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
    string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
    text.Append(System.Environment.NewLine);
    text.Append("\n Page Number:" + page);
    text.Append(System.Environment.NewLine);
    currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
    text.Append(currentText);
}
pdftext.Text += text.ToString();
pdfReader.Close();

解决方案

One apparent problem is:

currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));


It makes no sense at all.

Try to throw it out.

—SA


这篇关于如何从嵌入字体的pdf中读取文本?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆