如何从嵌入字体的pdf中读取文本？ [英] How do I read text from pdf with embedded fonts?

查看：186 发布时间：2019/6/11 9:43:24 itextsharp

本文介绍了如何从嵌入字体的pdf中读取文本？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我需要使用自定义字体从pdf扩展文本，但自定义不允许通过iText lib以明文/可读方式复制/粘贴文本或搜索文本或提取文本。 ..结果文本是空格或非uman可读字符

pdf格式为：作者：User Creator：Compart Docponent API制作人：Compart MFFPDF I / O Filter 2013 -03-09 00:51:11 CreationDate：04/21/16 11:26:59 ModDate：06/09/16 10:02:16标签：无表格：无页数：6加密：否页面大小：595.2 x 841.92分（A4）（旋转0度）文件大小：312703字节优化：是PDF版本：1.4

pdf字体信息是（每个运行pdffonts命令行） fonts）：name：[none];类型：[类型3]; emb：[是]; sub：[no]; uni：[yes];

所以pdf似乎有一个ToUnicode地图，但这还不够以及以下代码

如何以清晰的方式阅读文字？

提前感谢

GG

我尝试过：

 dftext.Text =  null ; 
 StringBuilder text =  new  StringBuilder（）; 
 PdfReader pdfReader =  new  PdfReader（filename）; 
  for （ int  page =  1 ; page <  = pdfReader.NumberOfPages; page ++）
 {
 ITextExtractionStrategy strategy =  new  SimpleTextExtractionStrategy（）; 
  string  currentText = PdfTextExtractor.GetTextFromPage（pdfReader，page，strategy）; 
 text.Append（System.Environment.NewLine）; 
 text.Append（  \ n页码： +页面）; 
 text.Append（System.Environment.NewLine）; 
 currentText = Encoding.UTF8.GetString（ASCIIEncoding.Convert（Encoding.Default，Encoding.UTF8，Encoding.Default.GetBytes（currentText）））; 
 text.Append（currentText）; 
} 
 pdftext.Text + = text.ToString（）; 
 pdfReader.Close（）;

解决方案

一个明显的问题是：
 currentText = Encoding.UTF8.GetString（ASCIIEncoding.Convert（Encoding.Default，Encoding.UTF8，Encoding.Default.GetBytes（currentText）））; 
完全没有意义。

试着扔掉它。

- SA

Hi,
I need to extect text from pdf with custom fonts but custom don't let to copy/paste text or search text or extract text in a clear/readble way by iText lib... the resultant text is space or non uman readable chars

The pdf format are: Author: User Creator: Compart Docponent API Producer: Compart MFFPDF I/O Filter 2013-03-09 00:51:11 CreationDate: 04/21/16 11:26:59 ModDate: 06/09/16 10:02:16 Tagged: no Form: none Pages: 6 Encrypted: no Page size: 595.2 x 841.92 pts (A4) (rotated 0 degrees) File size: 312703 bytes Optimized: yes PDF version: 1.4

the pdf fonts info are (running pdffonts command line for each fonts): name:[none] ; type:[Type 3] ; emb: [yes]; sub: [no]; uni : [yes];

so the pdf seems to have a ToUnicode map but that is not enough also with the follow code

How I can read text in a clear way?

thanks in advance

G.G.

What I have tried:

dftext.Text = null;
StringBuilder text = new StringBuilder();
PdfReader pdfReader = new PdfReader(filename);
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
    ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
    string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
    text.Append(System.Environment.NewLine);
    text.Append("\n Page Number:" + page);
    text.Append(System.Environment.NewLine);
    currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
    text.Append(currentText);
}
pdftext.Text += text.ToString();
pdfReader.Close();

解决方案

One apparent problem is:
currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
It makes no sense at all.

Try to throw it out.

—SA

这篇关于如何从嵌入字体的pdf中读取文本？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何从嵌入字体的pdf中读取文本？ [英] How do I read text from pdf with embedded fonts?

问题描述

相关文章

其他开发语言最新文章

热门教程

热门工具

登录关闭

如何从嵌入字体的pdf中读取文本？ [英] How do I read text from pdf with embedded fonts?

问题描述

相关文章

其他开发语言最新文章

热门教程

热门工具

登录 关闭

登录关闭