从乱码的 PDF 中提取文本 [英] Extracting text from garbled PDF

查看：40 发布时间：2021/12/14 16:07:16 pdf file-format text-analysis

本文介绍了从乱码的 PDF 中提取文本的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个包含重要文本信息的 PDF 文件.

问题是我无法提取文本，我得到的只是一堆乱码.如果我将文本从 PDF 阅读器复制并粘贴到文本文件中，也会发生同样的情况.即使在 Acrobat Reader 中 File -> Save as text 也失败了.

我已经使用了所有可以使用的工具，结果都是一样的.我相信这与字体嵌入有关，但我不知道究竟是什么?

我的问题:

这种奇怪的文字乱码的罪魁祸首是什么?
如何从 PDF 中提取文本内容(以编程方式、使用工具、直接操作位等)?
如何修复 PDF 以免在复制时出现乱码?

解决方案

找了很多人求助，OCR是解决这个问题的唯一办法</p>

I have a PDF file with valuable textual information.

The problem is that I cannot extract the text, all I get is a bunch of garbled symbols. The same happens if I copy and paste the text from the PDF reader to a text file. Even File -> Save as text in Acrobat Reader fails.

I have used all tools I could get my hands on and the result is the same. I believe that this has something to do with fonts embedding, but I don't know what exactly?

My questions:

What is the culprit of this weird text garbling?
How to extract the text content from the PDF (programmatically, with a tool, manipulating the bits directly, etc.)?
How to fix the PDF to not garble on copy?

解决方案

I went to a lot of people for help and OCR is the only solution to this problem

这篇关于从乱码的 PDF 中提取文本的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

从乱码的 PDF 中提取文本 [英] Extracting text from garbled PDF

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

从乱码的 PDF 中提取文本 [英] Extracting text from garbled PDF

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭