在哪里可以将Identity-H编码字符映射到ASCII或Unicode字符? [英] Where can I a mapping of Identity-H encoded characters to ASCII or Unicode characters?

查看:3437
本文介绍了在哪里可以将Identity-H编码字符映射到ASCII或Unicode字符?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个由第三方生成的PDF。我试图让它的文本,但是 pdf2text ,也不复制和粘贴结果可读的文本。在一个小的挖掘输出(二者之一)后,我发现屏幕上的每个字符都是由三个字节组成。例如,A是字节 ef 81 81 。看看PDF上的元数据,声称它被编码在Identity-H中,所以我假设我看到的是一组在Identity-H中编码的字符。我有一个基于我已经有文档的局部映射,但我想做一个更完整的映射。为了做到这一点,我需要一个像ASCII表的Identity-H。

I have a PDF generated by a third party. I am trying to get the text out of it, but neither pdf2text nor copying and pasting results in readable text. After a little digging in the output (of either of two) I found that each character on the screen is made up of three bytes. For example, "A" is the bytes ef, 81, and 81. Looking at the metadata on the PDF it claims to be encoded in Identity-H, so I assume what I am seeing is a set of characters encoded in Identity-H. I have a partial mapping based on the documents I already have, but I want to make a more complete mapping. To do that I need something like an ASCII table for Identity-H.

推荐答案

PDF,尤其是当/ m_kl指出/ ToUnicode映射缺失时。

It is not always possible to extract text from a PDF especially when the /ToUnicode map is missing as pointed out by mkl.

如果无法从Acrobat剪切和粘贴正确的文本,那么您自己很少有机会提取文本。如果Acrobat无法提取它,那么任何其他工具都不太可能正确提取文本。

If it is not possible to cut and paste the correct text from Acrobat then you will have very little chance of extracting the text yourself. If Acrobat cannot extract it then it is very unlikely that any other tool can extract the text correctly.

如果您手动创建编码表,那么您可以使用它将重新映射的字符提取到正确的值,但这很可能只适用于这一个文档。

If you manually create an encoding table then you could use this to remap the extracted characters to their correct values but this most likely will only work for this one document.

通常这是有意的。我看到文档随机重新映射字符不同的点的每个字体。它被用作混淆的一种形式,从这些PDF中提取文本的唯一真正的方法是诉诸OCR。有很多财务报告使用这种技巧阻止人们提取他们的数据。

Often this is done on purpose. I have seen documents that randomly remap characters differently for each font in the dot. It is used as a form of obfuscation and the only real way to extract text from these PDF's is to resort to OCR. There are many financial reports that use this type of trick to stop people from extracting their data.

此外,Identity-H只是一个1:1字符映射的所有字符从0x0000到0xFFFF。即。身份是身份映射。

Also, Identity-H is just a 1:1 character mapping for all characters from 0x0000 to 0xFFFF. ie. Identity is an identity mapping.

您的真正的问题是此PDF中缺少/ ToUnicode条目。我怀疑您的PDF中还有一个嵌入式CMap,解释为什么每个字符可以有3个字节。

Your real problem is the missing /ToUnicode entry in this PDF. I suspect there is also an embedded CMap in your PDF that explains why there could be 3 bytes per character.

这篇关于在哪里可以将Identity-H编码字符映射到ASCII或Unicode字符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆