阅读PDF文件中的日文字符 [英] Read Japanese characters in a PDF file

查看:717
本文介绍了阅读PDF文件中的日文字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下命令:

[<0e0f0a52030d030e0ce5030f0744030f> 10 <1030 10cd4>] TJ



我知道它在Hex部分隐藏了日语,因为这是PDF中唯一的东西,而且这行只是pdf文件中孤独页面的唯一内容流。

问题在于,无论如何尝试解码这个Hex字符串,我最终都会使用Gibberish,我已经将这些Hex字符串解码为字节,并试图识别每个字符集我可以找到,而且我仍然得到乞丐。



(也许我很绝望,因为我知道它可能不会工作)
我已经也尝试了另一种方式,在Android上进行测试,并且我可以导入pdf日文文本(从资源中加载它),并且在调试过程中,我可以在String实例的值中看到REAL日文文本,已经尝试应用所有字符集,只为整个文件生成4-6个匹配的十六进制字符,但是又一次......没什么。

我实际上不需要雕文,我会为正确的文本解决......



难道是文本本身是由字符集编码以外的东西编码的?
任何人都可以指向正确的方向吗?



===更新===

好的,所以我发现有一个额外的加密,Identity-H,并且我有在这里阅读,你需要一个/ ToUnicode地图,这在我看来在文件中找不到。 / p>

让我吃惊的是其他PDF查看器可以显示PDF,而且我无法计算出结果!再次遇到

,任何骨头都会很好......地狱我会去找零碎的:)

谢谢,



Adam。



对于某些文件上下文:

  ... 
10 0 obj
<<
/类型/页面
/父母7 0 R
/资源11 0 R
/内容16 0 R
/ MediaBox [0 0 595 842]
/ CropBox [0 0 595 842]
/旋转0
>>
endobj
11 0 obj
<<
/ ProcSet [/ PDF / Text]
/ Font<< / TT2 13 0 R / G1 12 0 R>>
/ ExtGState<< / GS1 19 0 R>>
/ ColorSpace<< / Cs6 15 0 R>>
>>
endobj
12 0 obj
<<
/ Type / Font
/ Subtype / Type0
/ BaseFont / Ryumin-Light-Identity-H
/ Encoding / Identity-H
/ DescendantFonts [18 0 R ]
>>
endobj
13 0 obj
<<
/类型/字体
/子类型/ TrueType
/ FirstChar 32
/ LastChar 32
/宽度[278]
/ Encoding / WinAnsiEncoding
/ BaseFont / Century
/ FontDescriptor 14 0 R
>>
endobj
14 0 obj
<<
/ Type / FontDescriptor
/ Ascent 985
/ CapHeight 0
/ Descent -216
/ Flags 34
/ FontBBox [-165 -307 1246 1201 ]
/ FontName / Century
/ ItalicAngle 0
/ StemV 0
>>
endobj
15 0 obj
[
/ ICCBased 20 0 R
]
endobj
16 0 obj

...
并[d 0e0f0a52030d030e0ce5030f0744030f大于10< 030d→10< 0cd4>] TJ
...
将00e700e700e700e700e700e700e700e700e700e700e700e700e700e700e700e700e700e700e700e700e700e700e700e7> TJ
...
< 030e030d0a48064403740353035a039408030ebd074807c1036e0358039304e10c8802a2074807c10cd40e8a030e030d02a303770a2a0a100374036d034d036f00e7> Tj
...
endstream
endobj
17 0 obj
<<
/ Type / FontDescriptor
/ Ascent 723
/ CapHeight 709
/ Descent -241
/ Flags 6
/ FontBBox [-170 -331 1024 903 ]
/ FontName / Ryumin-Light
/ ItalicAngle 0
/ StemV 69
/ XHeight 450
/ Style<< / Panose< 0105020203000000000000>>>
>>
endobj
18 0 obj
<<
/ Type / Font
/ Subtype / CIDFontType0
/ BaseFont / Ryumin-Light
/ FontDescriptor 17 0 R
/ CIDSystemInfo<< / Registry(Adobe)/ Ordering(Japan1)/ Supplement 2>>
/ DW 1000
/ W [231 [500]]
>>
endobj
19 0 obj
<<
/ Type / ExtGState
/ SA false
/ SM 0.02
/ TR2 /默认
>>
endobj
20 0 obj
<< / N 3 / Alternate / DeviceRGB / Length 2572 / Filter / FlateDecode>>

...
endstream
endobj
...

$ b $ >
/ toUnicode可以存在于PDF文件中,但不是必须的!

  • 有外部的,预定的/多种语言的预定义CMaps,此处



  • 在错误的地方挖掘这么长时间是非常令人沮丧的,我已经将PDF分成了小块,并且遍历了文件中的所有流,找到这个地图没有运气,因为它不在文件中!



    我希望这可以节省别人的麻烦...


    I have the following command:

    [<0e0f0a52030d030e0ce5030f0744030f>10<030d>10<0cd4>]TJ

    I know that it hides Japanese in the Hex sections, because that is the only thing in the PDF, and this line is in the only content stream of a lonely page in the pdf file.

    Problem is no matter how I try to decode this Hex strings I end up with Gibberish, I've decoded these Hex strings to bytes, and have tried literately applying every charset I could find, and still I get Gibberish.

    (Perhaps I was desperate, because I knew it would have probably not work as well) I've also tried it the other way, testing this on Android and I'm able to import the pdf Japanese text(load it from the resource), and while debugging I can see the REAL Japanese text in the value of the String instance, again I've tried applying all the charset only to produce a 4-6 matching hex chars to the entire file, but again... nothing.

    I actually don't need the Glyph, I would settle for the correct text...

    Could it be that the text itself is encoded by something other than a charset encoding? Can anyone point me in the right direction?

    === UPDATE ===

    OK, So I figured out that there is an extra "encryption", Identity-H, and I've read here that you need a /ToUnicode map which I cannot seem to find in the file.

    What drives me nuts is that other PDF Viewers can show the PDF, and I cannot figure how!

    Again, any bone would be nice... hell I'll go for scraps :)

    Thanks,

    Adam.

    For some file context:

    ...
    10 0 obj
        << 
        /Type /Page 
        /Parent 7 0 R 
        /Resources 11 0 R 
        /Contents 16 0 R 
        /MediaBox [ 0 0 595 842 ] 
        /CropBox [ 0 0 595 842 ] 
        /Rotate 0 
        >> 
    endobj
    11 0 obj
        << 
        /ProcSet [ /PDF /Text ] 
        /Font << /TT2 13 0 R /G1 12 0 R >> 
        /ExtGState << /GS1 19 0 R >> 
        /ColorSpace << /Cs6 15 0 R >> 
        >> 
    endobj
    12 0 obj
        << 
        /Type /Font 
        /Subtype /Type0 
        /BaseFont /Ryumin-Light-Identity-H 
        /Encoding /Identity-H 
        /DescendantFonts [ 18 0 R ] 
        >> 
    endobj
    13 0 obj
        << 
        /Type /Font 
        /Subtype /TrueType 
        /FirstChar 32 
        /LastChar 32 
        /Widths [ 278 ] 
        /Encoding /WinAnsiEncoding 
        /BaseFont /Century 
        /FontDescriptor 14 0 R 
        >> 
    endobj
    14 0 obj
        << 
        /Type /FontDescriptor 
        /Ascent 985 
        /CapHeight 0 
        /Descent -216 
        /Flags 34 
        /FontBBox [ -165 -307 1246 1201 ] 
        /FontName /Century 
        /ItalicAngle 0 
        /StemV 0 
        >> 
    endobj
    15 0 obj
        [ 
        /ICCBased 20 0 R 
        ]
    endobj
    16 0 obj
        << /Length 2221 /Filter /FlateDecode >> 
            stream
            ...
                    [<0e0f0a52030d030e0ce5030f0744030f>10<030d>10<0cd4>]TJ
            ...
                    <00e700e700e700e700e700e700e700e700e700e700e700e700e700e700e700e700e700e700e700e700e700e700e700e7>Tj
            ...
                    <030e030d0a48064403740353035a039408030ebd074807c1036e0358039304e10c8802a2074807c10cd40e8a030e030d02a303770a2a0a100374036d034d036f00e7>Tj
            ...
        endstream
    endobj
    17 0 obj
        << 
        /Type /FontDescriptor 
        /Ascent 723 
        /CapHeight 709 
        /Descent -241 
        /Flags 6 
        /FontBBox [ -170 -331 1024 903 ] 
        /FontName /Ryumin-Light 
        /ItalicAngle 0 
        /StemV 69 
        /XHeight 450 
        /Style << /Panose <010502020300000000000000>>> 
        >> 
    endobj
    18 0 obj
        << 
        /Type /Font 
        /Subtype /CIDFontType0 
        /BaseFont /Ryumin-Light 
        /FontDescriptor 17 0 R 
        /CIDSystemInfo << /Registry (Adobe)/Ordering (Japan1)/Supplement 2 >> 
        /DW 1000 
        /W [ 231 [ 500 ] ] 
        >> 
    endobj
    19 0 obj
        << 
        /Type /ExtGState 
        /SA false 
        /SM 0.02 
        /TR2 /Default 
        >> 
    endobj
    20 0 obj
        << /N 3 /Alternate /DeviceRGB /Length 2572 /Filter /FlateDecode >> 
        stream
        ...
        endstream
    endobj
    ...
    

    解决方案

    Since most thoughts here are fundamentally correct, they are not complete and not exact, so:

    • The /ToUnicode MAY be present in the PDF file, but is not a must!!!
    • There are external, predetermined/predefined CMaps for multiple languages, here.

    It was pretty frustrating to dig so long in the wrong place, I've tared the PDF into tiny pieces and have went through all the streams in the file, to find this map without luck, because it WAS NOT IN THE FILE!

    I hope this save someone else the hassle...

    这篇关于阅读PDF文件中的日文字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆