用PDFBox插入一个NULL字符 [英] insert a NULL character with PDFBox

查看:147
本文介绍了用PDFBox插入一个NULL字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

让我们考虑一下这个代码:

  public class Test1 {

public static void CreatePdf String src)throws IOException,COSVisitorException {
PDRectangle rec = new PDRectangle(400,400);
PDDocument document = null;
document = new PDDocument();
PDPage page = new PDPage(rec);
document.addPage(page);
PDDocumentInformation info = document.getDocumentInformation();
PDStream stream = new PDStream(document);
info.setAuthor(PdfBox);
info.setCreator(Pdf);
info.setSubject(Stéganographie);
info.setTitle(Stéganographiedans les documents PDF);
info.setKeywords(Stéganographie,pdf);
content = new PDPageContentStream(document,page,true,false);
font = PDType1Font.HELVETICA;

字符串hex =4C0061f; //显示La
//注意,我们在4C和61之间有00,00 = null字符


StringBuilder sb = new StringBuilder();
(int count = 0; count< hex.length() - 1; count + = 2)
{
String output = hex.substring(count,(count + 2) );
int decimal = Integer.parseInt(output,16);
StringBuilder ae = sb.append((char)decimal);
}
String tt = sb.toString();
content.beginText();
content.setFont(font,12);
content.appendRawCommands(15 385 Td\\\
);
content.appendRawCommands((+ tt +)+Tj\\\
);
content.endText();
content.close();
document.save(doc.pdf);
document.close();
}

我的问题是:为什么00被替换为PDF文档不是空字符?
请注意,这个空字符的宽度为0.0,但在PDF文档中显示为空格!
因此,我得到:L a而不是La
请我需要你的帮助。



最好的问候,



Liszt。

解决方案


为什么00被PDF文档中的空格替换为空字符?


如果您查看PDF,您会发现用于文本的字体定义为:

  9 0 obj 
<
/ Type / Font
/ Subtype / Type1
/ BaseFont / Helvetica
/ Encoding / WinAnsiEncoding
>>
endobj

因此,您使用具有 WinAnsiEncoding 的字体。如果您查看 PDF规范,您将看到没有32位(十进制)下面的代码被映射到任何东西。因此,您正在尝试的是使用手头编码中未定义的字符。因此,行为没有定义; Acrobat Reader似乎对这些未定义的代码点使用正宽度。



如果要确保隐藏的字符根本不会导致任何位移,您应该添加字体字典中的宽度显式数组,参见 PDF规范,并确保您的隐形角色的宽度为0.(BTW,这里您还将看到,没有嵌入宽度数组 - 如PDFBox所做的 - 已被废弃,无论多年前)。


请注意,这个空字符的宽度为0.0


一旦你在未定义的范围内,任何事情都可能发生,不同的程序有不同的假设。



PS 一些代码...之间

  font = PDType1Font.HELVETICA; 

  String hex =4C0061f; //显示La

我添加了以下代码:

  InputStream afmStream = ResourceLoader.loadResource(org / apache / pdfbox / resources / afm / Helvetica.afm); 
AFMParser afmParser = new AFMParser(afmStream);
afmParser.parse();
FontMetric afmMetrics = afmParser.getResult();
列表<浮动> newWidths = new ArrayList< Float>();
for(CharMetric charMetric:afmMetrics.getCharMetrics())
{
if(charMetric.getCharacterCode()<0)
continue;
while(charMetric.getCharacterCode()> = newWidths.size())
newWidths.add(0f);
newWidths.set(charMetric.getCharacterCode(),charMetric.getWx());
}
font.setFirstChar(0);
font.setLastChar(newWidths.size() - 1);
font.setWidths(newWidths);

此代码应该读取PDFBox中包含的Helvetica.afm字体指标资源,并创建 FirstChar LastChar 宽度条目。它在这里正常工作,但如果没有安装,只需从PDFBox jar中提取afm文件,并使用 FileInputStream读取它。



由于某些原因,00字符似乎认为它有一些宽度,但可以使用低于32(十进制)的其他字符,例如

  String hex =4C0461f; 

显示La没有间隙。如果我正确地解释了您之前(现已删除)1C和1D的问题,这已经有助于您继续。



PPS:关于问题在评论中:


你能告诉我这种方法的所有缺点吗?为什么这种方法与口音字符不匹配,例如(Lé),您的代码仅与没有口音的字符匹配,但是当我们有口音时,我们得到Lé而不是Le ...我只想知道什么是缺点的代码:)


我不能告诉所有(因为我真的不是那么深入字体),但实质上是这种方法如上所述,



如上所述,您使用的字体与 WinAnsiEncoding 使用,其中32(十进制)以下的代码不在映射到任何东西通过添加 FirstChar LastChar 宽度条目,我们尝试为代码低于32的字符定义零宽度。



尽管如此,我们既不关心这些代码的编码信息(编码仍然是纯粹的 WinAnsiEncoding ),我们也没有考虑字体实际上包含了这些代码的任何信息。此外,使事情变得不太可控,我们正在谈论 Helvetica ,即关于PDF浏览器必须随身携带的标准14字体之一。无论任何明确给出的信息和观众带来的信息相互矛盾,PDF查看者可能会倾向于偏向于他们自己的信息。



为什么特别是重音字符?我不确定。但是,我猜这是与字体通常不会将重音字符作为单独的实体相关联的,而是将口音和非重音字符相结合。观看者使用的字体可能内部具有映射在32以下代码点处的这些组合字符的一些信息,因此当您的显式代码低于32,并且字体隐含使用此类代码并排时,显示变得古怪。 p>

本质上,我通常会建议不要这样做。对于正常的PDF文档,根本就不是必需的。



在您的情况下,如您所指定的文件stéganographiedans les documents PDF你显然希望以PDF形式隐藏信息。使用不可见,不可打印的字符似乎是一种方法;因此,您可以在这个方向上进行实验。但是,PDF确实提供了更多的方法来将任何数量的信息放入PDF中,而不会直接显示。



因此,根据具体目标,我会认为其他方法可能更安全地隐藏信息,例如某些其他字典中的私人 PieceInfo 部分或自定义标签...


Let us consider this code:

public class Test1{

    public static void CreatePdf(String src) throws IOException, COSVisitorException{
    PDRectangle rec= new PDRectangle(400,400);
    PDDocument document= null;
    document = new PDDocument();
    PDPage page = new PDPage(rec);
    document.addPage(page);
    PDDocumentInformation info=document.getDocumentInformation();
 PDStream stream= new PDStream(document);
    info.setAuthor("PdfBox");
    info.setCreator("Pdf");
    info.setSubject("Stéganographie");
    info.setTitle("Stéganographie dans les documents PDF");
    info.setKeywords("Stéganographie, pdf");
    content= new PDPageContentStream(document, page, true, false );
    font= PDType1Font.HELVETICA;

String hex = "4C0061f";  // shows "La"
//Notice that we have 00 between 4C and 61 where 00 =null character


       StringBuilder sb = new StringBuilder();
        for (int count = 0; count < hex.length() - 1; count += 2)
    {
        String output = hex.substring(count, (count + 2));
        int decimal = Integer.parseInt(output, 16);
        StringBuilder ae= sb.append((char)decimal);
    }
        String tt=sb.toString();
    content.beginText();
    content.setFont(font, 12);
    content.appendRawCommands("15 385 Td\n");
   content.appendRawCommands("("+tt+")"+"Tj\n");
    content.endText();
   content.close();
    document.save("doc.pdf");
    document.close();       
    }

My problem is: why the "00" is replaced by a space in the PDF document not as a null character? Notice that i got the width 0.0 for this null character, but it shows as a space in the PDF document! Therefore i get : "L a" instead of "La" Please i need your help.

Best regards,

Liszt.

解决方案

why the "00" is replaced by a space in the PDF document not as a null character?

If you look into your PDF you'll find that the font used for your text is defined as:

9 0 obj
<<
/Type /Font
/Subtype /Type1
/BaseFont /Helvetica
/Encoding /WinAnsiEncoding
>>
endobj 

Thus, you use a font with WinAnsiEncoding. If you look at the definition of that encoding in Annex D of the PDF specification, you see that no code below 32 (decimal) is mapped to anything. Thus, what you are trying to do is use a character undefined in the encoding at hand. Thus, the behavior is not defined; Acrobat Reader seems to use a positive width for those undefined code points.

If you want to make sure your hidden characters don't cause any displacement at all, you should add an explicit array of widths in your font dictionary, cf. section 9.6.2 in the PDF specification, and make sure your invisible characters get a width of 0. (BTW, here you'll also see that not embedding a widths array - as PDFBox does - has been deprecated anyways years ago).

Notice that i got the width 0.0 for this null character

As soon as you are in undefined ranges, anything might happen and different programs have different assumptions.

PS Some code... Between your lines

font= PDType1Font.HELVETICA;

and

String hex = "4C0061f";  // shows "La"

I added the following code:

InputStream afmStream = ResourceLoader.loadResource("org/apache/pdfbox/resources/afm/Helvetica.afm");
AFMParser afmParser = new AFMParser(afmStream);
afmParser.parse();
FontMetric afmMetrics = afmParser.getResult();
List<Float> newWidths = new ArrayList<Float>();
for (CharMetric charMetric : afmMetrics.getCharMetrics())
{
    if (charMetric.getCharacterCode() < 0)
        continue;
    while (charMetric.getCharacterCode() >= newWidths.size())
        newWidths.add(0f);
    newWidths.set(charMetric.getCharacterCode(), charMetric.getWx());
}
font.setFirstChar(0);
font.setLastChar(newWidths.size() - 1);
font.setWidths(newWidths);

This code should read the Helvetica.afm font metrics resource included in PDFBox and create FirstChar, LastChar, and Widths entries from it. It works here alright, but if it doesn't in your installation, simply extract the afm file from the PDFBox jars and read it using a FileInputStream.

For some reason the 00 character still seems to think it has some width, but other characters below 32 (decimal) can be used alright, e.g.

String hex = "4C0461f";

shows "La" without a gap. If I interpret your former (now deleted) question concerning 1C and 1D correctly, this already would help you continue.

PPS: Concerning the question in the comments:

can you tell me the all disadvantages of this method ? and why this method does not match with accent characters, for example (Lé), your code match only with characters without accent , but when we have accent, we get L é instead of Le..I want to know only what are the disadvantages of your code :)

I cannot tell all (because I'm really not that deep into font matters) but in essence the approach described above is somewhat incomplete.

As mentioned at the start, you use a font with WinAnsiEncoding in which no code below 32 (decimal) is mapped to anything. By adding FirstChar, LastChar, and Widths entries, we tried to define a zero width for those characters with codes below 32.

In spite of all that, though, we neither cared about encoding information for those codes (the encoding remained a pure WinAnsiEncoding) nor did we consider whether the font actually contained any information for those codes. Furthermore, making things still less controllable, we are talking about Helvetica, i.e. one of the standard 14 fonts about which PDF browsers have to bring along their own information anyways. Wherever the explicitly given information and the information the viewer brings along contradict, PDF viewers might be tempted to be biased towards their own information.

Why there is trouble especially with accented characters? I'm not sure. I would guess, though, that is related to the fact that fonts usually don't bring along accented characters as separate entities but instead combine an accent and an unaccented character. Maybe internally the font the viewer uses has some information for such combined characters mapped at those code points below 32 and, therefore, the display becomes quirky when your explicit codes below 32 and the font's implicit use of such codes happen side by side.

Essentially I generally would advise against doing things like this. For normal PDF documents it is not necessary at all.

In your case though, as you have titled your document Stéganographie dans les documents PDF, you obviously do want to somehow hide information in PDFs. Using invisible, unprintable characters seems one approach for that; thus, it is ok that you experiment in that direction. But PDF does offer many more ways to put any amount of information into a PDF without it being directly visible.

Depending on your specific aim, therefore, I would think that other approaches might hide the information more securely, e.g. private PieceInfo sections or custom tags in some other dictionaries...

这篇关于用PDFBox插入一个NULL字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆