生成 PDF 时无法获取捷克语字符 [英] Can't get Czech characters while generating a PDF

查看:21
本文介绍了生成 PDF 时无法获取捷克语字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在生成 PDF 时添加诸如Č"或Ć"之类的字符时遇到问题.我主要使用段落将一些静态文本插入到我的 PDF 报告中.这是我使用的一些示例代码:

I have a problem when adding characters such as "Č" or "Ć" while generating a PDF. I'm mostly using paragraphs for inserting some static text into my PDF report. Here is some sample code I used:

var document = new Document();
document.Open();
Paragraph p1 = new Paragraph("Testing of letters Č,Ć,Š,Ž,Đ", new Font(Font.FontFamily.HELVETICA, 10));
document.Add(p1);

我在生成 PDF 文件时得到的输出如下所示:字母 ,,Š,Ž,Đ 的测试"

The output I get when the PDF file is generated, looks like this: "Testing of letters ,,Š,Ž,Đ"

出于某种原因,iTextSharp 似乎无法识别这些字母,例如Č"和Ć".

For some reason iTextSharp doesn't seem to recognize these letters such as "Č" and "Ć".

推荐答案

问题:

首先,您似乎不是在谈论西里尔字母,而是在谈论使用拉丁字母的中欧和东欧语言.看看代码页1250代码页 1251 来理解我的意思.[注意:我已经更新了问题,以便它谈论捷克字符而不是西里尔字符.]

First of all, you don't seem to be talking about Cyrillic characters, but about central and eastern European languages that use Latin script. Take a look at the difference between code page 1250 and code page 1251 to understand what I mean. [NOTE: I have updated the question so that it talks about Czech characters instead of Cyrillic.]

第二次观察.您正在编写包含特殊字符的代码:

Second observation. You are writing code that contains special characters:

"Testing of letters Č,Ć,Š,Ž,Đ"

这是一种不好的做法.代码文件以纯文本形式存储,可以使用不同的编码进行保存.编码的意外切换(例如:将其上传到使用不同编码的版本控制系统)可能会严重损坏文件内容.

That is a bad practice. Code files are stored as plain text and can be saved using different encodings. An accidental switch from encoding (for instance: by uploading it to a versioning system that uses a different encoding), can seriously damage the content of your file.

您应该编写不包含特殊字符但使用不同符号的代码.例如:

You should write code that doesn't contain special characters, but that use a different notations. For instance:

"Testing of letters u010c,u0106,u0160,u017d,u0110"

这也将确保在使用需要不同编码的编译器编译代码时内容不会被更改.

This will also make sure that the content doesn't get altered when compiling the code using a compiler that expects a different encoding.

您的第三个错误是您认为 Helvetica 是一种知道如何绘制这些字形的字体.这是一个错误的假设.您应该使用诸如 Arial.ttf 之类的字体文件(或选择任何其他知道如何绘制这些字形的字体).

Your third mistake is that you assume that Helvetica is a font that knows how to draw these glyphs. That is a false assumption. You should use a font file such as Arial.ttf (or pick any other font that knows how to draw those glyphs).

您的第四个错误是您没有嵌入字体.假设您使用本地机器上的字体并且能够绘制特殊字形,那么您将能够在本地机器上阅读文本.但是,收到您的文件但没有您在其本地计算机上使用的字体的人可能无法正确读取文档.

Your fourth mistake is that you do not embed the font. Suppose that you use a font you have on your local machine and that is able to draw the special glyphs, then you will be able to read the text on your local machine. However, somebody who receives your file, but doesn't have the font you used on his local machine may not be able to read the document correctly.

您的第五个错误是您在使用字体时没有定义编码(这与您的第二个错误有关,但有所不同).

Your fifth mistake is that you didn't define an encoding when using the font (this is related to your second mistake, but it's different).

解决方案:

我写了一个名为 CzechExample 的小例子,它生成以下 PDF:czech.pdf

I have written a small example called CzechExample that results in the following PDF: czech.pdf

我添加了两次相同的文本,但使用了不同的编码:

I have added the same text twice, but using a different encoding:

public static final String FONT = "resources/fonts/FreeSans.ttf";
public void createPdf(String dest) throws IOException, DocumentException {
    Document document = new Document();
    PdfWriter.getInstance(document, new FileOutputStream(DEST));
    document.open();
    Font f1 = FontFactory.getFont(FONT, "Cp1250", true);
    Paragraph p1 = new Paragraph("Testing of letters u010c,u0106,u0160,u017d,u0110", f1);
    document.add(p1);
    Font f2 = FontFactory.getFont(FONT, BaseFont.IDENTITY_H, true);
    Paragraph p2 = new Paragraph("Testing of letters u010c,u0106,u0160,u017d,u0110", f2);
    document.add(p2);
    document.close();
}

为了避免你的第三个错误,我使用字体 FreeSans.ttf 而不是 Helvetica.您可以选择任何其他字体,只要它支持您要使用的字符即可.为了避免您的第四个错误,我将 embedded 参数设置为 true.

To avoid your third mistake, I used the font FreeSans.ttf instead of Helvetica. You can choose any other font as long as it supports the characters you want to use. To avoid your fourth mistake, I have set the embedded parameter to true.

至于你的第五个错误,我介绍了两种不同的方法.

As for your fifth mistake, I introduced two different approaches.

在第一种情况下,我告诉 iText 使用代码页 1250.

In the first case, I told iText to use code page 1250.

Font f1 = FontFactory.getFont(FONT, "Cp1250", true);

这会将字体作为简单字体嵌入到 PDF 中,这意味着 String 中的每个字符都将使用 一个字节.这种方法的优点是简单;缺点是您不应该开始混合代码页.例如:这不适用于 Cyrillic 字形.

This will embed the font as a simple font into the PDF, meaning that each character in your String will be represented using a single byte. The advantage of this approach is simplicity; the disadvantage is that you shouldn't start mixing code pages. For instance: this won't work for Cyrillic glyphs.

在第二种情况下,我告诉 iText 使用 Unicode 进行水平书写:

In the second case, I told iText to use Unicode for horizontal writing:

Font f2 = FontFactory.getFont(FONT, BaseFont.IDENTITY_H, true);

这会将字体作为复合字体嵌入到 PDF 中,这意味着 String 中的每个字符都将使用 多于一个字节来表示em>.这种方法的优点是它是较新的 PDF 标准(例如 PDF/A、PDF/UA)中推荐的方法,并且您可以将西里尔文与拉丁文、中文与日文等混合使用……缺点是您创建更多字节,但这种效果受到内容流无论如何都被压缩的事实的限制.

This will embed the font as a composite font into the PDF, meaning that each character in your String will be represented using more than one byte. The advantage of this approach is that it is the recommended approach in the newer PDF standards (e.g. PDF/A, PDF/UA), and that you can mix Cyrillic with Latin, Chinese with Japanese, etc... The disadvantage is that you create more bytes, but that effect is limited by the fact that content streams are compressed anyway.

当我解压示例 PDF 中文本的内容流时,我看到以下 PDF 语法:

When I decompress the content stream for the text in the sample PDF, I see the following PDF syntax:

正如我所解释的,单个字节用于存储第一行的文本.双字节用于存储第二行的文本.

As I explained, single bytes are used to store the text of the first line. Double bytes are used to store the text of the second line.

您可能会惊讶于这些字符在外面看起来不错(在 Adob​​e Reader 中查看文本时),但与您在内部看到的内容不符(在查看第二个屏幕截图时),但这就是它是如何工作的.

You may be surprised that these characters look OK on the outside (when looking at the text in Adobe Reader), but don't correspond with what you see on the inside (when looking at the second screen shot), but that's how it works.

结论:

很多人认为创建PDF是微不足道的,创建PDF的工具应该是商品.实际上,事情并不总是那么简单;-)

Many people think that creating PDF is trivial, and that tools for creating PDF should be a commodity. In reality, it's not always that simple ;-)

这篇关于生成 PDF 时无法获取捷克语字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆