Can't get Czech characters while generating a PDF

THE PROBLEM:

First of all, you don't seem to be talking about Cyrillic characters, but about central and eastern European languages that use Latin script. Take a look at the difference between code page 1250 and code page 1251 to understand what I mean. [NOTE: I have updated the question so that it talks about Czech characters instead of Cyrillic.]

Second observation. You are writing code that contains special characters:

"Testing of letters Č,Ć,Š,Ž,Đ"

That is a bad practice. Code files are stored as plain text and can be saved using different encodings. An accidental switch from encoding (for instance: by uploading it to a versioning system that uses a different encoding), can seriously damage the content of your file.

You should write code that doesn't contain special characters, but that use a different notations. For instance:

"Testing of letters \u010c,\u0106,\u0160,\u017d,\u0110"

This will also make sure that the content doesn't get altered when compiling the code using a compiler that expects a different encoding.

Your third mistake is that you assume that Helvetica is a font that knows how to draw these glyphs. That is a false assumption. You should use a font file such as Arial.ttf (or pick any other font that knows how to draw those glyphs).

Your fourth mistake is that you do not embed the font. Suppose that you use a font you have on your local machine and that is able to draw the special glyphs, then you will be able to read the text on your local machine. However, somebody who receives your file, but doesn't have the font you used on his local machine may not be able to read the document correctly.

Your fifth mistake is that you didn't define an encoding when using the font (this is related to your second mistake, but it's different).

THE SOLUTION:

I have written a small example called CzechExample that results in the following PDF: czech.pdf

enter image description here

I have added the same text twice, but using a different encoding:

public static final String FONT = "resources/fonts/FreeSans.ttf";
public void createPdf(String dest) throws IOException, DocumentException {
    Document document = new Document();
    PdfWriter.getInstance(document, new FileOutputStream(DEST));
    document.open();
    Font f1 = FontFactory.getFont(FONT, "Cp1250", true);
    Paragraph p1 = new Paragraph("Testing of letters \u010c,\u0106,\u0160,\u017d,\u0110", f1);
    document.add(p1);
    Font f2 = FontFactory.getFont(FONT, BaseFont.IDENTITY_H, true);
    Paragraph p2 = new Paragraph("Testing of letters \u010c,\u0106,\u0160,\u017d,\u0110", f2);
    document.add(p2);
    document.close();
}

To avoid your third mistake, I used the font FreeSans.ttf instead of Helvetica. You can choose any other font as long as it supports the characters you want to use. To avoid your fourth mistake, I have set the embedded parameter to true.

As for your fifth mistake, I introduced two different approaches.

In the first case, I told iText to use code page 1250.

Font f1 = FontFactory.getFont(FONT, "Cp1250", true);

This will embed the font as a simple font into the PDF, meaning that each character in your String will be represented using a single byte. The advantage of this approach is simplicity; the disadvantage is that you shouldn't start mixing code pages. For instance: this won't work for Cyrillic glyphs.

In the second case, I told iText to use Unicode for horizontal writing:

Font f2 = FontFactory.getFont(FONT, BaseFont.IDENTITY_H, true);

This will embed the font as a composite font into the PDF, meaning that each character in your String will be represented using more than one byte. The advantage of this approach is that it is the recommended approach in the newer PDF standards (e.g. PDF/A, PDF/UA), and that you can mix Cyrillic with Latin, Chinese with Japanese, etc... The disadvantage is that you create more bytes, but that effect is limited by the fact that content streams are compressed anyway.

When I decompress the content stream for the text in the sample PDF, I see the following PDF syntax:

enter image description here

As I explained, single bytes are used to store the text of the first line. Double bytes are used to store the text of the second line.

You may be surprised that these characters look OK on the outside (when looking at the text in Adobe Reader), but don't correspond with what you see on the inside (when looking at the second screen shot), but that's how it works.

CONCLUSION:

Many people think that creating PDF is trivial, and that tools for creating PDF should be a commodity. In reality, it's not always that simple ;-)

Can't get Czech characters while generating a PDF

Related

Recent Posts