Monthly Archives: December 2012

Making iText work with Indic scripts

Why iText does not work properly for Indic Scripts?

There are a number of threads floating around as to why iText does not render Indian languages properly. The reason is because iText does not handle Ligature Substitution.

What is Ligature Substitution?

In Indian languages like Bangla and Hindi, two or more characters sometimes merge to form a single Glyph.

Bangla Example:

+ + = ক্ষ

+ + + + = ক্ষ্ম

+ + = ল্ল

Hindi example

+ + = क्ष

+ + + = क्ष्म

+ + = ल्ल

This essentially means that whenever we get these composite characters, we need to replace them with a single glyph.

In case you see only boxes above, click here. Upgrade your browser to one that can handle Unicode.

Where do we get the information about which Glyphs are to be substituted?

This information is available in the OpenTypeFont file(note that OpenTypeFonts can have the extension .ttf, which is also used for TrueTypeFonts). The OpenTypeFont has a table called the GlyphSubstitutionTable (GSUB). Its pretty cryptic and obfuscated, and you have to basically go on a wild goose chase. But after that, you can get a list of the Glyphs that should be replaced by a single Glyph. These specifications can be found here: http://www.microsoft.com/typography/otspec/gsub.htm

Inner workings of iText

The best part about iText is its Open Source. This is the svn: svn://svn.code.sf.net/p/itext/code/trunk

At the heart of converting text to PDF is the TrueTypeFont class. This parses the actual FontFile and reads various information like the Character to Glyph mappings (cmap), the Glyph metrics, etc. Then, we have the convertToBytes() method in the FontDetails class, which actually converts each character into the Glyph code and writes it to PDF.

Integration of the GlyphSubstitutionTable data with iText

  1. The GlyphSubstitutionTableReader class parses the FontFile and gleans the Glyph substitution information, and returns a Map<String, Glyph>, where the key is the String of composite characters and value is the Glyph object.
  2. Then, in the FontDetails::convertToBytes() method, tokenise the input String based on the composite characters.
  3. Replace the composite characters by their respective Glyphs.
  4. For characters that do not need substitution, proceed normally and replace them with their corresponding Glyph.

Test Harness

The following is the test harness for testing out my fix.

Before Fix

BeforeFix

 

After Fix

AfterFix

 

Source

The changes are done on itextpdf-5.4.0-SNAPSHOT, revision 5638. Please note that the below jar will not work in most cases,  as it is only half-baked.

Next Steps

If you notice the i-kar, e-kar and o-kar are still not displaying in their proper position. I am convinced that this is because we need to read the Positioning data from the GPOS – The Glyph Positioning Table. That is my next task. Stay tuned!

Update: Why is the latest iText still not working?

My code is commented out in the latest iText, as it seems to be interfering with some of their core functionalities.

How do I make it work?

Download the iText source from sourceforge:

http://sourceforge.net/p/itext/code/HEAD/tree/trunk/itext/

After getting the source, just uncomment the below line in the TrueTypeFontUnicode.java:

 

Building it with maven should be pretty straight forward. Cheers!