Table of Contents

CMap

CMaps1) (Character Maps) define unidirectional mapping from a code to another. (This should not be confused with the cmap table2) of an OpenType font.)

CMaps provide a very general mechanism which can describe any mappings, including unicode which was developed later. Input codes of variable length (1, 2, 3 or more bytes) can be mapped to characters.

They are part of type-0 fonts defining the mapping from input codes to glyphs in the font. This is used mainly for Asian fonts (Japanese, Chinese, Korean) with thousends of characters. But, since CMaps are so general, some PDF applications use it as default for encoding. Therefore, for PDF text extraction, it is necessary to understand and use CMaps.

A CMap is a PostScript program using operators from the /CIDInit ProcSet.

CMaps are used in two ways in PDF (and PostScript): mapping codes in text operators

  1. to the glyph to be displayed and
  2. to unicode (in the ToUnicode3) attribute of a font)

The official standard CMaps are now hosted at GitHub as open source project4). Also the mappings from the standard character collections to unicode are available5). An interesting blog post about how the CMap names were chosen can be found here6).

Example

The source of a typical CMap looks like: CMap source

The derived CMap is displayed like this: CMap object

Components

A CMap PostScript program creates a dictionary with all information in the CMap resource category. It can be accessed by is name with

((aPostScript.Interpreter resources at: #CMap) at: aCMapNameSymbol)

General Info

The following keys can be defined:

/CIDSystemInfo

A character collection defined by a dictionary with 3 keys: /Registry, /Ordering and /Supplement.

Example:

/CIDSystemInfo <</Registry (Adobe) /Ordering (GB1) /Supplement 5>> def

/Registry is almost always (Adobe). Especially the standard CMaps of PDF are all from that registry.

/Ordering is a specific ordering of characters. Besides (Identity), there are only 5 supported ones: (CNS1), (GB1), (Japan1), (Korea1) and (KR).

/Supplement is a version number. A higher number adds more characters to the collection at the end.

Codespace

The codespace defines the range of poosible mappings and the number of bytes used for the mapping.

The UTF-8 encoding codespace as example:

4 begincodespacerange
	<00> <7F>
	<C080> <DFBF>
	<E08080> <EFBFBF>
	<F0808080> <F7BFBFBF>
codespacerange

The byte ranges are dimensions. The bytes on each position define the range of possible bytes in that position. If we take the second codespace range <C080>..<DFBF>, it should be read a two ranges: <C0>..<DF> for the first byte and <80>..<BF> for the second. The code <C785> is in that space while <C77F> is not.

A CMap can be defined on the base of another with the operator /usecmap. usecmap takes the codespace and all the mappings from the referenced CMap and may add more mappings. In this case, the CMap cannot have a codespace definition. This means, that codespaces cannot be enlarged or altered when reusing another CMaps.

Mappings

The mapping information is provided by char and range mappings.

There are bf, cid and notdef mappings. bf (base font) maps codes to characters. cid and notdef map codes to CIDs (Character IDs) used as index of glyphs in a font.

Char mappings map one code to another and is written as 2 byte strings.

<A63F> <32>
<37> 7346456

The source (the first element) should be a bytestring written in hex notation, while the destination (second element) can also be given as integer.

Range mappings consist of 2 elements where the first 2 define a range and the third element is the first destination code.

<A63A> <A63F> <32>
<37> <3B> 7346456

The first mapping maps a range of 6 codes (<A63A>..<A63F>) to the destination range <32>..<37>.

For bf mappings (mapping to characters), the destination can also be a PostScript character name or an array of names for ranges.

beginbfchar
<A63F> <32>
<37> 7346456
<84> /epsilon
endbfchar
beginbfrange
<A63A> <A63F> <32>
<37> <3B> 7346456
<84> <86> [/a /c /mu]
endbfrange

Implementation notes

Canonical representation

When constructing a CMap object, great care has been taken to derive a canonical form of the CMap. This means that no matter how the original CMap is written, it will always end up with the same minimal CMap.

The following modifications are applied:

Monster from the wild

CMaps are not well defined. Therefore, there are some interesting variations of them in the wild. Here is a small selection of some issues.

Codespace problems

Wrong code length

%...
1 begincodespacerange
<0000> <FFFF>
endcodespacerange
27 beginbfchar
<20> <0020>
<2E> <002E>
<43> <0043>
<44> <0044>
<45> <0045>
%...

Here are single byte mappings in a double byte codespace which is not correct according to the documentation.

This can be seen often. These illegal mappings are collected into the #unmapped variable of a Mappings object.

Mappings outside the codespace

%...
1 begincodespacerange
<0001> <1004>
endcodespacerange
11 beginbfchar
<0003> <00A0>
<0005> <0022>
<0008> <0025>
<000F> <002C>
<0010> <00AD>
%...

Here, only the first mapping matches the code space. All others fall outside of it, because the second byte has to be between <00> and <04>.

Wrong PostScript

On one occasion, I saw a CMap where the PostScript used a non-existing operator (/find instead of /findresource). See the exception_handling_example on the PostScript page.

Prevent copying

%...
1 begincodespacerange
<0000> <FFFF>
endcodespacerange
100 beginbfchar
<0000> <001A>
<0100> <001A>
<0200> <001A>
<0300> <001A>
<0400> <001A>
%...
<4900> <001A>
<4A00> <001A>
<0001> <001A>
<0101> <001A>
<0201> <001A>
<0301> <001A>
<0401> <001A>
%...

Here, all codes map to the same character (Substitute character, Ctrl-Z) to prevent extracting the text. Interesting is also the ordering by the second byte, which forced me to redesign the object structure to avoid exponential processing time.

Seen in The Adobe-CNS1-7 Character Collection.

Char to string mapping

%...
/CMapType 2 def
1 begincodespacerange
<00><FF>
endcodespacerange
1 beginbfchar
<24><0009 000d 0020 00a0>
endbfchar
1 beginbfchar
<50><002d 00ad 2010>
endbfchar
50 beginbfrange
<21><21><0050>
%...

Two codes (<24> and <50>) are mapped to a string of 2-byte characters. This is defined by the PDF spec7) in section 9.10.3 "ToUnicode CMaps". This has not been implemented yet.

Seen in a PDF with the Producer "Mac OS X 10.7.1 Quartz PDFContext".

1)
5014.CIDFont_Spec.pdf Adobe CMap and CIDFont Files Specification
3)
5411.ToUnicode.pdf ToUnicode Mapping File Tutorial
4)
cmap-resources Standard CMaps from Adobe at GitHub
5)
mapping-resources-pdf Mapping character collections to unicode at GitHub