Creating Attributes via LDIF

Articles and Tips: qna

01 May 2003

Can you tell me what the difference between a CaseIgnoreMatch and CaseIgnorIA5Match is when creating attributes via LDIF in the schema? I did some digging and found out the following about the two cases:

caseIgnoreMatch

equality

case insensitive, space insensitive

directoryString:

1.3.6.1.4.1.1466.115.121.1.15 UTF-8 string

This is compared to:

caseIgnoreIA5Match

equality

case insensitive, space insensitive

IA5String: 1.3.6.1.4.1.1466.115.121.1.26 ASCII string

What I can interpret is they are similar in matching syntaxes, but are for different types of data: caseIgnoreIA5Match is for ASCII String data types and caseIgnoreMatch is for UTF-8 string data types. So I searched on ASCII and UTF-8 and the differences did not get any clearer. It appears that these strings are also similar, as I found from this description on the Linux Man Page:

UTF-8 - an ASCII-compatible multibyte Unicode encoding

I cannot find the documentation I am looking for--one that explains the differences between these two strings and when you would use one over the other. If I get that information, I may be able to answer the client's questions.

Chancy All Charactered-out

Dear Chancy: ASCII is a character encoding that uses a single byte to store characters. To be able to include all kind of characters used in different parts of the world, many different "code pages" exist. But this gives troubles when storing multilingual information. Hence a new character encoding was required to overcome that problem.

Most Western European languages require less than two bytes per character. For example, characters from Latin-based scripts require only 1.1 bytes on average. Greek, Arabic, Hebrew, and Russian require an average of 1.7 bytes. Japanese, Korean, and Chinese typically require three bytes per character.

The big idea of Unicode (and how it got its name) was to store and manipulate characters as 32-bit integers and to have one set instead of multiple sets. That is what UTF is all about. But convincing the world to suddenly use a 32-bit wide character encoding structure is not that easy because of all of the legacy issues. So you need a compromise: enter UTF-8.

UTF-8 is a multi-byte encoding scheme in which each character can be encoded in as little as one byte and as many as four bytes. Characters in the range U+0000 - U+007F can be encoded as a single byte. This means that the ASCII character set can be represented unchanged with a single byte of storage space. The characters in the range U+0800 - U+FFFF are Chinese, Korean, and Japanese.

Here's a table showing UTF-8 Bit encoding of a Unicode code point:

Character Range	Bit Encoding
U+0000 - U+007F	0xxxxxxx
U+0080 - U+07FF	110xxxxx 10xxxxxx
U+0800 - U+FFFF	1110xxxx 10xxxxxx 10xxxxxx
U+10000 - U+10FFFF	11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

So the power of UTF-8 is that is bridges between 32-bit Unicode and 8-bit ASCII. Other UTF versions are UTF-16 and UTF-32, which store characters as 16- or 32-bit values and hence are not compatible with ASCII. So I guess that the matching rules would be the same if the data matched is ASCII. But when the data would be UTF-8 encoded and some characters would be multi-byte, the ASCII matching would not work any more.

* Originally published in Novell AppNotes

Disclaimer

The origin of this information may be internal or external to Novell. While Novell makes all reasonable efforts to verify this information, Novell does not make explicit or implied claims to its validity.