Creating Attributes via LDIF
Articles and Tips: qna
01 May 2003
Q.
Can you tell me what the difference between a CaseIgnoreMatch and CaseIgnorIA5Match is when creating attributes via LDIF in the schema? I did some digging and found out the following about the two cases:
caseIgnoreMatch
equality
case insensitive, space insensitive
directoryString:
1.3.6.1.4.1.1466.115.121.1.15 UTF-8 string
This is compared to:
caseIgnoreIA5Match
equality
case insensitive, space insensitive
IA5String: 1.3.6.1.4.1.1466.115.121.1.26 ASCII string
What I can interpret is they are similar in matching syntaxes, but are for different types of data: caseIgnoreIA5Match is for ASCII String data types and caseIgnoreMatch is for UTF-8 string data types. So I searched on ASCII and UTF-8 and the differences did not get any clearer. It appears that these strings are also similar, as I found from this description on the Linux Man Page:
UTF-8 - an ASCII-compatible multibyte Unicode encoding
I cannot find the documentation I am looking for--one that explains the differences between these two strings and when you would use one over the other. If I get that information, I may be able to answer the client's questions.
Chancy All Charactered-out
A.
Dear Chancy: ASCII is a character encoding that uses a single byte to store characters. To be able to include all kind of characters used in different parts of the world, many different "code pages" exist. But this gives troubles when storing multilingual information. Hence a new character encoding was required to overcome that problem.
Most Western European languages require less than two bytes per character. For example, characters from Latin-based scripts require only 1.1 bytes on average. Greek, Arabic, Hebrew, and Russian require an average of 1.7 bytes. Japanese, Korean, and Chinese typically require three bytes per character.
The big idea of Unicode (and how it got its name) was to store and manipulate characters as 32-bit integers and to have one set instead of multiple sets. That is what UTF is all about. But convincing the world to suddenly use a 32-bit wide character encoding structure is not that easy because of all of the legacy issues. So you need a compromise: enter UTF-8.
UTF-8 is a multi-byte encoding scheme in which each character can be encoded in as little as one byte and as many as four bytes. Characters in the range U+0000 - U+007F can be encoded as a single byte. This means that the ASCII character set can be represented unchanged with a single byte of storage space. The characters in the range U+0800 - U+FFFF are Chinese, Korean, and Japanese.
Here's a table showing UTF-8 Bit encoding of a Unicode code point:
Character Range
|
Bit Encoding
|
U+0000 - U+007F |
0xxxxxxx |
U+0080 - U+07FF |
110xxxxx 10xxxxxx |
U+0800 - U+FFFF |
1110xxxx 10xxxxxx 10xxxxxx |
U+10000 - U+10FFFF |
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |
So the power of UTF-8 is that is bridges between 32-bit Unicode and 8-bit ASCII. Other UTF versions are UTF-16 and UTF-32, which store characters as 16- or 32-bit values and hence are not compatible with ASCII. So I guess that the matching rules would be the same if the data matched is ASCII. But when the data would be UTF-8 encoded and some characters would be multi-byte, the ASCII matching would not work any more.
* Originally published in Novell AppNotes
Disclaimer
The origin of this information may be internal or external to Novell. While Novell makes all reasonable efforts to verify this information, Novell does not make explicit or implied claims to its validity.