Tuesday 02 September 2003
More codecvt facet renaming. I, like many people, use the terms UTF16 and UCS2 interchangably when talking about slinging Unicode text as wide characters, even though I knew there was a difference between the two. Here's what the Unicode glossary says
- UCS-2. ISO/IEC 10646 encoding form: Universal Character Set coded in 2 octets. (See Appendix C, Relationship to ISO/IEC 10646.)
- UTF-16 Encoding Form. The Unicode encoding form which assigns each Unicode scalar value in the ranges U+0000 .. U+D7FF and U+E000 .. U+FFFF to a single unsigned 16-bit code unit with the same numeric value as the Unicode scalar value, and which assigns each Unicde scalar value in the range U+10000 .. U+10FFFF to a surrogate pair, according to the Table 3-4, UTF-16 Bit Distribution.
- UTF-16 Encoding Scheme. The UTF-16 encoding scheme that serializes a UTF-16 code unit sequence as a byte sequence in either big-endian or little-endian formats.
All clear? No, I didn't think so either. However, with Arabica (and XML processing more generally) it's sometimes helpful to distinguish between a Unicode text as wide characters, and a byte sequence encoding thos wide characters. Therefore, I'm going to use UTF16 to mean a byte sequence, and UCS2 to mean a character sequence. This is still probably not quite right, but I'll take my chances.
Upshot of this is I've renamed
. I've also committed two new codecvts,
, which perform UCS2 to big-endian and little-endian UTF16 conversion.