Tuesday 02 September 2003
Arabica
More codecvt facet renaming. I, like many people, use the terms UTF16 and UCS2 interchangably when talking about slinging Unicode text as wide characters, even though I knew there was a difference between the two. Here's what
the Unicode glossary says
- UCS-2. ISO/IEC 10646 encoding form: Universal Character Set coded in 2 octets. (See Appendix C, Relationship to ISO/IEC 10646.)
- UTF-16 Encoding Form. The Unicode encoding form which assigns each Unicode scalar value in the ranges U+0000 .. U+D7FF and U+E000 .. U+FFFF to a single unsigned 16-bit code unit with the same numeric value as the Unicode scalar value, and which assigns each Unicde scalar value in the range U+10000 .. U+10FFFF to a surrogate pair, according to the Table 3-4, UTF-16 Bit Distribution.
- UTF-16 Encoding Scheme. The UTF-16 encoding scheme that serializes a UTF-16 code unit sequence as a byte sequence in either big-endian or little-endian formats.
All clear? No, I didn't think so either. However, with Arabica (and XML processing more generally) it's sometimes helpful to distinguish between a Unicode text as wide characters, and a byte sequence encoding thos wide characters. Therefore, I'm going to use UTF16 to mean a byte sequence, and UCS2 to mean a character sequence. This is still probably not quite right, but I'll take my chances.
Upshot of this is I've renamed
utf8utf16codecvt
to
utf8ucs2codecvt
. I've also committed two new codecvts,
utf16beucscodecvt
and
utf16leucs2codecvt
, which perform UCS2 to big-endian and little-endian UTF16 conversion.
Tagged
code,
arabica,
xml, and
c++