Tuesday 02 September, 2003
#More codecvt facet renaming. I, like many people, use the terms UTF16 and UCS2 interchangably when talking about slinging Unicode text as wide characters, even though I knew there was a difference between the two. Here's what the Unicode glossary says
- UCS-2. ISO/IEC 10646 encoding form: Universal Character Set coded in 2 octets. (See Appendix C, Relationship to ISO/IEC 10646.)
- UTF-16 Encoding Form. The Unicode encoding form which assigns each Unicode scalar value in the ranges U+0000 .. U+D7FF and U+E000 .. U+FFFF to a single unsigned 16-bit code unit with the same numeric value as the Unicode scalar value, and which assigns each Unicde scalar value in the range U+10000 .. U+10FFFF to a surrogate pair, according to the Table 3-4, UTF-16 Bit Distribution.
- UTF-16 Encoding Scheme. The UTF-16 encoding scheme that serializes a UTF-16 code unit sequence as a byte sequence in either big-endian or little-endian formats.
Upshot of this is I've renamed
utf8utf16codecvt to utf8ucs2codecvt. I've also committed two new codecvts, utf16beucscodecvt and utf16leucs2codecvt, which perform UCS2 to big-endian and little-endian UTF16 conversion.[Add a comment]
