Convert UTF-8 character data into a std::wstring without using a 32-bit integer

  8-bit, c++, unicode, utf-8

This routine puts UTF8 data into a wstring, decoding as it goes.
It assumes the encoding is correct, which is okay for my purposes

It’s not originally my code, i got it from another SO question- Mark Ransom’s wonderful answer here: UTF8 to/from wide char conversion in STL and I intend to adapt it. Anyway, the problem with this
code is it is incompatible with 8-bit processors which I am targeting.

I don’t care about the wstring return value, I’ll be replacing it with a 4 byte buffer after I make the thing work. I just need the bit twiddling part.

Specifically I need the codepage split into two hard 16-bit numbers because 8-bit C++ will not use a 32-bit int. See the comments in the code

I just don’t have the head for this conversion. It’s a little beyond me. I might be able to do it with enough time and trial and error but I was hoping to find someone that maybe has a knack for this stuff and could retool the code to use no ints larger than 16-bit.

std::wstring UTF8_to_wchar(const char * in)
{
    std::wstring out;
    // this code assumes 32-bit but will be 16-bit on an 8 bit processor i think.
    // it needs to be split into two uint16_t values like:
    // uint16_t codepointHi=0;
    // uint16_t codepointLo;
    // something like that
    unsigned int codepoint;
    while (*in != 0)
    {
        unsigned char ch = static_cast<unsigned char>(*in);
        if (ch <= 0x7f)
            codepoint = ch;
        else if (ch <= 0xbf)
            // this is rough for me to convert:
            codepoint = (codepoint << 6) | (ch & 0x3f);
        else if (ch <= 0xdf)
            codepoint = ch & 0x1f;
        else if (ch <= 0xef)
            codepoint = ch & 0x0f;
        else
            codepoint = ch & 0x07;
        ++in;
        // this is where things get particularly confusing
        // and I start getting really lost:
        if (((*in & 0xc0) != 0x80) && (codepoint <= 0x10ffff))
        {
            // yay an easy part to convert.
            if (codepoint > 0xffff)
            {
                // i'm crying a little
                out.append(1, static_cast<wchar_t>(0xd800 + (codepoint >> 10)));
                out.append(1, static_cast<wchar_t>(0xdc00 + (codepoint & 0x03ff)));
            }
            else if (codepoint < 0xd800 || codepoint >= 0xe000) // not so bad
                out.append(1, static_cast<wchar_t>(codepoint));
        }
    }
    return out;
}

Thanks in advance!

Source: Windows Questions C++

LEAVE A COMMENT