How to use antlr4 to parse 3-byte utf8 string

  antlr, antlr4, c++, utf-8

Blow is my grammar file.

grammar My;

tokens {
    DELIMITER
}

string:SINGLE_QUOTED_TEXT;

SINGLE_QUOTED_TEXT: (
        ''' (.)*? '''
    )+
;

I’m trying to use this to accpet all string(It’s part of mysql’s g4 actually).
Then I use this code to test it:

#include "MyLexer.h"
#include "MyParser.h"
#include <string>
using namespace My;

int main()
{
    std::string s = "'中'";

    antlr4::ANTLRInputStream input(s);
    MyLexer lexer(&input);

    antlr4::CommonTokenStream tokens(&lexer);
    MyParser parser(&tokens);

    parser.string();

    return 0;
}

Result is
enter image description here

The Chinese character 中’s utf8 code is 3 bytes: xe4 xb8 xad

Both grammar file and code file are encoded in utf8.
What can I to to let this work fine.

Source: Windows Questions C++

LEAVE A COMMENT