Skip to content
Snippets Groups Projects
  • Jordan Rose's avatar
    c7629d94
    Handle universal character names and Unicode characters outside of literals. · c7629d94
    Jordan Rose authored
    This is a missing piece for C99 conformance.
    
    This patch handles UCNs by adding a '\\' case to LexTokenInternal and
    LexIdentifier -- if we see a backslash, we tentatively try to read in a UCN.
    If the UCN is not syntactically well-formed, we fall back to the old
    treatment: a backslash followed by an identifier beginning with 'u' (or 'U').
    
    Because the spelling of an identifier with UCNs still has the UCN in it, we
    need to convert that to UTF-8 in Preprocessor::LookUpIdentifierInfo.
    
    Of course, valid code that does *not* use UCNs will see only a very minimal
    performance hit (checks after each identifier for non-ASCII characters,
    checks when converting raw_identifiers to identifiers that they do not
    contain UCNs, and checks when getting the spelling of an identifier that it
    does not contain a UCN).
    
    This patch also adds basic support for actual UTF-8 in the source. This is
    treated almost exactly the same as UCNs except that we consider stray
    Unicode characters to be mistakes and offer a fixit to remove them.
    
    git-svn-id: https://llvm.org/svn/llvm-project/cfe/trunk@173369 91177308-0d34-0410-b5e6-96231b3b80d8
    c7629d94
    History
    Handle universal character names and Unicode characters outside of literals.
    Jordan Rose authored
    This is a missing piece for C99 conformance.
    
    This patch handles UCNs by adding a '\\' case to LexTokenInternal and
    LexIdentifier -- if we see a backslash, we tentatively try to read in a UCN.
    If the UCN is not syntactically well-formed, we fall back to the old
    treatment: a backslash followed by an identifier beginning with 'u' (or 'U').
    
    Because the spelling of an identifier with UCNs still has the UCN in it, we
    need to convert that to UTF-8 in Preprocessor::LookUpIdentifierInfo.
    
    Of course, valid code that does *not* use UCNs will see only a very minimal
    performance hit (checks after each identifier for non-ASCII characters,
    checks when converting raw_identifiers to identifiers that they do not
    contain UCNs, and checks when getting the spelling of an identifier that it
    does not contain a UCN).
    
    This patch also adds basic support for actual UTF-8 in the source. This is
    treated almost exactly the same as UCNs except that we consider stray
    Unicode characters to be mistakes and offer a fixit to remove them.
    
    git-svn-id: https://llvm.org/svn/llvm-project/cfe/trunk@173369 91177308-0d34-0410-b5e6-96231b3b80d8
Code owners
Assign users and groups as approvers for specific file changes. Learn more.