CppStd:lex

From emmtrix Wiki
Jump to navigation Jump to search
[edit]

Source: https://timsong-cpp.github.io/cppwp/n3337/lex

List of Tables [tab]
List of Figures [fig]
1 General [intro]
2 Lexical conventions [lex]
2.1 Separate translation [lex.separate]
2.2 Phases of translation [lex.phases]
2.3 Character sets [lex.charset]
2.4 Trigraph sequences [lex.trigraph]
2.5 Preprocessing tokens [lex.pptoken]
2.6 Alternative tokens [lex.digraph]
2.7 Tokens [lex.token]
2.8 Comments [lex.comment]
2.9 Header names [lex.header]
2.10 Preprocessing numbers [lex.ppnumber]
2.11 Identifiers [lex.name]
2.12 Keywords [lex.key]
2.13 Operators and punctuators [lex.operators]
2.14 Literals [lex.literal]
3 Basic concepts [basic]
4 Standard conversions [conv]
5 Expressions [expr]
6 Statements [stmt.stmt]
7 Declarations [dcl.dcl]
8 Declarators [dcl.decl]
9 Classes [class]
10 Derived classes [class.derived]
11 Member access control [class.access]
12 Special member functions [special]
13 Overloading [over]
14 Templates [temp]
15 Exception handling [except]
16 Preprocessing directives [cpp]
17 Library introduction [library]
18 Language support library [language.support]
19 Diagnostics library [diagnostics]
20 General utilities library [utilities]
21 Strings library [strings]
22 Localization library [localization]
23 Containers library [containers]
24 Iterators library [iterators]
25 Algorithms library [algorithms]
26 Numerics library [numerics]
27 Input/output library [input.output]
28 Regular expressions library [re]
29 Atomic operations library [atomics]
30 Thread support library [thread]
Annex A (informative) Grammar summary [gram]
Annex B (informative) Implementation quantities [implimits]
Annex C (informative) Compatibility [diff]
Annex D (normative) Compatibility features [depr]
Annex E (normative) Universal character names for identifier characters [charname]
Index [generalindex]
Index of library names [libraryindex]
Index of implementation-defined behavior [impldefindex]

2.1 Separate translation [lex.separate]

1 The text of the program is kept in units called source files in this International Standard. A source file together with all the headers ([headers]) and source files included ([cpp.include]) via the preprocessing directive #include, less any source lines skipped by any of the conditional inclusion ([cpp.cond]) preprocessing directives, is called a translation unit. NoteA C++ program need not all be translated at the same time.
2 NotePreviously translated translation units and instantiation units can be preserved individually or in libraries. The separate translation units of a program communicate ([basic.link]) by (for example) calls to functions whose identifiers have external linkage, manipulation of objects whose identifiers have external linkage, or manipulation of data files. Translation units can be separately translated and then later linked to produce an executable program ([basic.link]).

2.2 Phases of translation [lex.phases]

1 The precedence among the syntax rules of translation is specified by the following phases.[cpp11 1]
  1. Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set (introducing new-line characters for end-of-line indicators) if necessary. The set of physical source file characters accepted is implementation-defined. Trigraph sequences ([lex.trigraph]) are replaced by corresponding single-character internal representations. Any source file character not in the basic source character set ([lex.charset]) is replaced by the universal-character-name that designates that character. (An implementation may use any internal encoding, so long as an actual extended character encountered in the source file, and the same extended character expressed in the source file as a universal-character-name (i.e., using the \uXXXX notation), are handled equivalently except where this replacement is reverted in a raw string literal.)
  2. Each instance of a backslash character (\) immediately followed by a new-line character is deleted, splicing physical source lines to form logical source lines. Only the last backslash on any physical source line shall be eligible for being part of such a splice. If, as a result, a character sequence that matches the syntax of a universal-character-name is produced, the behavior is undefined. A source file that is not empty and that does not end in a new-line character, or that ends in a new-line character immediately preceded by a backslash character before any such splicing takes place, shall be processed as if an additional new-line character were appended to the file.
  3. The source file is decomposed into preprocessing tokens ([lex.pptoken]) and sequences of white-space characters (including comments). A source file shall not end in a partial preprocessing token or in a partial comment.[cpp11 2] Each comment is replaced by one space character. New-line characters are retained. Whether each nonempty sequence of white-space characters other than new-line is retained or replaced by one space character is unspecified. The process of dividing a source file's characters into preprocessing tokens is context-dependent.
    Examplesee the handling of < within a #include preprocessing directive.
  4. Preprocessing directives are executed, macro invocations are expanded, and _Pragma unary operator expressions are executed. If a character sequence that matches the syntax of a universal-character-name is produced by token concatenation ([cpp.concat]), the behavior is undefined. A #include preprocessing directive causes the named header or source file to be processed from phase 1 through phase 4, recursively. All preprocessing directives are then deleted.
  5. Each source character set member in a character literal or a string literal, as well as each escape sequence and universal-character-name in a character literal or a non-raw string literal, is converted to the corresponding member of the execution character set ([lex.ccon], [lex.string]); if there is no corresponding member, it is converted to an implementation-defined member other than the null (wide) character.[cpp11 3]
  6. Adjacent string literal tokens are concatenated.
  7. White-space characters separating tokens are no longer significant. Each preprocessing token is converted into a token. ([lex.token]). The resulting tokens are syntactically and semantically analyzed and translated as a translation unit. NoteThe process of analyzing and translating the tokens may occasionally result in one token being replaced by a sequence of other tokens ([temp.names]). NoteSource files, translation units and translated translation units need not necessarily be stored as files, nor need there be any one-to-one correspondence between these entities and any external representation. The description is conceptual only, and does not specify any particular implementation.
  8. Translated translation units and instantiation units are combined as follows: NoteSome or all of these may be supplied from a library. Each translated translation unit is examined to produce a list of required instantiations. NoteThis may include instantiations which have been explicitly requested ([temp.explicit]). The definitions of the required templates are located. It is implementation-defined whether the source of the translation units containing these definitions is required to be available. NoteAn implementation could encode sufficient information into the translated translation unit so as to ensure the source is not required here. All the required instantiations are performed to produce instantiation units. NoteThese are similar to translated translation units, but contain no references to uninstantiated templates and no template definitions. The program is ill-formed if any instantiation fails.
  9. All external entity references are resolved. Library components are linked to satisfy external references to entities not defined in the current translation. All such translator output is collected into a program image which contains information needed for execution in its execution environment.

2.3 Character sets [lex.charset]

1 The basic source character set consists of 96 characters: the space character, the control characters representing horizontal tab, vertical tab, form feed, and new-line, plus the following 91 graphical characters:[cpp11 4]
2 The universal-character-name construct provides a way to name other characters.

universal-character-name:

\u hex-quad
\U hex-quad hex-quad

The character designated by the universal-character-name \UNNNNNNNN is that character whose character short name in ISO/IEC 10646 is NNNNNNNN; the character designated by the universal-character-name \uNNNN is that character whose character short name in ISO/IEC 10646 is 0000NNNN. If the hexadecimal value for a universal-character-name corresponds to a surrogate code point (in the range 0xD800–0xDFFF, inclusive), the program is ill-formed. Additionally, if the hexadecimal value for a universal-character-name outside the c-char-sequence, s-char-sequence, or r-char-sequence of a character or string literal corresponds to a control character (in either of the ranges 0x00–0x1F or 0x7F–0x9F, both inclusive) or to a character in the basic source character set, the program is ill-formed.[cpp11 5]

3 The basic execution character set and the basic execution wide-character set shall each contain all the members of the basic source character set, plus control characters representing alert, backspace, and carriage return, plus a null character (respectively, null wide character), whose representation has all zero bits. For each basic execution character set, the values of the members shall be non-negative and distinct from one another. In both the source and execution basic character sets, the value of each character after 0 in the above list of decimal digits shall be one greater than the value of the previous. The execution character set and the execution wide-character set are implementation-defined supersets of the basic execution character set and the basic execution wide-character set, respectively. The values of the members of the execution character sets and the sets of additional members are locale-specific.

2.4 Trigraph sequences [lex.trigraph]

1 Before any other processing takes place, each occurrence of one of the following sequences of three characters (“trigraph sequences”) is replaced by the single character indicated in Table [tab:trigraph.sequences].
Table 1 — Trigraph sequences
Trigraph Replacement Trigraph Replacement Trigraph Replacement
??= # ??( [ ??< {
??/ \ ??) ] ??> }
??' ^ ??! | ??- ~
2
Example
??=define arraycheck(a,b) a??(b??) ??!??! b??(a??)

becomes

#define arraycheck(a,b) a[b] || b[a]
3 No other trigraph sequence exists. Each ? that does not begin one of the trigraphs listed above is not changed.

2.5 Preprocessing tokens [lex.pptoken]

1 Each preprocessing token that is converted to a token ([lex.token]) shall have the lexical form of a keyword, an identifier, a literal, an operator, or a punctuator.
2 A preprocessing token is the minimal lexical element of the language in translation phases 3 through 6. The categories of preprocessing token are: header names, identifiers, preprocessing numbers, character literals (including user-defined character literals), string literals (including user-defined string literals), preprocessing operators and punctuators, and single non-white-space characters that do not lexically match the other preprocessing token categories. If a ' or a " character matches the last category, the behavior is undefined. Preprocessing tokens can be separated by white space; this consists of comments ([lex.comment]), or white-space characters (space, horizontal tab, new-line, vertical tab, and form-feed), or both. As described in Clause [cpp], in certain circumstances during translation phase 4, white space (or the absence thereof) serves as more than preprocessing token separation. White space can appear within a preprocessing token only as part of a header name or between the quotation characters in a character literal or string literal.
3 If the input stream has been parsed into preprocessing tokens up to a given character:
(3.1)
  • If the next character begins a sequence of characters that could be the prefix and initial double quote of a raw string literal, such as R", the next preprocessing token shall be a raw string literal. Between the initial and final double quote characters of the raw string, any transformations performed in phases 1 and 2 (trigraphs, universal-character-names, and line splicing) are reverted; this reversion shall apply before any d-char, r-char, or delimiting parenthesis is identified. The raw string literal is defined as the shortest sequence of characters that matches the raw-string pattern

(3.2)
  • Otherwise, if the next three characters are <:: and the subsequent character is neither : nor >, the < is treated as a preprocessor token by itself and not as the first character of the alternative token <:.

(3.3)
  • Otherwise, the next preprocessing token is the longest sequence of characters that could constitute a preprocessing token, even if that would cause further lexical analysis to fail.

    Example
    #define R "x"
    const char* s = R"y";           // ill-formed raw string, not "x" "y"
    
4
ExampleThe program fragment 1Ex is parsed as a preprocessing number token (one that is not a valid floating or integer literal token), even though a parse as the pair of preprocessing tokens 1 and Ex might produce a valid expression (for example, if Ex were a macro defined as +1). Similarly, the program fragment 1E1 is parsed as a preprocessing number (one that is a valid floating literal token), whether or not E is a macro name.
5
ExampleThe program fragment x+++++y is parsed as x ++ ++ + y, which, if x and y have integral types, violates a constraint on increment operators, even though the parse x ++ + ++ y might yield a correct expression.

2.6 Alternative tokens [lex.digraph]

1 Alternative token representations are provided for some operators and punctuators.[cpp11 6]
2 In all respects of the language, each alternative token behaves the same, respectively, as its primary token, except for its spelling.[cpp11 7] The set of alternative tokens is defined in Table [tab:alternative.tokens].
Table 2 — Alternative tokens
Alternative Primary Alternative Primary Alternative Primary
<% { and && and_eq &=
%> } bitor | or_eq |=
<: [ or || xor_eq ^=
:> ] xor ^ not !
%: # compl ~ not_eq !=
%:%: ## bitand &

2.7 Tokens [lex.token]

token:

identifier keyword literal
operator punctuator
1 There are five kinds of tokens: identifiers, keywords, literals,[cpp11 8] operators, and other separators. Blanks, horizontal and vertical tabs, newlines, formfeeds, and comments (collectively, “white space”), as described below, are ignored except as they serve to separate tokens. NoteSome white space is required to separate otherwise adjacent identifiers, keywords, numeric literals, and alternative tokens containing alphabetic characters.

2.8 Comments [lex.comment]

1 The characters /* start a comment, which terminates with the characters */. These comments do not nest. The characters // start a comment, which terminates with the next new-line character. If there is a form-feed or a vertical-tab character in such a comment, only white-space characters shall appear between it and the new-line that terminates the comment; no diagnostic is required. NoteThe comment characters //, /*, and */ have no special meaning within a // comment and are treated just like other characters. Similarly, the comment characters // and /* have no special meaning within a /* comment.

2.9 Header names [lex.header]

header-name:

< h-char-sequence >
" q-char-sequence "

h-char-sequence:

h-char
h-char-sequence h-char

h-char:

any member of the source character set except new-line and >

q-char-sequence:

q-char
q-char-sequence q-char

q-char:

any member of the source character set except new-line and "
1 Header name preprocessing tokens shall only appear within a #include preprocessing directive ([cpp.include]). The sequences in both forms of header-names are mapped in an implementation-defined manner to headers or to external source file names as specified in [cpp.include].
2 The appearance of either of the characters ' or \ or of either of the character sequences /* or // in a q-char-sequence or an h-char-sequence is conditionally supported with implementation-defined semantics, as is the appearance of the character " in an h-char-sequence.[cpp11 9]

2.10 Preprocessing numbers [lex.ppnumber]

1 Preprocessing number tokens lexically include all integral literal tokens ([lex.icon]) and all floating literal tokens ([lex.fcon]).
2 A preprocessing number does not have a type or a value; it acquires both after a successful conversion to an integral literal token or a floating literal token.

2.11 Identifiers [lex.name]

identifier-nondigit:

nondigit
universal-character-name
other implementation-defined characters

nondigit: one of

a b c d e f g h i j k l m
n o p q r s t u v w x y z
A B C D E F G H I J K L M
N O P Q R S T U V W X Y Z _

digit: one of

0 1 2 3 4 5 6 7 8 9
1 An identifier is an arbitrarily long sequence of letters and digits. Each universal-character-name in an identifier shall designate a character whose encoding in ISO 10646 falls into one of the ranges specified in [charname.allowed]. The initial element shall not be a universal-character-name designating a character whose encoding falls into one of the ranges specified in [charname.disallowed]. Upper- and lower-case letters are different. All characters are significant.[cpp11 10]
2 The identifiers in Table [tab:identifiers.special] have a special meaning when appearing in a certain context. When referred to in the grammar, these identifiers are used explicitly rather than using the identifier grammar production. any ambiguity as to whether a given identifier has a special meaning is resolved to interpret the token as a regular identifier.
Table 3 — Identifiers with special meaning
override final
3 In addition, some identifiers are reserved for use by C++ implementations and standard libraries ([global.names]) and shall not be used otherwise; no diagnostic is required.

2.12 Keywords [lex.key]

1 The identifiers shown in Table [tab:keywords] are reserved for use as keywords (that is, they are unconditionally treated as keywords in phase 7) except in an attribute-token ([dcl.attr.grammar]) NoteThe export keyword is unused but is reserved for future use.:
Table 4 — Keywords
alignas continue friend register true
alignof decltype goto reinterpret_cast try
asm default if return typedef
auto delete inline short typeid
bool do int signed typename
break double long sizeof union
case dynamic_cast mutable static unsigned
catch else namespace static_assert using
char enum new static_cast virtual
char16_t explicit noexcept struct void
char32_t export nullptr switch volatile
class extern operator template wchar_t
const false private this while
constexpr float protected thread_local
const_cast for public throw
2 Furthermore, the alternative representations shown in Table [tab:alternative.representations] for certain operators and punctuators ([lex.digraph]) are reserved and shall not be used otherwise:
Table 5 — Alternative representations
and and_eq bitand bitor compl not
not_eq or or_eq xor xor_eq

2.13 Operators and punctuators [lex.operators]

1 The lexical representation of C++ programs includes a number of preprocessing tokens which are used in the syntax of the preprocessor or are converted into tokens for operators and punctuators:

preprocessing-op-or-punc: one of { } [ ] # ## ( ) <: :> <% %> %: %:%: ; : ... new delete ? :: . .* + - * / % ^ & | ~ ! = < > += -= *= /= %= ^= &= |= << >> >>= <<= == != <= >= && || ++ -- , ->* -> and and_eq bitand bitor compl not not_eq or or_eq xor xor_eq

Each preprocessing-op-or-punc is converted to a single token in translation phase 7 ([lex.phases]).

2.14.1 Kinds of literals [lex.literal.kinds]

1 There are several kinds of literals.[cpp11 11]

2.14.2 Integer literals [lex.icon]

octal-literal:

0
octal-literal octal-digit

nonzero-digit: one of

1 2 3 4 5 6 7 8 9

octal-digit: one of

0 1 2 3 4 5 6 7

hexadecimal-digit: one of

0 1 2 3 4 5 6 7 8 9
a b c d e f
A B C D E F

unsigned-suffix: one of

u U

long-suffix: one of

l L

long-long-suffix: one of

ll LL
1 An integer literal is a sequence of digits that has no period or exponent part. An integer literal may have a prefix that specifies its base and a suffix that specifies its type. The lexically first digit of the sequence of digits is the most significant. A decimal integer literal (base ten) begins with a digit other than 0 and consists of a sequence of decimal digits. An octal integer literal (base eight) begins with the digit 0 and consists of a sequence of octal digits.[cpp11 12] A hexadecimal integer literal (base sixteen) begins with 0x or 0X and consists of a sequence of hexadecimal digits, which include the decimal digits and the letters a through f and A through F with decimal values ten through fifteen.
Examplethe number twelve can be written 12, 014, or 0XC.
2 The type of an integer literal is the first of the corresponding list in Table [tab:lex.type.integer.constant] in which its value can be represented.
Table 6 — Types of integer constants
Suffix Decimal constants Octal or hexadecimal constant
none int int
long int unsigned int
long long int long int
unsigned long int
long long int
unsigned long long int
u or U unsigned int unsigned int
unsigned long int unsigned long int
unsigned long long int unsigned long long int
l or L long int long int
long long int unsigned long int
long long int
unsigned long long int
Both u or U unsigned long int unsigned long int
and l or L unsigned long long int unsigned long long int
ll or LL long long int long long int
unsigned long long int
Both u or U unsigned long long int unsigned long long int
and ll or LL
3 If an integer literal cannot be represented by any type in its list and an extended integer type ([basic.fundamental]) can represent its value, it may have that extended integer type. If all of the types in the list for the literal are signed, the extended integer type shall be signed. If all of the types in the list for the literal are unsigned, the extended integer type shall be unsigned. If the list contains both signed and unsigned types, the extended integer type may be signed or unsigned. A program is ill-formed if one of its translation units contains an integer literal that cannot be represented by any of the allowed types.

2.14.3 Character literals [lex.ccon]

character-literal:

' c-char-sequence ' u' c-char-sequence ' U' c-char-sequence ' L' c-char-sequence '

c-char-sequence:

c-char
c-char-sequence c-char

c-char:

any member of the source character set except
the single-quote ', backslash \, or new-line character
escape-sequence
universal-character-name

simple-escape-sequence: one of

\'  \"  \?  \\
\a  \b  \f  \n  \r  \t  \v
1 A character literal is one or more characters enclosed in single quotes, as in 'x', optionally preceded by one of the letters u, U, or L, as in u'y', U'z', or L'x', respectively. A character literal that does not begin with u, U, or L is an ordinary character literal, also referred to as a narrow-character literal. An ordinary character literal that contains a single c-char has type char, with value equal to the numerical value of the encoding of the c-char in the execution character set. An ordinary character literal that contains more than one c-char is a multicharacter literal. A multicharacter literal has type int

and implementation-defined value.

2 A character literal that begins with the letter u, such as u'y', is a character literal of type char16_t. The value of a char16_t literal containing a single c-char is equal to its ISO 10646 code point value, provided that the code point is representable with a single 16-bit code unit. (That is, provided it is a basic multi-lingual plane code point.) If the value is not representable within 16 bits, the program is ill-formed. A char16_t literal containing multiple c-chars is ill-formed. A character literal that begins with the letter U, such as U'z', is a character literal of type char32_t. The value of a char32_t literal containing a single c-char is equal to its ISO 10646 code point value. A char32_t literal containing multiple c-chars is ill-formed. A character literal that begins with the letter L, such as L'x', is a wide-character literal. A wide-character literal has type wchar_t.[cpp11 13] The value of a wide-character literal containing a single c-char has value equal to the numerical value of the encoding of the c-char in the execution wide-character set, unless the c-char has no representation in the execution wide-character set, in which case the value is implementation-defined. NoteThe type wchar_t is able to represent all members of the execution wide-character set (see [basic.fundamental]).. The value of a wide-character literal containing multiple c-chars is implementation-defined.
3 Certain nongraphic characters, the single quote ', the double quote ", the question mark ?,[cpp11 14] and the backslash \, can be represented according to Table [tab:escape.sequences]. The double quote " and the question mark ?, can be represented as themselves or by the escape sequences \" and \? respectively, but the single quote ' and the backslash \ shall be represented by the escape sequences \' and \\ respectively. Escape sequences in which the character following the backslash is not listed in Table [tab:escape.sequences] are conditionally-supported, with implementation-defined semantics. An escape sequence specifies a single character.
Table 7 — Escape sequences
new-line NL(LF) \n
horizontal tab HT \t
vertical tab VT \v
backspace BS \b
carriage return CR \r
form feed FF \f
alert BEL \a
backslash \ \\
question mark ? \?
single quote ' \'
double quote " \"
octal number ooo \ooo
hex number hhh \xhhh
4 The escape \ooo consists of the backslash followed by one, two, or three octal digits that are taken to specify the value of the desired character. The escape \xhhh consists of the backslash followed by x followed by one or more hexadecimal digits that are taken to specify the value of the desired character. There is no limit to the number of digits in a hexadecimal sequence. A sequence of octal or hexadecimal digits is terminated by the first character that is not an octal digit or a hexadecimal digit, respectively. The value of a character literal is implementation-defined if it falls outside of the implementation-defined range defined for char (for literals with no prefix), char16_t (for literals prefixed by 'u'), char32_t (for literals prefixed by 'U'), or wchar_t (for literals prefixed by 'L').
5 A universal-character-name is translated to the encoding, in the appropriate execution character set, of the character named. If there is no such encoding, the universal-character-name is translated to an implementation-defined encoding. NoteIn translation phase 1, a universal-character-name is introduced whenever an actual extended character is encountered in the source text. Therefore, all extended characters are described in terms of universal-character-names. However, the actual compiler implementation may use its own native character set, so long as the same results are obtained.

2.14.4 Floating literals [lex.fcon]

fractional-constant:

digit-sequenceopt . digit-sequence
digit-sequence .

exponent-part:

e signopt digit-sequence
E signopt digit-sequence

sign: one of

+ -

digit-sequence:

digit
digit-sequence digit

floating-suffix: one of

f l F L
1 A floating literal consists of an integer part, a decimal point, a fraction part, an e or E, an optionally signed integer exponent, and an optional type suffix. The integer and fraction parts both consist of a sequence of decimal (base ten) digits. Either the integer part or the fraction part (not both) can be omitted; either the decimal point or the letter e (or E ) and the exponent (not both) can be omitted. The integer part, the optional decimal point and the optional fraction part form the significant part of the floating literal. The exponent, if present, indicates the power of 10 by which the significant part is to be scaled. If the scaled value is in the range of representable values for its type, the result is the scaled value if representable, else the larger or smaller representable value nearest the scaled value, chosen in an implementation-defined manner. The type of a floating literal is double

unless explicitly specified by a suffix. The suffixes f and F specify float, the suffixes l and L specify long double. If the scaled value is not in the range of representable values for its type, the program is ill-formed.

2.14.5 String literals [lex.string]

encoding-prefix: u8 u U L

s-char-sequence:

s-char
s-char-sequence s-char

s-char:

any member of the source character set except
the double-quote ", backslash \, or new-line character
escape-sequence
universal-character-name

raw-string:

" d-char-sequenceopt ( r-char-sequenceopt ) d-char-sequenceopt "

r-char-sequence:

r-char
r-char-sequence r-char

r-char:

any member of the source character set, except
a right parenthesis ) followed by the initial d-char-sequence
(which may be empty) followed by a double quote ".

d-char-sequence:

d-char
d-char-sequence d-char

d-char:

any member of the basic source character set except:
space, the left parenthesis (, the right parenthesis ), the backslash \,
and the control characters representing horizontal tab,
vertical tab, form feed, and newline.
1 A string literal is a sequence of characters (as defined in [lex.ccon]) surrounded by double quotes, optionally prefixed by R, u8, u8R, u, uR, U, UR, L, or LR, as in "...", R"(...)", u8"...", u8R"**(...)**", u"...", uR"*~(...)*~", U"...", UR"zzz(...)zzz", L"...", or LR"(...)", respectively.
2 A string literal that has an R in the prefix is a raw string literal. The d-char-sequence serves as a delimiter. The terminating d-char-sequence of a raw-string is the same sequence of characters as the initial d-char-sequence. A d-char-sequence shall consist of at most 16 characters.
3 NoteThe characters '(' and ')' are permitted in a raw-string. Thus, R"delimiter((a|b))delimiter" is equivalent to "(a|b)".
4
NoteA source-file new-line in a raw string literal results in a new-line in the resulting execution string-literal. Assuming no whitespace at the beginning of lines in the following example, the assert will succeed:
const char *p = R"(a\
b
c)";
assert(std::strcmp(p, "a\\\nb\nc") == 0);
5
ExampleThe raw string
R"a(
)\
a"
)a"

is equivalent to "\n)\\\na\"\n". The raw string

R"(??)"

is equivalent to "\?\?". The raw string

R"#(
)??="
)#"
is equivalent to "\n)\?\?=\"\n".
6 After translation phase 6, a string literal that does not begin with an encoding-prefix is an ordinary string literal, and is initialized with the given characters.
7 A string literal that begins with u8, such as u8"asdf", is a UTF-8 string literal and is initialized with the given characters as encoded in UTF-8.
8 Ordinary string literals and UTF-8 string literals are also referred to as narrow string literals. A narrow string literal has type “array of n const char”, where n is the size of the string as defined below, and has static storage duration ([basic.stc]).
9 A string literal that begins with u, such as u"asdf", is a char16_t string literal. A char16_t string literal has type “array of n const char16_t”, where n is the size of the string as defined below; it has static storage duration and is initialized with the given characters. A single c-char may produce more than one char16_t character in the form of surrogate pairs.
10 A string literal that begins with U, such as U"asdf", is a char32_t string literal. A char32_t string literal has type “array of n const char32_t”, where n is the size of the string as defined below; it has static storage duration and is initialized with the given characters.
11 A string literal that begins with L, such as L"asdf", is a wide string literal. A wide string literal has type “array of n const wchar_t”, where n is the size of the string as defined below; it has static storage duration and is initialized with the given characters.
12 Whether all string literals are distinct (that is, are stored in nonoverlapping objects) is implementation-defined. The effect of attempting to modify a string literal is undefined.
13 In translation phase 6 ([lex.phases]), adjacent string literals are concatenated. If both string literals have the same encoding-prefix, the resulting concatenated string literal has that encoding-prefix. If one string literal has no encoding-prefix, it is treated as a string literal of the same encoding-prefix as the other operand. If a UTF-8 string literal token is adjacent to a wide string literal token, the program is ill-formed. Any other concatenations are conditionally supported with implementation-defined behavior. NoteThis concatenation is an interpretation, not a conversion. Because the interpretation happens in translation phase 6 (after each character from a literal has been translated into a value from the appropriate character set), a string literal's initial rawness has no effect on the interpretation or well-formedness of the concatenation. Table [tab:lex.string.concat] has some examples of valid concatenations.
Table 8 — String literal concatenations
Source Means Source Means Source Means
u"a" u"b" u"ab" U"a" U"b" U"ab" L"a" L"b" L"ab"
u"a" "b" u"ab" U"a" "b" U"ab" L"a" "b" L"ab"
"a" u"b" u"ab" "a" U"b" U"ab" "a" L"b" L"ab"

Characters in concatenated strings are kept distinct.

Example
"\xA" "B"
contains the two characters '\xA' and 'B' after concatenation (and not the single hexadecimal character '\xAB').
14 After any necessary concatenation, in translation phase 7 ([lex.phases]), '\0' is appended to every string literal so that programs that scan a string can find its end.
15 Escape sequences and universal-character-names in non-raw string literals have the same meaning as in character literals ([lex.ccon]), except that the single quote ' is representable either by itself or by the escape sequence \', and the double quote " shall be preceded by a \. In a narrow string literal, a universal-character-name may map to more than one char element due to multibyte encoding. The size of a char32_t or wide string literal is the total number of escape sequences, universal-character-names, and other characters, plus one for the terminating U'\0' or L'\0'. The size of a char16_t string literal is the total number of escape sequences, universal-character-names, and other characters, plus one for each character requiring a surrogate pair, plus one for the terminating u'\0'. NoteThe size of a char16_t string literal is the number of code units, not the number of characters. Within char32_t and char16_t literals, any universal-character-names shall be within the range 0x0 to 0x10FFFF. The size of a narrow string literal is the total number of escape sequences and other characters, plus at least one for the multibyte encoding of each universal-character-name, plus one for the terminating '\0'.

2.14.6 Boolean literals [lex.bool]

boolean-literal:

false
true
1 The Boolean literals are the keywords false and true. Such literals are prvalues and have type bool.

2.14.7 Pointer literals [lex.nullptr]

pointer-literal:

nullptr
1 The pointer literal is the keyword nullptr. It is a prvalue of type std::nullptr_t.
Notestd::nullptr_t is a distinct type that is neither a pointer type nor a pointer to member type; rather, a prvalue of this type is a null pointer constant and can be converted to a null pointer value or null member pointer value. See [conv.ptr] and [conv.mem].

2.14.8 User-defined literals [lex.ext]

pointer-literal:

nullptr
1 The pointer literal is the keyword nullptr. It is a prvalue of type std::nullptr_t.
Notestd::nullptr_t is a distinct type that is neither a pointer type nor a pointer to member type; rather, a prvalue of this type is a null pointer constant and can be converted to a null pointer value or null member pointer value. See [conv.ptr] and [conv.mem].

  1. Implementations must behave as if these separate phases occur, although in practice different phases might be folded together.
  2. A partial preprocessing token would arise from a source file ending in the first portion of a multi-character token that requires a terminating sequence of characters, such as a header-name that is missing the closing " or >. A partial comment would arise from a source file ending with an unclosed /* comment.
  3. An implementation need not convert all non-corresponding source characters to the same execution character.
  4. The glyphs for the members of the basic source character set are intended to identify characters from the subset of ISO/IEC 10646 which corresponds to the ASCII character set. However, because the mapping from source file characters to the source character set (described in translation phase 1) is specified as implementation-defined, an implementation is required to document how the basic source characters are represented in source files.
  5. A sequence of characters resembling a universal-character-name in an r-char-sequence ([lex.string]) does not form a universal-character-name.
  6. These include “digraphs” and additional reserved words. The term “digraph” (token consisting of two characters) is not perfectly descriptive, since one of the alternative preprocessing-tokens is %:%: and of course several primary tokens contain two characters. Nonetheless, those alternative tokens that aren't lexical keywords are colloquially known as “digraphs”.
  7. Thus the “stringized” values ([cpp.stringize]) of [ and <: will be different, maintaining the source spelling, but the tokens can otherwise be freely interchanged.
  8. Literals include strings and character and numeric literals.
  9. Thus, a sequence of characters that resembles an escape sequence might result in an error, be interpreted as the character corresponding to the escape sequence, or have a completely different meaning, depending on the implementation.
  10. On systems in which linkers cannot accept extended characters, an encoding of the universal-character-name may be used in forming valid external identifiers. For example, some otherwise unused character or sequence of characters may be used to encode the \u in a universal-character-name. Extended characters may produce a long external identifier, but C++ does not place a translation limit on significant characters for external identifiers. In C++, upper- and lower-case letters are considered different for all identifiers, including external identifiers.
  11. The term “literal” generally designates, in this International Standard, those tokens that are called “constants” in ISO C.
  12. The digits 8 and 9 are not octal digits.
  13. They are intended for character sets where a character does not fit into a single byte.
  14. Using an escape sequence for a question mark can avoid accidentally creating a trigraph.


[edit]

Source: https://timsong-cpp.github.io/cppwp/n4140/lex

List of Tables [tab]
List of Figures [fig]
1 General [intro]
2 Lexical conventions [lex]
2.1 Separate translation [lex.separate]
2.2 Phases of translation [lex.phases]
2.3 Character sets [lex.charset]
2.4 Trigraph sequences [lex.trigraph]
2.5 Preprocessing tokens [lex.pptoken]
2.6 Alternative tokens [lex.digraph]
2.7 Tokens [lex.token]
2.8 Comments [lex.comment]
2.9 Header names [lex.header]
2.10 Preprocessing numbers [lex.ppnumber]
2.11 Identifiers [lex.name]
2.12 Keywords [lex.key]
2.13 Operators and punctuators [lex.operators]
2.14 Literals [lex.literal]
3 Basic concepts [basic]
4 Standard conversions [conv]
5 Expressions [expr]
6 Statements [stmt.stmt]
7 Declarations [dcl.dcl]
8 Declarators [dcl.decl]
9 Classes [class]
10 Derived classes [class.derived]
11 Member access control [class.access]
12 Special member functions [special]
13 Overloading [over]
14 Templates [temp]
15 Exception handling [except]
16 Preprocessing directives [cpp]
17 Library introduction [library]
18 Language support library [language.support]
19 Diagnostics library [diagnostics]
20 General utilities library [utilities]
21 Strings library [strings]
22 Localization library [localization]
23 Containers library [containers]
24 Iterators library [iterators]
25 Algorithms library [algorithms]
26 Numerics library [numerics]
27 Input/output library [input.output]
28 Regular expressions library [re]
29 Atomic operations library [atomics]
30 Thread support library [thread]
Annex A (informative) Grammar summary [gram]
Annex B (informative) Implementation quantities [implimits]
Annex C (informative) Compatibility [diff]
Annex D (normative) Compatibility features [depr]
Annex E (normative) Universal character names for identifier characters [charname]
Index [generalindex]
Index of library names [libraryindex]
Index of implementation-defined behavior [impldefindex]

2.1 Separate translation [lex.separate]

1 The text of the program is kept in units called source files in this International Standard. A source file together with all the headers ([headers]) and source files included ([cpp.include]) via the preprocessing directive #include, less any source lines skipped by any of the conditional inclusion ([cpp.cond]) preprocessing directives, is called a translation unit. NoteA C++ program need not all be translated at the same time.
2 NotePreviously translated translation units and instantiation units can be preserved individually or in libraries. The separate translation units of a program communicate ([basic.link]) by (for example) calls to functions whose identifiers have external linkage, manipulation of objects whose identifiers have external linkage, or manipulation of data files. Translation units can be separately translated and then later linked to produce an executable program ([basic.link]).

2.2 Phases of translation [lex.phases]

1 The precedence among the syntax rules of translation is specified by the following phases.[cpp14 1]
  1. Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set (introducing new-line characters for end-of-line indicators) if necessary. The set of physical source file characters accepted is implementation-defined. Trigraph sequences ([lex.trigraph]) are replaced by corresponding single-character internal representations. Any source file character not in the basic source character set ([lex.charset]) is replaced by the universal-character-name that designates that character. (An implementation may use any internal encoding, so long as an actual extended character encountered in the source file, and the same extended character expressed in the source file as a universal-character-name (i.e., using the \uXXXX notation), are handled equivalently except where this replacement is reverted in a raw string literal.)
  2. Each instance of a backslash character (\) immediately followed by a new-line character is deleted, splicing physical source lines to form logical source lines. Only the last backslash on any physical source line shall be eligible for being part of such a splice. Except for splices reverted in a raw string literal, if a splice results in a character sequence that matches the syntax of a universal-character-name, the behavior is undefined. A source file that is not empty and that does not end in a new-line character, or that ends in a new-line character immediately preceded by a backslash character before any such splicing takes place, shall be processed as if an additional new-line character were appended to the file.
  3. The source file is decomposed into preprocessing tokens ([lex.pptoken]) and sequences of white-space characters (including comments). A source file shall not end in a partial preprocessing token or in a partial comment.[cpp14 2] Each comment is replaced by one space character. New-line characters are retained. Whether each nonempty sequence of white-space characters other than new-line is retained or replaced by one space character is unspecified. The process of dividing a source file's characters into preprocessing tokens is context-dependent.
    Examplesee the handling of < within a #include preprocessing directive.
  4. Preprocessing directives are executed, macro invocations are expanded, and _Pragma unary operator expressions are executed. If a character sequence that matches the syntax of a universal-character-name is produced by token concatenation ([cpp.concat]), the behavior is undefined. A #include preprocessing directive causes the named header or source file to be processed from phase 1 through phase 4, recursively. All preprocessing directives are then deleted.
  5. Each source character set member in a character literal or a string literal, as well as each escape sequence and universal-character-name in a character literal or a non-raw string literal, is converted to the corresponding member of the execution character set ([lex.ccon], [lex.string]); if there is no corresponding member, it is converted to an implementation-defined member other than the null (wide) character.[cpp14 3]
  6. Adjacent string literal tokens are concatenated.
  7. White-space characters separating tokens are no longer significant. Each preprocessing token is converted into a token. ([lex.token]). The resulting tokens are syntactically and semantically analyzed and translated as a translation unit. NoteThe process of analyzing and translating the tokens may occasionally result in one token being replaced by a sequence of other tokens ([temp.names]). NoteSource files, translation units and translated translation units need not necessarily be stored as files, nor need there be any one-to-one correspondence between these entities and any external representation. The description is conceptual only, and does not specify any particular implementation.
  8. Translated translation units and instantiation units are combined as follows: NoteSome or all of these may be supplied from a library. Each translated translation unit is examined to produce a list of required instantiations. NoteThis may include instantiations which have been explicitly requested ([temp.explicit]). The definitions of the required templates are located. It is implementation-defined whether the source of the translation units containing these definitions is required to be available. NoteAn implementation could encode sufficient information into the translated translation unit so as to ensure the source is not required here. All the required instantiations are performed to produce instantiation units. NoteThese are similar to translated translation units, but contain no references to uninstantiated templates and no template definitions. The program is ill-formed if any instantiation fails.
  9. All external entity references are resolved. Library components are linked to satisfy external references to entities not defined in the current translation. All such translator output is collected into a program image which contains information needed for execution in its execution environment.

2.3 Character sets [lex.charset]

1 The basic source character set consists of 96 characters: the space character, the control characters representing horizontal tab, vertical tab, form feed, and new-line, plus the following 91 graphical characters:[cpp14 4]
2 The universal-character-name construct provides a way to name other characters.

universal-character-name:

\u hex-quad
\U hex-quad hex-quad

The character designated by the universal-character-name \UNNNNNNNN is that character whose character short name in ISO/IEC 10646 is NNNNNNNN; the character designated by the universal-character-name \uNNNN is that character whose character short name in ISO/IEC 10646 is 0000NNNN. If the hexadecimal value for a universal-character-name corresponds to a surrogate code point (in the range 0xD800–0xDFFF, inclusive), the program is ill-formed. Additionally, if the hexadecimal value for a universal-character-name outside the c-char-sequence, s-char-sequence, or r-char-sequence of a character or string literal corresponds to a control character (in either of the ranges 0x00–0x1F or 0x7F–0x9F, both inclusive) or to a character in the basic source character set, the program is ill-formed.[cpp14 5]

3 The basic execution character set and the basic execution wide-character set shall each contain all the members of the basic source character set, plus control characters representing alert, backspace, and carriage return, plus a null character (respectively, null wide character), whose representation has all zero bits. For each basic execution character set, the values of the members shall be non-negative and distinct from one another. In both the source and execution basic character sets, the value of each character after 0 in the above list of decimal digits shall be one greater than the value of the previous. The execution character set and the execution wide-character set are implementation-defined supersets of the basic execution character set and the basic execution wide-character set, respectively. The values of the members of the execution character sets and the sets of additional members are locale-specific.

2.4 Trigraph sequences [lex.trigraph]

1 Before any other processing takes place, each occurrence of one of the following sequences of three characters (“trigraph sequences”) is replaced by the single character indicated in Table [tab:trigraph.sequences].
Table 1 — Trigraph sequences
Trigraph Replacement Trigraph Replacement Trigraph Replacement
??= # ??( [ ??< {
??/ \ ??) ] ??> }
??' ^ ??! | ??- ~
2
Example
??=define arraycheck(a,b) a??(b??) ??!??! b??(a??)

becomes

#define arraycheck(a,b) a[b] || b[a]
3 No other trigraph sequence exists. Each ? that does not begin one of the trigraphs listed above is not changed.

2.5 Preprocessing tokens [lex.pptoken]

1 Each preprocessing token that is converted to a token ([lex.token]) shall have the lexical form of a keyword, an identifier, a literal, an operator, or a punctuator.
2 A preprocessing token is the minimal lexical element of the language in translation phases 3 through 6. The categories of preprocessing token are: header names, identifiers, preprocessing numbers, character literals (including user-defined character literals), string literals (including user-defined string literals), preprocessing operators and punctuators, and single non-white-space characters that do not lexically match the other preprocessing token categories. If a ' or a " character matches the last category, the behavior is undefined. Preprocessing tokens can be separated by white space; this consists of comments ([lex.comment]), or white-space characters (space, horizontal tab, new-line, vertical tab, and form-feed), or both. As described in Clause [cpp], in certain circumstances during translation phase 4, white space (or the absence thereof) serves as more than preprocessing token separation. White space can appear within a preprocessing token only as part of a header name or between the quotation characters in a character literal or string literal.
3 If the input stream has been parsed into preprocessing tokens up to a given character:
(3.1)
  • If the next character begins a sequence of characters that could be the prefix and initial double quote of a raw string literal, such as R", the next preprocessing token shall be a raw string literal. Between the initial and final double quote characters of the raw string, any transformations performed in phases 1 and 2 (trigraphs, universal-character-names, and line splicing) are reverted; this reversion shall apply before any d-char, r-char, or delimiting parenthesis is identified. The raw string literal is defined as the shortest sequence of characters that matches the raw-string pattern

(3.2)
  • Otherwise, if the next three characters are <:: and the subsequent character is neither : nor >, the < is treated as a preprocessor token by itself and not as the first character of the alternative token <:.

(3.3)
  • Otherwise, the next preprocessing token is the longest sequence of characters that could constitute a preprocessing token, even if that would cause further lexical analysis to fail.

    Example
    #define R "x"
    const char* s = R"y";           // ill-formed raw string, not "x" "y"
    
4
ExampleThe program fragment 1Ex is parsed as a preprocessing number token (one that is not a valid floating or integer literal token), even though a parse as the pair of preprocessing tokens 1 and Ex might produce a valid expression (for example, if Ex were a macro defined as +1). Similarly, the program fragment 1E1 is parsed as a preprocessing number (one that is a valid floating literal token), whether or not E is a macro name.
5
ExampleThe program fragment x+++++y is parsed as x ++ ++ + y, which, if x and y have integral types, violates a constraint on increment operators, even though the parse x ++ + ++ y might yield a correct expression.

2.6 Alternative tokens [lex.digraph]

1 Alternative token representations are provided for some operators and punctuators.[cpp14 6]
2 In all respects of the language, each alternative token behaves the same, respectively, as its primary token, except for its spelling.[cpp14 7] The set of alternative tokens is defined in Table [tab:alternative.tokens].
Table 2 — Alternative tokens
Alternative Primary Alternative Primary Alternative Primary
<% { and && and_eq &=
%> } bitor | or_eq |=
<: [ or || xor_eq ^=
:> ] xor ^ not !
%: # compl ~ not_eq !=
%:%: ## bitand &

2.7 Tokens [lex.token]

token:

identifier keyword literal
operator punctuator
1 There are five kinds of tokens: identifiers, keywords, literals,[cpp14 8] operators, and other separators. Blanks, horizontal and vertical tabs, newlines, formfeeds, and comments (collectively, “white space”), as described below, are ignored except as they serve to separate tokens. NoteSome white space is required to separate otherwise adjacent identifiers, keywords, numeric literals, and alternative tokens containing alphabetic characters.

2.8 Comments [lex.comment]

1 The characters /* start a comment, which terminates with the characters */. These comments do not nest. The characters // start a comment, which terminates immediately before the next new-line character. If there is a form-feed or a vertical-tab character in such a comment, only white-space characters shall appear between it and the new-line that terminates the comment; no diagnostic is required. NoteThe comment characters //, /*, and */ have no special meaning within a // comment and are treated just like other characters. Similarly, the comment characters // and /* have no special meaning within a /* comment.

2.9 Header names [lex.header]

header-name:

< h-char-sequence >
" q-char-sequence "

h-char-sequence:

h-char
h-char-sequence h-char

h-char:

any member of the source character set except new-line and >

q-char-sequence:

q-char
q-char-sequence q-char

q-char:

any member of the source character set except new-line and "
1 Header name preprocessing tokens shall only appear within a #include preprocessing directive ([cpp.include]). The sequences in both forms of header-names are mapped in an implementation-defined manner to headers or to external source file names as specified in [cpp.include].
2 The appearance of either of the characters ' or \ or of either of the character sequences /* or // in a q-char-sequence or an h-char-sequence is conditionally-supported with implementation-defined semantics, as is the appearance of the character " in an h-char-sequence.[cpp14 9]

2.10 Preprocessing numbers [lex.ppnumber]

1 Preprocessing number tokens lexically include all integer literal tokens ([lex.icon]) and all floating literal tokens ([lex.fcon]).
2 A preprocessing number does not have a type or a value; it acquires both after a successful conversion to an integer literal token or a floating literal token.

2.11 Identifiers [lex.name]

identifier-nondigit:

nondigit
universal-character-name
other implementation-defined characters

nondigit: one of

a b c d e f g h i j k l m
n o p q r s t u v w x y z
A B C D E F G H I J K L M
N O P Q R S T U V W X Y Z _

digit: one of

0 1 2 3 4 5 6 7 8 9
1 An identifier is an arbitrarily long sequence of letters and digits. Each universal-character-name in an identifier shall designate a character whose encoding in ISO 10646 falls into one of the ranges specified in [charname.allowed]. The initial element shall not be a universal-character-name designating a character whose encoding falls into one of the ranges specified in [charname.disallowed]. Upper- and lower-case letters are different. All characters are significant.[cpp14 10]
2 The identifiers in Table [tab:identifiers.special] have a special meaning when appearing in a certain context. When referred to in the grammar, these identifiers are used explicitly rather than using the identifier grammar production. Unless otherwise specified, any ambiguity as to whether a given identifier has a special meaning is resolved to interpret the token as a regular identifier.
Table 3 — Identifiers with special meaning
override final
3 In addition, some identifiers are reserved for use by C++ implementations and standard libraries ([global.names]) and shall not be used otherwise; no diagnostic is required.

2.12 Keywords [lex.key]

1 The identifiers shown in Table [tab:keywords] are reserved for use as keywords (that is, they are unconditionally treated as keywords in phase 7) except in an attribute-token ([dcl.attr.grammar]) NoteThe export keyword is unused but is reserved for future use.:
Table 4 — Keywords
alignas continue friend register true
alignof decltype goto reinterpret_cast try
asm default if return typedef
auto delete inline short typeid
bool do int signed typename
break double long sizeof union
case dynamic_cast mutable static unsigned
catch else namespace static_assert using
char enum new static_cast virtual
char16_t explicit noexcept struct void
char32_t export nullptr switch volatile
class extern operator template wchar_t
const false private this while
constexpr float protected thread_local
const_cast for public throw
2 Furthermore, the alternative representations shown in Table [tab:alternative.representations] for certain operators and punctuators ([lex.digraph]) are reserved and shall not be used otherwise:
Table 5 — Alternative representations
and and_eq bitand bitor compl not
not_eq or or_eq xor xor_eq

2.13 Operators and punctuators [lex.operators]

1 The lexical representation of C++ programs includes a number of preprocessing tokens which are used in the syntax of the preprocessor or are converted into tokens for operators and punctuators:

preprocessing-op-or-punc: one of { } [ ] # ## ( ) <: :> <% %> %: %:%: ; : ... new delete ? :: . .* + - * / % ^ & | ~ ! = < > += -= *= /= %= ^= &= |= << >> >>= <<= == != <= >= && || ++ -- , ->* -> and and_eq bitand bitor compl not not_eq or or_eq xor xor_eq

Each preprocessing-op-or-punc is converted to a single token in translation phase 7 ([lex.phases]).

2.14.1 Kinds of literals [lex.literal.kinds]

1 There are several kinds of literals.[cpp14 11]

2.14.2 Integer literals [lex.icon]

octal-literal:

0
octal-literal 'opt octal-digit

decimal-literal:

nonzero-digit
decimal-literal 'opt digit

binary-digit:

0
1

octal-digit: one of

0 1 2 3 4 5 6 7

nonzero-digit: one of

1 2 3 4 5 6 7 8 9

hexadecimal-digit: one of

0 1 2 3 4 5 6 7 8 9
a b c d e f
A B C D E F

unsigned-suffix: one of

u U

long-suffix: one of

l L

long-long-suffix: one of

ll LL
1 An integer literal is a sequence of digits that has no period or exponent part, with optional separating single quotes that are ignored when determining its value. An integer literal may have a prefix that specifies its base and a suffix that specifies its type. The lexically first digit of the sequence of digits is the most significant. A binary integer literal (base two) begins with 0b or 0B and consists of a sequence of binary digits. An octal integer literal (base eight) begins with the digit 0 and consists of a sequence of octal digits.[cpp14 12] A decimal integer literal (base ten) begins with a digit other than 0 and consists of a sequence of decimal digits. A hexadecimal integer literal (base sixteen) begins with 0x or 0X and consists of a sequence of hexadecimal digits, which include the decimal digits and the letters a through f and A through F with decimal values ten through fifteen.
ExampleThe number twelve can be written 12, 014, 0XC, or 0b1100. The literals 1048576, 1'048'576, 0X100000, 0x10'0000, and 0'004'000'000 all have the same value.
2 The type of an integer literal is the first of the corresponding list in Table [tab:lex.type.integer.literal] in which its value can be represented.
Table 6 — Types of integer literals
Suffix Decimal literal Binary, octal, or hexadecimal literal
none int int
long int unsigned int
long long int long int
unsigned long int
long long int
unsigned long long int
u or U unsigned int unsigned int
unsigned long int unsigned long int
unsigned long long int unsigned long long int
l or L long int long int
long long int unsigned long int
long long int
unsigned long long int
Both u or U unsigned long int unsigned long int
and l or L unsigned long long int unsigned long long int
ll or LL long long int long long int
unsigned long long int
Both u or U unsigned long long int unsigned long long int
and ll or LL
3 If an integer literal cannot be represented by any type in its list and an extended integer type ([basic.fundamental]) can represent its value, it may have that extended integer type. If all of the types in the list for the literal are signed, the extended integer type shall be signed. If all of the types in the list for the literal are unsigned, the extended integer type shall be unsigned. If the list contains both signed and unsigned types, the extended integer type may be signed or unsigned. A program is ill-formed if one of its translation units contains an integer literal that cannot be represented by any of the allowed types.

2.14.3 Character literals [lex.ccon]

character-literal:

' c-char-sequence ' u' c-char-sequence ' U' c-char-sequence ' L' c-char-sequence '

c-char-sequence:

c-char
c-char-sequence c-char

c-char:

any member of the source character set except
the single-quote ', backslash \, or new-line character
escape-sequence
universal-character-name

simple-escape-sequence: one of

\'  \"  \?  \\
\a  \b  \f  \n  \r  \t  \v
1 A character literal is one or more characters enclosed in single quotes, as in 'x', optionally preceded by one of the letters u, U, or L, as in u'y', U'z', or L'x', respectively. A character literal that does not begin with u, U, or L is an ordinary character literal, also referred to as a narrow-character literal. An ordinary character literal that contains a single c-char representable in the execution character set has type char, with value equal to the numerical value of the encoding of the c-char in the execution character set. An ordinary character literal that contains more than one c-char is a multicharacter literal. A multicharacter literal, or an ordinary character literal containing a single c-char not representable in the execution character set, is conditionally-supported, has type int, and has an implementation-defined value.
2 A character literal that begins with the letter u, such as u'y', is a character literal of type char16_t. The value of a char16_t literal containing a single c-char is equal to its ISO 10646 code point value, provided that the code point is representable with a single 16-bit code unit. (That is, provided it is a basic multi-lingual plane code point.) If the value is not representable within 16 bits, the program is ill-formed. A char16_t literal containing multiple c-chars is ill-formed. A character literal that begins with the letter U, such as U'z', is a character literal of type char32_t. The value of a char32_t literal containing a single c-char is equal to its ISO 10646 code point value. A char32_t literal containing multiple c-chars is ill-formed. A character literal that begins with the letter L, such as L'x', is a wide-character literal. A wide-character literal has type wchar_t.[cpp14 13] The value of a wide-character literal containing a single c-char has value equal to the numerical value of the encoding of the c-char in the execution wide-character set, unless the c-char has no representation in the execution wide-character set, in which case the value is implementation-defined. NoteThe type wchar_t is able to represent all members of the execution wide-character set (see [basic.fundamental]).. The value of a wide-character literal containing multiple c-chars is implementation-defined.
3 Certain nongraphic characters, the single quote ', the double quote ", the question mark ?,[cpp14 14] and the backslash \, can be represented according to Table [tab:escape.sequences]. The double quote " and the question mark ?, can be represented as themselves or by the escape sequences \" and \? respectively, but the single quote ' and the backslash \ shall be represented by the escape sequences \' and \\ respectively. Escape sequences in which the character following the backslash is not listed in Table [tab:escape.sequences] are conditionally-supported, with implementation-defined semantics. An escape sequence specifies a single character.
Table 7 — Escape sequences
new-line NL(LF) \n
horizontal tab HT \t
vertical tab VT \v
backspace BS \b
carriage return CR \r
form feed FF \f
alert BEL \a
backslash \ \\
question mark ? \?
single quote ' \'
double quote " \"
octal number ooo \ooo
hex number hhh \xhhh
4 The escape \ooo consists of the backslash followed by one, two, or three octal digits that are taken to specify the value of the desired character. The escape \xhhh consists of the backslash followed by x followed by one or more hexadecimal digits that are taken to specify the value of the desired character. There is no limit to the number of digits in a hexadecimal sequence. A sequence of octal or hexadecimal digits is terminated by the first character that is not an octal digit or a hexadecimal digit, respectively. The value of a character literal is implementation-defined if it falls outside of the implementation-defined range defined for char (for literals with no prefix), char16_t (for literals prefixed by 'u'), char32_t (for literals prefixed by 'U'), or wchar_t (for literals prefixed by 'L').
5 A universal-character-name is translated to the encoding, in the appropriate execution character set, of the character named. If there is no such encoding, the universal-character-name is translated to an implementation-defined encoding. NoteIn translation phase 1, a universal-character-name is introduced whenever an actual extended character is encountered in the source text. Therefore, all extended characters are described in terms of universal-character-names. However, the actual compiler implementation may use its own native character set, so long as the same results are obtained.

2.14.4 Floating literals [lex.fcon]

fractional-constant:

digit-sequenceopt . digit-sequence
digit-sequence .

exponent-part:

e signopt digit-sequence
E signopt digit-sequence

sign: one of

+ -

digit-sequence:

digit
digit-sequence 'opt digit

floating-suffix: one of

f l F L
1 A floating literal consists of an integer part, a decimal point, a fraction part, an e or E, an optionally signed integer exponent, and an optional type suffix. The integer and fraction parts both consist of a sequence of decimal (base ten) digits. Optional separating single quotes in a digit-sequence are ignored when determining its value.
ExampleThe literals 1.602'176'565e-19 and 1.602176565e-19 have the same value.
Either the integer part or the fraction part (not both) can be omitted; either the decimal point or the letter e (or E ) and the exponent (not both) can be omitted. The integer part, the optional decimal point and the optional fraction part form the significant part of the floating literal. The exponent, if present, indicates the power of 10 by which the significant part is to be scaled. If the scaled value is in the range of representable values for its type, the result is the scaled value if representable, else the larger or smaller representable value nearest the scaled value, chosen in an implementation-defined manner. The type of a floating literal is double

unless explicitly specified by a suffix. The suffixes f and F specify float, the suffixes l and L specify long double. If the scaled value is not in the range of representable values for its type, the program is ill-formed.

2.14.5 String literals [lex.string]

encoding-prefix: u8 u U L

s-char-sequence:

s-char
s-char-sequence s-char

s-char:

any member of the source character set except
the double-quote ", backslash \, or new-line character
escape-sequence
universal-character-name

raw-string:

" d-char-sequenceopt ( r-char-sequenceopt ) d-char-sequenceopt "

r-char-sequence:

r-char
r-char-sequence r-char

r-char:

any member of the source character set, except
a right parenthesis ) followed by the initial d-char-sequence
(which may be empty) followed by a double quote ".

d-char-sequence:

d-char
d-char-sequence d-char

d-char:

any member of the basic source character set except:
space, the left parenthesis (, the right parenthesis ), the backslash \,
and the control characters representing horizontal tab,
vertical tab, form feed, and newline.
1 A string literal is a sequence of characters (as defined in [lex.ccon]) surrounded by double quotes, optionally prefixed by R, u8, u8R, u, uR, U, UR, L, or LR, as in "...", R"(...)", u8"...", u8R"**(...)**", u"...", uR"*~(...)*~", U"...", UR"zzz(...)zzz", L"...", or LR"(...)", respectively.
2 A string literal that has an R in the prefix is a raw string literal. The d-char-sequence serves as a delimiter. The terminating d-char-sequence of a raw-string is the same sequence of characters as the initial d-char-sequence. A d-char-sequence shall consist of at most 16 characters.
3 NoteThe characters '(' and ')' are permitted in a raw-string. Thus, R"delimiter((a|b))delimiter" is equivalent to "(a|b)".
4
NoteA source-file new-line in a raw string literal results in a new-line in the resulting execution string-literal. Assuming no whitespace at the beginning of lines in the following example, the assert will succeed:
const char* p = R"(a\
b
c)";
assert(std::strcmp(p, "a\\\nb\nc") == 0);
5
ExampleThe raw string
R"a(
)\
a"
)a"

is equivalent to "\n)\\\na\"\n". The raw string

R"(??)"

is equivalent to "\?\?". The raw string

R"#(
)??="
)#"
is equivalent to "\n)\?\?=\"\n".
6 After translation phase 6, a string literal that does not begin with an encoding-prefix is an ordinary string literal, and is initialized with the given characters.
7 A string literal that begins with u8, such as u8"asdf", is a UTF-8 string literal.
8 Ordinary string literals and UTF-8 string literals are also referred to as narrow string literals. A narrow string literal has type “array of n const char”, where n is the size of the string as defined below, and has static storage duration ([basic.stc]).
9 For a UTF-8 string literal, each successive element of the object representation ([basic.types]) has the value of the corresponding code unit of the UTF-8 encoding of the string.
10 A string literal that begins with u, such as u"asdf", is a char16_t string literal. A char16_t string literal has type “array of n const char16_t”, where n is the size of the string as defined below; it has static storage duration and is initialized with the given characters. A single c-char may produce more than one char16_t character in the form of surrogate pairs.
11 A string literal that begins with U, such as U"asdf", is a char32_t string literal. A char32_t string literal has type “array of n const char32_t”, where n is the size of the string as defined below; it has static storage duration and is initialized with the given characters.
12 A string literal that begins with L, such as L"asdf", is a wide string literal. A wide string literal has type “array of n const wchar_t”, where n is the size of the string as defined below; it has static storage duration and is initialized with the given characters.
13 Whether all string literals are distinct (that is, are stored in nonoverlapping objects) is implementation-defined. The effect of attempting to modify a string literal is undefined.
14 In translation phase 6 ([lex.phases]), adjacent string literals are concatenated. If both string literals have the same encoding-prefix, the resulting concatenated string literal has that encoding-prefix. If one string literal has no encoding-prefix, it is treated as a string literal of the same encoding-prefix as the other operand. If a UTF-8 string literal token is adjacent to a wide string literal token, the program is ill-formed. Any other concatenations are conditionally-supported with implementation-defined behavior. NoteThis concatenation is an interpretation, not a conversion. Because the interpretation happens in translation phase 6 (after each character from a literal has been translated into a value from the appropriate character set), a string literal's initial rawness has no effect on the interpretation or well-formedness of the concatenation. Table [tab:lex.string.concat] has some examples of valid concatenations.
Table 8 — String literal concatenations
Source Means Source Means Source Means
u"a" u"b" u"ab" U"a" U"b" U"ab" L"a" L"b" L"ab"
u"a" "b" u"ab" U"a" "b" U"ab" L"a" "b" L"ab"
"a" u"b" u"ab" "a" U"b" U"ab" "a" L"b" L"ab"

Characters in concatenated strings are kept distinct.

Example
"\xA" "B"
contains the two characters '\xA' and 'B' after concatenation (and not the single hexadecimal character '\xAB').
15 After any necessary concatenation, in translation phase 7 ([lex.phases]), '\0' is appended to every string literal so that programs that scan a string can find its end.
16 Escape sequences and universal-character-names in non-raw string literals have the same meaning as in character literals ([lex.ccon]), except that the single quote ' is representable either by itself or by the escape sequence \', and the double quote " shall be preceded by a \. In a narrow string literal, a universal-character-name may map to more than one char element due to multibyte encoding. The size of a char32_t or wide string literal is the total number of escape sequences, universal-character-names, and other characters, plus one for the terminating U'\0' or L'\0'. The size of a char16_t string literal is the total number of escape sequences, universal-character-names, and other characters, plus one for each character requiring a surrogate pair, plus one for the terminating u'\0'. NoteThe size of a char16_t string literal is the number of code units, not the number of characters. Within char32_t and char16_t literals, any universal-character-names shall be within the range 0x0 to 0x10FFFF. The size of a narrow string literal is the total number of escape sequences and other characters, plus at least one for the multibyte encoding of each universal-character-name, plus one for the terminating '\0'.

2.14.6 Boolean literals [lex.bool]

boolean-literal:

false
true
1 The Boolean literals are the keywords false and true. Such literals are prvalues and have type bool.

2.14.7 Pointer literals [lex.nullptr]

pointer-literal:

nullptr
1 The pointer literal is the keyword nullptr. It is a prvalue of type std::nullptr_t.
Notestd::nullptr_t is a distinct type that is neither a pointer type nor a pointer to member type; rather, a prvalue of this type is a null pointer constant and can be converted to a null pointer value or null member pointer value. See [conv.ptr] and [conv.mem].

2.14.8 User-defined literals [lex.ext]

pointer-literal:

nullptr
1 The pointer literal is the keyword nullptr. It is a prvalue of type std::nullptr_t.
Notestd::nullptr_t is a distinct type that is neither a pointer type nor a pointer to member type; rather, a prvalue of this type is a null pointer constant and can be converted to a null pointer value or null member pointer value. See [conv.ptr] and [conv.mem].

  1. Implementations must behave as if these separate phases occur, although in practice different phases might be folded together.
  2. A partial preprocessing token would arise from a source file ending in the first portion of a multi-character token that requires a terminating sequence of characters, such as a header-name that is missing the closing " or >. A partial comment would arise from a source file ending with an unclosed /* comment.
  3. An implementation need not convert all non-corresponding source characters to the same execution character.
  4. The glyphs for the members of the basic source character set are intended to identify characters from the subset of ISO/IEC 10646 which corresponds to the ASCII character set. However, because the mapping from source file characters to the source character set (described in translation phase 1) is specified as implementation-defined, an implementation is required to document how the basic source characters are represented in source files.
  5. A sequence of characters resembling a universal-character-name in an r-char-sequence ([lex.string]) does not form a universal-character-name.
  6. These include “digraphs” and additional reserved words. The term “digraph” (token consisting of two characters) is not perfectly descriptive, since one of the alternative preprocessing-tokens is %:%: and of course several primary tokens contain two characters. Nonetheless, those alternative tokens that aren't lexical keywords are colloquially known as “digraphs”.
  7. Thus the “stringized” values ([cpp.stringize]) of [ and <: will be different, maintaining the source spelling, but the tokens can otherwise be freely interchanged.
  8. Literals include strings and character and numeric literals.
  9. Thus, a sequence of characters that resembles an escape sequence might result in an error, be interpreted as the character corresponding to the escape sequence, or have a completely different meaning, depending on the implementation.
  10. On systems in which linkers cannot accept extended characters, an encoding of the universal-character-name may be used in forming valid external identifiers. For example, some otherwise unused character or sequence of characters may be used to encode the \u in a universal-character-name. Extended characters may produce a long external identifier, but C++ does not place a translation limit on significant characters for external identifiers. In C++, upper- and lower-case letters are considered different for all identifiers, including external identifiers.
  11. The term “literal” generally designates, in this International Standard, those tokens that are called “constants” in ISO C.
  12. The digits 8 and 9 are not octal digits.
  13. They are intended for character sets where a character does not fit into a single byte.
  14. Using an escape sequence for a question mark can avoid accidentally creating a trigraph.


[edit]

Source: https://timsong-cpp.github.io/cppwp/n4659/lex

List of Tables [tab]
List of Figures [fig]
1 Scope [intro.scope]
2 Normative references [intro.refs]
3 Terms and definitions [intro.defs]
4 General principles [intro]
5 Lexical conventions [lex]
5.1 Separate translation [lex.separate]
5.2 Phases of translation [lex.phases]
5.3 Character sets [lex.charset]
5.4 Preprocessing tokens [lex.pptoken]
5.5 Alternative tokens [lex.digraph]
5.6 Tokens [lex.token]
5.7 Comments [lex.comment]
5.8 Header names [lex.header]
5.9 Preprocessing numbers [lex.ppnumber]
5.10 Identifiers [lex.name]
5.11 Keywords [lex.key]
5.12 Operators and punctuators [lex.operators]
5.13 Literals [lex.literal]
6 Basic concepts [basic]
7 Standard conversions [conv]
8 Expressions [expr]
9 Statements [stmt.stmt]
10 Declarations [dcl.dcl]
11 Declarators [dcl.decl]
12 Classes [class]
13 Derived classes [class.derived]
14 Member access control [class.access]
15 Special member functions [special]
16 Overloading [over]
17 Templates [temp]
18 Exception handling [except]
19 Preprocessing directives [cpp]
20 Library introduction [library]
21 Language support library [language.support]
22 Diagnostics library [diagnostics]
23 General utilities library [utilities]
24 Strings library [strings]
25 Localization library [localization]
26 Containers library [containers]
27 Iterators library [iterators]
28 Algorithms library [algorithms]
29 Numerics library [numerics]
30 Input/output library [input.output]
31 Regular expressions library [re]
32 Atomic operations library [atomics]
33 Thread support library [thread]
Annex A (informative) Grammar summary [gram]
Annex B (informative) Implementation quantities [implimits]
Annex C (informative) Compatibility [diff]
Annex D (normative) Compatibility features [depr]
Index [generalindex]
Index of library names [libraryindex]
Index of implementation-defined behavior [impldefindex]

5.1 Separate translation [lex.separate]

1 The text of the program is kept in units called source files in this International Standard. A source file together with all the headers and source files included via the preprocessing directive #include, less any source lines skipped by any of the conditional inclusion preprocessing directives, is called a translation unit. NoteA C++ program need not all be translated at the same time.
2 NotePreviously translated translation units and instantiation units can be preserved individually or in libraries. The separate translation units of a program communicate ([basic.link]) by (for example) calls to functions whose identifiers have external linkage, manipulation of objects whose identifiers have external linkage, or manipulation of data files. Translation units can be separately translated and then later linked to produce an executable program.

5.2 Phases of translation [lex.phases]

1 The precedence among the syntax rules of translation is specified by the following phases.[cpp17 1]
  1. Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set (introducing new-line characters for end-of-line indicators) if necessary. The set of physical source file characters accepted is implementation-defined. Any source file character not in the basic source character set is replaced by the universal-character-name that designates that character. An implementation may use any internal encoding, so long as an actual extended character encountered in the source file, and the same extended character expressed in the source file as a universal-character-name (e.g., using the \uXXXX notation), are handled equivalently except where this replacement is reverted ([lex.pptoken]) in a raw string literal.
  2. Each instance of a backslash character (\) immediately followed by a new-line character is deleted, splicing physical source lines to form logical source lines. Only the last backslash on any physical source line shall be eligible for being part of such a splice. Except for splices reverted in a raw string literal, if a splice results in a character sequence that matches the syntax of a universal-character-name, the behavior is undefined. A source file that is not empty and that does not end in a new-line character, or that ends in a new-line character immediately preceded by a backslash character before any such splicing takes place, shall be processed as if an additional new-line character were appended to the file.
  3. The source file is decomposed into preprocessing tokens and sequences of white-space characters (including comments). A source file shall not end in a partial preprocessing token or in a partial comment.[cpp17 2] Each comment is replaced by one space character. New-line characters are retained. Whether each nonempty sequence of white-space characters other than new-line is retained or replaced by one space character is unspecified. The process of dividing a source file's characters into preprocessing tokens is context-dependent.
    Examplesee the handling of < within a #include preprocessing directive.
  4. Preprocessing directives are executed, macro invocations are expanded, and _Pragma unary operator expressions are executed. If a character sequence that matches the syntax of a universal-character-name is produced by token concatenation, the behavior is undefined. A #include preprocessing directive causes the named header or source file to be processed from phase 1 through phase 4, recursively. All preprocessing directives are then deleted.
  5. Each source character set member in a character literal or a string literal, as well as each escape sequence and universal-character-name in a character literal or a non-raw string literal, is converted to the corresponding member of the execution character set ([lex.ccon], [lex.string]); if there is no corresponding member, it is converted to an implementation-defined member other than the null (wide) character.[cpp17 3]
  6. Adjacent string literal tokens are concatenated.
  7. White-space characters separating tokens are no longer significant. Each preprocessing token is converted into a token. The resulting tokens are syntactically and semantically analyzed and translated as a translation unit. NoteThe process of analyzing and translating the tokens may occasionally result in one token being replaced by a sequence of other tokens ([temp.names]). NoteSource files, translation units and translated translation units need not necessarily be stored as files, nor need there be any one-to-one correspondence between these entities and any external representation. The description is conceptual only, and does not specify any particular implementation.
  8. Translated translation units and instantiation units are combined as follows: NoteSome or all of these may be supplied from a library. Each translated translation unit is examined to produce a list of required instantiations. NoteThis may include instantiations which have been explicitly requested. The definitions of the required templates are located. It is implementation-defined whether the source of the translation units containing these definitions is required to be available. NoteAn implementation could encode sufficient information into the translated translation unit so as to ensure the source is not required here. All the required instantiations are performed to produce instantiation units. NoteThese are similar to translated translation units, but contain no references to uninstantiated templates and no template definitions. The program is ill-formed if any instantiation fails.
  9. All external entity references are resolved. Library components are linked to satisfy external references to entities not defined in the current translation. All such translator output is collected into a program image which contains information needed for execution in its execution environment.

5.3 Character sets [lex.charset]

1 The basic source character set consists of 96 characters: the space character, the control characters representing horizontal tab, vertical tab, form feed, and new-line, plus the following 91 graphical characters:[cpp17 4]
2 The universal-character-name construct provides a way to name other characters.

universal-character-name:

\u hex-quad
\U hex-quad hex-quad

The character designated by the universal-character-name \UNNNNNNNN is that character whose character short name in ISO/IEC 10646 is NNNNNNNN; the character designated by the universal-character-name \uNNNN is that character whose character short name in ISO/IEC 10646 is 0000NNNN. If the hexadecimal value for a universal-character-name corresponds to a surrogate code point (in the range 0xD800–0xDFFF, inclusive), the program is ill-formed. Additionally, if the hexadecimal value for a universal-character-name outside the c-char-sequence, s-char-sequence, or r-char-sequence of a character or string literal corresponds to a control character (in either of the ranges 0x00–0x1F or 0x7F–0x9F, both inclusive) or to a character in the basic source character set, the program is ill-formed.[cpp17 5]

3 The basic execution character set and the basic execution wide-character set shall each contain all the members of the basic source character set, plus control characters representing alert, backspace, and carriage return, plus a null character (respectively, null wide character), whose value is 0. For each basic execution character set, the values of the members shall be non-negative and distinct from one another. In both the source and execution basic character sets, the value of each character after 0 in the above list of decimal digits shall be one greater than the value of the previous. The execution character set and the execution wide-character set are implementation-defined supersets of the basic execution character set and the basic execution wide-character set, respectively. The values of the members of the execution character sets and the sets of additional members are locale-specific.

5.4 Preprocessing tokens [lex.pptoken]

1 Each preprocessing token that is converted to a token shall have the lexical form of a keyword, an identifier, a literal, an operator, or a punctuator.
2 A preprocessing token is the minimal lexical element of the language in translation phases 3 through 6. The categories of preprocessing token are: header names, identifiers, preprocessing numbers, character literals (including user-defined character literals), string literals (including user-defined string literals), preprocessing operators and punctuators, and single non-white-space characters that do not lexically match the other preprocessing token categories. If a ' or a " character matches the last category, the behavior is undefined. Preprocessing tokens can be separated by white space; this consists of comments, or white-space characters (space, horizontal tab, new-line, vertical tab, and form-feed), or both. As described in Clause [cpp], in certain circumstances during translation phase 4, white space (or the absence thereof) serves as more than preprocessing token separation. White space can appear within a preprocessing token only as part of a header name or between the quotation characters in a character literal or string literal.
3 If the input stream has been parsed into preprocessing tokens up to a given character:
(3.1)
  • If the next character begins a sequence of characters that could be the prefix and initial double quote of a raw string literal, such as R", the next preprocessing token shall be a raw string literal. Between the initial and final double quote characters of the raw string, any transformations performed in phases 1 and 2 (universal-character-names and line splicing) are reverted; this reversion shall apply before any d-char, r-char, or delimiting parenthesis is identified. The raw string literal is defined as the shortest sequence of characters that matches the raw-string pattern

(3.2)
  • Otherwise, if the next three characters are <​::​ and the subsequent character is neither : nor >, the < is treated as a preprocessing token by itself and not as the first character of the alternative token <:.

(3.3)
  • Otherwise, the next preprocessing token is the longest sequence of characters that could constitute a preprocessing token, even if that would cause further lexical analysis to fail, except that a header-name is only formed within a #include directive.

    Example
    #define R "x"
    const char* s = R"y";           // ill-formed raw string, not "x" "y"
    
4
ExampleThe program fragment 0xe+foo is parsed as a preprocessing number token (one that is not a valid floating or integer literal token), even though a parse as three preprocessing tokens 0xe, +, and foo might produce a valid expression (for example, if foo were a macro defined as 1). Similarly, the program fragment 1E1 is parsed as a preprocessing number (one that is a valid floating literal token), whether or not E is a macro name.
5
ExampleThe program fragment x+++++y is parsed as x ++ ++ + y, which, if x and y have integral types, violates a constraint on increment operators, even though the parse x ++ + ++ y might yield a correct expression.

5.5 Alternative tokens [lex.digraph]

1 Alternative token representations are provided for some operators and punctuators.[cpp17 6]
2 In all respects of the language, each alternative token behaves the same, respectively, as its primary token, except for its spelling.[cpp17 7] The set of alternative tokens is defined in Table 1.
Table 1 — Alternative tokens
Alternative Primary Alternative Primary Alternative Primary
<% { and && and_eq &=
%> } bitor | or_eq |=
<: [ or || xor_eq ^=
:> ] xor ^ not !
%: # compl ~ not_eq !=
%:%: ## bitand &

5.6 Tokens [lex.token]

token:

identifier keyword literal
operator punctuator
1 There are five kinds of tokens: identifiers, keywords, literals,[cpp17 8] operators, and other separators. Blanks, horizontal and vertical tabs, newlines, formfeeds, and comments (collectively, “white space”), as described below, are ignored except as they serve to separate tokens. NoteSome white space is required to separate otherwise adjacent identifiers, keywords, numeric literals, and alternative tokens containing alphabetic characters.

5.7 Comments [lex.comment]

1 The characters /* start a comment, which terminates with the characters */. These comments do not nest. The characters // start a comment, which terminates immediately before the next new-line character. If there is a form-feed or a vertical-tab character in such a comment, only white-space characters shall appear between it and the new-line that terminates the comment; no diagnostic is required. NoteThe comment characters //, /*, and */ have no special meaning within a // comment and are treated just like other characters. Similarly, the comment characters // and /* have no special meaning within a /* comment.

5.8 Header names [lex.header]

header-name:

< h-char-sequence >
" q-char-sequence "

h-char-sequence:

h-char
h-char-sequence h-char

h-char:

any member of the source character set except new-line and >

q-char-sequence:

q-char
q-char-sequence q-char

q-char:

any member of the source character set except new-line and "
1 NoteHeader name preprocessing tokens only appear within a #include preprocessing directive (see [lex.pptoken]). The sequences in both forms of header-names are mapped in an implementation-defined manner to headers or to external source file names as specified in [cpp.include].
2 The appearance of either of the characters ' or \ or of either of the character sequences /* or // in a q-char-sequence or an h-char-sequence is conditionally-supported with implementation-defined semantics, as is the appearance of the character " in an h-char-sequence.[cpp17 9]

5.9 Preprocessing numbers [lex.ppnumber]

1 Preprocessing number tokens lexically include all integer literal tokens and all floating literal tokens.
2 A preprocessing number does not have a type or a value; it acquires both after a successful conversion to an integer literal token or a floating literal token.

5.10 Identifiers [lex.name]

identifier-nondigit:

nondigit
universal-character-name

nondigit: one of

a b c d e f g h i j k l m
n o p q r s t u v w x y z
A B C D E F G H I J K L M
N O P Q R S T U V W X Y Z _

digit: one of

0 1 2 3 4 5 6 7 8 9
1 An identifier is an arbitrarily long sequence of letters and digits. Each universal-character-name in an identifier shall designate a character whose encoding in ISO 10646 falls into one of the ranges specified in Table 2. The initial element shall not be a universal-character-name designating a character whose encoding falls into one of the ranges specified in Table 3. Upper- and lower-case letters are different. All characters are significant.[cpp17 10]
Table 2 — Ranges of characters allowed
00A8 00AA 00AD 00AF 00B2-00B5
00B7-00BA 00BC-00BE 00C0-00D6 00D8-00F6 00F8-00FF
0100-167F 1681-180D 180F-1FFF
200B-200D 202A-202E 203F-2040 2054 2060-206F
2070-218F 2460-24FF 2776-2793 2C00-2DFF 2E80-2FFF
3004-3007 3021-302F 3031-D7FF
F900-FD3D FD40-FDCF FDF0-FE44 FE47-FFFD
10000-1FFFD 20000-2FFFD 30000-3FFFD 40000-4FFFD 50000-5FFFD
60000-6FFFD 70000-7FFFD 80000-8FFFD 90000-9FFFD A0000-AFFFD
B0000-BFFFD C0000-CFFFD D0000-DFFFD E0000-EFFFD
Table 3 — Ranges of characters disallowed initially (combining characters)
0300-036F 1DC0-1DFF 20D0-20FF FE20-FE2F
2 The identifiers in Table 4 have a special meaning when appearing in a certain context. When referred to in the grammar, these identifiers are used explicitly rather than using the identifier grammar production. Unless otherwise specified, any ambiguity as to whether a given identifier has a special meaning is resolved to interpret the token as a regular identifier.
Table 4 — Identifiers with special meaning
override final
3 In addition, some identifiers are reserved for use by C++ implementations and shall not be used otherwise; no diagnostic is required.
(3.1)
  • Each identifier that contains a double underscore __ or begins with an underscore followed by an uppercase letter is reserved to the implementation for any use.

(3.2)
  • Each identifier that begins with an underscore is reserved to the implementation for use as a name in the global namespace.

5.11 Keywords [lex.key]

1 The identifiers shown in Table 5 are reserved for use as keywords (that is, they are unconditionally treated as keywords in phase 7) except in an attribute-token:
Table 5 — Keywords
alignas continue friend register true
alignof decltype goto reinterpret_cast try
asm default if return typedef
auto delete inline short typeid
bool do int signed typename
break double long sizeof union
case dynamic_cast mutable static unsigned
catch else namespace static_assert using
char enum new static_cast virtual
char16_t explicit noexcept struct void
char32_t export nullptr switch volatile
class extern operator template wchar_t
const false private this while
constexpr float protected thread_local
const_cast for public throw

NoteThe export and register keywords are unused but are reserved for future use.

2 Furthermore, the alternative representations shown in Table 6 for certain operators and punctuators ([lex.digraph]) are reserved and shall not be used otherwise:
Table 6 — Alternative representations
and and_eq bitand bitor compl not
not_eq or or_eq xor xor_eq

5.12 Operators and punctuators [lex.operators]

1 The lexical representation of C++ programs includes a number of preprocessing tokens which are used in the syntax of the preprocessor or are converted into tokens for operators and punctuators:
<span id="nt:preprocessing-op-or-punc">preprocessing-op-or-punc:</span> one of { } [ ] # ## ( ) <: :> <% %> %: %:%: ; : ... new delete ? :: . .{{*}} + - {{*}} / % ^ & {{or}} ~ ! {{=}} < > +{{=}} -{{=}} {{*}}{{=}} /{{=}} %{{=}} ^{{=}} &{{=}} {{or}}{{=}} << >> >>{{=}} <<{{=}} {{=}}{{=}} !{{=}} <{{=}} >{{=}} && {{or}}{{or}} ++ -- , ->{{*}} -> and and_eq bitand bitor compl not not_eq or or_eq xor xor_eq 

Each preprocessing-op-or-punc is converted to a single token in translation phase 7.

5.13.1 Kinds of literals [lex.literal.kinds]

1 There are several kinds of literals.[cpp17 11]

5.13.2 Integer literals [lex.icon]

octal-literal:

0
octal-literal 'opt octal-digit

decimal-literal:

nonzero-digit
decimal-literal 'opt digit

binary-digit:

0
1

octal-digit: one of

0 1 2 3 4 5 6 7

nonzero-digit: one of

1 2 3 4 5 6 7 8 9

hexadecimal-prefix: one of

0x 0X

hexadecimal-digit: one of

0 1 2 3 4 5 6 7 8 9
a b c d e f
A B C D E F

unsigned-suffix: one of

u U

long-suffix: one of

l L

long-long-suffix: one of

ll LL
1 An integer literal is a sequence of digits that has no period or exponent part, with optional separating single quotes that are ignored when determining its value. An integer literal may have a prefix that specifies its base and a suffix that specifies its type. The lexically first digit of the sequence of digits is the most significant. A binary integer literal (base two) begins with 0b or 0B and consists of a sequence of binary digits. An octal integer literal (base eight) begins with the digit 0 and consists of a sequence of octal digits.[cpp17 12] A decimal integer literal (base ten) begins with a digit other than 0 and consists of a sequence of decimal digits. A hexadecimal integer literal (base sixteen) begins with 0x or 0X and consists of a sequence of hexadecimal digits, which include the decimal digits and the letters a through f and A through F with decimal values ten through fifteen.
ExampleThe number twelve can be written 12, 014, 0XC, or 0b1100. The integer literals 1048576, 1'048'576, 0X100000, 0x10'0000, and 0'004'000'000 all have the same value.
2 The type of an integer literal is the first of the corresponding list in Table 7 in which its value can be represented.
Table 7 — Types of integer literals
Suffix Decimal literal Binary, octal, or hexadecimal literal
none int int
long int unsigned int
long long int long int
unsigned long int
long long int
unsigned long long int
u or U unsigned int unsigned int
unsigned long int unsigned long int
unsigned long long int unsigned long long int
l or L long int long int
long long int unsigned long int
long long int
unsigned long long int
Both u or U unsigned long int unsigned long int
and l or L unsigned long long int unsigned long long int
ll or LL long long int long long int
unsigned long long int
Both u or U unsigned long long int unsigned long long int
and ll or LL
3 If an integer literal cannot be represented by any type in its list and an extended integer type can represent its value, it may have that extended integer type. If all of the types in the list for the integer literal are signed, the extended integer type shall be signed. If all of the types in the list for the integer literal are unsigned, the extended integer type shall be unsigned. If the list contains both signed and unsigned types, the extended integer type may be signed or unsigned. A program is ill-formed if one of its translation units contains an integer literal that cannot be represented by any of the allowed types.

5.13.3 Character literals [lex.ccon]

character-literal:

encoding-prefixopt ' c-char-sequence '

encoding-prefix: one of

u8  u  U  L

c-char-sequence:

c-char
c-char-sequence c-char

c-char:

any member of the source character set except
the single-quote ', backslash \, or new-line character
escape-sequence
universal-character-name

simple-escape-sequence: one of

\'  \"  \?  \\
\a  \b  \f  \n  \r  \t  \v
1 A character literal is one or more characters enclosed in single quotes, as in 'x', optionally preceded by u8, u, U, or L, as in u8'w', u'x', U'y', or L'z', respectively.
2 A character literal that does not begin with u8, u, U, or L is an ordinary character literal. An ordinary character literal that contains a single c-char representable in the execution character set has type char, with value equal to the numerical value of the encoding of the c-char in the execution character set. An ordinary character literal that contains more than one c-char is a multicharacter literal. A multicharacter literal, or an ordinary character literal containing a single c-char not representable in the execution character set, is conditionally-supported, has type int, and has an implementation-defined value.
3 A character literal that begins with u8, such as u8'w', is a character literal of type char, known as a UTF-8 character literal. The value of a UTF-8 character literal is equal to its ISO 10646 code point value, provided that the code point value is representable with a single UTF-8 code unit (that is, provided it is in the C0 Controls and Basic Latin Unicode block). If the value is not representable with a single UTF-8 code unit, the program is ill-formed. A UTF-8 character literal containing multiple c-chars is ill-formed.
4 A character literal that begins with the letter u, such as u'x', is a character literal of type char16_t. The value of a char16_t character literal containing a single c-char is equal to its ISO 10646 code point value, provided that the code point is representable with a single 16-bit code unit. (That is, provided it is a basic multi-lingual plane code point.) If the value is not representable within 16 bits, the program is ill-formed. A char16_t character literal containing multiple c-chars is ill-formed.
5 A character literal that begins with the letter U, such as U'y', is a character literal of type char32_t. The value of a char32_t character literal containing a single c-char is equal to its ISO 10646 code point value. A char32_t character literal containing multiple c-chars is ill-formed.
6 A character literal that begins with the letter L, such as L'z', is a wide-character literal. A wide-character literal has type wchar_t.[cpp17 13] The value of a wide-character literal containing a single c-char has value equal to the numerical value of the encoding of the c-char in the execution wide-character set, unless the c-char has no representation in the execution wide-character set, in which case the value is implementation-defined. NoteThe type wchar_t is able to represent all members of the execution wide-character set (see [basic.fundamental]). The value of a wide-character literal containing multiple c-chars is implementation-defined.
7 Certain non-graphic characters, the single quote ', the double quote ", the question mark ?,[cpp17 14] and the backslash \, can be represented according to Table 8. The double quote " and the question mark ?, can be represented as themselves or by the escape sequences \" and \? respectively, but the single quote ' and the backslash \ shall be represented by the escape sequences \' and \\ respectively. Escape sequences in which the character following the backslash is not listed in Table 8 are conditionally-supported, with implementation-defined semantics. An escape sequence specifies a single character.
Table 8 — Escape sequences
new-line NL(LF) \n
horizontal tab HT \t
vertical tab VT \v
backspace BS \b
carriage return CR \r
form feed FF \f
alert BEL \a
backslash \ \\
question mark ? \?
single quote ' \'
double quote " \"
octal number ooo \ooo
hex number hhh \xhhh
8 The escape \ooo consists of the backslash followed by one, two, or three octal digits that are taken to specify the value of the desired character. The escape \xhhh consists of the backslash followed by x followed by one or more hexadecimal digits that are taken to specify the value of the desired character. There is no limit to the number of digits in a hexadecimal sequence. A sequence of octal or hexadecimal digits is terminated by the first character that is not an octal digit or a hexadecimal digit, respectively. The value of a character literal is implementation-defined if it falls outside of the implementation-defined range defined for char (for character literals with no prefix) or wchar_t (for character literals prefixed by L). NoteIf the value of a character literal prefixed by u, u8, or U is outside the range defined for its type, the program is ill-formed.
9 A universal-character-name is translated to the encoding, in the appropriate execution character set, of the character named. If there is no such encoding, the universal-character-name is translated to an implementation-defined encoding. NoteIn translation phase 1, a universal-character-name is introduced whenever an actual extended character is encountered in the source text. Therefore, all extended characters are described in terms of universal-character-names. However, the actual compiler implementation may use its own native character set, so long as the same results are obtained.

5.13.4 Floating literals [lex.fcon]

fractional-constant:

digit-sequenceopt . digit-sequence
digit-sequence .

exponent-part:

e signopt digit-sequence
E signopt digit-sequence

binary-exponent-part:

p signopt digit-sequence
P signopt digit-sequence

sign: one of

+ -

digit-sequence:

digit
digit-sequence 'opt digit

floating-suffix: one of

f l F L
1 A floating literal consists of an optional prefix specifying a base, an integer part, a radix point, a fraction part, an e, E, p or P, an optionally signed integer exponent, and an optional type suffix. The integer and fraction parts both consist of a sequence of decimal (base ten) digits if there is no prefix, or hexadecimal (base sixteen) digits if the prefix is 0x or 0X. The floating literal is a decimal floating literal in the former case and a hexadecimal floating literal in the latter case. Optional separating single quotes in a digit-sequence or hexadecimal-digit-sequence are ignored when determining its value.
ExampleThe floating literals 1.602'176'565e-19 and 1.602176565e-19 have the same value.
Either the integer part or the fraction part (not both) can be omitted. Either the radix point or the letter e or E and the exponent (not both) can be omitted from a decimal floating literal. The radix point (but not the exponent) can be omitted from a hexadecimal floating literal. The integer part, the optional radix point, and the optional fraction part, form the significand of the floating literal. In a decimal floating literal, the exponent, if present, indicates the power of 10 by which the significand is to be scaled. In a hexadecimal floating literal, the exponent indicates the power of 2 by which the significand is to be scaled.
ExampleThe floating literals 49.625 and 0xC.68p+2 have the same value.
If the scaled value is in the range of representable values for its type, the result is the scaled value if representable, else the larger or smaller representable value nearest the scaled value, chosen in an implementation-defined manner. The type of a floating literal is double

unless explicitly specified by a suffix. The suffixes f and F specify float, the suffixes l and L specify long double. If the scaled value is not in the range of representable values for its type, the program is ill-formed.

5.13.5 String literals [lex.string]

s-char-sequence:

s-char
s-char-sequence s-char

s-char:

any member of the source character set except
the double-quote ", backslash \, or new-line character
escape-sequence
universal-character-name

raw-string:

" d-char-sequenceopt ( r-char-sequenceopt ) d-char-sequenceopt "

r-char-sequence:

r-char
r-char-sequence r-char

r-char:

any member of the source character set, except
a right parenthesis ) followed by the initial d-char-sequence
(which may be empty) followed by a double quote ".

d-char-sequence:

d-char
d-char-sequence d-char

d-char:

any member of the basic source character set except:
space, the left parenthesis (, the right parenthesis ), the backslash \,
and the control characters representing horizontal tab,
vertical tab, form feed, and newline.
1 A string-literal is a sequence of characters (as defined in [lex.ccon]) surrounded by double quotes, optionally prefixed by R, u8, u8R, u, uR, U, UR, L, or LR, as in "...", R"(...)", u8"...", u8R"**(...)**", u"...", uR"*~(...)*~", U"...", UR"zzz(...)zzz", L"...", or LR"(...)", respectively.
2 A string-literal that has an R

in the prefix is a raw string literal. The d-char-sequence serves as a delimiter. The terminating d-char-sequence of a raw-string is the same sequence of characters as the initial d-char-sequence. A d-char-sequence shall consist of at most 16 characters.

3 NoteThe characters '(' and ')' are permitted in a raw-string. Thus, R"delimiter((a|b))delimiter" is equivalent to "(a|b)".
4
NoteA source-file new-line in a raw string literal results in a new-line in the resulting execution string literal. Assuming no whitespace at the beginning of lines in the following example, the assert will succeed:
const char* p = R"(a\
b
c)";
assert(std::strcmp(p, "a\\\nb\nc") == 0);
5
ExampleThe raw string
R"a(
)\
a"
)a"

is equivalent to "\n)\\\na\"\n". The raw string

R"(??)"

is equivalent to "\?\?". The raw string

R"#(
)??="
)#"
is equivalent to "\n)\?\?=\"\n".
6 After translation phase 6, a string-literal that does not begin with an encoding-prefix is an ordinary string literal, and is initialized with the given characters.
7 A string-literal that begins with u8, such as u8"asdf", is a UTF-8 string literal.
8 Ordinary string literals and UTF-8 string literals are also referred to as narrow string literals. A narrow string literal has type “array of n const char”, where n is the size of the string as defined below, and has static storage duration.
9 For a UTF-8 string literal, each successive element of the object representation has the value of the corresponding code unit of the UTF-8 encoding of the string.
10 A string-literal that begins with u, such as u"asdf", is a char16_t string literal. A char16_t string literal has type “array of n const char16_t”, where n is the size of the string as defined below; it is initialized with the given characters. A single c-char may produce more than one char16_t character in the form of surrogate pairs.
11 A string-literal that begins with U, such as U"asdf", is a char32_t string literal. A char32_t string literal has type “array of n const char32_t”, where n is the size of the string as defined below; it is initialized with the given characters.
12 A string-literal that begins with L, such as L"asdf", is a wide string literal. A wide string literal has type “array of n const wchar_t”, where n is the size of the string as defined below; it is initialized with the given characters.
13 In translation phase 6, adjacent string-literals are concatenated. If both string-literals have the same encoding-prefix, the resulting concatenated string literal has that encoding-prefix. If one string-literal has no encoding-prefix, it is treated as a string-literal of the same encoding-prefix as the other operand. If a UTF-8 string literal token is adjacent to a wide string literal token, the program is ill-formed. Any other concatenations are conditionally-supported with implementation-defined behavior. NoteThis concatenation is an interpretation, not a conversion. Because the interpretation happens in translation phase 6 (after each character from a string literal has been translated into a value from the appropriate character set), a string-literal's initial rawness has no effect on the interpretation or well-formedness of the concatenation. Table 9 has some examples of valid concatenations.
Table 9 — String literal concatenations
Source Means Source Means Source Means
u"a" u"b" u"ab" U"a" U"b" U"ab" L"a" L"b" L"ab"
u"a" "b" u"ab" U"a" "b" U"ab" L"a" "b" L"ab"
"a" u"b" u"ab" "a" U"b" U"ab" "a" L"b" L"ab"

Characters in concatenated strings are kept distinct.

Example
"\xA" "B"
contains the two characters '\xA' and 'B' after concatenation (and not the single hexadecimal character '\xAB').
14 After any necessary concatenation, in translation phase 7, '\0' is appended to every string literal so that programs that scan a string can find its end.
15 Escape sequences and universal-character-names in non-raw string literals have the same meaning as in character literals, except that the single quote ' is representable either by itself or by the escape sequence \', and the double quote " shall be preceded by a \, and except that a universal-character-name in a char16_t string literal may yield a surrogate pair. In a narrow string literal, a universal-character-name may map to more than one char element due to multibyte encoding. The size of a char32_t or wide string literal is the total number of escape sequences, universal-character-names, and other characters, plus one for the terminating U'\0' or L'\0'. The size of a char16_t string literal is the total number of escape sequences, universal-character-names, and other characters, plus one for each character requiring a surrogate pair, plus one for the terminating u'\0'. NoteThe size of a char16_t string literal is the number of code units, not the number of characters. Within char32_t and char16_t string literals, any universal-character-names shall be within the range 0x0 to 0x10FFFF. The size of a narrow string literal is the total number of escape sequences and other characters, plus at least one for the multibyte encoding of each universal-character-name, plus one for the terminating '\0'.
16 Evaluating a string-literal results in a string literal object with static storage duration, initialized from the given characters as specified above. Whether all string literals are distinct (that is, are stored in nonoverlapping objects) and whether successive evaluations of a string-literal yield the same or a different object is unspecified.
NoteThe effect of attempting to modify a string literal is undefined.

5.13.6 Boolean literals [lex.bool]

boolean-literal:

false
true
1 The Boolean literals are the keywords false and true. Such literals are prvalues and have type bool.

5.13.7 Pointer literals [lex.nullptr]

pointer-literal:

nullptr
1 The pointer literal is the keyword nullptr. It is a prvalue of type std​::​nullptr_t.
Notestd​::​nullptr_t is a distinct type that is neither a pointer type nor a pointer to member type; rather, a prvalue of this type is a null pointer constant and can be converted to a null pointer value or null member pointer value. See [conv.ptr] and [conv.mem].

5.13.8 User-defined literals [lex.ext]

pointer-literal:

nullptr
1 The pointer literal is the keyword nullptr. It is a prvalue of type std​::​nullptr_t.
Notestd​::​nullptr_t is a distinct type that is neither a pointer type nor a pointer to member type; rather, a prvalue of this type is a null pointer constant and can be converted to a null pointer value or null member pointer value. See [conv.ptr] and [conv.mem].

  1. Implementations must behave as if these separate phases occur, although in practice different phases might be folded together.
  2. A partial preprocessing token would arise from a source file ending in the first portion of a multi-character token that requires a terminating sequence of characters, such as a header-name that is missing the closing " or >. A partial comment would arise from a source file ending with an unclosed /* comment.
  3. An implementation need not convert all non-corresponding source characters to the same execution character.
  4. The glyphs for the members of the basic source character set are intended to identify characters from the subset of ISO/IEC 10646 which corresponds to the ASCII character set. However, because the mapping from source file characters to the source character set (described in translation phase 1) is specified as implementation-defined, an implementation is required to document how the basic source characters are represented in source files.
  5. A sequence of characters resembling a universal-character-name in an r-char-sequence does not form a universal-character-name.
  6. These include “digraphs” and additional reserved words. The term “digraph” (token consisting of two characters) is not perfectly descriptive, since one of the alternative preprocessing-tokens is %:%: and of course several primary tokens contain two characters. Nonetheless, those alternative tokens that aren't lexical keywords are colloquially known as “digraphs”.
  7. Thus the “stringized” values of [ and <: will be different, maintaining the source spelling, but the tokens can otherwise be freely interchanged.
  8. Literals include strings and character and numeric literals.
  9. Thus, a sequence of characters that resembles an escape sequence might result in an error, be interpreted as the character corresponding to the escape sequence, or have a completely different meaning, depending on the implementation.
  10. On systems in which linkers cannot accept extended characters, an encoding of the universal-character-name may be used in forming valid external identifiers. For example, some otherwise unused character or sequence of characters may be used to encode the \u in a universal-character-name. Extended characters may produce a long external identifier, but C++ does not place a translation limit on significant characters for external identifiers. In C++, upper- and lower-case letters are considered different for all identifiers, including external identifiers.
  11. The term “literal” generally designates, in this International Standard, those tokens that are called “constants” in ISO C.
  12. The digits 8 and 9 are not octal digits.
  13. They are intended for character sets where a character does not fit into a single byte.
  14. Using an escape sequence for a question mark is supported for compatibility with ISO C++ 2014 and ISO C.


[edit]

Source: https://timsong-cpp.github.io/cppwp/n4868/lex

1 Scope [intro.scope]
2 Normative references [intro.refs]
3 Terms and definitions [intro.defs]
4 General principles [intro]
5 Lexical conventions [lex]
5.1 Separate translation [lex.separate]
5.2 Phases of translation [lex.phases]
5.3 Character sets [lex.charset]
5.4 Preprocessing tokens [lex.pptoken]
5.5 Alternative tokens [lex.digraph]
5.6 Tokens [lex.token]
5.7 Comments [lex.comment]
5.8 Header names [lex.header]
5.9 Preprocessing numbers [lex.ppnumber]
5.10 Identifiers [lex.name]
5.11 Keywords [lex.key]
5.12 Operators and punctuators [lex.operators]
5.13 Literals [lex.literal]
6 Basics [basic]
7 Expressions [expr]
8 Statements [stmt.stmt]
9 Declarations [dcl.dcl]
10 Modules [module]
11 Classes [class]
12 Overloading [over]
13 Templates [temp]
14 Exception handling [except]
15 Preprocessing directives [cpp]
16 Library introduction [library]
17 Language support library [support]
18 Concepts library [concepts]
19 Diagnostics library [diagnostics]
20 General utilities library [utilities]
21 Strings library [strings]
22 Containers library [containers]
23 Iterators library [iterators]
24 Ranges library [ranges]
25 Algorithms library [algorithms]
26 Numerics library [numerics]
27 Time library [time]
28 Localization library [localization]
29 Input/output library [input.output]
30 Regular expressions library [re]
31 Atomic operations library [atomics]
32 Thread support library [thread]
Annex A (informative) Grammar summary [gram]
Annex B (normative) Implementation quantities [implimits]
Annex C (informative) Compatibility [diff]
Annex D (normative) Compatibility features [depr]
Index [generalindex]
Index of grammar productions [grammarindex]
Index of library headers [headerindex]
Index of library names [libraryindex]
Index of library concepts [conceptindex]
Index of implementation-defined behavior [impldefindex]

5.1 Separate translation [lex.separate]

1 The text of the program is kept in units called source files in this document. A source file together with all the headers and source files included via the preprocessing directive #include, less any source lines skipped by any of the conditional inclusion preprocessing directives, is called a translation unit.

NoteA C++ program need not all be translated at the same time.

2 NotePreviously translated translation units and instantiation units can be preserved individually or in libraries. The separate translation units of a program communicate ([basic.link]) by (for example) calls to functions whose identifiers have external or module linkage, manipulation of objects whose identifiers have external or module linkage, or manipulation of data files. Translation units can be separately translated and then later linked to produce an executable program.

5.2 Phases of translation [lex.phases]

1 The precedence among the syntax rules of translation is specified by the following phases.[cpp20 1]
  1. Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set (introducing new-line characters for end-of-line indicators) if necessary. The set of physical source file characters accepted is implementation-defined. Any source file character not in the basic source character set is replaced by the universal-character-name that designates that character. An implementation may use any internal encoding, so long as an actual extended character encountered in the source file, and the same extended character expressed in the source file as a universal-character-name (e.g., using the \uXXXX notation), are handled equivalently except where this replacement is reverted ([lex.pptoken]) in a raw string literal.
  2. Each instance of a backslash character (\) immediately followed by a new-line character is deleted, splicing physical source lines to form logical source lines. Only the last backslash on any physical source line shall be eligible for being part of such a splice. Except for splices reverted in a raw string literal, if a splice results in a character sequence that matches the syntax of a universal-character-name, the behavior is undefined. A source file that is not empty and that does not end in a new-line character, or that ends in a new-line character immediately preceded by a backslash character before any such splicing takes place, shall be processed as if an additional new-line character were appended to the file.
  3. The source file is decomposed into preprocessing tokens ([lex.pptoken]) and sequences of whitespace characters (including comments). A source file shall not end in a partial preprocessing token or in a partial comment.[cpp20 2] Each comment is replaced by one space character. New-line characters are retained. Whether each nonempty sequence of whitespace characters other than new-line is retained or replaced by one space character is unspecified. The process of dividing a source file's characters into preprocessing tokens is context-dependent.
    ExampleSee the handling of < within a #include preprocessing directive.
  4. Preprocessing directives are executed, macro invocations are expanded, and _Pragma unary operator expressions are executed. If a character sequence that matches the syntax of a universal-character-name is produced by token concatenation, the behavior is undefined. A #include preprocessing directive causes the named header or source file to be processed from phase 1 through phase 4, recursively. All preprocessing directives are then deleted.
  5. Each basic source character set member in a character-literal or a string-literal, as well as each escape sequence and universal-character-name in a character-literal or a non-raw string literal, is converted to the corresponding member of the execution character set ([lex.ccon], [lex.string]); if there is no corresponding member, it is converted to an implementation-defined member other than the null (wide) character.[cpp20 3]
  6. Adjacent string literal tokens are concatenated.
  7. White-space characters separating tokens are no longer significant. Each preprocessing token is converted into a token ([lex.token]). The resulting tokens are syntactically and semantically analyzed and translated as a translation unit. NoteThe process of analyzing and translating the tokens can occasionally result in one token being replaced by a sequence of other tokens ([temp.names]). It is implementation-defined whether the sources for module units and header units on which the current translation unit has an interface dependency ([module.unit], [module.import]) are required to be available. NoteSource files, translation units and translated translation units need not necessarily be stored as files, nor need there be any one-to-one correspondence between these entities and any external representation. The description is conceptual only, and does not specify any particular implementation.
  8. Translated translation units and instantiation units are combined as follows: NoteSome or all of these can be supplied from a library. Each translated translation unit is examined to produce a list of required instantiations. NoteThis can include instantiations which have been explicitly requested ([temp.explicit]). The definitions of the required templates are located. It is implementation-defined whether the source of the translation units containing these definitions is required to be available. NoteAn implementation can choose to encode sufficient information into the translated translation unit so as to ensure the source is not required here. All the required instantiations are performed to produce instantiation units. NoteThese are similar to translated translation units, but contain no references to uninstantiated templates and no template definitions. The program is ill-formed if any instantiation fails.
  9. All external entity references are resolved. Library components are linked to satisfy external references to entities not defined in the current translation. All such translator output is collected into a program image which contains information needed for execution in its execution environment.

5.3 Character sets [lex.charset]

1 The basic source character set consists of 96 characters: the space character, the control characters representing horizontal tab, vertical tab, form feed, and new-line, plus the following 91 graphical characters:[cpp20 4]
 a b c d e f g h i j k l m n o p q r s t u v w x y z A B C D E F G H I J K L M N O P Q R S T U V W X Y Z 0 1 2 3 4 5 6 7 8 9 _ { } [ ] # ( ) < > % : ; . ? {{*}} + - / ^ & {{or}} ~ ! {{=}} , \ " {{'}} 
2 The universal-character-name construct provides a way to name other characters.

A universal-character-name designates the character in ISO/IEC 10646 (if any) whose code point is the hexadecimal number represented by the sequence of hexadecimal-digits in the universal-character-name. The program is ill-formed if that number is not a code point or if it is a surrogate code point. Noncharacter code points and reserved code points are considered to designate separate characters distinct from any ISO/IEC 10646 character. If a universal-character-name outside the c-char-sequence, s-char-sequence, or r-char-sequence of a character-literal or string-literal (in either case, including within a user-defined-literal) corresponds to a control character or to a character in the basic source character set, the program is ill-formed.[cpp20 5] NoteISO/IEC 10646 code points are integers in the range [0,10FFFF] (hexadecimal). A surrogate code point is a value in the range [D800,DFFF] (hexadecimal). A control character is a character whose code point is in either of the ranges [0,1F] or [7F,9F] (hexadecimal).

3 The basic execution character set and the basic execution wide-character set shall each contain all the members of the basic source character set, plus control characters representing alert, backspace, and carriage return, plus a null character (respectively, null wide character), whose value is 0. For each basic execution character set, the values of the members shall be non-negative and distinct from one another. In both the source and execution basic character sets, the value of each character after 0 in the above list of decimal digits shall be one greater than the value of the previous. The execution character set and the execution wide-character set are implementation-defined supersets of the basic execution character set and the basic execution wide-character set, respectively. The values of the members of the execution character sets and the sets of additional members are locale-specific.

5.4 Preprocessing tokens [lex.pptoken]

preprocessing-token:

header-name
import-keyword
module-keyword
export-keyword
identifier
pp-number
character-literal
user-defined-character-literal
string-literal
user-defined-string-literal
preprocessing-op-or-punc
each non-whitespace character that cannot be one of the above
1 Each preprocessing token that is converted to a token shall have the lexical form of a keyword, an identifier, a literal, or an operator or punctuator.
2 A preprocessing token is the minimal lexical element of the language in translation phases 3 through 6. The categories of preprocessing token are: header names, placeholder tokens produced by preprocessing import and module directives (import-keyword, module-keyword, and export-keyword), identifiers, preprocessing numbers, character literals (including user-defined character literals), string literals (including user-defined string literals), preprocessing operators and punctuators, and single non-whitespace characters that do not lexically match the other preprocessing token categories. If a ' or a " character matches the last category, the behavior is undefined. Preprocessing tokens can be separated by whitespace; this consists of comments ([lex.comment]), or whitespace characters (space, horizontal tab, new-line, vertical tab, and form-feed), or both. As described in [cpp], in certain circumstances during translation phase 4, whitespace (or the absence thereof) serves as more than preprocessing token separation. White space can appear within a preprocessing token only as part of a header name or between the quotation characters in a character literal or string literal.
3 If the input stream has been parsed into preprocessing tokens up to a given character:
(3.1)
  • If the next character begins a sequence of characters that could be the prefix and initial double quote of a raw string literal, such as R", the next preprocessing token shall be a raw string literal. Between the initial and final double quote characters of the raw string, any transformations performed in phases 1 and 2 (universal-character-names and line splicing) are reverted; this reversion shall apply before any d-char, r-char, or delimiting parenthesis is identified. The raw string literal is defined as the shortest sequence of characters that matches the raw-string pattern

    encoding-prefixopt R raw-string

(3.2)
  • Otherwise, if the next three characters are <​::​ and the subsequent character is neither : nor >, the < is treated as a preprocessing token by itself and not as the first character of the alternative token <:.

(3.3)
  • Otherwise, the next preprocessing token is the longest sequence of characters that could constitute a preprocessing token, even if that would cause further lexical analysis to fail, except that a header-name ([lex.header]) is only formed

(3.3.1)
(3.3.2)
  • within a has-include-expression.

    Example
    #define R "x"
    const char* s = R"y";           // ill-formed raw string, not "x" "y"
    
4 The import-keyword is produced by processing an import directive ([cpp.import]), the module-keyword is produced by preprocessing a module directive ([cpp.module]), and the export-keyword is produced by preprocessing either of the previous two directives.

NoteNone has any observable spelling.

5
ExampleThe program fragment 0xe+foo is parsed as a preprocessing number token (one that is not a valid integer-literal or floating-point-literal token), even though a parse as three preprocessing tokens 0xe, +, and foo might produce a valid expression (for example, if foo were a macro defined as 1). Similarly, the program fragment 1E1 is parsed as a preprocessing number (one that is a valid floating-point-literal token), whether or not E is a macro name.
6
ExampleThe program fragment x+++++y is parsed as x ++ ++ + y, which, if x and y have integral types, violates a constraint on increment operators, even though the parse x ++ + ++ y might yield a correct expression.

5.5 Alternative tokens [lex.digraph]

1 Alternative token representations are provided for some operators and punctuators.[cpp20 6]
2 In all respects of the language, each alternative token behaves the same, respectively, as its primary token, except for its spelling.[cpp20 7] The set of alternative tokens is defined in Table 1.
Table 1: Alternative tokens [tab:lex.digraph]
Alternative Primary Alternative Primary Alternative Primary
<% { and && and_eq &=
%> } bitor | or_eq |=
<: [ or || xor_eq ^=
:> ] xor ^ not !
%: # compl ~ not_eq !=
%:%: ## bitand &

5.6 Tokens [lex.token]

1 There are five kinds of tokens: identifiers, keywords, literals,[cpp20 8] operators, and other separators. Blanks, horizontal and vertical tabs, newlines, formfeeds, and comments (collectively, “whitespace”), as described below, are ignored except as they serve to separate tokens.

NoteSome whitespace is required to separate otherwise adjacent identifiers, keywords, numeric literals, and alternative tokens containing alphabetic characters.

5.7 Comments [lex.comment]

1 The characters /* start a comment, which terminates with the characters */. These comments do not nest. The characters // start a comment, which terminates immediately before the next new-line character. If there is a form-feed or a vertical-tab character in such a comment, only whitespace characters shall appear between it and the new-line that terminates the comment; no diagnostic is required.

NoteThe comment characters //, /*, and */ have no special meaning within a // comment and are treated just like other characters. Similarly, the comment characters // and /* have no special meaning within a /* comment.

5.8 Header names [lex.header]

h-char:

any member of the source character set except new-line and >

q-char:

any member of the source character set except new-line and "
1 NoteHeader name preprocessing tokens only appear within a #include preprocessing directive, a __has_include preprocessing expression, or after certain occurrences of an import token (see [lex.pptoken]).

The sequences in both forms of header-names are mapped in an implementation-defined manner to headers or to external source file names as specified in [cpp.include].

2 The appearance of either of the characters ' or \ or of either of the character sequences /* or // in a q-char-sequence or an h-char-sequence is conditionally-supported with implementation-defined semantics, as is the appearance of the character " in an h-char-sequence.[cpp20 9]

5.9 Preprocessing numbers [lex.ppnumber]

1 Preprocessing number tokens lexically include all integer-literal tokens ([lex.icon]) and all floating-point-literal tokens ([lex.fcon]).
2 A preprocessing number does not have a type or a value; it acquires both after a successful conversion to an integer-literal token or a floating-point-literal token.

5.10 Identifiers [lex.name]

nondigit: one of

a b c d e f g h i j k l m
n o p q r s t u v w x y z
A B C D E F G H I J K L M
N O P Q R S T U V W X Y Z _

digit: one of

0 1 2 3 4 5 6 7 8 9
1 An identifier is an arbitrarily long sequence of letters and digits. Each universal-character-name in an identifier shall designate a character whose encoding in ISO/IEC 10646 falls into one of the ranges specified in Table 2. The initial element shall not be a universal-character-name designating a character whose encoding falls into one of the ranges specified in Table 3. Upper- and lower-case letters are different. All characters are significant.[cpp20 10]
Table 2: Ranges of characters allowed [tab:lex.name.allowed]
00A8 00AA 00AD 00AF 00B2-00B5
00B7-00BA 00BC-00BE 00C0-00D6 00D8-00F6 00F8-00FF
0100-167F 1681-180D 180F-1FFF
200B-200D 202A-202E 203F-2040 2054 2060-206F
2070-218F 2460-24FF 2776-2793 2C00-2DFF 2E80-2FFF
3004-3007 3021-302F 3031-D7FF
F900-FD3D FD40-FDCF FDF0-FE44 FE47-FFFD
10000-1FFFD 20000-2FFFD 30000-3FFFD 40000-4FFFD 50000-5FFFD
60000-6FFFD 70000-7FFFD 80000-8FFFD 90000-9FFFD A0000-AFFFD
B0000-BFFFD C0000-CFFFD D0000-DFFFD E0000-EFFFD
Table 3: Ranges of characters disallowed initially (combining characters) [tab:lex.name.disallowed]
0300-036F 1DC0-1DFF 20D0-20FF FE20-FE2F
2 The identifiers in Table 4 have a special meaning when appearing in a certain context. When referred to in the grammar, these identifiers are used explicitly rather than using the identifier grammar production. Unless otherwise specified, any ambiguity as to whether a given identifier has a special meaning is resolved to interpret the token as a regular identifier.
Table 4: Identifiers with special meaning [tab:lex.name.special]
final import module override
3 In addition, some identifiers are reserved for use by C++ implementations and shall not be used otherwise; no diagnostic is required.
(3.1)
  • Each identifier that contains a double underscore __ or begins with an underscore followed by an uppercase letter is reserved to the implementation for any use.

(3.2)
  • Each identifier that begins with an underscore is reserved to the implementation for use as a name in the global namespace.

5.11 Keywords [lex.key]

keyword:

any identifier listed in Table 5
import-keyword
module-keyword
export-keyword
1 The identifiers shown in Table 5 are reserved for use as keywords (that is, they are unconditionally treated as keywords in phase 7) except in an attribute-token ([dcl.attr.grammar]).

NoteThe register keyword is unused but is reserved for future use.

Table 5: Keywords [tab:lex.key]
alignas constinit false public true
alignof const_cast float register try
asm continue for reinterpret_cast typedef
auto co_await friend requires typeid
bool co_return goto return typename
break co_yield if short union
case decltype inline signed unsigned
catch default int sizeof using
char delete long static virtual
char8_t do mutable static_assert void
char16_t double namespace static_cast volatile
char32_t dynamic_cast new struct wchar_t
class else noexcept switch while
concept enum nullptr template
const explicit operator this
consteval export private thread_local
constexpr extern protected throw
2 Furthermore, the alternative representations shown in Table 6 for certain operators and punctuators ([lex.digraph]) are reserved and shall not be used otherwise.
Table 6: Alternative representations [tab:lex.key.digraph]
and and_eq bitand bitor compl not
not_eq or or_eq xor xor_eq

5.12 Operators and punctuators [lex.operators]

1 The lexical representation of C++ programs includes a number of preprocessing tokens that are used in the syntax of the preprocessor or are converted into tokens for operators and punctuators:

preprocessing-operator: one of

# ##  %:  %:%:

operator-or-punctuator: one of

{ } [ ] ( )
<:  :> <%  %> ; : ...
?  :: . .* -> ->* ~
! + - * /  % ^ & |
= += -= *= /=  %= ^= &= |=
==  != < > <= >= <=> && ||
<< >> <<= >>= ++ -- ,
and or xor not bitand bitor compl
and_eq or_eq xor_eq not_eq

Each operator-or-punctuator is converted to a single token in translation phase 7.

5.13.1 Kinds of literals [lex.literal.kinds]

1 There are several kinds of literals.[cpp20 11]

5.13.2 Integer literals [lex.icon]

binary-digit: one of

0 1

octal-digit: one of

0 1 2 3 4 5 6 7

nonzero-digit: one of

1 2 3 4 5 6 7 8 9

hexadecimal-prefix: one of

0x 0X

hexadecimal-digit: one of

0 1 2 3 4 5 6 7 8 9
a b c d e f
A B C D E F

unsigned-suffix: one of

u U

long-suffix: one of

l L

long-long-suffix: one of

ll LL
1 In an integer-literal, the sequence of binary-digits, octal-digits, digits, or hexadecimal-digits is interpreted as a base N integer as shown in table Table 7; the lexically first digit of the sequence of digits is the most significant.

NoteThe prefix and any optional separating single quotes are ignored when determining the value.

Table 7: Base of integer-literals[tab:lex.icon.base]
Kind of integer-literal base N
binary-literal 2
octal-literal 8
decimal-literal 10
hexadecimal-literal 16
2 The hexadecimal-digits

a through f and A through F have decimal values ten through fifteen.

ExampleThe number twelve can be written 12, 014, 0XC, or 0b1100. The integer-literals 1048576, 1'048'576, 0X100000, 0x10'0000, and 0'004'000'000 all have the same value.
3 The type of an integer-literal is the first type in the list in Table 8 corresponding to its optional integer-suffix in which its value can be represented. An integer-literal is a prvalue.
Table 8: Types of integer-literals[tab:lex.icon.type]
integer-suffix decimal-literal integer-literal other than decimal-literal
none int int
long int unsigned int
long long int long int
unsigned long int
long long int
unsigned long long int
u or U unsigned int unsigned int
unsigned long int unsigned long int
unsigned long long int unsigned long long int
l or L long int long int
long long int unsigned long int
long long int
unsigned long long int
Both u or U unsigned long int unsigned long int
and l or L unsigned long long int unsigned long long int
ll or LL long long int long long int
unsigned long long int
Both u or U unsigned long long int unsigned long long int
and ll or LL
4 If an integer-literal cannot be represented by any type in its list and an extended integer type ([basic.fundamental]) can represent its value, it may have that extended integer type. If all of the types in the list for the integer-literal are signed, the extended integer type shall be signed. If all of the types in the list for the integer-literal are unsigned, the extended integer type shall be unsigned. If the list contains both signed and unsigned types, the extended integer type may be signed or unsigned. A program is ill-formed if one of its translation units contains an integer-literal that cannot be represented by any of the allowed types.

5.13.3 Character literals [lex.ccon]

encoding-prefix: one of

u8  u  U  L

c-char:

any member of the basic source character set except the single-quote ', backslash \, or new-line character
escape-sequence
universal-character-name

simple-escape-sequence: one of

\'  \"  \?  \\
\a  \b  \f  \n  \r  \t  \v
1 A character-literal that does not begin with u8, u, U, or L is an ordinary character literal. An ordinary character literal that contains a single c-char representable in the execution character set has type char, with value equal to the numerical value of the encoding of the c-char in the execution character set. An ordinary character literal that contains more than one c-char is a multicharacter literal. A multicharacter literal, or an ordinary character literal containing a single c-char not representable in the execution character set, is conditionally-supported, has type int, and has an implementation-defined value.
2 A character-literal that begins with u8, such as u8'w', is a character-literal of type char8_t, known as a UTF-8 character literal. The value of a UTF-8 character literal is equal to its ISO/IEC 10646 code point value, provided that the code point value can be encoded as a single UTF-8 code unit.

NoteThat is, provided the code point value is in the range [0,7F] (hexadecimal). If the value is not representable with a single UTF-8 code unit, the program is ill-formed. A UTF-8 character literal containing multiple c-chars is ill-formed.

3 A character-literal that begins with the letter u, such as u'x', is a character-literal of type char16_t, known as a UTF-16 character literal. The value of a UTF-16 character literal is equal to its ISO/IEC 10646 code point value, provided that the code point value is representable with a single 16-bit code unit.

NoteThat is, provided the code point value is in the range [0,FFFF] (hexadecimal). If the value is not representable with a single 16-bit code unit, the program is ill-formed. A UTF-16 character literal containing multiple c-chars is ill-formed.

4 A character-literal that begins with the letter U, such as U'y', is a character-literal of type char32_t, known as a UTF-32 character literal. The value of a UTF-32 character literal containing a single c-char is equal to its ISO/IEC 10646 code point value. A UTF-32 character literal containing multiple c-chars is ill-formed.
5 A character-literal that begins with the letter L, such as L'z', is a wide-character literal. A wide-character literal has type wchar_t.[cpp20 12] The value of a wide-character literal containing a single c-char has value equal to the numerical value of the encoding of the c-char in the execution wide-character set, unless the c-char has no representation in the execution wide-character set, in which case the value is implementation-defined.

NoteThe type wchar_t is able to represent all members of the execution wide-character set (see [basic.fundamental]). The value of a wide-character literal containing multiple c-chars is implementation-defined.

6 Certain non-graphic characters, the single quote ', the double quote ", the question mark ?,[cpp20 13] and the backslash \, can be represented according to Table 9. The double quote " and the question mark ?, can be represented as themselves or by the escape sequences \" and \? respectively, but the single quote ' and the backslash \ shall be represented by the escape sequences \' and \\ respectively. Escape sequences in which the character following the backslash is not listed in Table 9 are conditionally-supported, with implementation-defined semantics. An escape sequence specifies a single character.
Table 9: Escape sequences [tab:lex.ccon.esc]
new-line NL(LF) \n
horizontal tab HT \t
vertical tab VT \v
backspace BS \b
carriage return CR \r
form feed FF \f
alert BEL \a
backslash \ \\
question mark ? \?
single quote ' \'
double quote " \"
octal number ooo \ooo
hex number hhh \xhhh
7 The escape \ooo consists of the backslash followed by one, two, or three octal digits that are taken to specify the value of the desired character. The escape \xhhh consists of the backslash followed by x followed by one or more hexadecimal digits that are taken to specify the value of the desired character. There is no limit to the number of digits in a hexadecimal sequence. A sequence of octal or hexadecimal digits is terminated by the first character that is not an octal digit or a hexadecimal digit, respectively. The value of a character-literal is implementation-defined if it falls outside of the implementation-defined range defined for char (for character-literals with no prefix) or wchar_t (for character-literals prefixed by L).

NoteIf the value of a character-literal prefixed by u, u8, or U is outside the range defined for its type, the program is ill-formed.

8 A universal-character-name is translated to the encoding, in the appropriate execution character set, of the character named. If there is no such encoding, the universal-character-name is translated to an implementation-defined encoding.

NoteIn translation phase 1, a universal-character-name is introduced whenever an actual extended character is encountered in the source text. Therefore, all extended characters are described in terms of universal-character-names. However, the actual compiler implementation can use its own native character set, so long as the same results are obtained.

5.13.4 Floating-point literals [lex.fcon]

sign: one of

+ -

floating-point-suffix: one of

f l F L
1 The type of a floating-point-literal is determined by its floating-point-suffix as specified in Table 10.
Table 10: Types of floating-point-literals[tab:lex.fcon.type]
floating-point-suffix type
none double
f or F float
l or L long double
2 The significand of a floating-point-literal is the fractional-constant or digit-sequence of a decimal-floating-point-literal or the hexadecimal-fractional-constant or hexadecimal-digit-sequence of a hexadecimal-floating-point-literal. In the significand, the sequence of digits or hexadecimal-digits and optional period are interpreted as a base N real number s, where N is 10 for a decimal-floating-point-literal and 16 for a hexadecimal-floating-point-literal.

NoteAny optional separating single quotes are ignored when determining the value. If an exponent-part or binary-exponent-part is present, the exponent e of the floating-point-literal is the result of interpreting the sequence of an optional sign and the digits as a base 10 integer. Otherwise, the exponent e is 0. The scaled value of the literal is s×10e for a decimal-floating-point-literal and s×2e for a hexadecimal-floating-point-literal.

ExampleThe floating-point-literals

49.625 and 0xC.68p+2 have the same value. The floating-point-literals

1.602'176'565e-19 and 1.602176565e-19 have the same value.
3 If the scaled value is not in the range of representable values for its type, the program is ill-formed. Otherwise, the value of a floating-point-literal is the scaled value if representable, else the larger or smaller representable value nearest the scaled value, chosen in an implementation-defined manner.

5.13.5 String literals [lex.string]

s-char:

any member of the basic source character set except the double-quote ", backslash \, or new-line character
escape-sequence
universal-character-name

r-char:

any member of the source character set, except a right parenthesis ) followed by
   the initial d-char-sequence (which may be empty) followed by a double quote ".

d-char:

any member of the basic source character set except:
   space, the left parenthesis (, the right parenthesis ), the backslash \, and the control characters
   representing horizontal tab, vertical tab, form feed, and newline.
1 A string-literal that has an R

in the prefix is a raw string literal. The d-char-sequence serves as a delimiter. The terminating d-char-sequence of a raw-string is the same sequence of characters as the initial d-char-sequence. A d-char-sequence shall consist of at most 16 characters.

2 NoteThe characters '(' and ')' are permitted in a raw-string. Thus, R"delimiter((a|b))delimiter" is equivalent to "(a|b)".
3
NoteA source-file new-line in a raw string literal results in a new-line in the resulting execution string literal. Assuming no whitespace at the beginning of lines in the following example, the assert will succeed:
const char* p = R"(a\
b
c)";
assert(std::strcmp(p, "a\\\nb\nc") == 0);
4
ExampleThe raw string
R"a(
)\
a"
)a"
is equivalent to "\n)\\\na\"\n". The raw string
R"(x = "\"y\"")"
is equivalent to "x = \"\\\"y\\\"\"".
5 After translation phase 6, a string-literal that does not begin with an encoding-prefix is an ordinary string literal. An ordinary string literal has type “array of n const char” where n is the size of the string as defined below, has static storage duration ([basic.stc]), and is initialized with the given characters.
6 A string-literal that begins with u8, such as u8"asdf", is a UTF-8 string literal. A UTF-8 string literal has type “array of n const char8_t”, where n is the size of the string as defined below; each successive element of the object representation ([basic.types]) has the value of the corresponding code unit of the UTF-8 encoding of the string.
7 Ordinary string literals and UTF-8 string literals are also referred to as narrow string literals.
8 A string-literal that begins with u, such as u"asdf", is a UTF-16 string literal. A UTF-16 string literal has type “array of n const char16_t”, where n is the size of the string as defined below; each successive element of the array has the value of the corresponding code unit of the UTF-16 encoding of the string.

NoteA single c-char may produce more than one char16_t character in the form of surrogate pairs. A surrogate pair is a representation for a single code point as a sequence of two 16-bit code units.

9 A string-literal that begins with U, such as U"asdf", is a UTF-32 string literal. A UTF-32 string literal has type “array of n const char32_t”, where n is the size of the string as defined below; each successive element of the array has the value of the corresponding code unit of the UTF-32 encoding of the string.
10 A string-literal that begins with L, such as L"asdf", is a wide string literal. A wide string literal has type “array of n const

wchar_t”, where n is the size of the string as defined below; it is initialized with the given characters.

11 In translation phase 6 ([lex.phases]), adjacent string-literals are concatenated. If both string-literals have the same encoding-prefix, the resulting concatenated string-literal has that encoding-prefix. If one string-literal has no encoding-prefix, it is treated as a string-literal of the same encoding-prefix as the other operand. If a UTF-8 string literal token is adjacent to a wide string literal token, the program is ill-formed. Any other concatenations are conditionally-supported with implementation-defined behavior.

NoteThis concatenation is an interpretation, not a conversion. Because the interpretation happens in translation phase 6 (after each character from a string-literal has been translated into a value from the appropriate character set), a string-literal's initial rawness has no effect on the interpretation or well-formedness of the concatenation. Table 11 has some examples of valid concatenations.

Table 11: String literal concatenations [tab:lex.string.concat]
Source Means Source Means Source Means
u"a" u"b" u"ab" U"a" U"b" U"ab" L"a" L"b" L"ab"
u"a" "b" u"ab" U"a" "b" U"ab" L"a" "b" L"ab"
"a" u"b" u"ab" "a" U"b" U"ab" "a" L"b" L"ab"

Characters in concatenated strings are kept distinct.

Example
"\xA" "B"
contains the two characters '\xA' and 'B' after concatenation (and not the single hexadecimal character '\xAB').
12 After any necessary concatenation, in translation phase 7 ([lex.phases]), '\0' is appended to every string-literal so that programs that scan a string can find its end.
13 Escape sequences and universal-character-names in non-raw string literals have the same meaning as in character-literals ([lex.ccon]), except that the single quote ' is representable either by itself or by the escape sequence \', and the double quote " shall be preceded by a \, and except that a universal-character-name in a UTF-16 string literal may yield a surrogate pair. In a narrow string literal, a universal-character-name may map to more than one char or char8_t element due to multibyte encoding. The size of a char32_t or wide string literal is the total number of escape sequences, universal-character-names, and other characters, plus one for the terminating U'\0' or L'\0'. The size of a UTF-16 string literal is the total number of escape sequences, universal-character-names, and other characters, plus one for each character requiring a surrogate pair, plus one for the terminating u'\0'.

NoteThe size of a char16_t string literal is the number of code units, not the number of characters.

NoteAny universal-character-names are required to correspond to a code point in the range [0, D800) or [E000, 10FFFF] (hexadecimal) ([lex.charset]). The size of a narrow string literal is the total number of escape sequences and other characters, plus at least one for the multibyte encoding of each universal-character-name, plus one for the terminating '\0'.

14 Evaluating a string-literal results in a string literal object with static storage duration, initialized from the given characters as specified above. Whether all string-literals are distinct (that is, are stored in nonoverlapping objects) and whether successive evaluations of a string-literal yield the same or a different object is unspecified.

NoteThe effect of attempting to modify a string-literal is undefined.

5.13.6 Boolean literals [lex.bool]

boolean-literal:

false
true
1 The Boolean literals are the keywords false and true. Such literals are prvalues and have type bool.

5.13.7 Pointer literals [lex.nullptr]

1 The pointer literal is the keyword nullptr. It is a prvalue of type std​::​nullptr_t.

Notestd​::​nullptr_t is a distinct type that is neither a pointer type nor a pointer-to-member type; rather, a prvalue of this type is a null pointer constant and can be converted to a null pointer value or null member pointer value. See [conv.ptr] and [conv.mem].

5.13.8 User-defined literals [lex.ext]

1 The pointer literal is the keyword nullptr. It is a prvalue of type std​::​nullptr_t.

Notestd​::​nullptr_t is a distinct type that is neither a pointer type nor a pointer-to-member type; rather, a prvalue of this type is a null pointer constant and can be converted to a null pointer value or null member pointer value. See [conv.ptr] and [conv.mem].


  1. Implementations behave as if these separate phases occur, although in practice different phases can be folded together. 
  2. A partial preprocessing token would arise from a source file ending in the first portion of a multi-character token that requires a terminating sequence of characters, such as a header-name that is missing the closing " or >. A partial comment would arise from a source file ending with an unclosed /* comment. 
  3. An implementation need not convert all non-corresponding source characters to the same execution character. 
  4. The glyphs for the members of the basic source character set are intended to identify characters from the subset of ISO/IEC 10646 which corresponds to the ASCII character set. However, the mapping from source file characters to the source character set (described in translation phase 1) is specified as implementation-defined, and therefore implementations must document how the basic source characters are represented in source files. 
  5. A sequence of characters resembling a universal-character-name in an r-char-sequence ([lex.string]) does not form a universal-character-name. 
  6. These include “digraphs” and additional reserved words. The term “digraph” (token consisting of two characters) is not perfectly descriptive, since one of the alternative preprocessing-tokens is %:%: and of course several primary tokens contain two characters. Nonetheless, those alternative tokens that aren't lexical keywords are colloquially known as “digraphs”. 
  7. Thus the “stringized” values ([cpp.stringize]) of [ and <: will be different, maintaining the source spelling, but the tokens can otherwise be freely interchanged. 
  8. Literals include strings and character and numeric literals. 
  9. Thus, a sequence of characters that resembles an escape sequence might result in an error, be interpreted as the character corresponding to the escape sequence, or have a completely different meaning, depending on the implementation. 
  10. On systems in which linkers cannot accept extended characters, an encoding of the universal-character-name can be used in forming valid external identifiers. For example, some otherwise unused character or sequence of characters can be used to encode the \u in a universal-character-name. Extended characters can produce a long external identifier, but C++ does not place a translation limit on significant characters for external identifiers. In C++, upper- and lower-case letters are considered different for all identifiers, including external identifiers. 
  11. The term “literal” generally designates, in this document, those tokens that are called “constants” in ISO C. 
  12. They are intended for character sets where a character does not fit into a single byte. 
  13. Using an escape sequence for a question mark is supported for compatibility with ISO C++ 2014 and ISO C. 


[edit]

Source: https://timsong-cpp.github.io/cppwp/n4950/lex

1 Scope [intro.scope]
2 Normative references [intro.refs]
3 Terms and definitions [intro.defs]
4 General principles [intro]
5 Lexical conventions [lex]
5.1 Separate translation [lex.separate]
5.2 Phases of translation [lex.phases]
5.3 Character sets [lex.charset]
5.4 Preprocessing tokens [lex.pptoken]
5.5 Alternative tokens [lex.digraph]
5.6 Tokens [lex.token]
5.7 Comments [lex.comment]
5.8 Header names [lex.header]
5.9 Preprocessing numbers [lex.ppnumber]
5.10 Identifiers [lex.name]
5.11 Keywords [lex.key]
5.12 Operators and punctuators [lex.operators]
5.13 Literals [lex.literal]
6 Basics [basic]
7 Expressions [expr]
8 Statements [stmt.stmt]
9 Declarations [dcl.dcl]
10 Modules [module]
11 Classes [class]
12 Overloading [over]
13 Templates [temp]
14 Exception handling [except]
15 Preprocessing directives [cpp]
16 Library introduction [library]
17 Language support library [support]
18 Concepts library [concepts]
19 Diagnostics library [diagnostics]
20 Memory management library [mem]
21 Metaprogramming library [meta]
22 General utilities library [utilities]
23 Strings library [strings]
24 Containers library [containers]
25 Iterators library [iterators]
26 Ranges library [ranges]
27 Algorithms library [algorithms]
28 Numerics library [numerics]
29 Time library [time]
30 Localization library [localization]
31 Input/output library [input.output]
32 Regular expressions library [re]
33 Concurrency support library [thread]
Annex A (informative) Grammar summary [gram]
Annex B (normative) Implementation quantities [implimits]
Annex C (informative) Compatibility [diff]
Annex D (normative) Compatibility features [depr]
Annex E (informative) Conformance with UAX #31 [uaxid]
Bibliography [bibliography]
Index [generalindex]
Index of grammar productions [grammarindex]
Index of library headers [headerindex]
Index of library names [libraryindex]
Index of library concepts [conceptindex]
Index of implementation-defined behavior [impldefindex]

5.1 Separate translation [lex.separate]

1 The text of the program is kept in units called source files in this document. A source file together with all the headers and source files included via the preprocessing directive #include, less any source lines skipped by any of the conditional inclusion preprocessing directives, is called a preprocessing translation unit.

NoteA C++ program need not all be translated at the same time.

2 NotePreviously translated translation units and instantiation units can be preserved individually or in libraries. The separate translation units of a program communicate ([basic.link]) by (for example) calls to functions whose identifiers have external or module linkage, manipulation of objects whose identifiers have external or module linkage, or manipulation of data files. Translation units can be separately translated and then later linked to produce an executable program.

5.2 Phases of translation [lex.phases]

1 The precedence among the syntax rules of translation is specified by the following phases.[cpp23 1]
  1. An implementation shall support input files that are a sequence of UTF-8 code units (UTF-8 files). It may also support an implementation-defined set of other kinds of input files, and, if so, the kind of an input file is determined in an implementation-defined manner that includes a means of designating input files as UTF-8 files, independent of their content. NoteIn other words, recognizing the U+feff byte order mark is not sufficient. If an input file is determined to be a UTF-8 file, then it shall be a well-formed UTF-8 code unit sequence and it is decoded to produce a sequence of Unicode scalar values. A sequence of translation character set elements is then formed by mapping each Unicode scalar value to the corresponding translation character set element. In the resulting sequence, each pair of characters in the input sequence consisting of U+000d carriage return followed by U+000a line feed, as well as each U+000d carriage return not immediately followed by a U+000a line feed, is replaced by a single new-line character. For any other kind of input file supported by the implementation, characters are mapped, in an implementation-defined manner, to a sequence of translation character set elements ([lex.charset]), representing end-of-line indicators as new-line characters.
  2. If the first translation character is U+feff byte order mark, it is deleted. Each sequence of a backslash character (\) immediately followed by zero or more whitespace characters other than new-line followed by a new-line character is deleted, splicing physical source lines to form logical source lines. Only the last backslash on any physical source line shall be eligible for being part of such a splice. Except for splices reverted in a raw string literal, if a splice results in a character sequence that matches the syntax of a universal-character-name, the behavior is undefined. A source file that is not empty and that does not end in a new-line character, or that ends in a splice, shall be processed as if an additional new-line character were appended to the file.
  3. The source file is decomposed into preprocessing tokens ([lex.pptoken]) and sequences of whitespace characters (including comments). A source file shall not end in a partial preprocessing token or in a partial comment.[cpp23 2] Each comment is replaced by one space character. New-line characters are retained. Whether each nonempty sequence of whitespace characters other than new-line is retained or replaced by one space character is unspecified. As characters from the source file are consumed to form the next preprocessing token (i.e., not being consumed as part of a comment or other forms of whitespace), except when matching a c-char-sequence, s-char-sequence, r-char-sequence, h-char-sequence, or q-char-sequence, universal-character-names are recognized and replaced by the designated element of the translation character set. The process of dividing a source file's characters into preprocessing tokens is context-dependent.
    Example See the handling of < within a #include preprocessing directive.
  4. Preprocessing directives are executed, macro invocations are expanded, and _Pragma unary operator expressions are executed. A #include preprocessing directive causes the named header or source file to be processed from phase 1 through phase 4, recursively. All preprocessing directives are then deleted.
  5. For a sequence of two or more adjacent string-literal tokens, a common encoding-prefix is determined as specified in [lex.string]. Each such string-literal token is then considered to have that common encoding-prefix.
  6. Adjacent string-literal tokens are concatenated ([lex.string]).
  7. Whitespace characters separating tokens are no longer significant. Each preprocessing token is converted into a token ([lex.token]). The resulting tokens constitute a translation unit and are syntactically and semantically analyzed and translated. NoteThe process of analyzing and translating the tokens can occasionally result in one token being replaced by a sequence of other tokens ([temp.names]). It is implementation-defined whether the sources for module units and header units on which the current translation unit has an interface dependency ([module.unit], [module.import]) are required to be available. NoteSource files, translation units and translated translation units need not necessarily be stored as files, nor need there be any one-to-one correspondence between these entities and any external representation. The description is conceptual only, and does not specify any particular implementation.
  8. Translated translation units and instantiation units are combined as follows: NoteSome or all of these can be supplied from a library. Each translated translation unit is examined to produce a list of required instantiations. NoteThis can include instantiations which have been explicitly requested ([temp.explicit]). The definitions of the required templates are located. It is implementation-defined whether the source of the translation units containing these definitions is required to be available. NoteAn implementation can choose to encode sufficient information into the translated translation unit so as to ensure the source is not required here. All the required instantiations are performed to produce instantiation units. NoteThese are similar to translated translation units, but contain no references to uninstantiated templates and no template definitions. The program is ill-formed if any instantiation fails.
  9. All external entity references are resolved. Library components are linked to satisfy external references to entities not defined in the current translation. All such translator output is collected into a program image which contains information needed for execution in its execution environment.

5.3 Character sets [lex.charset]

1 The translation character set consists of the following elements:
(1.1)
  • each abstract character assigned a code point in the Unicode codespace as specified in the Unicode Standard, and

(1.2)
  • a distinct character for each Unicode scalar value not assigned to an abstract character.

    NoteUnicode code points are integers in the range [0, 10FFFF] (hexadecimal). A surrogate code point is a value in the range [D800, DFFF] (hexadecimal). A Unicode scalar value is any code point that is not a surrogate code point.

2 The basic character set is a subset of the translation character set, consisting of 96 characters as specified in Table 1.

NoteUnicode short names are given only as a means to identifying the character; the numerical value has no other meaning in this context.

Table 1: Basic character set [tab:lex.charset.basic]
character glyph
U+0009 character tabulation
U+000b line tabulation
U+000c form feed
U+0020 space
U+000a line feed new-line
U+0021 exclamation mark !
U+0022 quotation mark "
U+0023 number sign #
U+0025 percent sign %
U+0026 ampersand &
U+0027 apostrophe '
U+0028 left parenthesis (
U+0029 right parenthesis )
U+002a asterisk *
U+002b plus sign +
U+002c comma ,
U+002d hyphen-minus -
U+002e full stop .
U+002f solidus /
U+0030 .. U+0039 digit zero .. nine 0 1 2 3 4 5 6 7 8 9
U+003a colon :
U+003b semicolon ;
U+003c less-than sign <
U+003d equals sign =
U+003e greater-than sign >
U+003f question mark ?
U+0041 .. U+005a latin capital letter a .. z A B C D E F G H I J K L M
N O P Q R S T U V W X Y Z
U+005b left square bracket [
U+005c reverse solidus \
U+005d right square bracket ]
U+005e circumflex accent ^
U+005f low line _
U+0061 .. U+007a latin small letter a .. z a b c d e f g h i j k l m
n o p q r s t u v w x y z
U+007b left curly bracket {
U+007c vertical line |
U+007d right curly bracket }
U+007e tilde ~
3 The universal-character-name construct provides a way to name other characters.
Syntax (BNF)

n-char: one of

any member of the translation character set except the U+007d right curly bracket or new-line character
4 A universal-character-name of the form \u hex-quad, \U hex-quad hex-quad, or \u{simple-hexadecimal-digit-sequence} designates the character in the translation character set whose Unicode scalar value is the hexadecimal number represented by the sequence of hexadecimal-digits in the universal-character-name. The program is ill-formed if that number is not a Unicode scalar value.
5 A universal-character-name that is a named-universal-character designates the corresponding character in the Unicode Standard (chapter 4.8 Name) if the n-char-sequence is equal to its character name or to one of its character name aliases of type “control”, “correction”, or “alternate”; otherwise, the program is ill-formed.

NoteThese aliases are listed in the Unicode Character Database's NameAliases.txt. None of these names or aliases have leading or trailing spaces.

6 If a universal-character-name outside the c-char-sequence, s-char-sequence, or r-char-sequence of a character-literal or string-literal (in either case, including within a user-defined-literal) corresponds to a control character or to a character in the basic character set, the program is ill-formed.

NoteA sequence of characters resembling a universal-character-name in an r-char-sequence ([lex.string]) does not form a universal-character-name.

7 The basic literal character set consists of all characters of the basic character set, plus the control characters specified in Table 2.
Table 2: Additional control characters in the basic literal character set [tab:lex.charset.literal]
character
U+0000 null
U+0007 alert
U+0008 backspace
U+000d carriage return
8 A code unit is an integer value of character type ([basic.fundamental]). Characters in a character-literal other than a multicharacter or non-encodable character literal or in a string-literal are encoded as a sequence of one or more code units, as determined by the encoding-prefix ([lex.ccon], [lex.string]); this is termed the respective literal encoding. The ordinary literal encoding is the encoding applied to an ordinary character or string literal. The wide literal encoding is the encoding applied to a wide character or string literal.
9 A literal encoding or a locale-specific encoding of one of the execution character sets ([character.seq]) encodes each element of the basic literal character set as a single code unit with non-negative value, distinct from the code unit for any other such element.

NoteA character not in the basic literal character set can be encoded with more than one code unit; the value of such a code unit can be the same as that of a code unit for an element of the basic literal character set. The U+0000 null character is encoded as the value 0. No other element of the translation character set is encoded with a code unit of value 0. The code unit value of each decimal digit character after the digit 0 (U+0030) shall be one greater than the value of the previous. The ordinary and wide literal encodings are otherwise implementation-defined. For a UTF-8, UTF-16, or UTF-32 literal, the implementation shall encode the Unicode scalar value corresponding to each character of the translation character set as specified in the Unicode Standard for the respective Unicode encoding form.

5.4 Preprocessing tokens [lex.pptoken]

preprocessing-token:

header-name
import-keyword
module-keyword
export-keyword
identifier
pp-number
character-literal
user-defined-character-literal
string-literal
user-defined-string-literal
preprocessing-op-or-punc
each non-whitespace character that cannot be one of the above
1 Each preprocessing token that is converted to a token shall have the lexical form of a keyword, an identifier, a literal, or an operator or punctuator.
2 A preprocessing token is the minimal lexical element of the language in translation phases 3 through 6. In this document, glyphs are used to identify elements of the basic character set ([lex.charset]). The categories of preprocessing token are: header names, placeholder tokens produced by preprocessing import and module directives (import-keyword, module-keyword, and export-keyword), identifiers, preprocessing numbers, character literals (including user-defined character literals), string literals (including user-defined string literals), preprocessing operators and punctuators, and single non-whitespace characters that do not lexically match the other preprocessing token categories. If a U+0027 apostrophe or a U+0022 quotation mark character matches the last category, the behavior is undefined. If any character not in the basic character set matches the last category, the program is ill-formed. Preprocessing tokens can be separated by whitespace; this consists of comments ([lex.comment]), or whitespace characters (U+0020 space, U+0009 character tabulation, new-line, U+000b line tabulation, and U+000c form feed), or both. As described in [cpp], in certain circumstances during translation phase 4, whitespace (or the absence thereof) serves as more than preprocessing token separation. Whitespace can appear within a preprocessing token only as part of a header name or between the quotation characters in a character literal or string literal.
3 If the input stream has been parsed into preprocessing tokens up to a given character:
(3.1)
  • If the next character begins a sequence of characters that could be the prefix and initial double quote of a raw string literal, such as R", the next preprocessing token shall be a raw string literal. Between the initial and final double quote characters of the raw string, any transformations performed in phase 2 (line splicing) are reverted; this reversion shall apply before any d-char, r-char, or delimiting parenthesis is identified. The raw string literal is defined as the shortest sequence of characters that matches the raw-string pattern
(3.2)
  • Otherwise, if the next three characters are <​::​ and the subsequent character is neither : nor >, the < is treated as a preprocessing token by itself and not as the first character of the alternative token <:.

(3.3)
  • Otherwise, the next preprocessing token is the longest sequence of characters that could constitute a preprocessing token, even if that would cause further lexical analysis to fail, except that a header-name ([lex.header]) is only formed

(3.3.1)
(3.3.2)
  • within a has-include-expression.

    Example
    #define R "x"
    const char* s = R"y";           // ill-formed raw string, not "x" "y"
    
4 The import-keyword is produced by processing an import directive ([cpp.import]), the module-keyword is produced by preprocessing a module directive ([cpp.module]), and the export-keyword is produced by preprocessing either of the previous two directives.

NoteNone has any observable spelling.

5
Example The program fragment 0xe+foo is parsed as a preprocessing number token (one that is not a valid integer-literal or floating-point-literal token), even though a parse as three preprocessing tokens 0xe, +, and foo can produce a valid expression (for example, if foo is a macro defined as 1). Similarly, the program fragment 1E1 is parsed as a preprocessing number (one that is a valid floating-point-literal token), whether or not E is a macro name.
6
Example The program fragment x+++++y is parsed as x ++ ++ + y, which, if x and y have integral types, violates a constraint on increment operators, even though the parse x ++ + ++ y can yield a correct expression.

5.5 Alternative tokens [lex.digraph]

1 Alternative token representations are provided for some operators and punctuators.[cpp23 3]
2 In all respects of the language, each alternative token behaves the same, respectively, as its primary token, except for its spelling.[cpp23 4] The set of alternative tokens is defined in Table 3.
Table 3: Alternative tokens [tab:lex.digraph]
Alternative Primary Alternative Primary Alternative Primary
<% { and && and_eq &=
%> } bitor | or_eq |=
<: [ or || xor_eq ^=
:> ] xor ^ not !
%: # compl ~ not_eq !=
%:%: ## bitand &

5.6 Tokens [lex.token]

1 There are five kinds of tokens: identifiers, keywords, literals,[cpp23 5] operators, and other separators. Blanks, horizontal and vertical tabs, newlines, formfeeds, and comments (collectively, “whitespace”), as described below, are ignored except as they serve to separate tokens.

NoteSome whitespace is required to separate otherwise adjacent identifiers, keywords, numeric literals, and alternative tokens containing alphabetic characters.

5.7 Comments [lex.comment]

1 The characters /* start a comment, which terminates with the characters */. These comments do not nest. The characters // start a comment, which terminates immediately before the next new-line character. If there is a form-feed or a vertical-tab character in such a comment, only whitespace characters shall appear between it and the new-line that terminates the comment; no diagnostic is required.

NoteThe comment characters //, /*, and */ have no special meaning within a // comment and are treated just like other characters. Similarly, the comment characters // and /* have no special meaning within a /* comment.

5.8 Header names [lex.header]

h-char:

any member of the translation character set except new-line and U+003e greater-than sign

q-char:

any member of the translation character set except new-line and U+0022 quotation mark
1 NoteHeader name preprocessing tokens only appear within a #include preprocessing directive, a __has_include preprocessing expression, or after certain occurrences of an import token (see [lex.pptoken]).

The sequences in both forms of header-names are mapped in an implementation-defined manner to headers or to external source file names as specified in [cpp.include].

2 The appearance of either of the characters ' or \ or of either of the character sequences /* or // in a q-char-sequence or an h-char-sequence is conditionally-supported with implementation-defined semantics, as is the appearance of the character " in an h-char-sequence.[cpp23 6]

5.9 Preprocessing numbers [lex.ppnumber]

1 Preprocessing number tokens lexically include all integer-literal tokens ([lex.icon]) and all floating-point-literal tokens ([lex.fcon]).
2 A preprocessing number does not have a type or a value; it acquires both after a successful conversion to an integer-literal token or a floating-point-literal token.

5.10 Identifiers [lex.name]

identifier-start:

nondigit
an element of the translation character set with the Unicode property XID_Start

identifier-continue:

digit
nondigit
an element of the translation character set with the Unicode property XID_Continue

nondigit: one of

a b c d e f g h i j k l m
n o p q r s t u v w x y z
A B C D E F G H I J K L M
N O P Q R S T U V W X Y Z _

digit: one of

0 1 2 3 4 5 6 7 8 9
1
NoteThe character properties XID_Start and XID_Continue are Derived Core Properties as described by UAX #44 of the Unicode Standard.[cpp23 7]

The program is ill-formed if an identifier does not conform to Normalization Form C as specified in the Unicode Standard. NoteIdentifiers are case-sensitive.

Note[uaxid] compares the requirements of UAX #31 of the Unicode Standard with the C++ rules for identifiers.

NoteIn translation phase 4, identifier also includes those preprocessing-tokens ([lex.pptoken]) differentiated as keywords ([lex.key]) in the later translation phase 7 ([lex.token]).

2 The identifiers in Table 4 have a special meaning when appearing in a certain context. When referred to in the grammar, these identifiers are used explicitly rather than using the identifier grammar production. Unless otherwise specified, any ambiguity as to whether a given identifier has a special meaning is resolved to interpret the token as a regular identifier.
Table 4: Identifiers with special meaning [tab:lex.name.special]
final import module override
3 In addition, some identifiers appearing as a token or preprocessing-token are reserved for use by C++ implementations and shall not be used otherwise; no diagnostic is required.
(3.1)
  • Each identifier that contains a double underscore __ or begins with an underscore followed by an uppercase letter is reserved to the implementation for any use.

(3.2)
  • Each identifier that begins with an underscore is reserved to the implementation for use as a name in the global namespace.

5.11 Keywords [lex.key]

keyword:

any identifier listed in Table 5
import-keyword
module-keyword
export-keyword
1 The identifiers shown in Table 5 are reserved for use as keywords (that is, they are unconditionally treated as keywords in phase 7) except in an attribute-token ([dcl.attr.grammar]).

NoteThe register keyword is unused but is reserved for future use.

Table 5: Keywords [tab:lex.key]
alignas constinit false public true
alignof const_cast float register try
asm continue for reinterpret_cast typedef
auto co_await friend requires typeid
bool co_return goto return typename
break co_yield if short union
case decltype inline signed unsigned
catch default int sizeof using
char delete long static virtual
char8_t do mutable static_assert void
char16_t double namespace static_cast volatile
char32_t dynamic_cast new struct wchar_t
class else noexcept switch while
concept enum nullptr template
const explicit operator this
consteval export private thread_local
constexpr extern protected throw
2 Furthermore, the alternative representations shown in Table 6 for certain operators and punctuators ([lex.digraph]) are reserved and shall not be used otherwise.
Table 6: Alternative representations [tab:lex.key.digraph]
and and_eq bitand bitor compl not
not_eq or or_eq xor xor_eq

5.12 Operators and punctuators [lex.operators]

1 The lexical representation of C++ programs includes a number of preprocessing tokens that are used in the syntax of the preprocessor or are converted into tokens for operators and punctuators:

preprocessing-operator: one of

# ##  %:  %:%:

operator-or-punctuator: one of

{ } [ ] ( )
<:  :> <%  %> ; : ...
?  :: . .* -> ->* ~
! + - * /  % ^ & |
= += -= *= /=  %= ^= &= |=
==  != < > <= >= <=> && ||
<< >> <<= >>= ++ -- ,
and or xor not bitand bitor compl
and_eq or_eq xor_eq not_eq

Each operator-or-punctuator is converted to a single token in translation phase 7.

5.13.1 Kinds of literals [lex.literal.kinds]

1 There are several kinds of literals.[cpp23 8]

NoteWhen appearing as an expression, a literal has a type and a value category ([expr.prim.literal]).

5.13.2 Integer literals [lex.icon]

binary-digit: one of

0 1

octal-digit: one of

0 1 2 3 4 5 6 7

nonzero-digit: one of

1 2 3 4 5 6 7 8 9

hexadecimal-prefix: one of

0x 0X

hexadecimal-digit: one of

0 1 2 3 4 5 6 7 8 9
a b c d e f
A B C D E F

unsigned-suffix: one of

u U

long-suffix: one of

l L

long-long-suffix: one of

ll LL

size-suffix: one of

z Z
1 In an integer-literal, the sequence of binary-digits, octal-digits, digits, or hexadecimal-digits is interpreted as a base N integer as shown in table Table 7; the lexically first digit of the sequence of digits is the most significant.

NoteThe prefix and any optional separating single quotes are ignored when determining the value.

Table 7: Base of integer-literals[tab:lex.icon.base]
Kind of integer-literal base N
binary-literal 2
octal-literal 8
decimal-literal 10
hexadecimal-literal 16
2 The hexadecimal-digits

a through f and A through F have decimal values ten through fifteen.

Example The number twelve can be written 12, 014, 0XC, or 0b1100. The integer-literals 1048576, 1'048'576, 0X100000, 0x10'0000, and 0'004'000'000 all have the same value.
3 The type of an integer-literal is the first type in the list in Table 8 corresponding to its optional integer-suffix in which its value can be represented.
Table 8: Types of integer-literals[tab:lex.icon.type]
integer-suffix decimal-literal integer-literal other than decimal-literal
none int int
long int unsigned int
long long int long int
unsigned long int
long long int
unsigned long long int
u or U unsigned int unsigned int
unsigned long int unsigned long int
unsigned long long int unsigned long long int
l or L long int long int
long long int unsigned long int
long long int
unsigned long long int
Both u or U unsigned long int unsigned long int
and l or L unsigned long long int unsigned long long int
ll or LL long long int long long int
unsigned long long int
Both u or U unsigned long long int unsigned long long int
and ll or LL
z or Z the signed integer type corresponding the signed integer type
  to std​::​size_t ([support.types.layout])   corresponding to std​::​size_t
std​::​size_t
Both u or U std​::​size_t std​::​size_t
and z or Z
4 If an integer-literal cannot be represented by any type in its list and an extended integer type ([basic.fundamental]) can represent its value, it may have that extended integer type. If all of the types in the list for the integer-literal are signed, the extended integer type shall be signed. If all of the types in the list for the integer-literal are unsigned, the extended integer type shall be unsigned. If the list contains both signed and unsigned types, the extended integer type may be signed or unsigned. A program is ill-formed if one of its translation units contains an integer-literal that cannot be represented by any of the allowed types.

5.13.3 Character literals [lex.ccon]

encoding-prefix: one of

u8  u  U  L

basic-c-char:

any member of the translation character set except the U+0027 apostrophe,
   U+005c reverse solidus, or new-line character

simple-escape-sequence-char: one of

' " ? \ a b f n r t v

conditional-escape-sequence-char:

any member of the basic character set that is not an octal-digit, a simple-escape-sequence-char, or the characters N, o, u, U, or x
1 A non-encodable character literal is a character-literal whose c-char-sequence consists of a single c-char that is not a numeric-escape-sequence and that specifies a character that either lacks representation in the literal's associated character encoding or that cannot be encoded as a single code unit. A multicharacter literal is a character-literal whose c-char-sequence consists of more than one c-char. The encoding-prefix of a non-encodable character literal or a multicharacter literal shall be absent. Such character-literals are conditionally-supported.
2 The kind of a character-literal, its type, and its associated character encoding ([lex.charset]) are determined by its encoding-prefix and its c-char-sequence as defined by Table 9. The special cases for non-encodable character literals and multicharacter literals take precedence over the base kind.

NoteThe associated character encoding for ordinary character literals determines encodability, but does not determine the value of non-encodable ordinary character literals or ordinary multicharacter literals. The examples in Table 9 for non-encodable ordinary character literals assume that the specified character lacks representation in the ordinary literal encoding or that encoding the character would require more than one code unit.

Table 9: Character literals [tab:lex.ccon.literal]
Encoding Kind Type Associated char- Example
prefix acter encoding
none ordinary character literal char ordinary 'v'
non-encodable ordinary character literal int literal '\U0001F525'
ordinary multicharacter literal int encoding 'abcd'
L wide character literal wchar_t wide literal L'w'
encoding
u8 UTF-8 character literal char8_t UTF-8 u8'x'
u UTF-16 character literal char16_t UTF-16 u'y'
U UTF-32 character literal char32_t UTF-32 U'z'
3 In translation phase 4, the value of a character-literal is determined using the range of representable values of the character-literal's type in translation phase 7. A non-encodable character literal or a multicharacter literal has an implementation-defined value. The value of any other kind of character-literal is determined as follows:
(3.1)
(3.2)
(3.2.1)
(3.2.2)
  • If v does not exceed the range of representable values of the character-literal's type, then the value is v.

(3.2.3)
  • Otherwise, if the character-literal's encoding-prefix is absent or L, and v does not exceed the range of representable values of the corresponding unsigned type for the underlying type of the character-literal's type, then the value is the unique value of the character-literal's type T that is congruent to v modulo 2N, where N is the width of T.

(3.2.4)
(3.3)
4 The character specified by a simple-escape-sequence is specified in Table 10.

NoteUsing an escape sequence for a question mark is supported for compatibility with ISO C++ 2014 and ISO C.

Table 10: Simple escape sequences [tab:lex.ccon.esc]
character simple-escape-sequence
U+000a line feed \n
U+0009 character tabulation \t
U+000b line tabulation \v
U+0008 backspace \b
U+000d carriage return \r
U+000c form feed \f
U+0007 alert \a
U+005c reverse solidus \\
U+003f question mark \?
U+0027 apostrophe \'
U+0022 quotation mark \"

5.13.4 Floating-point literals [lex.fcon]

sign: one of

+ -

floating-point-suffix: one of

f l f16 f32 f64 f128 bf16 F L F16 F32 F64 F128 BF16
1 The type of a floating-point-literal ([basic.fundamental], [basic.extended.fp]) is determined by its floating-point-suffix as specified in Table 11.

NoteThe floating-point suffixes f16, f32, f64, f128, bf16, F16, F32, F64, F128, and BF16 are conditionally-supported. See [basic.extended.fp].

Table 11: Types of floating-point-literals[tab:lex.fcon.type]
floating-point-suffix type
none double
f or F float
l or L long double
f16 or F16 std::float16_t
f32 or F32 std::float32_t
f64 or F64 std::float64_t
f128 or F128 std::float128_t
bf16 or BF16 std::bfloat16_t
2 The significand of a floating-point-literal is the fractional-constant or digit-sequence of a decimal-floating-point-literal or the hexadecimal-fractional-constant or hexadecimal-digit-sequence of a hexadecimal-floating-point-literal. In the significand, the sequence of digits or hexadecimal-digits and optional period are interpreted as a base N real number s, where N is 10 for a decimal-floating-point-literal and 16 for a hexadecimal-floating-point-literal.

NoteAny optional separating single quotes are ignored when determining the value. If an exponent-part or binary-exponent-part is present, the exponent e of the floating-point-literal is the result of interpreting the sequence of an optional sign and the digits as a base 10 integer. Otherwise, the exponent e is 0. The scaled value of the literal is s×10e for a decimal-floating-point-literal and s×2e for a hexadecimal-floating-point-literal.

Example The floating-point-literals

49.625 and 0xC.68p+2 have the same value. The floating-point-literals

1.602'176'565e-19 and 1.602176565e-19 have the same value.
3 If the scaled value is not in the range of representable values for its type, the program is ill-formed. Otherwise, the value of a floating-point-literal is the scaled value if representable, else the larger or smaller representable value nearest the scaled value, chosen in an implementation-defined manner.

5.13.5 String literals [lex.string]

basic-s-char:

any member of the translation character set except the U+0022 quotation mark,
   U+005c reverse solidus, or new-line character

r-char:

any member of the translation character set, except a U+0029 right parenthesis followed by
   the initial d-char-sequence (which may be empty) followed by a U+0022 quotation mark

d-char:

any member of the basic character set except:
   U+0020 space, U+0028 left parenthesis, U+0029 right parenthesis, U+005c reverse solidus,
   U+0009 character tabulation, U+000b line tabulation, U+000c form feed, and new-line
1 The kind of a string-literal, its type, and its associated character encoding ([lex.charset]) are determined by its encoding prefix and sequence of s-chars or r-chars as defined by Table 12 where n is the number of encoded code units as described below.
Table 12: String literals [tab:lex.string.literal]
Encoding Kind Type Associated Examples
prefix character
encoding
none ordinary string literal array of n
const char
ordinary literal encoding "ordinary string"
R"(ordinary raw string)"
L wide string literal array of n
const wchar_t
wide literal
encoding
L"wide string"
LR"w(wide raw string)w"
u8 UTF-8 string literal array of n
const char8_t
UTF-8 u8"UTF-8 string"
u8R"x(UTF-8 raw string)x"
u UTF-16 string literal array of n
const char16_t
UTF-16 u"UTF-16 string"
uR"y(UTF-16 raw string)y"
U UTF-32 string literal array of n
const char32_t
UTF-32 U"UTF-32 string"
UR"z(UTF-32 raw string)z"
2 A string-literal that has an R

in the prefix is a raw string literal. The d-char-sequence serves as a delimiter. The terminating d-char-sequence of a raw-string is the same sequence of characters as the initial d-char-sequence. A d-char-sequence shall consist of at most 16 characters.

3 NoteThe characters '(' and ')' are permitted in a raw-string. Thus, R"delimiter((a|b))delimiter" is equivalent to "(a|b)".
4
NoteA source-file new-line in a raw string literal results in a new-line in the resulting execution string literal. Assuming no whitespace at the beginning of lines in the following example, the assert will succeed:
const char* p = R"(a\
b
c)";
assert(std::strcmp(p, "a\\\nb\nc") == 0);
5
Example The raw string
R"a(
)\
a"
)a"
is equivalent to "\n)\\\na\"\n". The raw string
R"(x = "\"y\"")"
is equivalent to "x = \"\\\"y\\\"\"".
6 Ordinary string literals and UTF-8 string literals are also referred to as narrow string literals.
7 The common encoding-prefix for a sequence of adjacent string-literals is determined pairwise as follows: If two string-literals have the same encoding-prefix, the common encoding-prefix is that encoding-prefix. If one string-literal has no encoding-prefix, the common encoding-prefix is that of the other string-literal. Any other combinations are ill-formed.

NoteA string-literal's rawness has no effect on the determination of the common encoding-prefix.

8 In translation phase 6 ([lex.phases]), adjacent string-literals are concatenated. The lexical structure and grouping of the contents of the individual string-literals is retained.
Example
"\xA" "B"
represents the code unit '\xA' and the character 'B' after concatenation (and not the single code unit '\xAB'). Similarly,
R"(\u00)" "41"
represents six characters, starting with a backslash and ending with the digit 1 (and not the single character 'A' specified by a universal-character-name). Table 13 has some examples of valid concatenations.
Table 13: String literal concatenations [tab:lex.string.concat]
Source Means Source Means Source Means
u"a" u"b" u"ab" U"a" U"b" U"ab" L"a" L"b" L"ab"
u"a" "b" u"ab" U"a" "b" U"ab" L"a" "b" L"ab"
"a" u"b" u"ab" "a" U"b" U"ab" "a" L"b" L"ab"
9 Evaluating a string-literal results in a string literal object with static storage duration ([basic.stc]). Whether all string-literals are distinct (that is, are stored in nonoverlapping objects) and whether successive evaluations of a string-literal yield the same or a different object is unspecified.

NoteThe effect of attempting to modify a string literal object is undefined.

10 String literal objects are initialized with the sequence of code unit values corresponding to the string-literal's sequence of s-chars (originally from non-raw string literals) and r-chars (originally from raw string literals), plus a terminating U+0000 null character, in order as follows:
(10.1)
  • The sequence of characters denoted by each contiguous sequence of basic-s-chars, r-chars, simple-escape-sequences ([lex.ccon]), and universal-character-names ([lex.charset]) is encoded to a code unit sequence using the string-literal's associated character encoding. If a character lacks representation in the associated character encoding, then the string-literal is conditionally-supported and an implementation-defined code unit sequence is encoded. NoteNo character lacks representation in any Unicode encoding form. When encoding a stateful character encoding, implementations should encode the first such sequence beginning with the initial encoding state and encode subsequent sequences beginning with the final encoding state of the prior sequence. NoteThe encoded code unit sequence can differ from the sequence of code units that would be obtained by encoding each character independently.

(10.2)
(10.2.1)
(10.2.2)
  • If v does not exceed the range of representable values of the string-literal's array element type, then the value is v.

(10.2.3)
  • Otherwise, if the string-literal's encoding-prefix is absent or L, and v does not exceed the range of representable values of the corresponding unsigned type for the underlying type of the string-literal's array element type, then the value is the unique value of the string-literal's array element type T that is congruent to v modulo 2N, where N is the width of T.

(10.2.4)
  • Otherwise, the string-literal is ill-formed.

    When encoding a stateful character encoding, these sequences should have no effect on encoding state.

(10.3)
  • Each conditional-escape-sequence ([lex.ccon]) contributes an implementation-defined code unit sequence. When encoding a stateful character encoding, it is implementation-defined what effect these sequences have on encoding state.

5.13.6 Boolean literals [lex.bool]

boolean-literal:

false
true
1 The Boolean literals are the keywords false and true. Such literals have type bool.

5.13.7 Pointer literals [lex.nullptr]

1 The pointer literal is the keyword nullptr. It has type std​::​nullptr_t.

Notestd​::​nullptr_t is a distinct type that is neither a pointer type nor a pointer-to-member type; rather, a prvalue of this type is a null pointer constant and can be converted to a null pointer value or null member pointer value. See [conv.ptr] and [conv.mem].

5.13.8 User-defined literals [lex.ext]

1 The pointer literal is the keyword nullptr. It has type std​::​nullptr_t.

Notestd​::​nullptr_t is a distinct type that is neither a pointer type nor a pointer-to-member type; rather, a prvalue of this type is a null pointer constant and can be converted to a null pointer value or null member pointer value. See [conv.ptr] and [conv.mem].


  1. Implementations behave as if these separate phases occur, although in practice different phases can be folded together.
  2. A partial preprocessing token would arise from a source file ending in the first portion of a multi-character token that requires a terminating sequence of characters, such as a header-name that is missing the closing " or >. A partial comment would arise from a source file ending with an unclosed /* comment.
  3. These include “digraphs” and additional reserved words. The term “digraph” (token consisting of two characters) is not perfectly descriptive, since one of the alternative preprocessing-tokens is %:%: and of course several primary tokens contain two characters. Nonetheless, those alternative tokens that aren't lexical keywords are colloquially known as “digraphs”.
  4. Thus the “stringized” values ([cpp.stringize]) of [ and <: will be different, maintaining the source spelling, but the tokens can otherwise be freely interchanged.
  5. Literals include strings and character and numeric literals.
  6. Thus, a sequence of characters that resembles an escape sequence can result in an error, be interpreted as the character corresponding to the escape sequence, or have a completely different meaning, depending on the implementation.
  7. On systems in which linkers cannot accept extended characters, an encoding of the universal-character-name can be used in forming valid external identifiers. For example, some otherwise unused character or sequence of characters can be used to encode the \u in a universal-character-name. Extended characters can produce a long external identifier, but C++ does not place a translation limit on significant characters for external identifiers.
  8. The term “literal” generally designates, in this document, those tokens that are called “constants” in ISO C.