For the purpose of this section, the POSIX Extended Regular Expressions (ERE) syntax is used to describe the production of lexical elements. The POSIX regular expression is chosen for it being vendor neutral. There's a difference between the POSIX semantic of regular expression and PCRE semantic, the latter of which is widely used in many programming languages even on POSIX platforms, most notably Perl, Python, PHP, and have been adopted by JavaScript. Care have been taken to ensure the expressions used in this chapter are interpreted identically under both semantics.
Comments in the language begin with 2 forward slashses: //, or 1 hash
sign: #, and span towards the end of the line. Another form of comments
exists, where it begins with /* and ends with */ - this form of comment
can span multiple lines.
Comments in the following explanatory code blocks use the same notation as in the actual language.
An identfier has the following production: [_[:alpha:]][_[:alnum:]]*.
A keyword is an identifier that matches one of the following:
// Special Values:
true false null
// Phrases:
return break continue and or _Fallback
// Statements and Declarations:
decl
// Control Flows:
if else elif while do for
// Functions:
subr method this
// Translation Unit Interface:
_Include extern const
Decimal integer literals have the following production: [1-9][0-9]*[uU]?.
When the literal has the "U" suffix, the literal has type ulong, otherwise,
the literal has type long.
Octal integer literals have the following production: 0o?[0-7]*. An octal
literal always has type ulong.
Note: As it had been a common mistake in newcomers to zero-pad a decimal
number only to realize it's become an octal literal, it is recommended that
implementations issue warnings when a number is zero-padded and recommend user
to prefix the literal with 0o when they do intend to use octals. Likewise,
for some functions (e.g. chmod in POSIX), users may actually DO intend to
use octals when they forget to zero-prefix them to become octal literals - in
these cases, it is recommended that semantic analysis be performed using syntax
information (if possible) and appropriate warnings be given.
Hexadecimal integer literals have
the following production: 0[xX][0-9a-fA-F]+.
A hexadecimal literal always has type ulong.
Radix-64 literals have the following production: 0\\[A-Za-z0-9._]+.
The primary use of radix-64 literals are as option flags to functions, as
bitwise compositions are obscure, and symbolic constants need verbose prefixes
to not pollute global name space. A radix-64 literal always have type ulong.
The characters following the backslash have the same numerical value as those
in the Base 64 Encoding with URL and Filename Safe Alphabet
except that the minus sign (-) is replaced with a period (.) due to possible
ambiguity with the subtraction expression operator, and that there's no
padding characters.
Fraction literals has the following production: [0-9]+\.[0-9]*|\.[0-9]+.
The literal always has type double.
Decimal scientific literals is a fraction literal further suffixed by
a decimal exponent literal production: [eE][-+]?[0-9]+. The digits of the
production indicates a power of 10 to raise fraction part to.
Hexadecimal fraction literal has the following production:
0[xX]([0-9a-fA-F]+.[0-9a-fA-F]*|.[0-9a-fA-F]+) - this production is
NOT a valid lexical element in the language,
but hexadecimal scientific literal is, which is defined as
hex fraction literal followed by hexadecimal exponent literal - having the
production: [pP][-+]?[0-9]+. The digits of the production indicates a power
of 2 to raise the fraction part to.
Character and string literals have the following production:
['"]([^\]|\\(["'abefnrtv]|x[0-9a-fA-F]{2,2}|[0-7]{1,3}))['"]
In the 2nd subexpression, each alternative have the following meanings:
BEL ASCII 'bell' control character,BS ASCII backspace character,ESC ASCII escape character,FF ASCII form-feed character,LF ASCII line-feed character,CR ASCII carriage return character,HT ASCII horizontal tab character,VT ASCII vertical tab character.When single-quoted, the literal is a character literal having the value of the
first character as type long, the behavior is implementation-defined if there
are multiple characters.
When double-quoted, the literal is a string literal having type str.
Raw string literals have the following production:
\\("[^"]*"|'[^']')
In a raw string literal, there is no escape sequence. Single quotes cannot appear in single-quoted raw string literals, and double quotes cannot appear in double-quoted raw string literals.
Raw string literals are primarily intended for writing regular expressions.
Any number of raw string and double-quoted string may be concatenated into one string object by virtue of them being placed in adjacency with no character in between other than whitespaces. The set of whitespace characters are defined to be exactly the following: U+0020 (space), U+000D (carriage return), U+000B (vertical tab), U+000A (line-feed), U+0009 (horizontal tab).
A punctuation is one of the following:
( ) [ ] =? . ++ -- + - ~ ! * / %
<< >> >>> < > & ^ |
<= >= == != === !== && || ?? ? :
= *= /= %= += -= <<= >>= >>>= &= ^= |= ,
; { }