Lexical structure (Revised(7) Report on the Algorithmic Language Scheme)

7.1.1 Lexical structure

This section describes how individual tokens (identifiers, numbers, etc.) are formed from sequences of characters. The following sections describe how expressions and programs are formed from sequences of tokens.

⟨Intertoken space⟩ can occur on either side of any token, but not within a token.

Identifiers that do not begin with a vertical line are terminated by a ⟨delimiter⟩ or by the end of the input. So are dot, numbers, characters, and booleans. Identifiers that begin with a vertical line are terminated by another vertical line.

The following four characters from the ASCII repertoire are reserved for future extensions to the language: [ ] { }

In addition to the identifier characters of the ASCII repertoire specified below, Scheme implementations may permit any additional repertoire of Unicode characters to be employed in identifiers, provided that each such character has a Unicode general category of Lu, Ll, Lt, Lm, Lo, Mn, Mc, Me, Nd, Nl, No, Pd, Pc, Po, Sc, Sm, Sk, So, or Co, or is U+200C or U+200D (the zero-width non-joiner and joiner, respectively, which are needed for correct spelling in Persian, Hindi, and other languages). However, it is an error for the first character to have a general category of Nd, Mc, or Me. It is also an error to use a non-Unicode character in symbols or identifiers.

All Scheme implementations must permit the escape sequence \x⟨hexdigits⟩; to appear in Scheme identifiers that are enclosed in vertical lines. If the character with the given Unicode scalar value is supported by the implementation, identifiers containing such a sequence are equivalent to identifiers containing the corresponding character.

⟨token⟩ → ⟨identifier⟩ | ⟨boolean⟩ | ⟨number⟩ | ⟨character⟩ | ⟨string⟩ | ( | ) | #( | #u8 | ' | ` | , | ,@ | .

⟨delimiter⟩ → ⟨whitespace⟩ | ⟨vertical line⟩ | ( | ) | " | ;

⟨intraline whitespace⟩ → ⟨space or tab⟩

⟨whitespace⟩ → ⟨intraline whitespace⟩ | ⟨line ending⟩

⟨vertical line⟩ → |

⟨line ending⟩ → ⟨newline⟩ | ⟨return⟩ ⟨newline⟩ | ⟨return⟩

⟨comment⟩ → ; ⟨all subsequent characters up to a line ending⟩ | ⟨nested comment⟩ | #; ⟨intertoken space⟩ ⟨datum⟩

⟨nested comment⟩ → #| ⟨comment text⟩ ⟨comment cont⟩^* |#

⟨comment text⟩ → ⟨character sequence not containing #| or |#⟩

⟨comment cont⟩ → ⟨nested comment⟩ ⟨comment text⟩

⟨directive⟩ → #!fold-case | #!no-fold-case

Note that it is ungrammatical to follow a ⟨directive⟩ with anything but a ⟨delimiter⟩ or the end of file.

⟨atmosphere⟩ → ⟨whitespace⟩ ∣ ⟨comment⟩ ∣ ⟨directive⟩

⟨intertoken space⟩ → ⟨atmosphere⟩^*

Note that +i, -i and ⟨infnan⟩ below are exceptions to the ⟨peculiar identifier⟩ rule; they are parsed as numbers, not identifiers.

⟨identifier⟩ → ⟨initial⟩ ⟨subsequent⟩^* | ⟨vertical line⟩ ⟨symbol element⟩^* ⟨vertical line⟩ | ⟨peculiar identifier⟩

⟨initial⟩ → ⟨letter⟩ | ⟨special initial⟩

⟨letter⟩ → a | b | c | … | z | A | B | C | … | Z

⟨special initial⟩ → ! | $ | % | & | * | / | : | < | = | > | ? | ^ | _ | ~

⟨subsequent⟩ → ⟨initial⟩ | ⟨digit⟩ | ⟨special subsequent⟩

⟨digit⟩ → 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

⟨hex digit⟩ → ⟨digit⟩ | a | b | c | d | e | f

⟨explicit sign⟩ → + | -

⟨special subsequent⟩ → ⟨explicit sign⟩ | . | @

⟨inline hex escape⟩ → \x⟨hex scalar value⟩;

⟨hex scalar value⟩ → ⟨hex digit⟩⁺

⟨mnemonic escape⟩ → \a | \b | \t | \n | \r

⟨peculiar identifier⟩ → ⟨explicit sign⟩ | ⟨explicit sign⟩ ⟨sign subsequent⟩ ⟨subsequent⟩^* | ⟨explicit sign⟩ . ⟨dot subsequent⟩ ⟨subsequent⟩^* | . ⟨dot subsequent⟩ ⟨subsequent⟩^*

⟨dot subsequent⟩ → ⟨sign subsequent⟩ | .

⟨sign subsequent⟩ → ⟨initial⟩ | ⟨explicit sign⟩ | @

⟨symbol element⟩ → ⟨any character other than ⟨vertical line⟩ or \⟩ | ⟨inline hex escape⟩ | ⟨mnemonic escape⟩ | |

⟨boolean⟩ → #t | #f | #true | #false

⟨character⟩ → #\ ⟨any character⟩ | #\ ⟨character name⟩ | #\x⟨hex scalar value⟩

⟨character name⟩ → alarm | backspace | delete | escape | newline | null | return | space | tab

⟨string⟩ → " ⟨string element⟩^* "

⟨string element⟩ → ⟨any character other than " or \⟩ | ⟨mnemonic escape⟩ | \" | \\ | \⟨intraline whitespace⟩^*⟨line ending⟩ ⟨intraline whitespace⟩^* | ⟨inline hex escape⟩

⟨bytevector⟩ → \#u8(⟨byte⟩^*)

⟨byte⟩ → ⟨any exact integer between 0 and 255⟩

The following rules for ⟨num R⟩, ⟨complex R⟩, ⟨real R⟩, ⟨ureal R⟩, ⟨uinteger R⟩, and ⟨prefix R⟩ are implicitly replicated for R = 2, 8, 10, and 16. There are no rules for ⟨decimal 2⟩, ⟨decimal 8⟩, and ⟨decimal 16⟩, which means that numbers containing decimal points or exponents are always in decimal radix. Although not shown below, all alphabetic characters used in the grammar of numbers can appear in either upper or lower case.

⟨num R⟩ → ⟨prefix R⟩ ⟨complex R⟩

⟨complex R⟩ → ⟨real R⟩ | ⟨real R⟩ @ ⟨real R⟩ | ⟨real R⟩ + ⟨ureal R⟩ i | ⟨real R⟩ - ⟨ureal R⟩ i | ⟨real R⟩ + i | ⟨real R⟩ - i | ⟨real R⟩ ⟨infnan⟩ i | + ⟨ureal R⟩ i | - ⟨ureal R⟩ i | ⟨infnan⟩ i | + i | - i

⟨real R⟩ → ⟨sign⟩ ⟨ureal R⟩ | ⟨infnan⟩

⟨ureal R⟩ → ⟨uinteger R⟩ | ⟨uinteger R⟩ / ⟨uinteger R⟩ | ⟨decimal R⟩

⟨decimal 10⟩ → ⟨uinteger 10⟩ ⟨suffix⟩ | . ⟨digit 10⟩⁺ ⟨suffix⟩ | ⟨digit 10⟩⁺ . ⟨digit 10⟩^* ⟨suffix⟩

⟨uinteger R⟩ → ⟨digit R⟩⁺

⟨prefix R⟩ → ⟨radix R⟩ ⟨exactness⟩ | ⟨exactness⟩ ⟨radix R⟩

⟨infnan⟩ → +inf.0 | -inf.0 | +nan.0 | -nan.0

⟨suffix⟩ → ⟨empty⟩ | ⟨exponent marker⟩ ⟨sign⟩ ⟨digit 10⟩⁺

⟨exponent marker⟩ → e

⟨sign⟩ → ⟨empty⟩ | + | -

⟨exactness⟩ → ⟨empty⟩ | #i | #e

⟨radix 2⟩ → #b

⟨radix 8⟩ → #o

⟨radix 10⟩ → ⟨empty⟩ | #d

⟨radix 16⟩ → #x

⟨digit 2⟩ → 0 | 1

⟨digit 8⟩ → 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7

⟨digit 10⟩ → ⟨digit⟩

⟨digit 16⟩ → ⟨digit 10⟩ | a | b | c | d | e | f