stdx.lexer
Summary
This module contains a range-based compile-time lexer generator.
Overview
The lexer generator consists of a template mixin,
Lexer, along with
several helper templates for generating such things as token identifiers.
To write a lexer using this API:
- Create the string array costants for your language.
- Create aliases for the various token and token identifier types
specific to your language.
- Create a struct that mixes in the Lexer template mixin and
implements the necessary functions.
Examples:- A lexer for D is available here.
- A lexer for Lua is available here.
Template Parameter Definitions
- defaultTokenFunction
- A function that serves as the default token lexing function. For most
languages this will be the identifier lexing function.
- tokenSeparatingFunction
- A function that is able to determine if an identifier/keyword has come
to an end. This function must return bool and take a single size_t
argument representing the number of bytes to skip over before looking for
a separating character.
- staticTokens
- A listing of the tokens whose exact value never changes and which cannot
possibly be a token handled by the default token lexing function. The
most common example of this kind of token is an operator such as
"*", or "-" in a programming language.
- dynamicTokens
- A listing of tokens whose value is variable, such as whitespace,
identifiers, number literals, and string literals.
- possibleDefaultTokens
- A listing of tokens that could posibly be one of the tokens handled by
the default token handling function. An common example of this is
a keyword such as "for", which looks like the beginning of
the identifier "fortunate". tokenSeparatingFunction is
called to determine if the character after the 'r' separates
the identifier, indicating that the token is "for", or if
lexing should be turned over to the defaultTokenFunction.
- tokenHandlers
- A mapping of prefixes to custom token handling function names. The
generated lexer will search for the even-index elements of this array,
and then call the function whose name is the element immedately after the
even-indexed element. This is used for lexing complex tokens whose prefix
is fixed.
Here are some example constants for a simple calculator lexer:
enum string[] dynamicTokens = ["numberLiteral", "whitespace"];
enum string[] staticTokens = ["-", "+", "*", "/"];
enum string[] possibleDefaultTokens = [];
enum string[] tokenHandlers = [
"0", "lexNumber",
"1", "lexNumber",
"2", "lexNumber",
"3", "lexNumber",
"4", "lexNumber",
"5", "lexNumber",
"6", "lexNumber",
"7", "lexNumber",
"8", "lexNumber",
"9", "lexNumber",
" ", "lexWhitespace",
"\n", "lexWhitespace",
"\t", "lexWhitespace",
"\r", "lexWhitespace"
];
License:License 1.0
Authors:Brian Schott, with ideas shamelessly stolen from Andrei Alexandrescu
Source:
std/lexer.d
- template TokenIdType(alias staticTokens, alias dynamicTokens, alias possibleDefaultTokens)
- Template for determining the type used for a token type. Selects the smallest
unsigned integral type that is able to hold the value
staticTokens.length + dynamicTokens.length + possibleDefaultTokens.length.
For example if there are 20 static tokens, 30 dynamic tokens,
and 10 possible default tokens, this template will alias itself to ubyte,
as 20 + 30 + 10 < ubyte.max.
Examples:
alias IdType = TokenIdType!(staticTokens, dynamicTokens, possibleDefaultTokens);
- @property string tokenStringRepresentation(IdType, alias staticTokens, alias dynamicTokens, alias possibleDefaultTokens)(IdType type);
- Looks up the string representation of the given token type. This is the
opposite of the function of the TokenId template.
Parameters:
IdType type |
the token type identifier |
Examples:
alias str = tokenStringRepresentation(IdType, staticTokens, dynamicTokens, possibleDefaultTokens);
assert (str(tok!"*") == "*");
See Also:
TokenId
- template TokenId(IdType, alias staticTokens, alias dynamicTokens, alias possibleDefaultTokens, string symbol)
- Generates the token type identifier for the given symbol. There are two
special cases:
- If symbol is "", then the token identifier will be 0
- If symbol is "\0", then the token identifier will be the maximum
valid token type identifier
In all cases this template will alias itself to a constant of type IdType.
This template will fail at compile time if symbol is not one of
the staticTokens, dynamicTokens, or possibleDefaultTokens.
Examples:
template tok(string symbol)
{
alias tok = TokenId!(IdType, staticTokens, dynamicTokens,
possibleDefaultTokens, symbol);
}
IdType plus = tok!"+";
IdType num = tok!"numberLiteral";
- struct TokenStructure(IdType, string extraFields = "");
- The token that is returned by the lexer.
Parameters:
IdType |
The D type of the "type" token type field. |
extraFields |
A string containing D code for any extra fields that should
be included in the token structure body. This string is passed
directly to a mixin statement. |
Examples:
alias Token = TokenStructure!(IdType, "");
Token minusToken = Token(tok!"-");
- const pure nothrow @safe bool opEquals(IdType type);
- == overload for the the token type.
- this(IdType type);
- Constructs a token from a token type.
Parameters:
IdType type |
the token type |
- this(IdType type, string text, size_t line, size_t column, size_t index);
- Constructs a token.
Parameters:
IdType type |
the token type |
string text |
the text of the token, which may be null |
size_t line |
the line number at which this token occurs |
size_t column |
the column number at which this token occurs |
size_t index |
the byte offset from the beginning of the input at which this
token occurs |
- string text;
- The text of the token.
- size_t line;
- The line number at which this token occurs.
- size_t column;
- The Column number at which this token occurs.
- size_t index;
- The byte offset from the beginning of the input at which this token
occurs.
- IdType type;
- The token type.
- template Lexer(Token, alias defaultTokenFunction, alias tokenSeparatingFunction, alias staticTokens, alias dynamicTokens, alias possibleDefaultTokens, alias tokenHandlers)
- The implementation of the lexer is contained within this mixin template.
To use it, this template should be mixed in to a struct that represents the
lexer for your language. This struct should implement the following methods:
- popFront, which should call this mixin's popFront() and
additionally perform any token filtering or shuffling you deem
necessary. For example, you can implement popFront to skip comment or
tokens.
- A function that serves as the default token lexing function. For
most languages this will be the identifier lexing function.
- A function that is able to determine if an identifier/keyword has
come to an end. This function must retorn bool and take
a single size_t argument representing the number of
bytes to skip over before looking for a separating character.
- Any functions referred to in the tokenHandlers template paramater.
These functions must be marked pure nothrow, take no
arguments, and return a token
- A constructor that initializes the range field as well as calls
popFront() exactly once (to initialize the front field).
Parameters:
Examples:
struct CalculatorLexer
{
mixin Lexer!(Token, defaultTokenFunction, isSeparating,
tokenHandlers, staticTokens, dynamicTokens, possibleDefaultTokens);
this (ubyte[] bytes)
{
this.range = LexerRange(bytes);
popFront();
}
void popFront() pure
{
_popFront();
}
Token lexNumber() pure nothrow @safe
{
}
Token lexWhitespace() pure nothrow @safe
{
}
Token defaultTokenFunction() pure nothrow @safe
{
range.popFront();
return Token(tok!"");
}
bool isSeparating(size_t offset) pure nothrow @safe
{
return true;
}
}
- const pure nothrow @property const(Token) front();
- Implements the range primitive front.
- const pure nothrow @property bool empty();
- Implements the range primitive empty.
- LexerRange range;
- The lexer input.
- Token _front;
- The token that is currently at the front of the range.
- struct LexerRange;
- Range structure that wraps the lexer's input.
- pure nothrow @safe this(const(ubyte)[] bytes, size_t index = 0, size_t column = 1, size_t line = 1);
- Parameters:
const(ubyte)[] bytes |
the lexer input |
size_t index |
the initial offset from the beginning of bytes |
size_t column |
the initial column number |
size_t line |
the initial line number |
- const pure nothrow @safe size_t mark();
- Returns:
a mark at the current position that can then be used with slice.
- pure nothrow @safe void seek(size_t m);
- Sets the range to the given position.
Parameters:
size_t m |
the position to seek to |
- const pure nothrow @safe const(ubyte)[] slice(size_t m);
- Returs a slice of the input byte array between the given mark and the
current position.
Params m = the beginning index of the slice to return
- const pure nothrow @safe bool empty();
- Implements the range primitive empty.
- const pure nothrow @safe ubyte front();
- Implements the range primitive front.
- const pure nothrow @safe const(ubyte)[] peek(size_t p);
- Returns:
the current item as well as the items p items ahead.
- const pure nothrow @safe ubyte peekAt(size_t offset);
-
- const pure nothrow @safe bool canPeek(size_t p);
- Returns:
true if it is possible to peek p bytes ahead.
- pure nothrow @safe void popFront();
- Implements the range primitive popFront.
- pure nothrow @safe void popFrontN(size_t n);
- Implements the algorithm popFrontN more efficiently.
- pure nothrow @safe void incrementLine();
- Increments the range's line number and resets the column counter.
- const(ubyte)[] bytes;
- The input bytes.
- size_t index;
- The range's current position.
- size_t column;
- The current column number.
- size_t line;
- The current line number.
- struct StringCache;
- The string cache implements a map/set for strings. Placing a string in the
cache returns an identifier that can be used to instantly access the stored
string. It is then possible to simply compare these indexes instead of
performing full string comparisons when comparing the string content of
dynamic tokens. The string cache also handles its own memory, so that mutable
ubyte[] to lexers can still have immutable string fields in their tokens.
Because the string cache also performs de-duplication it is possible to
drastically reduce the memory usage of a lexer.
- this(size_t bucketCount);
- Parameters:
size_t bucketCount |
the initial number of buckets. |
- pure nothrow @safe string cacheGet(const(ubyte[]) bytes);
- Equivalent to calling cache() and get().
StringCache cache;
ubyte[] str = ['a', 'b', 'c'];
string s = cache.get(cache.cache(str));
assert(s == "abc");
- pure nothrow @safe string cacheGet(const(ubyte[]) bytes, uint hash);
- Equivalent to calling cache() and get().
- pure nothrow @safe size_t cache(const(ubyte)[] bytes);
- Caches a string.
Parameters:
const(ubyte)[] bytes |
the string to cache |
Returns:
A key that can be used to retrieve the cached string
Examples:
StringCache cache;
ubyte[] bytes = ['a', 'b', 'c'];
size_t first = cache.cache(bytes);
size_t second = cache.cache(bytes);
assert (first == second);
- pure nothrow @safe size_t cache(const(ubyte)[] bytes, uint hash);
- Caches a string as above, but uses the given hash code instead of
calculating one itself. Use this alongside hashStep() can reduce the
amount of work necessary when lexing dynamic tokens.
- const pure nothrow @safe string get(size_t index);
- Gets a cached string based on its key.
Parameters:
Returns:
the cached string
- static pure nothrow @safe uint hashStep(ubyte b, uint h);
- Incremental hashing.
Parameters:
ubyte b |
the byte to add to the hash |
uint h |
the hash that has been calculated so far |
Returns:
the new hash code for the string.
- static int defaultBucketCount;
- The default bucket count for the string cache.