curly.lexer

Lexer has one of the main Curly’s functions, tokenize().

The main idea of lexing is to split raw text into structured list of well known and defined parts, called tokens. Each token has a class and some contents (let’s say, from {% if something %} we can get following contents for StartBlockToken: if as a function name and ["something"] as a block expression).

Here is the example:

>>> from curly.lexer import tokenize
>>> text = '''\
...     Hello! My name is {{ name }}.\
... {% if likes %}And I like these things: {% loop likes %}\
... {{ item }},{% /loop %}{% /if %}'''
>>> for token in tokenize(text):
...     print(repr(token))
...
<LiteralToken(raw='    Hello! My name is ', contents={'text': '    Hello! My name is '})>
<PrintToken(raw='{{ name }}', contents={'expression': ['name']})>
<LiteralToken(raw='.', contents={'text': '.'})>
<StartBlockToken(raw='{% if likes %}', contents={'function': 'if', 'expression': ['likes']})>
<LiteralToken(raw='And I like these things: ', contents={'text': 'And I like these things: '})>
<StartBlockToken(raw='{% loop likes %}', contents={'function': 'loop', 'expression': ['likes']})>
<PrintToken(raw='{{ item }}', contents={'expression': ['item']})>
<LiteralToken(raw=',', contents={'text': ','})>
<EndBlockToken(raw='{% /loop %}', contents={'function': 'loop'})>
<EndBlockToken(raw='{% /if %}', contents={'function': 'if'})>
>>>

Some terminology:

function
Function is the name of function to call within a block. For example, in block tag {% if something %} function is if.
expression

Expression is the something to print or to pass to function. For example, in block tag {% if lala | blabla | valueof "qq pp" %}, expression is lala | blabla | valueof "qq pp". Usually, expression is parsed according to POSIX shell lexing: ["lala", "|", "blabla", "|", "valueof", "qq pp"].

It is out of the scope of the Curly is how to implement evaluation of the expression. By default, curly tries to find it in context literally, but if you want, feel free to implement your own Jinja2-style DSL. Or even call ast.parse() with compile().

For details on lexing please check tokenize() function.

class curly.lexer.EndBlockToken(raw_string)[source]

Bases: curly.lexer.Token

Responsible for matching of ending function call block tag.

In other words, it matches {% /function %}.

The contents of the block is the function (regular expression is REGEXP_FUNCTION).

REGEXP = re.compile('\n{%\\s* # open block tag\n/\\s* # / character\n([a-zA-Z0-9_-]+) # function name\n\\s*%} # closing block tag\n', re.MULTILINE|re.DOTALL|re.VERBOSE)

Regular expression of the token.

maketrans()

Return a translation table usable for str.translate().

If there is only one argument, it must be a dictionary mapping Unicode ordinals (integers) or characters to Unicode ordinals, strings or None. Character keys will be then converted to ordinals. If there are two arguments, they must be strings of equal length, and in the resulting dictionary, each character in x will be mapped to the character at the same position in y. If there is a third argument, it must be a string, whose characters will be mapped to None in the result.

class curly.lexer.LiteralToken(text)[source]

Bases: curly.lexer.Token

Responsible for parts of the texts which are literal.

Literal part of the text should be printed as is, they are context undependend and not enclosed in any tag. Otherwise: they are placed outside any tag.

For example, in the template {{ first_name }} - {{ last_name }}, literal token is ” - ” (yes, with spaces).

extract_contents(matcher)

Extract more detail token information from regular expression.

Parameters:matcher (re.match) – Regular expression matcher.
Returns:A details on the token.
Return type:dict[str, str]
maketrans()

Return a translation table usable for str.translate().

If there is only one argument, it must be a dictionary mapping Unicode ordinals (integers) or characters to Unicode ordinals, strings or None. Character keys will be then converted to ordinals. If there are two arguments, they must be strings of equal length, and in the resulting dictionary, each character in x will be mapped to the character at the same position in y. If there is a third argument, it must be a string, whose characters will be mapped to None in the result.

class curly.lexer.PrintToken(raw_string)[source]

Bases: curly.lexer.Token

Responsible for matching of print tag {{ var }}.

The contents of the block is the expression which should be printed. In {{ var }} it is ["var"]. Regular expression for expression is REGEXP_EXPRESSION.

REGEXP = re.compile("\n{{\\s* # open {{\n((?:\\\\.|[^\\{\\}%])+) # expression 'var' in {{ var }}\n\\s*}} # closing }}\n", re.MULTILINE|re.DOTALL|re.VERBOSE)

Regular expression of the token.

maketrans()

Return a translation table usable for str.translate().

If there is only one argument, it must be a dictionary mapping Unicode ordinals (integers) or characters to Unicode ordinals, strings or None. Character keys will be then converted to ordinals. If there are two arguments, they must be strings of equal length, and in the resulting dictionary, each character in x will be mapped to the character at the same position in y. If there is a third argument, it must be a string, whose characters will be mapped to None in the result.

curly.lexer.REGEXP_EXPRESSION = '(?:\\\\.|[^\\{\\}%])+'

Regular expression for ‘expression’ definition.

curly.lexer.REGEXP_FUNCTION = '[a-zA-Z0-9_-]+'

Regular expression for function definition.

class curly.lexer.StartBlockToken(raw_string)[source]

Bases: curly.lexer.Token

Responsible for matching of start function call block tag.

In other words, it matches {% function expr1 expr2 expr3... %}.

The contents of the block is the function and expression. Regular expression for function is REGEXP_FUNCTION, for expression: REGEXP_EXPRESSION.

REGEXP = re.compile('\n{%\\s* # open block tag\n([a-zA-Z0-9_-]+) # function name\n((?:\\\\.|[^\\{\\}%])+)? # expression for function\n\\s*%} # closing block tag\n', re.MULTILINE|re.DOTALL|re.VERBOSE)

Regular expression of the token.

maketrans()

Return a translation table usable for str.translate().

If there is only one argument, it must be a dictionary mapping Unicode ordinals (integers) or characters to Unicode ordinals, strings or None. Character keys will be then converted to ordinals. If there are two arguments, they must be strings of equal length, and in the resulting dictionary, each character in x will be mapped to the character at the same position in y. If there is a third argument, it must be a string, whose characters will be mapped to None in the result.

class curly.lexer.Token(raw_string)[source]

Bases: collections.UserString

Base class for every token to parse.

Token is parsed by tokenize() only if it has defined REGEXP attribute.

Parameters:raw_string (str) – Text which was recognized as a token.
Raises:curly.exceptions.CurlyLexerStringDoesNotMatchError: if string does not match regular expression.
extract_contents(matcher)[source]

Extract more detail token information from regular expression.

Parameters:matcher (re.match) – Regular expression matcher.
Returns:A details on the token.
Return type:dict[str, str]
maketrans()

Return a translation table usable for str.translate().

If there is only one argument, it must be a dictionary mapping Unicode ordinals (integers) or characters to Unicode ordinals, strings or None. Character keys will be then converted to ordinals. If there are two arguments, they must be strings of equal length, and in the resulting dictionary, each character in x will be mapped to the character at the same position in y. If there is a third argument, it must be a string, whose characters will be mapped to None in the result.

curly.lexer.get_token_patterns[source]

Mapping of pattern name to its class.

Returns:Mapping of the known tokens with regular expressions.
Return type:dict[str, Token]
curly.lexer.make_tokenizer_regexp[source]

Create regular expression for tokenize().

This small wrapper takes a list of know tokens and their regular expressions and concatenates them into one big expression.

Returns:Regular expression for tokenize() function.
Return type:re.regex
curly.lexer.tokenize(text)[source]

Lexical analysis of the given text.

Main lexing function: it takes text and returns iterator to the produced tokens. There are several facts you have to know about this function:

  1. It does not raise exceptions. If something goes fishy, tokenizer fallbacks to LiteralToken.

  2. It uses one big regular expression, taken from make_tokenizer_regexp(). This regular expression looks like this:

    (?P<SomeToken>{%\s*(\S+)\s*%})|(?P<AnotherToken>{{\s*(\w+)\s*}})
    
  3. Actually, function searches only for template tokens, emiting of LiteralToken is a side effect.

The logic of the function is quite simple:

  1. It gets expression to match from make_tokenizer_regexp().

  2. Function starts to traverse the text using re.regex.finditer() method. It yields non-overlapping matches for the regular expression.

  3. When match is found, we are trying to check if we’ve emit LiteralToken for the text before. Let’s say, we have a text like that:

    'Hello, {{ var }}'
    

    First match on iteration of re.regex.finditer() will be for “{{ var }}”, so we’ve jumped over “Hello, ” substring. To emit this token, we need to remember position where last match was made (re.match.end(), safe to start with 0) and where new one is occured (re.match.start()).

    So text[previous_end:matcher.start(0)] is our text “Hello, ” which goes for LiteralToken.

  4. When we stop iteration, we need to check if we have any leftovers after. This could be done emiting LiteralToken with text[previous_end:] text (if it is non empty, obviously).

Parameters:text (str or bytes) – Text to lex into tokens.
Returns:Generator with Token instances.
Return type:Generator[Token]