About tokens

Warning

The module token_utils has been made into its own library. It should be installed automatically when installing ideas. You can get it on its own using python -m pip install token-utils

While ideas aims to provide support for all kinds of transformations, including those that affect the Abstract Syntax Tree or the bytecode, most transformations deal with exploring alternative syntax that is not compatible with Python’s current syntax defined by its grammar. Such alternative syntax cannot be parsed by Python without generating a SyntaxError thus preventing the execution of the code. For this reason, almost all of our examples transform the code prior to letting Python parse it. We do this using a set of tools built upon Python’s tokenize module.

In our description of these tools below, we assume that you are somewhat familiar with the concept of token objects generated by Python’s tokenize module. If you are not familiar with those, we suggest that you read through at least once through the documentation about Python’s tokenize module mentioned above.

The main points to understand:

Using the tokenize function, a source can be broken down in tokens, which, as generated by Python, are 5-tuples carrying information about their type, their string content, their position in the source (identified by starting and ending row, aka line number, and column), as well as the content of the line where they are found.
From a list of tokens, the original source can essentially recreated by using the untokenize function. However, as stated in the documentation:

The result is guaranteed to tokenize back to match the input so that the conversion is lossless and round-trips are assured. The guarantee applies only to the token type and token string as the spacing between tokens (column positions) may change.
To untokenize using the function from the Python standard library, one can use either a list of 5-tuple tokens, or a list of two-tuple tokens that include only the type and string information.

About Ideas’ tokens

Recently (Feb. 21, 2020), on the Python-ideas mailing list, Andrew Barnert wrote:

Unfortunately, the boilerplate to write an import hook is more complicated than you’d like (and pretty hard to figure out the first time), and the support for filtering on the token stream (the most obvious way to do this one) rather than the text stream, AST, or bytecode is pretty minimal and clumsy. [emphasis added]

Ideas uses token-utils which defines its own Token class built from Python’s tokens. While they carry the same information, they are much easier to use and manipulate.

Below is the API from the token_utils module.

Tip

While we show below the full API of the token_utils module, you might want to first to to next page to see a demonstration of its usage, done in an actual programming session using a Jupyter notebook.

token_utils.py API extracted by Sphinx

token_utils.py

A collection of useful functions and methods to deal with tokenizing source code.

class token_utils.Token(token)[source]

Token as generated from Python’s tokenize.generate_tokens written here in a more convenient form, and with some custom methods.

The various parameters are:

type: token type
string: the token written as a string
start = (start_row, start_col)
end = (end_row, end_col)
line: entire line of code where the token is found.

Token instances are mutable objects. Therefore, given a list of tokens, we can change the value of any token’s attribute, untokenize the list and automatically obtain a transformed source.

is_comment()[source]: Returns True if the token is a comment.

is_complex()[source]: Returns True if the token represents a complex number

is_float()[source]: Returns True if the token represents a float

is_identifier()[source]

Returns True if the token represents a valid Python identifier excluding Python keywords.

Note: this is different from Python’s string method isidentifier which also returns True if the string is a keyword.

is_in(iterable)[source]: Returns True if the string attribute is found as an item of iterable.

is_integer()[source]: Returns True if the token represents an integer

is_keyword()[source]: Returns True if the token represents a Python keyword.

is_name()[source]: Returns True if the token is a type NAME

is_not_in(iterable)[source]: Returns True if the string attribute is found as an item of iterable.

is_number()[source]: Returns True if the token represents a number

is_space()[source]

Returns True if the token indicates a change in indentation, the end of a line, or the end of the source (INDENT, DEDENT, NEWLINE, NL, and ENDMARKER).

Note that spaces, including tab charcters \t, between tokens on a given line are not considered to be tokens themselves.

is_string()[source]: Returns True if the token is a string

token_utils.dedent(tokens, nb)[source]: Given a list of tokens, produces an equivalent list corresponding to a line of code with the first nb characters removed.

token_utils.find_token_by_position(tokens, row, column)[source]

Given a list of tokens, a specific row (linenumber) and column, a two-tuple is returned that includes the token found at that position as well as its list index.

If no such token can be found, None, None is returned.

token_utils.fix_empty_line(source, tokens)[source]

Python’s tokenizer drops entirely a last line if it consists only of space characters and/or tab characters. To ensure that we can always have:

untokenize(tokenize(source)) == source

we correct the last token content if needed.

token_utils.get_first(tokens, exclude_comment=True)[source]

Given a list of tokens, find the first token which is not a space token (such as a NEWLINE, INDENT, DEDENT, etc.) and, by default, also not a COMMMENT.

COMMMENT tokens can be included by setting exclude_comment to False.

Returns None if none is found.

token_utils.get_first_index(tokens, exclude_comment=True)[source]

Given a list of tokens, find the index of the first token which is not a space token (such as a NEWLINE, INDENT, DEDENT, etc.) nor a COMMMENT. If it is desired to include COMMENT, set exclude_comment to True.

Returns None if none is found.

token_utils.get_last(tokens, exclude_comment=True)[source]

Given a list of tokens, find the last token which is not a space token (such as a NEWLINE, INDENT, DEDENT, etc.) and, by default, also not a COMMMENT.

COMMMENT tokens can be included by setting``exclude_comment`` to False.

Returns None if none is found.

token_utils.get_last_index(tokens, exclude_comment=True)[source]

Given a list of tokens, find the index of the last token which is not a space token (such as a NEWLINE, INDENT, DEDENT, etc.) nor a COMMMENT. If it is desired to include COMMENT, set exclude_comment to True.

Returns None if none is found.

token_utils.get_lines(source, warning=True)[source]: Transforms a source (string) into a list of Tokens, with each (inner) list containing all the tokens found on a given line of code.

token_utils.get_number(tokens, exclude_comment=True)[source]

Given a list of tokens, gives a count of the number of tokens which are not space tokens (such as NEWLINE, INDENT, DEDENT, etc.)

By default, COMMMENT tokens are not included in the count. If you wish to include them, set exclude_comment to False.

token_utils.indent(tokens, nb, tab=False)[source]

Given a list of tokens, produces an equivalent list corresponding to a line of code with nb space characters inserted at the beginning.

If tab is specified to be True, nb tab characters are inserted instead of spaces.

token_utils.print_tokens(source)[source]

Prints tokens found in source, excluding spaces and comments.

source is either a string to be tokenized, or a list of Token objects.

This is occasionally useful as a debugging tool.

token_utils.strip_comment(line)[source]: Removes comments from a line

token_utils.tokenize(source, warning=True)[source]

Transforms a source (string) into a list of Tokens.

If an exception is raised by Python’s tokenize module, the list of tokens accumulated up to that point is returned.

token_utils.untokenize(tokens)[source]

Return source code based on tokens.

This is similar to Python’s own tokenize.untokenize(), except that it preserves spacing between tokens, by using the line information recorded by Python’s tokenize.generate_tokens. As a result, if the original soure code had multiple spaces between some tokens or if escaped newlines were used or if tab characters were present in the original source, those will also be present in the source code produced by untokenize.

Thus source == untokenize(tokenize(source)).

Note: if you you modifying tokens from an original source:

Instead of full token object, untokenize will accept simple strings; however, it will only insert them as is without taking them into account when it comes with figuring out spacing between tokens.