Tokenizing notebook
Note
This is from a Jupyter notebook used to demonstrate some usage of the tokenizing utility functions. The content is in two main sections. First, we demonstrate the usage of various functions to get information about the source. Once this is done, we demonstrate how to change the content in a reliable way.
First, the all important import
statement.
[1]:
from ideas import token_utils
Getting information
We start with a very simple example, where we have a repeated token, a
.
[2]:
source = "a = a"
tokens = token_utils.tokenize(source)
for token in tokens:
print(token)
type=1 (NAME) string='a' start=(1, 0) end=(1, 1) line='a = a'
type=53 (OP) string='=' start=(1, 2) end=(1, 3) line='a = a'
type=1 (NAME) string='a' start=(1, 4) end=(1, 5) line='a = a'
type=4 (NEWLINE) string='' start=(1, 5) end=(1, 6) line=''
type=0 (ENDMARKER) string='' start=(2, 0) end=(2, 0) line=''
Notice how the NEWLINE
token here, in spite of its name, does not correpond to \n
.
Comparing tokens
Tokens are considered equals if they have the same string
attribute. Given this notion of equality, we make things even simpler by allowing to compare a token directly to a string as shown below.
[3]:
print(tokens[0] == tokens[2])
print(tokens[0] == tokens[2].string)
print(tokens[0] == 'a') # <-- Our normal choice
True
True
True
Printing tokens by line of code
If we simply want to tokenize a source and print the result, or simply print a list of tokens, we can use print_tokens
to do it in a single instruction, with the added benefit of separating tokens from different lines of code.
[4]:
source = """
if True:
pass
"""
token_utils.print_tokens(source)
type=56 (NL) string='\n' start=(1, 0) end=(1, 1) line='\n'
type=1 (NAME) string='if' start=(2, 0) end=(2, 2) line='if True:\n'
type=1 (NAME) string='True' start=(2, 3) end=(2, 7) line='if True:\n'
type=53 (OP) string=':' start=(2, 7) end=(2, 8) line='if True:\n'
type=4 (NEWLINE) string='\n' start=(2, 8) end=(2, 9) line='if True:\n'
type=5 (INDENT) string=' ' start=(3, 0) end=(3, 4) line=' pass\n'
type=1 (NAME) string='pass' start=(3, 4) end=(3, 8) line=' pass\n'
type=4 (NEWLINE) string='\n' start=(3, 8) end=(3, 9) line=' pass\n'
type=6 (DEDENT) string='' start=(4, 0) end=(4, 0) line=''
type=0 (ENDMARKER) string='' start=(4, 0) end=(4, 0) line=''
Getting tokens by line of code
Once a source is broken down into token, it might be difficult to find some particular tokens of interest if we print the entire content. Instead, using get_lines
, we can tokenize by line of code , and just focus on a few lines of interest.
[5]:
source = """
if True:
if False:
pass
else:
a = 42 # a comment
print('ok')
"""
lines = token_utils.get_lines(source)
for line in lines[4:6]:
for token in line:
print(token)
print()
type=6 (DEDENT) string='' start=(5, 4) end=(5, 4) line=' else:\n'
type=1 (NAME) string='else' start=(5, 4) end=(5, 8) line=' else:\n'
type=53 (OP) string=':' start=(5, 8) end=(5, 9) line=' else:\n'
type=4 (NEWLINE) string='\n' start=(5, 9) end=(5, 10) line=' else:\n'
type=5 (INDENT) string=' ' start=(6, 0) end=(6, 8) line=' a = 42 # a comment\n'
type=1 (NAME) string='a' start=(6, 8) end=(6, 9) line=' a = 42 # a comment\n'
type=53 (OP) string='=' start=(6, 10) end=(6, 11) line=' a = 42 # a comment\n'
type=2 (NUMBER) string='42' start=(6, 12) end=(6, 14) line=' a = 42 # a comment\n'
type=55 (COMMENT) string='# a comment' start=(6, 15) end=(6, 26) line=' a = 42 # a comment\n'
type=4 (NEWLINE) string='\n' start=(6, 26) end=(6, 27) line=' a = 42 # a comment\n'
Getting particular tokens
Let’s focus on the sixth line.
[6]:
line = lines[5]
print( token_utils.untokenize(line) )
a = 42 # a comment
Ignoring the indentation, the first token is a
; ignoring newlines indicator and comments, the last token is 42
. We can get at these tokens using some utility functions.
[7]:
print("The first useful token is:\n ", token_utils.get_first(line))
print("The index of the first token is: ", token_utils.get_first_index(line))
print()
print("The last useful token on that line is:\n ", token_utils.get_last(line))
print("Its index is", token_utils.get_last_index(line))
The first useful token is:
type=1 (NAME) string='a' start=(6, 8) end=(6, 9) line=' a = 42 # a comment\n'
The index of the first token is: 1
The last useful token on that line is:
type=2 (NUMBER) string='42' start=(6, 12) end=(6, 14) line=' a = 42 # a comment\n'
Its index is 3
Note that these four functions, get_first
, get_first_index
, get_last
, get_last_index
exclude end of line comments by default; but this can be changed by setting the optional parameter exclude_comment
to False
.
[8]:
print( token_utils.get_last(line, exclude_comment=False))
type=55 (COMMENT) string='# a comment' start=(6, 15) end=(6, 26) line=' a = 42 # a comment\n'
Getting the indentation of a line
The sixth line starts with an INDENT
token. We can get the indentation of that line, either by printing the length of the INDENT
token string, or by looking at the start_col
attribute of the first “useful” token. The attribute start_col
is part of the two-tuple start = (start_row, start_col)
.
[9]:
print(len(line[0].string))
first = token_utils.get_first(line)
print(first.start_col)
8
8
In general, the second method is more reliable. For example, if we look at tokens the previous line (line 5, index 4), we can see that the length of the string of the first token, INDENT
, does not give us the information about the line indentation. Furthermore, a given line may start with multiple INDENT
tokens. However, once again, the start_col
attribute of the first “useful” token can give us this value.
[10]:
for token in lines[4]:
print(token)
print("-" * 50)
print(token_utils.untokenize(lines[4]))
first = token_utils.get_first(lines[4])
print("indentation = ", first.start_col)
type=6 (DEDENT) string='' start=(5, 4) end=(5, 4) line=' else:\n'
type=1 (NAME) string='else' start=(5, 4) end=(5, 8) line=' else:\n'
type=53 (OP) string=':' start=(5, 8) end=(5, 9) line=' else:\n'
type=4 (NEWLINE) string='\n' start=(5, 9) end=(5, 10) line=' else:\n'
--------------------------------------------------
else:
indentation = 4
Changing information
Suppose we wish to do the following replacement
repeat n: --> for some_variable in range(n):
Here n
might be anything that evaluates as an integer. Let’s see a couple of different ways to do this.
First, we simply change the string content of two tokens.
[11]:
source = "repeat 2 * 3 : "
tokens = token_utils.tokenize(source)
repeat = token_utils.get_first(tokens)
colon = token_utils.get_last(tokens)
repeat.string = "for some_variable in range("
colon.string = "):"
print(token_utils.untokenize(tokens))
for some_variable in range( 2 * 3 ):
Let’s revert back the change for the colon, to see a different way of doing the same thing.
[12]:
colon.string = ":"
print(token_utils.untokenize(tokens))
for some_variable in range( 2 * 3 :
This time, let’s insert an extra token, written as a simple Python string.
[13]:
colon_index = token_utils.get_last_index(tokens)
tokens.insert(colon_index, ")")
for token in tokens:
print(token)
type=1 (NAME) string='for some_variable in range(' start=(1, 0) end=(1, 6) line='repeat 2 * 3 : '
type=2 (NUMBER) string='2' start=(1, 7) end=(1, 8) line='repeat 2 * 3 : '
type=53 (OP) string='*' start=(1, 9) end=(1, 10) line='repeat 2 * 3 : '
type=2 (NUMBER) string='3' start=(1, 11) end=(1, 12) line='repeat 2 * 3 : '
)
type=53 (OP) string=':' start=(1, 13) end=(1, 14) line='repeat 2 * 3 : '
type=4 (NEWLINE) string='' start=(1, 15) end=(1, 16) line=''
type=0 (ENDMARKER) string='' start=(2, 0) end=(2, 0) line=''
In spite of ')'
being a normal Python string, it can still be processed correctly by the untokenize
function.
[14]:
print(token_utils.untokenize(tokens))
for some_variable in range( 2 * 3) :
Thus, unlike Python’s own untokenize function, we do not have to worry about token types when we wish to insert extra tokens.
Changing indentation
We can easily change the indentation of a given line using either the indent
or dedent
function.
[18]:
source = """
if True:
a = 1
b = 2
"""
# First, reducing the indentation of the "b = 2" line
lines = token_utils.get_lines(source)
a_line = lines[2]
a = token_utils.get_first(a_line)
assert a == "a"
b_line = lines[3]
b = token_utils.get_first(b_line)
lines[3] = token_utils.dedent(b_line, b.start_col - a.start_col)
print(token_utils.untokenize(a_line))
print(token_utils.untokenize(lines[3]))
a = 1
b = 2
Alternatively, we can indent the “a = 1” line
[20]:
lines = token_utils.get_lines(source)
a_line = lines[2]
a = token_utils.get_first(a_line)
assert a == "a"
b_line = lines[3]
b = token_utils.get_first(b_line)
lines[2] = token_utils.indent(a_line, b.start_col - a.start_col)
print(token_utils.untokenize(lines[2]))
print(token_utils.untokenize(b_line))
a = 1
b = 2
Finally, let’s recover the entire source with the fixed indentation.
[21]:
new_tokens = []
for line in lines:
new_tokens.extend(line)
print(token_utils.untokenize(new_tokens))
if True:
a = 1
b = 2