Tokenizing notebook

Note

This is from a Jupyter notebook used to demonstrate some usage of the tokenizing utility functions. The content is in two main sections. First, we demonstrate the usage of various functions to get information about the source. Once this is done, we demonstrate how to change the content in a reliable way.

First, the all important import statement.

[1]:

from ideas import token_utils

Getting information

We start with a very simple example, where we have a repeated token, a.

[2]:

source = "a = a"
tokens = token_utils.tokenize(source)
for token in tokens:
    print(token)

type=1 (NAME)  string='a'  start=(1, 0)  end=(1, 1)  line='a = a'
type=53 (OP)  string='='  start=(1, 2)  end=(1, 3)  line='a = a'
type=1 (NAME)  string='a'  start=(1, 4)  end=(1, 5)  line='a = a'
type=4 (NEWLINE)  string=''  start=(1, 5)  end=(1, 6)  line=''
type=0 (ENDMARKER)  string=''  start=(2, 0)  end=(2, 0)  line=''

Notice how the NEWLINE token here, in spite of its name, does not correpond to \n.

Comparing tokens

Tokens are considered equals if they have the same string attribute. Given this notion of equality, we make things even simpler by allowing to compare a token directly to a string as shown below.

[3]:

print(tokens[0] == tokens[2])
print(tokens[0] == tokens[2].string)
print(tokens[0] == 'a')  #  <--  Our normal choice

True
True
True

Printing tokens by line of code

If we simply want to tokenize a source and print the result, or simply print a list of tokens, we can use print_tokens to do it in a single instruction, with the added benefit of separating tokens from different lines of code.

[4]:

source = """
if True:
    pass
"""
token_utils.print_tokens(source)

type=56 (NL)  string='\n'  start=(1, 0)  end=(1, 1)  line='\n'

type=1 (NAME)  string='if'  start=(2, 0)  end=(2, 2)  line='if True:\n'
type=1 (NAME)  string='True'  start=(2, 3)  end=(2, 7)  line='if True:\n'
type=53 (OP)  string=':'  start=(2, 7)  end=(2, 8)  line='if True:\n'
type=4 (NEWLINE)  string='\n'  start=(2, 8)  end=(2, 9)  line='if True:\n'

type=5 (INDENT)  string='    '  start=(3, 0)  end=(3, 4)  line='    pass\n'
type=1 (NAME)  string='pass'  start=(3, 4)  end=(3, 8)  line='    pass\n'
type=4 (NEWLINE)  string='\n'  start=(3, 8)  end=(3, 9)  line='    pass\n'

type=6 (DEDENT)  string=''  start=(4, 0)  end=(4, 0)  line=''
type=0 (ENDMARKER)  string=''  start=(4, 0)  end=(4, 0)  line=''

Getting tokens by line of code

Once a source is broken down into token, it might be difficult to find some particular tokens of interest if we print the entire content. Instead, using get_lines, we can tokenize by line of code , and just focus on a few lines of interest.

[5]:

source = """
if True:
    if False:
        pass
    else:
        a = 42 # a comment
print('ok')
"""
lines = token_utils.get_lines(source)
for line in lines[4:6]:
    for token in line:
        print(token)
    print()

type=6 (DEDENT)  string=''  start=(5, 4)  end=(5, 4)  line='    else:\n'
type=1 (NAME)  string='else'  start=(5, 4)  end=(5, 8)  line='    else:\n'
type=53 (OP)  string=':'  start=(5, 8)  end=(5, 9)  line='    else:\n'
type=4 (NEWLINE)  string='\n'  start=(5, 9)  end=(5, 10)  line='    else:\n'

type=5 (INDENT)  string='        '  start=(6, 0)  end=(6, 8)  line='        a = 42 # a comment\n'
type=1 (NAME)  string='a'  start=(6, 8)  end=(6, 9)  line='        a = 42 # a comment\n'
type=53 (OP)  string='='  start=(6, 10)  end=(6, 11)  line='        a = 42 # a comment\n'
type=2 (NUMBER)  string='42'  start=(6, 12)  end=(6, 14)  line='        a = 42 # a comment\n'
type=55 (COMMENT)  string='# a comment'  start=(6, 15)  end=(6, 26)  line='        a = 42 # a comment\n'
type=4 (NEWLINE)  string='\n'  start=(6, 26)  end=(6, 27)  line='        a = 42 # a comment\n'

Getting particular tokens

Let’s focus on the sixth line.

[6]:

line = lines[5]
print( token_utils.untokenize(line) )

        a = 42 # a comment

Ignoring the indentation, the first token is a; ignoring newlines indicator and comments, the last token is 42. We can get at these tokens using some utility functions.

[7]:

print("The first useful token is:\n   ", token_utils.get_first(line))
print("The index of the first token is: ", token_utils.get_first_index(line))
print()
print("The last useful token on that line is:\n  ", token_utils.get_last(line))
print("Its index is", token_utils.get_last_index(line))

The first useful token is:
    type=1 (NAME)  string='a'  start=(6, 8)  end=(6, 9)  line='        a = 42 # a comment\n'
The index of the first token is:  1

The last useful token on that line is:
   type=2 (NUMBER)  string='42'  start=(6, 12)  end=(6, 14)  line='        a = 42 # a comment\n'
Its index is 3

Note that these four functions, get_first, get_first_index, get_last, get_last_index exclude end of line comments by default; but this can be changed by setting the optional parameter exclude_comment to False.

[8]:

print( token_utils.get_last(line, exclude_comment=False))

type=55 (COMMENT)  string='# a comment'  start=(6, 15)  end=(6, 26)  line='        a = 42 # a comment\n'

Getting the indentation of a line

The sixth line starts with an INDENT token. We can get the indentation of that line, either by printing the length of the INDENT token string, or by looking at the start_col attribute of the first “useful” token. The attribute start_col is part of the two-tuple start = (start_row, start_col).

[9]:

print(len(line[0].string))
first = token_utils.get_first(line)
print(first.start_col)

8
8

In general, the second method is more reliable. For example, if we look at tokens the previous line (line 5, index 4), we can see that the length of the string of the first token, INDENT, does not give us the information about the line indentation. Furthermore, a given line may start with multiple INDENT tokens. However, once again, the start_col attribute of the first “useful” token can give us this value.

[10]:

for token in lines[4]:
    print(token)
print("-" * 50)

print(token_utils.untokenize(lines[4]))
first = token_utils.get_first(lines[4])
print("indentation = ", first.start_col)

type=6 (DEDENT)  string=''  start=(5, 4)  end=(5, 4)  line='    else:\n'
type=1 (NAME)  string='else'  start=(5, 4)  end=(5, 8)  line='    else:\n'
type=53 (OP)  string=':'  start=(5, 8)  end=(5, 9)  line='    else:\n'
type=4 (NEWLINE)  string='\n'  start=(5, 9)  end=(5, 10)  line='    else:\n'
--------------------------------------------------
    else:

indentation =  4

Changing information

Suppose we wish to do the following replacement

repeat n: --> for some_variable in range(n):

Here n might be anything that evaluates as an integer. Let’s see a couple of different ways to do this.

First, we simply change the string content of two tokens.

[11]:

source = "repeat 2 * 3 : "

tokens = token_utils.tokenize(source)
repeat = token_utils.get_first(tokens)
colon = token_utils.get_last(tokens)

repeat.string = "for some_variable in range("
colon.string = "):"

print(token_utils.untokenize(tokens))

for some_variable in range( 2 * 3 ):

Let’s revert back the change for the colon, to see a different way of doing the same thing.

[12]:

colon.string = ":"
print(token_utils.untokenize(tokens))

for some_variable in range( 2 * 3 :

This time, let’s insert an extra token, written as a simple Python string.

[13]:

colon_index = token_utils.get_last_index(tokens)
tokens.insert(colon_index, ")")
for token in tokens:
    print(token)

type=1 (NAME)  string='for some_variable in range('  start=(1, 0)  end=(1, 6)  line='repeat 2 * 3 : '
type=2 (NUMBER)  string='2'  start=(1, 7)  end=(1, 8)  line='repeat 2 * 3 : '
type=53 (OP)  string='*'  start=(1, 9)  end=(1, 10)  line='repeat 2 * 3 : '
type=2 (NUMBER)  string='3'  start=(1, 11)  end=(1, 12)  line='repeat 2 * 3 : '
)
type=53 (OP)  string=':'  start=(1, 13)  end=(1, 14)  line='repeat 2 * 3 : '
type=4 (NEWLINE)  string=''  start=(1, 15)  end=(1, 16)  line=''
type=0 (ENDMARKER)  string=''  start=(2, 0)  end=(2, 0)  line=''

In spite of ')' being a normal Python string, it can still be processed correctly by the untokenize function.

[14]:

print(token_utils.untokenize(tokens))

for some_variable in range( 2 * 3) :

Thus, unlike Python’s own untokenize function, we do not have to worry about token types when we wish to insert extra tokens.

Changing indentation

We can easily change the indentation of a given line using either the indent or dedent function.

[18]:

source = """
if True:
    a = 1
      b = 2
"""

# First, reducing the indentation of the "b = 2" line

lines = token_utils.get_lines(source)
a_line = lines[2]
a = token_utils.get_first(a_line)
assert a == "a"
b_line = lines[3]
b = token_utils.get_first(b_line)
lines[3] = token_utils.dedent(b_line, b.start_col - a.start_col)

print(token_utils.untokenize(a_line))
print(token_utils.untokenize(lines[3]))

    a = 1

    b = 2

Alternatively, we can indent the “a = 1” line

[20]:

lines = token_utils.get_lines(source)
a_line = lines[2]
a = token_utils.get_first(a_line)
assert a == "a"
b_line = lines[3]
b = token_utils.get_first(b_line)
lines[2] = token_utils.indent(a_line, b.start_col - a.start_col)

print(token_utils.untokenize(lines[2]))
print(token_utils.untokenize(b_line))

      a = 1

      b = 2

Finally, let’s recover the entire source with the fixed indentation.

[21]:

new_tokens = []
for line in lines:
    new_tokens.extend(line)

print(token_utils.untokenize(new_tokens))


if True:
      a = 1
      b = 2