{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Tokenizing notebook" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ ".. note::\n", "\n", " This is from a Jupyter notebook used to demonstrate some usage of the tokenizing utility functions.\n", " The content is in two main sections. First, we demonstrate the\n", " usage of various functions to get information about the source. Once this is\n", " done, we demonstrate how to change the content in a reliable way." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, the all important `import` statement." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from ideas import token_utils" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Getting information\n", "We start with a very simple example, where we have a repeated token, `a`." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "type=1 (NAME) string='a' start=(1, 0) end=(1, 1) line='a = a'\n", "type=53 (OP) string='=' start=(1, 2) end=(1, 3) line='a = a'\n", "type=1 (NAME) string='a' start=(1, 4) end=(1, 5) line='a = a'\n", "type=4 (NEWLINE) string='' start=(1, 5) end=(1, 6) line=''\n", "type=0 (ENDMARKER) string='' start=(2, 0) end=(2, 0) line=''\n" ] } ], "source": [ "source = \"a = a\"\n", "tokens = token_utils.tokenize(source)\n", "for token in tokens:\n", " print(token)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice how the `NEWLINE` token here, in spite of its name, does not correpond to `\\n`.\n", "\n", "### Comparing tokens\n", "\n", "Tokens are considered equals if they have the same `string` attribute. Given this notion of equality, we make things even simpler by allowing to compare a token directly to a string as shown below." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "True\n", "True\n", "True\n" ] } ], "source": [ "print(tokens[0] == tokens[2])\n", "print(tokens[0] == tokens[2].string)\n", "print(tokens[0] == 'a') # <-- Our normal choice" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Printing tokens by line of code\n", "If we simply want to tokenize a source and print the result, or simply print a list of tokens, we can use `print_tokens` to do it in a single instruction, with the added benefit of separating tokens from different lines of code." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "type=56 (NL) string='\\n' start=(1, 0) end=(1, 1) line='\\n'\n", "\n", "type=1 (NAME) string='if' start=(2, 0) end=(2, 2) line='if True:\\n'\n", "type=1 (NAME) string='True' start=(2, 3) end=(2, 7) line='if True:\\n'\n", "type=53 (OP) string=':' start=(2, 7) end=(2, 8) line='if True:\\n'\n", "type=4 (NEWLINE) string='\\n' start=(2, 8) end=(2, 9) line='if True:\\n'\n", "\n", "type=5 (INDENT) string=' ' start=(3, 0) end=(3, 4) line=' pass\\n'\n", "type=1 (NAME) string='pass' start=(3, 4) end=(3, 8) line=' pass\\n'\n", "type=4 (NEWLINE) string='\\n' start=(3, 8) end=(3, 9) line=' pass\\n'\n", "\n", "type=6 (DEDENT) string='' start=(4, 0) end=(4, 0) line=''\n", "type=0 (ENDMARKER) string='' start=(4, 0) end=(4, 0) line=''\n", "\n" ] } ], "source": [ "source = \"\"\"\n", "if True:\n", " pass\n", "\"\"\"\n", "token_utils.print_tokens(source)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Getting tokens by line of code\n", "Once a source is broken down into token, it might be difficult to find some particular tokens of interest if we print the entire content. Instead, using `get_lines`, we can tokenize by line of code , and just focus on a few lines of interest." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "type=6 (DEDENT) string='' start=(5, 4) end=(5, 4) line=' else:\\n'\n", "type=1 (NAME) string='else' start=(5, 4) end=(5, 8) line=' else:\\n'\n", "type=53 (OP) string=':' start=(5, 8) end=(5, 9) line=' else:\\n'\n", "type=4 (NEWLINE) string='\\n' start=(5, 9) end=(5, 10) line=' else:\\n'\n", "\n", "type=5 (INDENT) string=' ' start=(6, 0) end=(6, 8) line=' a = 42 # a comment\\n'\n", "type=1 (NAME) string='a' start=(6, 8) end=(6, 9) line=' a = 42 # a comment\\n'\n", "type=53 (OP) string='=' start=(6, 10) end=(6, 11) line=' a = 42 # a comment\\n'\n", "type=2 (NUMBER) string='42' start=(6, 12) end=(6, 14) line=' a = 42 # a comment\\n'\n", "type=55 (COMMENT) string='# a comment' start=(6, 15) end=(6, 26) line=' a = 42 # a comment\\n'\n", "type=4 (NEWLINE) string='\\n' start=(6, 26) end=(6, 27) line=' a = 42 # a comment\\n'\n", "\n" ] } ], "source": [ "source = \"\"\"\n", "if True:\n", " if False:\n", " pass\n", " else:\n", " a = 42 # a comment\n", "print('ok')\n", "\"\"\"\n", "lines = token_utils.get_lines(source)\n", "for line in lines[4:6]:\n", " for token in line:\n", " print(token)\n", " print()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Getting particular tokens\n", "Let's focus on the sixth line." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " a = 42 # a comment\n", "\n" ] } ], "source": [ "line = lines[5]\n", "print( token_utils.untokenize(line) )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ignoring the indentation, the first token is `a`; ignoring newlines indicator and comments, the last token is `42`. We can get at these tokens using some utility functions." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The first useful token is:\n", " type=1 (NAME) string='a' start=(6, 8) end=(6, 9) line=' a = 42 # a comment\\n'\n", "The index of the first token is: 1\n", "\n", "The last useful token on that line is:\n", " type=2 (NUMBER) string='42' start=(6, 12) end=(6, 14) line=' a = 42 # a comment\\n'\n", "Its index is 3\n" ] } ], "source": [ "print(\"The first useful token is:\\n \", token_utils.get_first(line))\n", "print(\"The index of the first token is: \", token_utils.get_first_index(line))\n", "print()\n", "print(\"The last useful token on that line is:\\n \", token_utils.get_last(line))\n", "print(\"Its index is\", token_utils.get_last_index(line))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that these four functions, `get_first`, `get_first_index`, `get_last`, `get_last_index` exclude end of line comments by default; but this can be changed by setting the optional parameter `exclude_comment` to `False`." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "type=55 (COMMENT) string='# a comment' start=(6, 15) end=(6, 26) line=' a = 42 # a comment\\n'\n" ] } ], "source": [ "print( token_utils.get_last(line, exclude_comment=False))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Getting the indentation of a line\n", "The sixth line starts with an `INDENT` token. We can get the indentation of that line, either by printing the length of the `INDENT` token string, or by looking at the `start_col` attribute of the first \"useful\" token. The attribute `start_col` is part of the two-tuple `start = (start_row, start_col)`. " ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "8\n", "8\n" ] } ], "source": [ "print(len(line[0].string))\n", "first = token_utils.get_first(line)\n", "print(first.start_col)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In general, **the second method is more reliable**. For example, if we look at tokens the previous line (line 5, index 4), we can see that the length of the string of the first token, `INDENT`, does not give us the information about the line indentation. Furthermore, a given line may start with multiple `INDENT` tokens. However, once again, the `start_col` attribute of the first \"useful\" token can give us this value." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "type=6 (DEDENT) string='' start=(5, 4) end=(5, 4) line=' else:\\n'\n", "type=1 (NAME) string='else' start=(5, 4) end=(5, 8) line=' else:\\n'\n", "type=53 (OP) string=':' start=(5, 8) end=(5, 9) line=' else:\\n'\n", "type=4 (NEWLINE) string='\\n' start=(5, 9) end=(5, 10) line=' else:\\n'\n", "--------------------------------------------------\n", " else:\n", "\n", "indentation = 4\n" ] } ], "source": [ "for token in lines[4]:\n", " print(token)\n", "print(\"-\" * 50)\n", " \n", "print(token_utils.untokenize(lines[4]))\n", "first = token_utils.get_first(lines[4])\n", "print(\"indentation = \", first.start_col)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Changing information\n", "\n", "Suppose we wish to do the following replacement\n", "\n", "```\n", "repeat n: --> for some_variable in range(n):\n", "```\n", "Here `n` might be anything that evaluates as an integer. Let's see a couple of different ways to do this.\n", "\n", "First, we simply change the string content of two tokens." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "for some_variable in range( 2 * 3 ):\n" ] } ], "source": [ "source = \"repeat 2 * 3 : \"\n", "\n", "tokens = token_utils.tokenize(source)\n", "repeat = token_utils.get_first(tokens)\n", "colon = token_utils.get_last(tokens)\n", "\n", "repeat.string = \"for some_variable in range(\"\n", "colon.string = \"):\"\n", "\n", "print(token_utils.untokenize(tokens))\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's revert back the change for the colon, to see a different way of doing the same thing." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "for some_variable in range( 2 * 3 :\n" ] } ], "source": [ "colon.string = \":\"\n", "print(token_utils.untokenize(tokens))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This time, let's **insert** an extra token, written as a simple Python string." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "type=1 (NAME) string='for some_variable in range(' start=(1, 0) end=(1, 6) line='repeat 2 * 3 : '\n", "type=2 (NUMBER) string='2' start=(1, 7) end=(1, 8) line='repeat 2 * 3 : '\n", "type=53 (OP) string='*' start=(1, 9) end=(1, 10) line='repeat 2 * 3 : '\n", "type=2 (NUMBER) string='3' start=(1, 11) end=(1, 12) line='repeat 2 * 3 : '\n", ")\n", "type=53 (OP) string=':' start=(1, 13) end=(1, 14) line='repeat 2 * 3 : '\n", "type=4 (NEWLINE) string='' start=(1, 15) end=(1, 16) line=''\n", "type=0 (ENDMARKER) string='' start=(2, 0) end=(2, 0) line=''\n" ] } ], "source": [ "colon_index = token_utils.get_last_index(tokens)\n", "tokens.insert(colon_index, \")\")\n", "for token in tokens:\n", " print(token)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In spite of `')'` being a normal Python string, it can still be processed correctly by the `untokenize` function." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "for some_variable in range( 2 * 3) :\n" ] } ], "source": [ "print(token_utils.untokenize(tokens))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Thus, unlike Python's own untokenize function, we do not have to worry about token types when we wish to insert extra tokens.\n", "\n", "## Changing indentation\n", "\n", "We can easily change the indentation of a given line using either the `indent` or `dedent` function." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " a = 1\n", "\n", " b = 2\n", "\n" ] } ], "source": [ "source = \"\"\"\n", "if True:\n", " a = 1\n", " b = 2\n", "\"\"\"\n", "\n", "# First, reducing the indentation of the \"b = 2\" line\n", "\n", "lines = token_utils.get_lines(source)\n", "a_line = lines[2]\n", "a = token_utils.get_first(a_line)\n", "assert a == \"a\"\n", "b_line = lines[3]\n", "b = token_utils.get_first(b_line)\n", "lines[3] = token_utils.dedent(b_line, b.start_col - a.start_col)\n", "\n", "print(token_utils.untokenize(a_line))\n", "print(token_utils.untokenize(lines[3]))\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Alternatively, we can indent the \"a = 1\" line" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " a = 1\n", "\n", " b = 2\n", "\n" ] } ], "source": [ "lines = token_utils.get_lines(source)\n", "a_line = lines[2]\n", "a = token_utils.get_first(a_line)\n", "assert a == \"a\"\n", "b_line = lines[3]\n", "b = token_utils.get_first(b_line)\n", "lines[2] = token_utils.indent(a_line, b.start_col - a.start_col)\n", "\n", "print(token_utils.untokenize(lines[2]))\n", "print(token_utils.untokenize(b_line))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, let's recover the entire source with the fixed indentation." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "if True:\n", " a = 1\n", " b = 2\n", "\n" ] } ], "source": [ "new_tokens = []\n", "for line in lines:\n", " new_tokens.extend(line)\n", " \n", "print(token_utils.untokenize(new_tokens))" ] } ], "metadata": { "celltoolbar": "Raw Cell Format", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 4 }