{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Tokenizing notebook"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    ".. note::\n",
    "\n",
    "   This is from a Jupyter notebook used to demonstrate some usage of the tokenizing utility functions.\n",
    "   The content is in two main sections. First, we demonstrate the\n",
    "   usage of various functions to get information about the source. Once this is\n",
    "   done, we demonstrate how to change the content in a reliable way."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, the all important `import` statement."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from ideas import token_utils"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Getting information\n",
    "We start with a very simple example, where we have a repeated token, `a`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "type=1 (NAME)  string='a'  start=(1, 0)  end=(1, 1)  line='a = a'\n",
      "type=53 (OP)  string='='  start=(1, 2)  end=(1, 3)  line='a = a'\n",
      "type=1 (NAME)  string='a'  start=(1, 4)  end=(1, 5)  line='a = a'\n",
      "type=4 (NEWLINE)  string=''  start=(1, 5)  end=(1, 6)  line=''\n",
      "type=0 (ENDMARKER)  string=''  start=(2, 0)  end=(2, 0)  line=''\n"
     ]
    }
   ],
   "source": [
    "source = \"a = a\"\n",
    "tokens = token_utils.tokenize(source)\n",
    "for token in tokens:\n",
    "    print(token)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice how the `NEWLINE` token here, in spite of its name, does not correpond to `\\n`.\n",
    "\n",
    "### Comparing tokens\n",
    "\n",
    "Tokens are considered equals if they have the same `string` attribute. Given this notion of equality, we make things even simpler by allowing to compare a token directly to a string as shown below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "True\n",
      "True\n",
      "True\n"
     ]
    }
   ],
   "source": [
    "print(tokens[0] == tokens[2])\n",
    "print(tokens[0] == tokens[2].string)\n",
    "print(tokens[0] == 'a')  #  <--  Our normal choice"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Printing tokens by line of code\n",
    "If we simply want to tokenize a source and print the result, or simply print a list of tokens, we can use `print_tokens` to do it in a single instruction, with the added benefit of separating tokens from different lines of code."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "type=56 (NL)  string='\\n'  start=(1, 0)  end=(1, 1)  line='\\n'\n",
      "\n",
      "type=1 (NAME)  string='if'  start=(2, 0)  end=(2, 2)  line='if True:\\n'\n",
      "type=1 (NAME)  string='True'  start=(2, 3)  end=(2, 7)  line='if True:\\n'\n",
      "type=53 (OP)  string=':'  start=(2, 7)  end=(2, 8)  line='if True:\\n'\n",
      "type=4 (NEWLINE)  string='\\n'  start=(2, 8)  end=(2, 9)  line='if True:\\n'\n",
      "\n",
      "type=5 (INDENT)  string='    '  start=(3, 0)  end=(3, 4)  line='    pass\\n'\n",
      "type=1 (NAME)  string='pass'  start=(3, 4)  end=(3, 8)  line='    pass\\n'\n",
      "type=4 (NEWLINE)  string='\\n'  start=(3, 8)  end=(3, 9)  line='    pass\\n'\n",
      "\n",
      "type=6 (DEDENT)  string=''  start=(4, 0)  end=(4, 0)  line=''\n",
      "type=0 (ENDMARKER)  string=''  start=(4, 0)  end=(4, 0)  line=''\n",
      "\n"
     ]
    }
   ],
   "source": [
    "source = \"\"\"\n",
    "if True:\n",
    "    pass\n",
    "\"\"\"\n",
    "token_utils.print_tokens(source)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Getting tokens by line of code\n",
    "Once a source is broken down into token, it might be difficult to find some particular tokens of interest if we print the entire content. Instead, using `get_lines`, we can tokenize by line of code , and just focus on a few lines of interest."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "type=6 (DEDENT)  string=''  start=(5, 4)  end=(5, 4)  line='    else:\\n'\n",
      "type=1 (NAME)  string='else'  start=(5, 4)  end=(5, 8)  line='    else:\\n'\n",
      "type=53 (OP)  string=':'  start=(5, 8)  end=(5, 9)  line='    else:\\n'\n",
      "type=4 (NEWLINE)  string='\\n'  start=(5, 9)  end=(5, 10)  line='    else:\\n'\n",
      "\n",
      "type=5 (INDENT)  string='        '  start=(6, 0)  end=(6, 8)  line='        a = 42 # a comment\\n'\n",
      "type=1 (NAME)  string='a'  start=(6, 8)  end=(6, 9)  line='        a = 42 # a comment\\n'\n",
      "type=53 (OP)  string='='  start=(6, 10)  end=(6, 11)  line='        a = 42 # a comment\\n'\n",
      "type=2 (NUMBER)  string='42'  start=(6, 12)  end=(6, 14)  line='        a = 42 # a comment\\n'\n",
      "type=55 (COMMENT)  string='# a comment'  start=(6, 15)  end=(6, 26)  line='        a = 42 # a comment\\n'\n",
      "type=4 (NEWLINE)  string='\\n'  start=(6, 26)  end=(6, 27)  line='        a = 42 # a comment\\n'\n",
      "\n"
     ]
    }
   ],
   "source": [
    "source = \"\"\"\n",
    "if True:\n",
    "    if False:\n",
    "        pass\n",
    "    else:\n",
    "        a = 42 # a comment\n",
    "print('ok')\n",
    "\"\"\"\n",
    "lines = token_utils.get_lines(source)\n",
    "for line in lines[4:6]:\n",
    "    for token in line:\n",
    "        print(token)\n",
    "    print()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Getting particular tokens\n",
    "Let's focus on the sixth line."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "        a = 42 # a comment\n",
      "\n"
     ]
    }
   ],
   "source": [
    "line = lines[5]\n",
    "print( token_utils.untokenize(line) )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Ignoring the indentation, the first token is `a`; ignoring newlines indicator and comments, the last token is `42`. We can get at these tokens using some utility functions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The first useful token is:\n",
      "    type=1 (NAME)  string='a'  start=(6, 8)  end=(6, 9)  line='        a = 42 # a comment\\n'\n",
      "The index of the first token is:  1\n",
      "\n",
      "The last useful token on that line is:\n",
      "   type=2 (NUMBER)  string='42'  start=(6, 12)  end=(6, 14)  line='        a = 42 # a comment\\n'\n",
      "Its index is 3\n"
     ]
    }
   ],
   "source": [
    "print(\"The first useful token is:\\n   \", token_utils.get_first(line))\n",
    "print(\"The index of the first token is: \", token_utils.get_first_index(line))\n",
    "print()\n",
    "print(\"The last useful token on that line is:\\n  \", token_utils.get_last(line))\n",
    "print(\"Its index is\", token_utils.get_last_index(line))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that these four functions, `get_first`, `get_first_index`, `get_last`, `get_last_index` exclude end of line comments by default; but this can be changed by setting the optional parameter `exclude_comment` to `False`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "type=55 (COMMENT)  string='# a comment'  start=(6, 15)  end=(6, 26)  line='        a = 42 # a comment\\n'\n"
     ]
    }
   ],
   "source": [
    "print( token_utils.get_last(line, exclude_comment=False))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Getting the indentation of a line\n",
    "The sixth line starts with an `INDENT` token. We can get the indentation of that line, either by printing the length of the `INDENT` token string, or by looking at the `start_col` attribute of the first \"useful\" token. The attribute `start_col` is part of the two-tuple `start = (start_row, start_col)`. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "8\n",
      "8\n"
     ]
    }
   ],
   "source": [
    "print(len(line[0].string))\n",
    "first = token_utils.get_first(line)\n",
    "print(first.start_col)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In general, **the second method is more reliable**. For example, if we look at tokens the previous line (line 5, index 4), we can see that the length of the string of the first token, `INDENT`, does not give us the information about the line indentation. Furthermore, a given line may start with multiple `INDENT` tokens. However, once again, the `start_col` attribute of the first \"useful\" token can give us this value."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "type=6 (DEDENT)  string=''  start=(5, 4)  end=(5, 4)  line='    else:\\n'\n",
      "type=1 (NAME)  string='else'  start=(5, 4)  end=(5, 8)  line='    else:\\n'\n",
      "type=53 (OP)  string=':'  start=(5, 8)  end=(5, 9)  line='    else:\\n'\n",
      "type=4 (NEWLINE)  string='\\n'  start=(5, 9)  end=(5, 10)  line='    else:\\n'\n",
      "--------------------------------------------------\n",
      "    else:\n",
      "\n",
      "indentation =  4\n"
     ]
    }
   ],
   "source": [
    "for token in lines[4]:\n",
    "    print(token)\n",
    "print(\"-\" * 50)\n",
    "    \n",
    "print(token_utils.untokenize(lines[4]))\n",
    "first = token_utils.get_first(lines[4])\n",
    "print(\"indentation = \", first.start_col)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Changing information\n",
    "\n",
    "Suppose we wish to do the following replacement\n",
    "\n",
    "```\n",
    "repeat n: --> for some_variable in range(n):\n",
    "```\n",
    "Here `n` might be anything that evaluates as an integer.  Let's see a couple of different ways to do this.\n",
    "\n",
    "First, we simply change the string content of two tokens."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "for some_variable in range( 2 * 3 ):\n"
     ]
    }
   ],
   "source": [
    "source = \"repeat 2 * 3 : \"\n",
    "\n",
    "tokens = token_utils.tokenize(source)\n",
    "repeat = token_utils.get_first(tokens)\n",
    "colon = token_utils.get_last(tokens)\n",
    "\n",
    "repeat.string = \"for some_variable in range(\"\n",
    "colon.string = \"):\"\n",
    "\n",
    "print(token_utils.untokenize(tokens))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's revert back the change for the colon, to see a different way of doing the same thing."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "for some_variable in range( 2 * 3 :\n"
     ]
    }
   ],
   "source": [
    "colon.string = \":\"\n",
    "print(token_utils.untokenize(tokens))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This time, let's **insert** an extra token, written as a simple Python string."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "type=1 (NAME)  string='for some_variable in range('  start=(1, 0)  end=(1, 6)  line='repeat 2 * 3 : '\n",
      "type=2 (NUMBER)  string='2'  start=(1, 7)  end=(1, 8)  line='repeat 2 * 3 : '\n",
      "type=53 (OP)  string='*'  start=(1, 9)  end=(1, 10)  line='repeat 2 * 3 : '\n",
      "type=2 (NUMBER)  string='3'  start=(1, 11)  end=(1, 12)  line='repeat 2 * 3 : '\n",
      ")\n",
      "type=53 (OP)  string=':'  start=(1, 13)  end=(1, 14)  line='repeat 2 * 3 : '\n",
      "type=4 (NEWLINE)  string=''  start=(1, 15)  end=(1, 16)  line=''\n",
      "type=0 (ENDMARKER)  string=''  start=(2, 0)  end=(2, 0)  line=''\n"
     ]
    }
   ],
   "source": [
    "colon_index = token_utils.get_last_index(tokens)\n",
    "tokens.insert(colon_index, \")\")\n",
    "for token in tokens:\n",
    "    print(token)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In spite of `')'` being a normal Python string, it can still be processed correctly by the `untokenize` function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "for some_variable in range( 2 * 3) :\n"
     ]
    }
   ],
   "source": [
    "print(token_utils.untokenize(tokens))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Thus, unlike Python's own untokenize function, we do not have to worry about token types when we wish to insert extra tokens.\n",
    "\n",
    "## Changing indentation\n",
    "\n",
    "We can easily change the indentation of a given line using either the `indent` or `dedent` function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "    a = 1\n",
      "\n",
      "    b = 2\n",
      "\n"
     ]
    }
   ],
   "source": [
    "source = \"\"\"\n",
    "if True:\n",
    "    a = 1\n",
    "      b = 2\n",
    "\"\"\"\n",
    "\n",
    "# First, reducing the indentation of the \"b = 2\" line\n",
    "\n",
    "lines = token_utils.get_lines(source)\n",
    "a_line = lines[2]\n",
    "a = token_utils.get_first(a_line)\n",
    "assert a == \"a\"\n",
    "b_line = lines[3]\n",
    "b = token_utils.get_first(b_line)\n",
    "lines[3] = token_utils.dedent(b_line, b.start_col - a.start_col)\n",
    "\n",
    "print(token_utils.untokenize(a_line))\n",
    "print(token_utils.untokenize(lines[3]))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Alternatively, we can indent the \"a = 1\" line"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "      a = 1\n",
      "\n",
      "      b = 2\n",
      "\n"
     ]
    }
   ],
   "source": [
    "lines = token_utils.get_lines(source)\n",
    "a_line = lines[2]\n",
    "a = token_utils.get_first(a_line)\n",
    "assert a == \"a\"\n",
    "b_line = lines[3]\n",
    "b = token_utils.get_first(b_line)\n",
    "lines[2] = token_utils.indent(a_line, b.start_col - a.start_col)\n",
    "\n",
    "print(token_utils.untokenize(lines[2]))\n",
    "print(token_utils.untokenize(b_line))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, let's recover the entire source with the fixed indentation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "if True:\n",
      "      a = 1\n",
      "      b = 2\n",
      "\n"
     ]
    }
   ],
   "source": [
    "new_tokens = []\n",
    "for line in lines:\n",
    "    new_tokens.extend(line)\n",
    "    \n",
    "print(token_utils.untokenize(new_tokens))"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Raw Cell Format",
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}