Update for TorchText 0.3

- reworded a lot of the explanations - fixed a lot of typos - updated some images and moved new ones to assets folder - fixed deterministic evaluation with `torch.backends.cudnn.deterministic = True` - tensor_type in Fields becomes dtype - changed train, valid, test TorchText dataset objects to be named train_data, valid_data and test_data - TorchText iterators now have `repeat = False` by default, so removed - TorchText iterators now take a `torch.device` instead of an int representing the GPU number (and -1 for CPU) - `F.sigmoid` is now `torch.sigmoid` - padding sentences in the CNN model is now fixed (before spaCy would tokenize the <pad> token into three separate tokens)
2018-10-17 17:54:31 +01:00 · 2018-10-17 17:54:31 +01:00 · 61bf7da1d0
commit 61bf7da1d0
parent 710ead1daa
15 changed files with 325 additions and 379 deletions
--- a/Analysis.ipynb
+++ b/Analysis.ipynb
@ -6,13 +6,23 @@
   "source": [
    "# 1 - Simple Sentiment Analysis\n",
    "\n",
-    "In this series we'll be building a *machine learning* model to detect sentiment (i.e. detect if a sentence is positive or negative) using PyTorch and TorchText. This will be done on movie reviews using the IMDb dataset.\n",
+    "In this series we'll be building a machine learning model to detect sentiment (i.e. detect if a sentence is positive or negative) using PyTorch and TorchText. This will be done on movie reviews, using the [IMDb dataset](http://ai.stanford.edu/~amaas/data/sentiment/).\n",
    "\n",
-    "In this first notebook, we'll start very simple to understand the general concepts whilst not really caring about good results. Further notebooks will build on this knowledge, to actually get good results.\n",
+    "In this first notebook, we'll start very simple to understand the general concepts whilst not really caring about good results. Further notebooks will build on this knowledge and we'll actually get good results.\n",
    "\n",
-    "We'll be using a **recurrent neural network** (RNN) which reads a sequence of words, and for each word (sometimes called a _step_) will output a _hidden state_. We then use the hidden state for subsequent word in the sentence, until the final word has been fed into the RNN. This final hidden state will then be used to predict the sentiment of the sentence.\n",
+    "### Introduction\n",
    "\n",
-    "![](https://i.imgur.com/VedY9iG.png)"
+    "We'll be using a **recurrent neural network** (RNN) as they are commonly used in analysing sequences. An RNN takes in sequence of words, $X=\\{x_1, ..., x_T\\}$, one at a time, and produces a _hidden state_, $h$, for each word. We use the RNN _recurrently_ by feeding in the current word $x_t$ as well as the hidden state from the previous word, $h_{t-1}$, to produce the next hidden state, $h_t$. \n",
+    "\n",
+    "$$h_t = \\text{RNN}(x_t, h_{t-1})$$\n",
+    "\n",
+    "Once we have our final hidden state, $h_T$, (from feeding in the last word in the sequence, $x_T$) we feed it through a linear layer, $f$, (also known as a fully connected layer), to receive our predicted sentiment, $\\hat{y} = f(h_T)$.\n",
+    "\n",
+    "Below shows an example sentence, with the RNN predicting zero, which indicates a negative sentiment. The RNN is shown in orange and the linear layer shown in silver. Note that we use the same RNN for every word, i.e. it has the same parameters. The initial hidden state, $h_0$, is a tensor initialized to all zeros. \n",
+    "\n",
+    "![](assets/sentiment1.png)\n",
+    "\n",
+    "**Note:** some layers and steps have been omitted from the diagram, but these will be explained later."
   ]
  },
  {
@ -21,15 +31,15 @@
   "source": [
    "## Preparing Data\n",
    "\n",
-    "One of the main concepts of TorchText is the `Field`. These define how your data should be processed. In our sentiment classification task we have to sources of data, the raw string of the review and the sentiment, either \"pos\" or \"neg\".\n",
-    "\n",
-    "We use the `TEXT` field to handle the review and the `LABEL` field to handle the sentiment. \n",
+    "One of the main concepts of TorchText is the `Field`. These define how your data should be processed. In our sentiment classification task the data consists of both the raw string of the review and the sentiment, either \"pos\" or \"neg\".\n",
    "\n",
    "The parameters of a `Field` specify how the data should be processed. \n",
    "\n",
-    "Our `TEXT` field has `tokenize='spacy'`, which defines that the \"tokenization\" (the act of splitting the string into discrete \"tokens\") should be done using the [spaCy](https://spacy.io) tokenizer. If no `tokenize` argument is passed, the default is simply splitting the string on spaces.\n",
+    "We use the `TEXT` field to define how the review should be processed, and the `LABEL` field to process the sentiment. \n",
    "\n",
-    "`LABEL` is defined by a `LabelField`, a special subset of the `Field` class specifically for handling labels. We will explain the `tensor_type` argument later.\n",
+    "Our `TEXT` field has `tokenize='spacy'` as an argument. This defines that the \"tokenization\" (the act of splitting the string into discrete \"tokens\") should be done using the [spaCy](https://spacy.io) tokenizer. If no `tokenize` argument is passed, the default is simply splitting the string on spaces.\n",
+    "\n",
+    "`LABEL` is defined by a `LabelField`, a special subset of the `Field` class specifically used for handling labels. We will explain the `dtype` argument later.\n",
    "\n",
    "For more on `Fields`, go [here](https://github.com/pytorch/text/blob/master/torchtext/data/field.py).\n",
    "\n",
@ -49,18 +59,19 @@
    "\n",
    "torch.manual_seed(SEED)\n",
    "torch.cuda.manual_seed(SEED)\n",
+    "torch.backends.cudnn.deterministic = True\n",
    "\n",
    "TEXT = data.Field(tokenize='spacy')\n",
-    "LABEL = data.LabelField(tensor_type=torch.FloatTensor)"
+    "LABEL = data.LabelField(dtype=torch.float)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Another handy feature of TorchText is that it has support for common datasets used in NLP. \n",
+    "Another handy feature of TorchText is that it has support for common datasets used in natural language process (NLP). \n",
    "\n",
-    "The following code automatically downloads the IMDb dataset and splits it into the canonical train/test splits as `torchtext.datasets` objects. It uses the `Fields` we have previously defined. "
+    "The following code automatically downloads the IMDb dataset and splits it into the canonical train/test splits as `torchtext.datasets` objects. It process the data using the `Fields` we have previously defined. "
   ]
  },
  {
@ -71,7 +82,7 @@
   "source": [
    "from torchtext import datasets\n",
    "\n",
-    "train, test = datasets.IMDB.splits(TEXT, LABEL)"
+    "train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)"
   ]
  },
  {
@ -90,38 +101,14 @@
     "name": "stdout",
     "output_type": "stream",
     "text": [
-      "len(train): 25000\n",
-      "len(test): 25000\n"
+      "Number of training examples: 25000\n",
+      "Number of testing examples: 25000\n"
     ]
    }
   ],
   "source": [
-    "print('len(train):', len(train))\n",
-    "print('len(test):', len(test))"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "We can check the fields of the data, hoping that it they match the `Fields` given earlier."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 4,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "train.fields: {'text': <torchtext.data.field.Field object at 0x7fda01652240>, 'label': <torchtext.data.field.LabelField object at 0x7fda01652518>}\n"
-     ]
-    }
-   ],
-   "source": [
-    "print('train.fields:', train.fields)"
+    "print(f'Number of training examples: {len(train_data)}')\n",
+    "print(f'Number of testing examples: {len(test_data)}')"
   ]
  },
  {
@ -133,19 +120,19 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 5,
+   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
-      "vars(train[0]): {'text': ['elvira', 'mistress', 'of', 'the', 'dark', 'is', 'one', 'of', 'my', 'fav', 'movies', ',', 'it', 'has', 'every', 'thing', 'you', 'would', 'want', 'in', 'a', 'film', ',', 'like', 'great', 'one', 'liners', ',', 'sexy', 'star', 'and', 'a', 'Outrageous', 'story', '!', 'if', 'you', 'have', 'not', 'seen', 'it', ',', 'you', 'are', 'missing', 'out', 'on', 'one', 'of', 'the', 'greatest', 'films', 'made', '.', 'i', 'ca', \"n't\", 'wait', 'till', 'her', 'new', 'movie', 'comes', 'out', '!'], 'label': 'pos'}\n"
+      "{'text': ['elvira', 'mistress', 'of', 'the', 'dark', 'is', 'one', 'of', 'my', 'fav', 'movies', ',', 'it', 'has', 'every', 'thing', 'you', 'would', 'want', 'in', 'a', 'film', ',', 'like', 'great', 'one', 'liners', ',', 'sexy', 'star', 'and', 'a', 'Outrageous', 'story', '!', 'if', 'you', 'have', 'not', 'seen', 'it', ',', 'you', 'are', 'missing', 'out', 'on', 'one', 'of', 'the', 'greatest', 'films', 'made', '.', 'i', 'ca', \"n't\", 'wait', 'till', 'her', 'new', 'movie', 'comes', 'out', '!'], 'label': 'pos'}\n"
     ]
    }
   ],
   "source": [
-    "print('vars(train[0]):', vars(train[0]))"
+    "print(vars(train_data.examples[0]))"
   ]
  },
  {
@ -161,13 +148,13 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 6,
+   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "import random\n",
    "\n",
-    "train, valid = train.split(random_state=random.seed(SEED))"
+    "train_data, valid_data = train_data.split(random_state=random.seed(SEED))"
   ]
  },
  {
@ -179,50 +166,52 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 8,
+   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
-      "len(train): 17500\n",
-      "len(valid): 7500\n",
-      "len(test): 25000\n"
+      "Number of training examples: 17500\n",
+      "Number of validation examples: 7500\n",
+      "Number of testing examples: 25000\n"
     ]
    }
   ],
   "source": [
-    "print('len(train):', len(train))\n",
-    "print('len(valid):', len(valid))\n",
-    "print('len(test):', len(test))"
+    "print(f'Number of training examples: {len(train_data)}')\n",
+    "print(f'Number of validation examples: {len(valid_data)}')\n",
+    "print(f'Number of testing examples: {len(test_data)}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Next, we have to build a _vocabulary_. This is a effectively a look up table where every unique word in your _dictionary_ (every word that occurs in all of your examples) has a corresponding _index_ (an integer).\n",
+    "Next, we have to build a _vocabulary_. This is a effectively a look up table where every unique word in your data set has a corresponding _index_ (an integer).\n",
+    "\n",
+    "We do this as our machine learning model cannot operate on strings, only numbers. Each _index_ is used to construct a _one-hot_ vector for each word. A one-hot vector is a vector where all of the elements are 0, except one, which is 1, and dimensionality is the total number of unique words in your vocabulary, commonly denoted by $V$.\n",
    "\n",
    "![](https://i.imgur.com/0o5Gdar.png)\n",
    "\n",
-    "We do this as our machine learning model cannot operate on strings, only numbers. Each _index_ is used to construct a _one-hot_ vector for each word. A one-hot vector is a vector where all of the elements are 0, except one, which is 1, and dimensionality is the total number of unique words in your vocabulary.\n",
+    "The number of unique words in our training set is over 100,000, which means that our one-hot vectors will have over 100,000 dimensions! This will make training slow and possibly won't fit onto your GPU (if you're using one). \n",
    "\n",
-    "The number of unique words in our training set is over 100,000, which means that our one-hot vectors will be 100,000 dimensions! This will make training slow and possibly won't fit onto your GPU (if you're using one). \n",
+    "There are two ways effectively cut down our vocabulary, we can either only take the top $n$ most common words or ignore words that appear less than $m$ times. We'll do the former, only keeping the top 25,000 words.\n",
    "\n",
-    "There are two ways effectively cut down our vocabulary, we can either only take the top $n$ most common words or ignore words that appear less than $n$ times. We'll do the former, only keeping the top 25,000 words.\n",
+    "What do we do with words that appear in examples but we have cut from the vocabulary? We replace them with a special _unknown_ or `<unk>` token. For example, if the sentence was \"This film is great and I love it\" but the word \"love\" was not in the vocabulary, it would become \"This film is great and I `<unk>` it\".\n",
    "\n",
-    "What do we do with words that appear in examples but we have cut from the vocabulary? We replace them with a special _unknown_ or _unk_ token. For example, if the sentence was \"This film is great and I love it\" but the word \"love\" was not in the vocabulary, it would become \"This film is great and I unk it\"."
+    "The following builds the vocabulary, only keeping the most common `max_size` tokens."
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 9,
+   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
-    "TEXT.build_vocab(train, max_size=25000)\n",
-    "LABEL.build_vocab(train)"
+    "TEXT.build_vocab(train_data, max_size=25000)\n",
+    "LABEL.build_vocab(train_data)"
   ]
  },
  {
@ -234,39 +223,39 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 10,
+   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
-      "len(TEXT.vocab): 25002\n",
-      "len(LABEL.vocab): 2\n"
+      "Unique tokens in TEXT vocabulary: 25002\n",
+      "Unique tokens in LABEL vocabulary: 2\n"
     ]
    }
   ],
   "source": [
-    "print('len(TEXT.vocab):', len(TEXT.vocab))\n",
-    "print('len(LABEL.vocab):', len(LABEL.vocab))"
+    "print(f\"Unique tokens in TEXT vocabulary: {len(TEXT.vocab)}\")\n",
+    "print(f\"Unique tokens in LABEL vocabulary: {len(LABEL.vocab)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Why is the vocab size 25002 and not 25000? One of the addition tokens is the _unk_ token and the other is a _pad_ token.\n",
+    "Why is the vocab size 25002 and not 25000? One of the addition tokens is the `<unk>` token and the other is a `<pad>` token.\n",
+    "\n",
+    "When we feed sentences into our model, we feed a _batch_ of them at a time, i.e. more than one at a time, and all sentences in the batch need to be the same size. Thus, to ensure each sentence in the batch is the same size, any shorter than the longest within the batch are padded.\n",
    "\n",
    "![](https://i.imgur.com/TZRJAX4.png)\n",
    "\n",
-    "When we feed sentences into our model, we feed a _batch_ of them at a time, i.e. more than one at a time, and all sentences in the batch need to be the same size. Thus, to ensure each sentence in the batch is the same size, any shorter than the largest within the batch are padded.\n",
-    "\n",
    "We can also view the most common words in the vocabulary. "
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 11,
+   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
@ -290,7 +279,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 12,
+   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
@ -314,14 +303,14 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 13,
+   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
-      "defaultdict(<function _default_unk_index at 0x7fd9a7d341e0>, {'neg': 0, 'pos': 1})\n"
+      "defaultdict(<function _default_unk_index at 0x7f836744c1e0>, {'neg': 0, 'pos': 1})\n"
     ]
    }
   ],
@ -333,24 +322,27 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "The final step of preparing the data is creating the iterators.\n",
+    "The final step of preparing the data is creating the iterators. We iterate over these in the training/evaluation loop, and they return a batch of examples (indexed and converted into tensors) at each iteration.\n",
    "\n",
-    "`BucketIterator` first sorts of the examples using the `sort_key`, here we use the length of the sentences, and then partitions them into _buckets_. When the iterator is called it returns a batch of examples from the same bucket. This will return a batch of examples where each example is a similar length, minimizing the amount of padding."
+    "We'll use a `BucketIterator` which is a special type of iterator that will return a batch of examples where each example is of a similar length, minimizing the amount of padding per example.\n",
+    "\n",
+    "We also want to place the tensors returned by the iterator on the GPU (if you're using one). PyTorch handles this using `torch.device`, we then pass this device to the iterator."
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 14,
+   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "BATCH_SIZE = 64\n",
    "\n",
+    "device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')\n",
+    "\n",
    "train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(\n",
-    "    (train, valid, test), \n",
-    "    batch_size=BATCH_SIZE, \n",
-    "    sort_key=lambda x: len(x.text), \n",
-    "    repeat=False)"
+    "    (train_data, valid_data, test_data), \n",
+    "    batch_size=BATCH_SIZE,\n",
+    "    device=device)"
   ]
  },
  {
@ -363,13 +355,13 @@
    "\n",
    "There is a small amount of boilerplate code when creating models in PyTorch, note how our `RNN` class is a sub-class of `nn.Module` and the use of `super`.\n",
    "\n",
-    "Within the `__init__` we define the _layers_ of the module. Our three layers are an _embedding_ layer, our RNN, and a _linear_ layer. \n",
+    "Within the `__init__` we define the _layers_ of the module. Our three layers are an _embedding_ layer, our RNN, and a _linear_ layer. All layers have their parameters initialized to random values, unless explicitly specified.\n",
    "\n",
-    "The embedding layer is used to transform our sparse one-hot vector (sparse as most of the elements are 0) into a dense embedding vector (dense as the dimensionality is a lot smaller). This embedding layer is simply a single fully connected layer. The theory is that words that have similar impact on the sentiment are mapped close together in this dense vector space. For more information about word embeddings, see [here](https://monkeylearn.com/blog/word-embeddings-transform-text-numbers/).\n",
+    "The embedding layer is used to transform our sparse one-hot vector (sparse as most of the elements are 0) into a dense embedding vector (dense as the dimensionality is a lot smaller and all the elements are real numbers). This embedding layer is simply a single fully connected layer. As well as reducing the dimensionality of the input to the RNN, there is the theory that words which have similar impact on the sentiment of the review are mapped close together in this dense vector space. For more information about word embeddings, see [here](https://monkeylearn.com/blog/word-embeddings-transform-text-numbers/).\n",
    "\n",
    "The RNN layer is our RNN which takes in our dense vector and the previous hidden state $h_{t-1}$, which it uses to calculate the next hidden state, $h_t$.\n",
    "\n",
-    "Finally, the linear layer takes the final hidden state and feeds it through a fully connected layer, transforming it to the correct output dimension.\n",
+    "Finally, the linear layer takes the final hidden state and feeds it through a fully connected layer, $f(h_T)$, transforming it to the correct output dimension.\n",
    "\n",
    "![](https://i.imgur.com/GIov3zF.png)\n",
    "\n",
@ -377,20 +369,20 @@
    "\n",
    "Each batch, `x`, is a tensor of size _**[sentence length, batch size]**_. That is a batch of sentences, each having each word converted into a one-hot vector. \n",
    "\n",
-    "You may notice that this tensor should have another dimension due to the one-hot vectors, however PyTorch conveniently stores a one-hot vector as it's index value.\n",
+    "You may notice that this tensor should have another dimension due to the one-hot vectors, however PyTorch conveniently stores a one-hot vector as it's index value, i.e. the tensor representing a sentence is just a tensor of the indexes for each token in that sentence.\n",
    "\n",
-    "The input batch is then passed through the embedding layer to get `embedded`, where now each one-hot vector is converted to a dense vector. `embedded` is a tensor of size _**[sentence length, batch size, embedding dim]**_.\n",
+    "The input batch is then passed through the embedding layer to get `embedded`, which gives us a dense vector representation of our sentences. `embedded` is a tensor of size _**[sentence length, batch size, embedding dim]**_.\n",
    "\n",
    "`embedded` is then fed into the RNN. In some frameworks you must feed the initial hidden state, $h_0$, into the RNN, however in PyTorch, if no initial hidden state is passed as an argument it defaults to a tensor of all zeros.\n",
    "\n",
    "The RNN returns 2 tensors, `output` of size _**[sentence length, batch size, hidden dim]**_ and `hidden` of size _**[1, batch size, embedding dim]**_. `output` is the concatenation of the hidden state from every time step, whereas `hidden` is simply the final hidden state. We verify this using the `assert` statement. Note the `squeeze` method, which is used to remove a dimension of size 1. \n",
    "\n",
-    "Finally, we feed the last hidden state, `hidden`, through the linear layer to produce a prediction."
+    "Finally, we feed the last hidden state, `hidden`, through the linear layer, `fc`, to produce a prediction."
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 15,
+   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
@ -430,16 +422,16 @@
    "\n",
    "The input dimension is the dimension of the one-hot vectors, which is equal to the vocabulary size. \n",
    "\n",
-    "The embedding dimension is the size of the dense word vectors, this is usually around the square root of the vocab size. \n",
+    "The embedding dimension is the size of the dense word vectors. This is usually around 50-250 dimensions, but depends on the size of the vocabulary.\n",
    "\n",
-    "The hidden dimension is the size of the hidden states, this is usually around 100-500 dimensions, but depends on the vocab size, embedding dimension and the complexity of the task.\n",
+    "The hidden dimension is the size of the hidden states. This is usually around 100-500 dimensions, but also depends on factors such as on the vocabulary size, the size of the dense vectors and the complexity of the task.\n",
    "\n",
-    "The output dimension is usually the number of classes, however in the case of only 2 classes the output value is between 0 and 1 and thus can be 1-dimensional, i.e. a single scalar."
+    "The output dimension is usually the number of classes, however in the case of only 2 classes the output value is between 0 and 1 and thus can be 1-dimensional, i.e. a single scalar real number."
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 16,
+   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
@ -464,12 +456,12 @@
   "source": [
    "Now we'll set up the training and then train the model.\n",
    "\n",
-    "First, we'll create an optimizer. This is the algorithm we use to update the parameters of the module. Here, we'll use _stochastic gradient descent_ (SGD). The first argument is the parameters will be updated by the optimizer, the second is the learning rate, i.e. how much we'll change the parameters by when we do an update."
+    "First, we'll create an optimizer. This is the algorithm we use to update the parameters of the module. Here, we'll use _stochastic gradient descent_ (SGD). The first argument is the parameters will be updated by the optimizer, the second is the learning rate, i.e. how much we'll change the parameters by when we do a parameter update."
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 17,
+   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
@ -486,14 +478,16 @@
    "\n",
    "The loss function here is _binary cross entropy with logits_. \n",
    "\n",
-    "The prediction for each sentence is an unbound real number, as our labels are either 0 or 1, we want to restrict the number between 0 and 1, we do this using the _sigmoid function_, see [here](https://en.wikipedia.org/wiki/Sigmoid_function). \n",
+    "The prediction for each sentence is an unbound real number, as our labels are either 0 or 1, we want to restrict the number between 0 and 1, we do this using the _sigmoid_ or _logit_ functions. \n",
    "\n",
-    "We then calculate this bound scalar using binary cross entropy, see [here](https://rdipietro.github.io/friendly-intro-to-cross-entropy-loss/). "
+    "We then calculate this bound scalar using binary cross entropy. \n",
+    "\n",
+    "The `BCEWithLogitsLoss` criterion carries out both the sigmoid and the binary cross entropy steps."
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 18,
+   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
@ -504,19 +498,15 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "PyTorch has excellent support for NVIDIA GPUs via CUDA. `torch.cuda.is_available()` returns `True` if PyTorch detects a GPU.\n",
-    "\n",
-    "Using `.to`, we can place the model and the criterion on the GPU. "
+    "Using `.to`, we can place the model and the criterion on the GPU (if we have one). "
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 19,
+   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
-    "device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')\n",
-    "\n",
    "model = model.to(device)\n",
    "criterion = criterion.to(device)"
   ]
@ -527,26 +517,24 @@
   "source": [
    "Our criterion function calculates the loss, however we have to write our function to calculate the accuracy. \n",
    "\n",
-    "This function first feeds the predictions through a sigmoid layer, squashing the values between 0 and 1, we then round them to the nearest integer. This rounds any value greater than 0.5 to 1 (a positive sentiment). \n",
+    "This function first feeds the predictions through a sigmoid layer, squashing the values between 0 and 1, we then round them to the nearest integer. This rounds any value greater than 0.5 to 1 (a positive sentiment) and the rest to 0 (a negative sentiment).\n",
    "\n",
    "We then calculate how many rounded predictions equal the actual labels and average it across the batch."
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 20,
+   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
-    "import torch.nn.functional as F\n",
-    "\n",
    "def binary_accuracy(preds, y):\n",
    "    \"\"\"\n",
    "    Returns accuracy per batch, i.e. if you get 8/10 right, this returns 0.8, NOT 8\n",
    "    \"\"\"\n",
    "\n",
    "    #round predictions to the closest integer\n",
-    "    rounded_preds = torch.round(F.sigmoid(preds))\n",
+    "    rounded_preds = torch.round(torch.sigmoid(preds))\n",
    "    correct = (rounded_preds == y).float() #convert into float for division \n",
    "    acc = correct.sum()/len(correct)\n",
    "    return acc"
@ -556,13 +544,13 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "The `train` function iterates over all examples, a batch at a time. \n",
+    "The `train` function iterates over all examples, one batch at a time. \n",
    "\n",
    "`model.train()` is used to put the model in \"training mode\", which turns on _dropout_ and _batch normalization_. Although we aren't using them in this model, it's good practice to include it.\n",
    "\n",
-    "For each batch, we first zero the gradients. Each parameter in a model has a `grad` attribute which stores the gradient calculated by the `criterion`. PyTorch does not automatically remove (or zero) the gradients calculated from the last gradient calculation so they must be manually cleared.\n",
+    "For each batch, we first zero the gradients. Each parameter in a model has a `grad` attribute which stores the gradient calculated by the `criterion`. PyTorch does not automatically remove (or \"zero\") the gradients calculated from the last gradient calculation, so they must be manually zeroed.\n",
    "\n",
-    "We then feed the batch of sentences, `batch.text`, into the model. Note, you do not need to do `model.forward(batch.text)`, simply calling the model works. The `squeeze` is needed as the predictions are initially size _**[batch size, 1]**_, and we need to remove the dimension of size 1.\n",
+    "We then feed the batch of sentences, `batch.text`, into the model. Note, you do not need to do `model.forward(batch.text)`, simply calling the model works. The `squeeze` is needed as the predictions are initially size _**[batch size, 1]**_, and we need to remove the dimension of size 1 as PyTorch expects the predictions input to a loss function to simply be of size _**[batch size]**_.\n",
    "\n",
    "The loss and accuracy are then calculated using our predictions and the labels, `batch.label`. \n",
    "\n",
@ -570,16 +558,14 @@
    "\n",
    "The loss and accuracy is accumulated across the epoch, the `.item()` method is used to extract a scalar from a tensor which only contains a single value.\n",
    "\n",
-    "Finally, we return the loss and accuracy, averaged across the epoch. The len of an iterator is the number of batches in the iterator.\n",
+    "Finally, we return the loss and accuracy, averaged across the epoch. The `len` of an iterator is the number of batches in the iterator.\n",
    "\n",
-    "You may recall when initializing the `LABEL` field, we set `tensor_type=torch.FloatTensor`. This is because TorchText sets tensors to be `LongTensor`s by default, however our criterion expects both inputs to be `FloatTensor`s. As we have manually set the `tensor_type` to be `FloatTensor`s, this conversion is done for us.\n",
-    "\n",
-    "Another method would be to do the conversion inside the `train` function by passing `batch.label.float()` instad of `batch.label` to the criterion. "
+    "You may recall when initializing the `LABEL` field, we set `dtype=torch.float`. This is because TorchText sets tensors to be `LongTensor`s by default, however our criterion expects both inputs to be `FloatTensor`s. As we have manually set the `dtype` to be `torch.float`, this is automatically done for us. The alternative method of doing this would be to do the conversion inside the `train` function by passing `batch.label.float()` instad of `batch.label` to the criterion. "
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 21,
+   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
@ -593,7 +579,7 @@
    "    for batch in iterator:\n",
    "        \n",
    "        optimizer.zero_grad()\n",
-    "        \n",
+    "                \n",
    "        predictions = model(batch.text).squeeze(1)\n",
    "        \n",
    "        loss = criterion(predictions, batch.label)\n",
@ -616,16 +602,16 @@
   "source": [
    "`evaluate` is similar to `train`, with a few modifications as you don't want to update the parameters when evaluating.\n",
    "\n",
-    "`model.eval()` puts the model in \"evaluation mode\", this turns off _dropout_ and _batch normalization_. Again, we are not using them in this model, but it is good practice to include it.\n",
+    "`model.eval()` puts the model in \"evaluation mode\", this turns off _dropout_ and _batch normalization_. Again, we are not using them in this model, but it is good practice to include them.\n",
    "\n",
-    "Inside the `no_grad()`, no gradients are calculated which speeds up computation.\n",
+    "No gradients are calculated on PyTorch operations inside the `with no_grad()` block. This causes less memory to be used and speeds up computation.\n",
    "\n",
-    "The rest of the function is the same as `train`, with the removal of `optimizer.zero_grad()`, `loss.backward()`, `optimizer.step()`."
+    "The rest of the function is the same as `train`, with the removal of `optimizer.zero_grad()`, `loss.backward()` and `optimizer.step()`, as we do not update the model's parameters when evaluating."
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 22,
+   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
@ -661,26 +647,18 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 23,
+   "execution_count": 21,
   "metadata": {},
   "outputs": [
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "/home/ben/.conda/envs/pytorch04/lib/python3.6/site-packages/torchtext/data/field.py:322: UserWarning: volatile was removed and now has no effect. Use `with torch.no_grad():` instead.\n",
-      "  return Variable(arr, volatile=not train)\n"
-     ]
-    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
-      "Epoch: 01, Train Loss: 0.694, Train Acc: 50.11%, Val. Loss: 0.696, Val. Acc: 50.28%\n",
-      "Epoch: 02, Train Loss: 0.693, Train Acc: 49.83%, Val. Loss: 0.697, Val. Acc: 51.00%\n",
-      "Epoch: 03, Train Loss: 0.693, Train Acc: 50.23%, Val. Loss: 0.697, Val. Acc: 50.83%\n",
-      "Epoch: 04, Train Loss: 0.693, Train Acc: 49.94%, Val. Loss: 0.696, Val. Acc: 49.97%\n",
-      "Epoch: 05, Train Loss: 0.693, Train Acc: 50.07%, Val. Loss: 0.697, Val. Acc: 51.04%\n"
+      "| Epoch: 01 | Train Loss: 0.694 | Train Acc: 50.11% | Val. Loss: 0.696 | Val. Acc: 50.28% |\n",
+      "| Epoch: 02 | Train Loss: 0.693 | Train Acc: 49.83% | Val. Loss: 0.697 | Val. Acc: 51.00% |\n",
+      "| Epoch: 03 | Train Loss: 0.693 | Train Acc: 50.23% | Val. Loss: 0.697 | Val. Acc: 50.83% |\n",
+      "| Epoch: 04 | Train Loss: 0.693 | Train Acc: 49.94% | Val. Loss: 0.696 | Val. Acc: 49.97% |\n",
+      "| Epoch: 05 | Train Loss: 0.693 | Train Acc: 50.07% | Val. Loss: 0.697 | Val. Acc: 51.04% |\n"
     ]
    }
   ],
@ -692,7 +670,7 @@
    "    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)\n",
    "    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)\n",
    "    \n",
-    "    print(f'Epoch: {epoch+1:02}, Train Loss: {train_loss:.3f}, Train Acc: {train_acc*100:.2f}%, Val. Loss: {valid_loss:.3f}, Val. Acc: {valid_acc*100:.2f}%')"
+    "    print(f'| Epoch: {epoch+1:02} | Train Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}% | Val. Loss: {valid_loss:.3f} | Val. Acc: {valid_acc*100:.2f}% |')"
   ]
  },
  {
@ -701,34 +679,26 @@
   "source": [
    "You may have noticed the loss is not really decreasing and the accuracy is poor. This is due to several issues with the model which we'll improve in the next notebook.\n",
    "\n",
-    "Finally, the metric you actually care about, the test loss and accuracy."
+    "Finally, the metric we actually care about, the test loss and accuracy."
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 24,
+   "execution_count": 22,
   "metadata": {},
   "outputs": [
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "/home/ben/.conda/envs/pytorch04/lib/python3.6/site-packages/torchtext/data/field.py:322: UserWarning: volatile was removed and now has no effect. Use `with torch.no_grad():` instead.\n",
-      "  return Variable(arr, volatile=not train)\n"
-     ]
-    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
-      "Test Loss: 0.710, Test Acc: 47.84%\n"
+      "| Test Loss: 0.710 | Test Acc: 47.84% |\n"
     ]
    }
   ],
   "source": [
    "test_loss, test_acc = evaluate(model, test_iterator, criterion)\n",
    "\n",
-    "print(f'Test Loss: {test_loss:.3f}, Test Acc: {test_acc*100:.2f}%')"
+    "print(f'| Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}% |')"
   ]
  },
  {
--- a/Analysis.ipynb
+++ b/Analysis.ipynb
@ -43,22 +43,27 @@
    "\n",
    "torch.manual_seed(SEED)\n",
    "torch.cuda.manual_seed(SEED)\n",
+    "torch.backends.cudnn.deterministic = True\n",
    "\n",
    "TEXT = data.Field(tokenize='spacy')\n",
-    "LABEL = data.LabelField(tensor_type=torch.FloatTensor)\n",
+    "LABEL = data.LabelField(dtype=torch.float)\n",
    "\n",
-    "train, test = datasets.IMDB.splits(TEXT, LABEL)\n",
+    "train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)\n",
    "\n",
-    "train, valid = train.split(random_state=random.seed(SEED))"
+    "train_data, valid_data = train_data.split(random_state=random.seed(SEED))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "The first update, is the addition of pre-trained word embeddings. These vectors have been trained on corpuses of billions of tokens. Now, instead of having our word embeddings initialized randomly, they are initialized with these pre-trained vectors, where words that appear in similar contexts appear nearby in this vector space.\n",
+    "The first addition is the use of pre-trained word embeddings. Now, instead of having our word embeddings initialized randomly, they are initialized with these pre-trained vectors. The theory is that these vectors \n",
    "\n",
-    "The first step to using these is to specify the vectors and download them, which is passed as an argument to `build_vocab`. The `glove` is the algorithm used to calculate the vectors, go [here](https://nlp.stanford.edu/projects/glove/) for more. `6B` indicates these vectors were trained on 6 billion tokens. `100d` indicates these vectors are 100-dimensional.\n",
+    "We get these vectors simply by specifying which vectors we want, and passing it as an argument to `build_vocab`. Here, we'll be using the `\"glove.6B.100d\" vectors\"`. `glove` is the algorithm used to calculate the vectors, go [here](https://nlp.stanford.edu/projects/glove/) for more. `6B` indicates these vectors were trained on 6 billion tokens and `100d` indicates these vectors are 100-dimensional.\n",
+    "\n",
+    "You can see the other available vectors [here](https://github.com/pytorch/text/blob/master/torchtext/vocab.py#L113).\n",
+    "\n",
+    "The theory is that these pre-trained vectors already have words with similar semantic meaning close together in vector space, e.g. \"terrible\", \"awful\", \"dreadful\" are nearby. This gives our embedding layer a good initialization as it does not have to learn these relations from scratch.\n",
    "\n",
    "**Note**: these vectors are about 862MB, so watch out if you have a limited internet connection."
   ]
@ -69,15 +74,15 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "TEXT.build_vocab(train, max_size=25000, vectors=\"glove.6B.100d\")\n",
-    "LABEL.build_vocab(train)"
+    "TEXT.build_vocab(train_data, max_size=25000, vectors=\"glove.6B.100d\")\n",
+    "LABEL.build_vocab(train_data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "As before, we create the iterators."
+    "As before, we create the iterators, placing the tensors on the GPU if one is available."
   ]
  },
  {
@ -88,11 +93,12 @@
   "source": [
    "BATCH_SIZE = 64\n",
    "\n",
+    "device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')\n",
+    "\n",
    "train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(\n",
-    "    (train, valid, test), \n",
+    "    (train_data, valid_data, test_data), \n",
    "    batch_size=BATCH_SIZE, \n",
-    "    sort_key=lambda x: len(x.text), \n",
-    "    repeat=False)"
+    "    device=device)"
   ]
  },
  {
@ -105,33 +111,47 @@
    "\n",
    "### Different RNN Architecture\n",
    "\n",
-    "We use a different RNN architecture called a Long Short-Term Memory (LSTM). Why is an LSTM better than a standard RNN? The hidden state can be thought of as a \"memory\" of the words seen by the model. It is difficult to train a standard RNN as the gradient decays exponentially along the sequence, causing the RNN to \"forget\" what has happened earlier in the sequence. LSTMs have an extra recurrent state called a _cell_, which can be thought of as the \"memory\" of the LSTM and can remember information for many time steps. LSTMs also use multiple _gates_, these control the flow of information into and out of the memory. For more information, go [here](https://colah.github.io/posts/2015-08-Understanding-LSTMs/).\n",
+    "We'll be using a different RNN architecture called a Long Short-Term Memory (LSTM). Why is an LSTM better than a standard RNN? Standard RNNs suffer from the [vanishing gradient problem](https://en.wikipedia.org/wiki/Vanishing_gradient_problem). LSTMs overcome this by having an extra recurrent state called a _cell_, $c$ - which can be thought of as the \"memory\" of the LSTM - and the use use multiple _gates_ which control the flow of information into and out of the memory. For more information, go [here](https://colah.github.io/posts/2015-08-Understanding-LSTMs/). We can simply think of the LSTM as a function of $x_t$, $h_t$ and $c_t$, instead of just $x_t$ and $h_t$.\n",
+    "\n",
+    "$$(h_t, c_t) = \\text{LSTM}(x_t, h_t, c_t)$$\n",
+    "\n",
+    "Thus, the model using an LSTM looks something like:\n",
+    "\n",
+    "![](assets/sentiment2.png)\n",
+    "\n",
+    "The initial cell state, $c_0$, like the initial hidden state is initialized to a tensor of all zeros. The sentiment prediction is still, however, only made using the final hidden state, not the final cell state, i.e. $\\hat{y}=f(h_T)$.\n",
    "\n",
    "### Bidirectional RNN\n",
    "\n",
-    "The concept behind a bidirectional RNN is simple. As well as having an RNN processing the words in the sentence from the first to the last, we have a second RNN processing the words in the sentence from the **last to the first**. PyTorch simplifies this by concatenating both the forward and backward RNNs together, and thus the returned final hidden state, `hidden`, is the concatenation of the hidden state from the last word of the sentence from the forward RNN with the hidden state of the first word of the sentence from the backward RNN, both of which are the final hidden states from their respective RNNs.\n",
+    "The concept behind a bidirectional RNN is simple. As well as having an RNN processing the words in the sentence from the first to the last (a forward RNN), we have a second RNN processing the words in the sentence from the **last to the first** (a backward RNN). In PyTorch, the hidden state (and cell state) tensors returned by the forward and backward RNNs are stacked on top of each other. We make our sentiment prediction using the last hidden state from the forward RNN (obtained from final word of the sentence), $h_T^\\rightarrow$, and the last hidden state from the backward RNN (obtained from the first word of the sentence), $h_T^\\leftarrow$, i.e. $\\hat{y}=f(h_T^\\rightarrow, h_T^\\leftarrow)$   \n",
    "\n",
-    "![](https://i.imgur.com/itmIIgx.png)\n",
+    "The image below shows a bi-directional RNN, with the forward RNN in orange, the backward RNN in green and the linear layer in silver.  \n",
+    "\n",
+    "![](assets/sentiment3.png)\n",
    "\n",
    "### Multi-layer RNN\n",
    "\n",
-    "Multi-layer RNNs (also called *deep RNNs*) are another simple concept. The idea is that we add additional RNNs on top of the initial standard RNN, where each RNN added is another *layer*. The hidden state output by the first (bottom) RNN at time-step $t$ will be the input to the RNN above it at time step $t$. The prediction is then usually made from the final hidden state of the final (highest) layer. These are easily combined with bi-directional RNNs, where each extra layer adds an additional forward and backward RNN. \n",
+    "Multi-layer RNNs (also called *deep RNNs*) are another simple concept. The idea is that we add additional RNNs on top of the initial standard RNN, where each RNN added is another *layer*. The hidden state output by the first (bottom) RNN at time-step $t$ will be the input to the RNN above it at time step $t$. The prediction is then made from the final hidden state of the final (highest) layer.\n",
    "\n",
-    "![](https://i.imgur.com/knsIzeh.png)\n",
+    "The image below shows a multi-layer unidirectional RNN, where the layer number is given as a superscript. Also note that each layer needs their own initial hidden state, $h_0^L$.\n",
+    "\n",
+    "![](assets/sentiment4.png)\n",
    "\n",
    "### Regularization\n",
    "\n",
-    "Although we've added improvements to our model, each one adds additional parameters. Without going into overfitting into to much detail, the more parameters you have in in your model, the higher the probability that you'll overfit (have a low train error but high validation/test error). To combat this, we use regularization. More specifically, we use a method of regularization called *dropout*. Dropout works by randomly *dropping out* (setting to 0) neurons during a forward pass. The probability that each neuron is dropped out is set by a hyperparameter and each neuron with dropout applied is considered indepenently. One theory about why dropout works is that a model with parameters dropped out can be seen as a \"weaker\" (less parameters) model, the predictions from all these \"weaker\" models (one for each forward pass) get averaged together in the parameters of the model. Thus, your one model can be thought of as an ensemble of weaker models, none of which are over-parameterized and thus should not overfit.\n",
+    "Although we've added improvements to our model, each one adds additional parameters. Without going into overfitting into to much detail, the more parameters you have in in your model, the higher the probability that your model will overfit (memorize the training data, causing  a low training error but high validation/testing error, i.e. poor generalization to new, unseen examples). To combat this, we use regularization. More specifically, we use a method of regularization called *dropout*. Dropout works by randomly *dropping out* (setting to 0) neurons in a layer during a forward pass. The probability that each neuron is dropped out is set by a hyperparameter and each neuron with dropout applied is considered indepenently. One theory about why dropout works is that a model with parameters dropped out can be seen as a \"weaker\" (less parameters) model. The predictions from all these \"weaker\" models (one for each forward pass) get averaged together withinin the parameters of the model. Thus, your one model can be thought of as an ensemble of weaker models, none of which are over-parameterized and thus should not overfit.\n",
    "\n",
    "### Implementation Details\n",
    "\n",
-    "To use an LSTM instead of the standard RNN, we use `nn.LSTM` instead of `nn.RNN` on line 8. Also note on line 20 the LSTM returns the `output` and a tuple of the final `hidden` state and the final `cell` state, whereas the standard RNN only returned the `output` and final `hidden` state. \n",
+    "To use an LSTM instead of the standard RNN, we use `nn.LSTM` instead of `nn.RNN`. Also, note that the LSTM returns the `output` and a tuple of the final `hidden` state and the final `cell` state, whereas the standard RNN only returned the `output` and final `hidden` state. \n",
    "\n",
-    "As the final hidden state of our LSTM has both a forward and a backward component, which are concatenated together, the size of the input to the `nn.Linear` layer is twice that of the hidden dimension size.\n",
+    "As the final hidden state of our LSTM has both a forward and a backward component, which will be concatenated together, the size of the input to the `nn.Linear` layer is twice that of the hidden dimension size.\n",
    "\n",
    "Implementing bidirectionality and adding additional layers are done by passing values for the `num_layers` and `bidirectional` arguments for the RNN/LSTM. \n",
    "\n",
-    "Dropout is implemented by initializing an `nn.Dropout` layer (the argument is the probability of dropout for each neuron) and using it within the `forward` method after each layer we want to apply dropout to. **Note**: never use dropout on the input or output layers (`x` or `fc` in this case), you only ever want to use dropout on intermediate layers. The LSTM has a `dropout` argument which adds dropout on the connections between hidden states in one layer to hidden states in the next layer.  "
+    "Dropout is implemented by initializing an `nn.Dropout` layer (the argument is the probability of dropping out each neuron) and using it within the `forward` method after each layer we want to apply dropout to. **Note**: never use dropout on the input or output layers (`x` or `fc` in this case), you only ever want to use dropout on intermediate layers. The LSTM has a `dropout` argument which adds dropout on the connections between hidden states in one layer to hidden states in the next layer. \n",
+    "\n",
+    "The final hidden state, `hidden`, has a shape of _**[num layers * num directions, batch size, hid dim]**_. These are ordered: **[forward_layer_0, backward_layer_0, forward_layer_1, backward_layer 1, ..., forward_layer_n, backward_layer n]**. As we want the final (top) layer forward and backward hidden states, we get the top two hidden layers from the first dimension, `hidden[-2,:,:]` and `hidden[-1,:,:]`, and concatenate them together before passing them to the linear layer (after applying dropout). "
   ]
  },
  {
@ -162,12 +182,15 @@
    "        output, (hidden, cell) = self.rnn(embedded)\n",
    "        \n",
    "        #output = [sent len, batch size, hid dim * num directions]\n",
-    "        #hidden = [num layers * num directions, batch size, hid. dim]\n",
-    "        #cell = [num layers * num directions, batch size, hid. dim]\n",
+    "        #hidden = [num layers * num directions, batch size, hid dim]\n",
+    "        #cell = [num layers * num directions, batch size, hid dim]\n",
+    "        \n",
+    "        #concat the final forward (hidden[-2,:,:]) and backward (hidden[-1,:,:]) hidden layers\n",
+    "        #and apply dropout\n",
    "        \n",
    "        hidden = self.dropout(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim=1))\n",
    "                \n",
-    "        #hidden [batch size, hid. dim * num directions]\n",
+    "        #hidden = [batch size, hid dim * num directions]\n",
    "            \n",
    "        return self.fc(hidden.squeeze(0))"
   ]
@ -204,7 +227,7 @@
   "source": [
    "The final addition is copying the pre-trained word embeddings we loaded earlier into the `embedding` layer of our model.\n",
    "\n",
-    "We retrieve the embeddings from the field's vocab, and ensure they're the correct size, _**[vocab size, embedding dim]**_ "
+    "We retrieve the embeddings from the field's vocab, and check they're the correct size, _**[vocab size, embedding dim]**_ "
   ]
  },
  {
@ -272,9 +295,9 @@
   "source": [
    "Now to training the model.\n",
    "\n",
-    "The only change we'll make here is changing the optimizer from `SGD` to `Adam`. SGD updates all parameters with the same learning rate and choosing this learning rate can be tricky. Adam adapts the learning rate for each parameter, giving parameters that are updated more frequently lower learning rates and parameters that are updated infrequently higher learning rates. More information about Adam (and other optimizers) can be found [here](http://ruder.io/optimizing-gradient-descent/index.html).\n",
+    "The only change we'll make here is changing the optimizer from `SGD` to `Adam`. SGD updates all parameters with the same learning rate and choosing this learning rate can be tricky. `Adam` adapts the learning rate for each parameter, giving parameters that are updated more frequently lower learning rates and parameters that are updated infrequently higher learning rates. More information about `Adam` (and other optimizers) can be found [here](http://ruder.io/optimizing-gradient-descent/index.html).\n",
    "\n",
-    "To change `SGD` to `Adam`, we simply change `optim.SGD` to `optim.Adam`, also note how we do not have to provide an initial learning rate for Adam as PyTorch specifies a sensibile initial learning rate."
+    "To change `SGD` to `Adam`, we simply change `optim.SGD` to `optim.Adam`, also note how we do not have to provide an initial learning rate for Adam as PyTorch specifies a sensibile default initial learning rate."
   ]
  },
  {
@ -305,8 +328,6 @@
   "source": [
    "criterion = nn.BCEWithLogitsLoss()\n",
    "\n",
-    "device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')\n",
-    "\n",
    "model = model.to(device)\n",
    "criterion = criterion.to(device)"
   ]
@ -324,15 +345,13 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "import torch.nn.functional as F\n",
-    "\n",
    "def binary_accuracy(preds, y):\n",
    "    \"\"\"\n",
    "    Returns accuracy per batch, i.e. if you get 8/10 right, this returns 0.8, NOT 8\n",
    "    \"\"\"\n",
    "\n",
    "    #round predictions to the closest integer\n",
-    "    rounded_preds = torch.round(F.sigmoid(preds))\n",
+    "    rounded_preds = torch.round(torch.sigmoid(preds))\n",
    "    correct = (rounded_preds == y).float() #convert into float for division \n",
    "    acc = correct.sum()/len(correct)\n",
    "    return acc"
@ -430,23 +449,15 @@
   "execution_count": 13,
   "metadata": {},
   "outputs": [
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "/home/ben/.conda/envs/pytorch04/lib/python3.6/site-packages/torchtext/data/field.py:322: UserWarning: volatile was removed and now has no effect. Use `with torch.no_grad():` instead.\n",
-      "  return Variable(arr, volatile=not train)\n"
-     ]
-    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
-      "Epoch: 01, Train Loss: 0.678, Train Acc: 57.77%, Val. Loss: 0.630, Val. Acc: 66.30%\n",
-      "Epoch: 02, Train Loss: 0.639, Train Acc: 64.19%, Val. Loss: 0.591, Val. Acc: 69.22%\n",
-      "Epoch: 03, Train Loss: 0.512, Train Acc: 75.75%, Val. Loss: 0.503, Val. Acc: 75.02%\n",
-      "Epoch: 04, Train Loss: 0.399, Train Acc: 83.24%, Val. Loss: 0.404, Val. Acc: 83.40%\n",
-      "Epoch: 05, Train Loss: 0.302, Train Acc: 87.86%, Val. Loss: 0.330, Val. Acc: 87.43%\n"
+      "| Epoch: 01 | Train Loss: 0.667 | Train Acc: 59.63% | Val. Loss: 0.594 | Val. Acc: 68.39% |\n",
+      "| Epoch: 02 | Train Loss: 0.610 | Train Acc: 65.96% | Val. Loss: 0.523 | Val. Acc: 74.06% |\n",
+      "| Epoch: 03 | Train Loss: 0.390 | Train Acc: 83.55% | Val. Loss: 0.318 | Val. Acc: 87.27% |\n",
+      "| Epoch: 04 | Train Loss: 0.251 | Train Acc: 90.43% | Val. Loss: 0.370 | Val. Acc: 85.56% |\n",
+      "| Epoch: 05 | Train Loss: 0.183 | Train Acc: 93.18% | Val. Loss: 0.308 | Val. Acc: 88.28% |\n"
     ]
    }
   ],
@ -458,7 +469,7 @@
    "    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)\n",
    "    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)\n",
    "    \n",
-    "    print(f'Epoch: {epoch+1:02}, Train Loss: {train_loss:.3f}, Train Acc: {train_acc*100:.2f}%, Val. Loss: {valid_loss:.3f}, Val. Acc: {valid_acc*100:.2f}%')"
+    "    print(f'| Epoch: {epoch+1:02} | Train Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}% | Val. Loss: {valid_loss:.3f} | Val. Acc: {valid_acc*100:.2f}% |')"
   ]
  },
  {
@ -473,26 +484,18 @@
   "execution_count": 14,
   "metadata": {},
   "outputs": [
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "/home/ben/.conda/envs/pytorch04/lib/python3.6/site-packages/torchtext/data/field.py:322: UserWarning: volatile was removed and now has no effect. Use `with torch.no_grad():` instead.\n",
-      "  return Variable(arr, volatile=not train)\n"
-     ]
-    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
-      "Test Loss: 0.384, Test Acc: 85.34%\n"
+      "| Test Loss: 0.374 | Test Acc: 85.35% |\n"
     ]
    }
   ],
   "source": [
    "test_loss, test_acc = evaluate(model, test_iterator, criterion)\n",
    "\n",
-    "print(f'Test Loss: {test_loss:.3f}, Test Acc: {test_acc*100:.2f}%')"
+    "print(f'| Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}% |')"
   ]
  },
  {
@ -528,7 +531,7 @@
    "    indexed = [TEXT.vocab.stoi[t] for t in tokenized]\n",
    "    tensor = torch.LongTensor(indexed).to(device)\n",
    "    tensor = tensor.unsqueeze(1)\n",
-    "    prediction = F.sigmoid(model(tensor))\n",
+    "    prediction = torch.sigmoid(model(tensor))\n",
    "    return prediction.item()"
   ]
  },
@ -547,7 +550,7 @@
    {
     "data": {
      "text/plain": [
-       "0.014928288757801056"
+       "0.00401143217459321"
      ]
     },
     "execution_count": 16,
@ -574,7 +577,7 @@
    {
     "data": {
      "text/plain": [
-       "0.88564133644104"
+       "0.9425999522209167"
      ]
     },
     "execution_count": 17,
--- a/Analysis.ipynb
+++ b/Analysis.ipynb
@ -6,10 +6,7 @@
   "source": [
    "# 3 - Faster Sentiment Analysis\n",
    "\n",
-    "In the previous notebook, we managed to achieve a decent test accuracy of ~85% using all of the common techniques used for sentiment analysis. In this notebook, we'll implement a model that achieves comparable results a lot faster. More specifically, we'll be implementing the \"FastText\" model from the paper [Bag of Tricks for Efficient Text Classification](https://arxiv.org/abs/1607.01759).\n",
-    "\n",
-    "\n",
-    "This will allow us to achieve the same ~85% test accuracy as the last model, but much faster."
+    "In the previous notebook we managed to achieve a decent test accuracy of ~85% using all of the common techniques used for sentiment analysis. In this notebook, we'll implement a model that gets comparable results whilst training significantly faster. More specifically, we'll be implementing the \"FastText\" model from the paper [Bag of Tricks for Efficient Text Classification](https://arxiv.org/abs/1607.01759)."
   ]
  },
  {
@ -69,7 +66,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "TorchText `Field`s have a `preprocessing` argument. A function passed here will be applied to a sentence after it has been tokenized (transformed from a string into a list of tokens), but before it has been indexed (transformed from a token to an integer). Here, we pass our `generate_bigrams` function."
+    "TorchText `Field`s have a `preprocessing` argument. A function passed here will be applied to a sentence after it has been tokenized (transformed from a string into a list of tokens), but before it has been indexed (transformed from a list of tokens to a list of indexes). This is where we'll pass our `generate_bigrams` function."
   ]
  },
  {
@ -86,9 +83,10 @@
    "\n",
    "torch.manual_seed(SEED)\n",
    "torch.cuda.manual_seed(SEED)\n",
+    "torch.backends.cudnn.deterministic = True\n",
    "\n",
    "TEXT = data.Field(tokenize='spacy', preprocessing=generate_bigrams)\n",
-    "LABEL = data.LabelField(tensor_type=torch.FloatTensor)"
+    "LABEL = data.LabelField(dtype=torch.float)"
   ]
  },
  {
@ -106,9 +104,9 @@
   "source": [
    "import random\n",
    "\n",
-    "train, test = datasets.IMDB.splits(TEXT, LABEL)\n",
+    "train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)\n",
    "\n",
-    "train, valid = train.split(random_state=random.seed(SEED))"
+    "train_data, valid_data = train_data.split(random_state=random.seed(SEED))"
   ]
  },
  {
@ -124,8 +122,8 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "TEXT.build_vocab(train, max_size=25000, vectors=\"glove.6B.100d\")\n",
-    "LABEL.build_vocab(train)"
+    "TEXT.build_vocab(train_data, max_size=25000, vectors=\"glove.6B.100d\")\n",
+    "LABEL.build_vocab(train_data)"
   ]
  },
  {
@ -143,11 +141,12 @@
   "source": [
    "BATCH_SIZE = 64\n",
    "\n",
+    "device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')\n",
+    "\n",
    "train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(\n",
-    "    (train, valid, test), \n",
+    "    (train_data, valid_data, test_data), \n",
    "    batch_size=BATCH_SIZE, \n",
-    "    sort_key=lambda x: len(x.text), \n",
-    "    repeat=False)"
+    "    device=device)"
   ]
  },
  {
@ -158,19 +157,19 @@
    "\n",
    "This model has far fewer parameters than the previous model as it only has 2 layers that have any parameters, the embedding layer and the linear layer. There is no RNN component in sight!\n",
    "\n",
-    "Instead, it first calculates the word embedding for each word using the `Embedding` layer, then calculates the average of all of the word embeddings and feeds this through the `Linear` layer, and that's it!\n",
+    "Instead, it first calculates the word embedding for each word using the `Embedding` layer (purple), then calculates the average of all of the word embeddings and feeds this through the `Linear` layer (silver), and that's it!\n",
    "\n",
    "![](https://i.imgur.com/e0sWZoZ.png)\n",
    "\n",
-    "We implement the averaging with the `avg_pool2d` (average pool 2-dimensions) function. Initially, you may think using a 2-dimensional pooling seems strange, surely our sentences are 1-dimensional, not 2-dimensional? However, you can think of the word embeddings as a 2-dimensional grid, where the ones are along one axis and the dimensions of the word embeddings are along another. In the image below is an example sentence after being converted into 5-dimensional word embeddings, with the words along the vertical axis and the embeddings along the horizontal axis.\n",
+    "We implement the averaging with the `avg_pool2d` (average pool 2-dimensions) function. Initially, you may think using a 2-dimensional pooling seems strange, surely our sentences are 1-dimensional, not 2-dimensional? However, you can think of the word embeddings as a 2-dimensional grid, where the words are along one axis and the dimensions of the word embeddings are along the other. The image below is an example sentence after being converted into 5-dimensional word embeddings, with the words along the vertical axis and the embeddings along the horizontal axis. Each element in this [4x5] tensor is represented by a green block.\n",
    "\n",
    "![](https://i.imgur.com/SSH25NT.png)\n",
    "\n",
-    "The `avg_pool2d` passes a filter of size `embedded.shape[1]` (i.e. the length of the sentence) by 1. This is shown in pink in the image below.\n",
+    "The `avg_pool2d` uses a filter of size `embedded.shape[1]` (i.e. the length of the sentence) by 1. This is shown in pink in the image below.\n",
    "\n",
    "![](https://i.imgur.com/U7eRnIe.png)\n",
    "\n",
-    "The average value of all of the dimensions is calculated and concatenated into a 5-dimensional (in our pictoral examples, 100-dimensional in the code) tensor for each sentence. This tensor is then passed through the linear layer to produce our prediction."
+    "The filter then slides to the right, calculating the average of the next column of embedding values for each word in the sentence. After the filter has covered all embedding dimensions, we get a [1x5] tensor. This tensor is then passed through the linear layer to produce our prediction."
   ]
  },
  {
@ -180,6 +179,7 @@
   "outputs": [],
   "source": [
    "import torch.nn as nn\n",
+    "import torch.nn.functional as F\n",
    "\n",
    "class FastText(nn.Module):\n",
    "    def __init__(self, vocab_size, embedding_dim, output_dim):\n",
@ -304,8 +304,6 @@
   "source": [
    "criterion = nn.BCEWithLogitsLoss()\n",
    "\n",
-    "device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')\n",
-    "\n",
    "model = model.to(device)\n",
    "criterion = criterion.to(device)"
   ]
@ -323,15 +321,13 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "import torch.nn.functional as F\n",
-    "\n",
    "def binary_accuracy(preds, y):\n",
    "    \"\"\"\n",
    "    Returns accuracy per batch, i.e. if you get 8/10 right, this returns 0.8, NOT 8\n",
    "    \"\"\"\n",
    "\n",
    "    #round predictions to the closest integer\n",
-    "    rounded_preds = torch.round(F.sigmoid(preds))\n",
+    "    rounded_preds = torch.round(torch.sigmoid(preds))\n",
    "    correct = (rounded_preds == y).float() #convert into float for division \n",
    "    acc = correct.sum()/len(correct)\n",
    "    return acc"
@ -429,23 +425,15 @@
   "execution_count": 15,
   "metadata": {},
   "outputs": [
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "/home/ben/.conda/envs/pytorch04/lib/python3.6/site-packages/torchtext/data/field.py:322: UserWarning: volatile was removed and now has no effect. Use `with torch.no_grad():` instead.\n",
-      "  return Variable(arr, volatile=not train)\n"
-     ]
-    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
-      "Epoch: 01, Train Loss: 0.688, Train Acc: 56.85%, Val. Loss: 0.644, Val. Acc: 71.34%\n",
-      "Epoch: 02, Train Loss: 0.656, Train Acc: 70.80%, Val. Loss: 0.523, Val. Acc: 75.57%\n",
-      "Epoch: 03, Train Loss: 0.585, Train Acc: 79.04%, Val. Loss: 0.447, Val. Acc: 79.60%\n",
-      "Epoch: 04, Train Loss: 0.505, Train Acc: 83.24%, Val. Loss: 0.423, Val. Acc: 82.47%\n",
-      "Epoch: 05, Train Loss: 0.439, Train Acc: 85.92%, Val. Loss: 0.395, Val. Acc: 85.01%\n"
+      "| Epoch: 01 | Train Loss: 0.688 | Train Acc: 56.85% | Val. Loss: 0.644 | Val. Acc: 71.34% |\n",
+      "| Epoch: 02 | Train Loss: 0.656 | Train Acc: 70.80% | Val. Loss: 0.523 | Val. Acc: 75.57% |\n",
+      "| Epoch: 03 | Train Loss: 0.585 | Train Acc: 79.04% | Val. Loss: 0.447 | Val. Acc: 79.60% |\n",
+      "| Epoch: 04 | Train Loss: 0.505 | Train Acc: 83.24% | Val. Loss: 0.423 | Val. Acc: 82.47% |\n",
+      "| Epoch: 05 | Train Loss: 0.439 | Train Acc: 85.92% | Val. Loss: 0.395 | Val. Acc: 85.01% |\n"
     ]
    }
   ],
@ -457,7 +445,7 @@
    "    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)\n",
    "    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)\n",
    "    \n",
-    "    print(f'Epoch: {epoch+1:02}, Train Loss: {train_loss:.3f}, Train Acc: {train_acc*100:.2f}%, Val. Loss: {valid_loss:.3f}, Val. Acc: {valid_acc*100:.2f}%')"
+    "    print(f'| Epoch: {epoch+1:02} | Train Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}% | Val. Loss: {valid_loss:.3f} | Val. Acc: {valid_acc*100:.2f}% |')"
   ]
  },
  {
@ -474,26 +462,18 @@
   "execution_count": 16,
   "metadata": {},
   "outputs": [
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "/home/ben/.conda/envs/pytorch04/lib/python3.6/site-packages/torchtext/data/field.py:322: UserWarning: volatile was removed and now has no effect. Use `with torch.no_grad():` instead.\n",
-      "  return Variable(arr, volatile=not train)\n"
-     ]
-    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
-      "Test Loss: 0.387, Test Acc: 85.16%\n"
+      "| Test Loss: 0.387 | Test Acc: 85.16% |\n"
     ]
    }
   ],
   "source": [
    "test_loss, test_acc = evaluate(model, test_iterator, criterion)\n",
    "\n",
-    "print(f'Test Loss: {test_loss:.3f}, Test Acc: {test_acc*100:.2f}%')"
+    "print(f'| Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}% |')"
   ]
  },
  {
@ -519,7 +499,7 @@
    "    indexed = [TEXT.vocab.stoi[t] for t in tokenized]\n",
    "    tensor = torch.LongTensor(indexed).to(device)\n",
    "    tensor = tensor.unsqueeze(1)\n",
-    "    prediction = F.sigmoid(model(tensor))\n",
+    "    prediction = torch.sigmoid(model(tensor))\n",
    "    return prediction.item()"
   ]
  },
@ -538,7 +518,7 @@
    {
     "data": {
      "text/plain": [
-       "2.414054449673131e-07"
+       "2.414095661151805e-07"
      ]
     },
     "execution_count": 18,
--- a/Analysis.ipynb
+++ b/Analysis.ipynb
@ -10,7 +10,7 @@
    "\n",
    "**Note**: I am not aiming to give a comprehensive introduction and explanation of CNNs. For a better and more in-depth explanation check out [here](https://ujjwalkarn.me/2016/08/11/intuitive-explanation-convnets/) and [here](https://cs231n.github.io/convolutional-networks/).\n",
    "\n",
-    "Traditionally, CNNs are used to analyse images and are made up of one or more *convolutional* layers, followed by one or more linear layers. The convolutional layers use filters (also called *kernels* or *receptive fields*) which scan across an image and produce a processed version of the image. This processed version of the image can be fed into another convolutional layer or a linear layer. Each filter has a shape, e.g. a 3x3 filter covers a 3 pixel wide and 3 pixel high area of the image, and each element of the filter has a weight associated with it, the 3x3 filter would have 9 weights. In traditional image processing these weights were specified by hand by engineers, however the main advantage of the convolutional layers is that these weights are learned via backpropagation. \n",
+    "Traditionally, CNNs are used to analyse images and are made up of one or more *convolutional* layers, followed by one or more linear layers. The convolutional layers use filters (also called *kernels* or *receptive fields*) which scan across an image and produce a processed version of the image. This processed version of the image can be fed into another convolutional layer or a linear layer. Each filter has a shape, e.g. a 3x3 filter covers a 3 pixel wide and 3 pixel high area of the image, and each element of the filter has a weight associated with it, the 3x3 filter would have 9 weights. In traditional image processing these weights were specified by hand by engineers, however the main advantage of the convolutional layers in neural networks is that these weights are learned via backpropagation. \n",
    "\n",
    "The intuitive idea behind learning the weights is that your convolutional layers act like *feature extractors*, extracting parts of the image that are most important for your CNN's goal, e.g. if using a CNN to detect faces in an image, the CNN may be looking for features such as the existance of a nose, mouth or a pair of eyes in the image.\n",
    "\n",
@ -45,13 +45,14 @@
    "\n",
    "torch.manual_seed(SEED)\n",
    "torch.cuda.manual_seed(SEED)\n",
+    "torch.backends.cudnn.deterministic = True\n",
    "\n",
    "TEXT = data.Field(tokenize='spacy')\n",
-    "LABEL = data.LabelField(tensor_type=torch.FloatTensor)\n",
+    "LABEL = data.LabelField(dtype=torch.float)\n",
    "\n",
-    "train, test = datasets.IMDB.splits(TEXT, LABEL)\n",
+    "train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)\n",
    "\n",
-    "train, valid = train.split(random_state=random.seed(SEED))"
+    "train_data, valid_data = train_data.split(random_state=random.seed(SEED))"
   ]
  },
  {
@ -67,8 +68,8 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "TEXT.build_vocab(train, max_size=25000, vectors=\"glove.6B.100d\")\n",
-    "LABEL.build_vocab(train)"
+    "TEXT.build_vocab(train_data, max_size=25000, vectors=\"glove.6B.100d\")\n",
+    "LABEL.build_vocab(train_data)"
   ]
  },
  {
@ -86,11 +87,12 @@
   "source": [
    "BATCH_SIZE = 64\n",
    "\n",
+    "device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')\n",
+    "\n",
    "train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(\n",
-    "    (train, valid, test), \n",
+    "    (train_data, valid_data, test_data), \n",
    "    batch_size=BATCH_SIZE, \n",
-    "    sort_key=lambda x: len(x.text), \n",
-    "    repeat=False)"
+    "    device=device)"
   ]
  },
  {
@ -101,15 +103,15 @@
    "\n",
    "Now to build our model.\n",
    "\n",
-    "The first major hurdle is visualizing how CNNs are used for text. Images are typically 2 dimensional (we'll ignore the fact that there is a third \"colour\" dimension for now) whereas text is 1 dimensional. However, we know that the first step in almost all of our previous tutorials (and pretty much all NLP pipelines) is converting the words into vectors. This is how we can visualize our words in 2 dimensions, each word along one axis and the elements of vectors aross the other dimension. Consider the 2 dimensional representation of the embedded sentence below:\n",
+    "The first major hurdle is visualizing how CNNs are used for text. Images are typically 2 dimensional (we'll ignore the fact that there is a third \"colour\" dimension for now) whereas text is 1 dimensional. However, we know that the first step in almost all of our previous tutorials (and pretty much all NLP pipelines) is converting the words into word embeddings. This is how we can visualize our words in 2 dimensions, each word along one axis and the elements of vectors aross the other dimension. Consider the 2 dimensional representation of the embedded sentence below:\n",
    "\n",
    "![](https://i.imgur.com/ci1h9hv.png)\n",
    "\n",
-    "We can then use a filter that is **[n x emb_dim]**. This will cover $n$ sequential words entirely, as their width will be `emb_dim` dimensions. Consider the image below. Our word vectors are represented in green, here we have 4 words with 5 dimensional embeddings, creating a 4x5 \"image\". A filter that covers two words at a time (i.e. bi-grams) will be **[2x5]**, shown in yellow. The output of this filter (shown in red) will be a single real number that is the weighted sum of all elements covered by the filter.\n",
+    "We can then use a filter that is **[n x emb_dim]**. This will cover $n$ sequential words entirely, as their width will be `emb_dim` dimensions. Consider the image below, with our word vectors are represented in green. Here we have 4 words with 5 dimensional embeddings, creating a [4x5] \"image\" tensor. A filter that covers two words at a time (i.e. bi-grams) will be **[2x5]** filter, shown in yellow, and each element of the filter with have a _weight_ associated with it. The output of this filter (shown in red) will be a single real number that is the weighted sum of all elements covered by the filter.\n",
    "\n",
    "![](https://i.imgur.com/QlXduXu.png)\n",
    "\n",
-    "The filter then moves \"down\" the image (or across the sentence) to cover the next bi-gram and another output is calculated. \n",
+    "The filter then moves \"down\" the image (or across the sentence) to cover the next bi-gram and another output (weighted sum) is calculated. \n",
    "\n",
    "![](https://i.imgur.com/wuA330x.png)\n",
    "\n",
@ -119,7 +121,7 @@
    "\n",
    "In our case (and in the general case where the width of the filter equals the width of the \"image\"), our output will be a vector with number of elements equal to the height of the image (or lenth of the word) minus the height of the filter plus one, $4-2+1=3$ in this case.\n",
    "\n",
-    "This example showed how to calculate the output of one filter. Our model (and pretty much all CNNs) will have lots of these filters. The idea is that each filter will learn a different feature to extract. In the scenario of analysing text, we are hoping each of the **[2 x emb_dim]** filters will be looking for a certain bi-gram. \n",
+    "This example showed how to calculate the output of one filter. Our model (and pretty much all CNNs) will have lots of these filters. The idea is that each filter will learn a different feature to extract. In the scenario of analysing text, we are hoping each of the **[2 x emb_dim]** filters will be looking for the occurence of different bi-grams. \n",
    "\n",
    "In our model, we will also have different sizes of filters, heights of 3, 4 and 5, with 100 of each of them. The intuition is that we will be looking for the occurence of different tri-grams, 4-grams and 5-grams that are relevant for analysing sentiment of movie reviews.\n",
    "\n",
@ -127,7 +129,7 @@
    "\n",
    "![](https://i.imgur.com/gzkS3ze.png)\n",
    "\n",
-    "The idea here is that the maximum value is the \"most important\" feature for determining the sentiment of the review, this corresponds to the \"most important\" n-gram within the review. How do we know what the \"most important\" n-gram is? Luckily, we don't have to! Through backpropagation, the weights of the filters are changed so that whenever certain n-grams that correspond to a sentiment are seen, the output of the filter is a \"high\" value. This \"high\" value then passes through the max pooling layer if it is the maximum value in the output. \n",
+    "The idea here is that the maximum value is the \"most important\" feature for determining the sentiment of the review, which corresponds to the \"most important\" n-gram within the review. How do we know what the \"most important\" n-gram is? Luckily, we don't have to! Through backpropagation, the weights of the filters are changed so that whenever certain n-grams that are highly indicative of the sentiment are seen, the output of the filter is a \"high\" value. This \"high\" value then passes through the max pooling layer if it is the maximum value in the output. \n",
    "\n",
    "As our model has 100 filters of 3 different sizes, that means we have 300 different n-grams the model thinks are important. We concatenate these together into a single vector and pass them through a linear layer to predict the sentiment. We can think of the weights of this linear layer as \"weighting up the evidence\" from each of the 300 n-grams and making a final decision. \n",
    "\n",
@ -135,9 +137,9 @@
    "\n",
    "We implement the convolutional layers with `nn.Conv2d`. The `in_channels` argument is the number of \"channels\" in your image going into the convolutional layer. In actual images this is usually 3 (one channel for each of the red, blue and green channels), however when using text we only have a single channel, the text itself. The `out_channels` is the number of filters and the `kernel_size` is the size of the filters. Each of our `kernel_size`s is going to be **[n x emb_dim]** where $n$ is the size of the n-grams.\n",
    "\n",
-    "In PyTorch, RNNs want the input with the batch dimension second, whereas CNNs want the batch dimension first. Thus, the first thing we do to our input is `permute` it to make it the correct shape. We then pass the sentence through an embedding layer to get our embeddings. The second dimension of the input into a `nn.Conv2d` layer must be the channel dimension, as text technically does have one, we `unsqueeze` our tensor to create a 1 dimensional channel. This matches with our `in_channels=1` in the initialization of our convolutional layers. \n",
+    "In PyTorch, RNNs want the input with the batch dimension second, whereas CNNs want the batch dimension first. Thus, the first thing we do to our input is `permute` it to make it the correct shape. We then pass the sentence through an embedding layer to get our embeddings. The second dimension of the input into a `nn.Conv2d` layer must be the channel dimension. As text technically does not have a channel dimension, we `unsqueeze` our tensor to create one. This matches with our `in_channels=1` in the initialization of our convolutional layers. \n",
    "\n",
-    "We then pass the tensors through the convolutional and pooling layers, using the `ReLU` activation function after the convolutional layers. Another nice feature of the pooling layers is that they handle sentences of different lengths. The size of the output of the convolutional layer is dependent on the size of the input to it, and different batches contain sentences of different lengths. Without the max pooling layer we would have the linear layer dependent on the size of the input sentence, but with the max pooling layer we always know the input to the linear layer will be the total number of filters. **Note**: there an exception to this if your sentence(s) are shorter than the largest filter used. You will then have to pad your sentences to the length of the largest filter. In the IMDb data there are no reviews only 5 words long so we don't have to worry about that, but you will if you are using your own data.\n",
+    "We then pass the tensors through the convolutional and pooling layers, using the `ReLU` activation function after the convolutional layers. Another nice feature of the pooling layers is that they handle sentences of different lengths. The size of the output of the convolutional layer is dependent on the size of the input to it, and different batches contain sentences of different lengths. Without the max pooling layer the input to our linear layer would depend on the size of the input sentence (not what we want). One option to rectify this would be to trim/pad all sentences to the same length, however with the max pooling layer we always know the input to the linear layer will be the total number of filters. **Note**: there an exception to this if your sentence(s) are shorter than the largest filter used. You will then have to pad your sentences to the length of the largest filter. In the IMDb data there are no reviews only 5 words long so we don't have to worry about that, but you will if you are using your own data.\n",
    "\n",
    "Finally, we perform dropout on the concatenated filter outputs and then pass them through a linear layer to make our predictions."
   ]
@ -149,6 +151,7 @@
   "outputs": [],
   "source": [
    "import torch.nn as nn\n",
+    "import torch.nn.functional as F\n",
    "\n",
    "class CNN(nn.Module):\n",
    "    def __init__(self, vocab_size, embedding_dim, n_filters, filter_sizes, output_dim, dropout):\n",
@ -200,9 +203,9 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Currently the `CNN` model can only use 3 different filters, but we can actually improve the code of our model to make it more generic and take any number of filters.\n",
+    "Currently the `CNN` model can only use 3 different sized filters, but we can actually improve the code of our model to make it more generic and take any number of filters.\n",
    "\n",
-    "We do this by using `nn.ModuleList`, a function that takes in a list of PyTorch `nn.Module`s. If we simply used a list without `nn.ModuleList`, the modules within the list cannot be \"seen\" by any modules outside the list which causes errors.\n",
+    "We do this by placing all of our convolutional layers in a  `nn.ModuleList`, a function used to hold a list of PyTorch `nn.Module`s. If we simply used a standard Python list, the modules within the list cannot be \"seen\" by any modules outside the list which will cause us some errors.\n",
    "\n",
    "We can now pass an arbitrary sized list of filter sizes and the list comprehension will create a convolutional layer for each of them. Then, in the `forward` method we iterate through the list applying each convolutional layer to get a list of convolutional outputs, which we also feed through the max pooling in a list comprehension before concatenating together and passing through the dropout and linear layers."
   ]
@ -337,8 +340,6 @@
    "\n",
    "criterion = nn.BCEWithLogitsLoss()\n",
    "\n",
-    "device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')\n",
-    "\n",
    "model = model.to(device)\n",
    "criterion = criterion.to(device)"
   ]
@ -356,15 +357,13 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "import torch.nn.functional as F\n",
-    "\n",
    "def binary_accuracy(preds, y):\n",
    "    \"\"\"\n",
    "    Returns accuracy per batch, i.e. if you get 8/10 right, this returns 0.8, NOT 8\n",
    "    \"\"\"\n",
    "\n",
    "    #round predictions to the closest integer\n",
-    "    rounded_preds = torch.round(F.sigmoid(preds))\n",
+    "    rounded_preds = torch.round(torch.sigmoid(preds))\n",
    "    correct = (rounded_preds == y).float() #convert into float for division \n",
    "    acc = correct.sum()/len(correct)\n",
    "    return acc"
@ -464,23 +463,15 @@
    "scrolled": true
   },
   "outputs": [
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "/home/ben/.conda/envs/pytorch04/lib/python3.6/site-packages/torchtext/data/field.py:322: UserWarning: volatile was removed and now has no effect. Use `with torch.no_grad():` instead.\n",
-      "  return Variable(arr, volatile=not train)\n"
-     ]
-    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
-      "Epoch: 01, Train Loss: 0.495, Train Acc: 74.84%, Val. Loss: 0.347, Val. Acc: 85.16%\n",
-      "Epoch: 02, Train Loss: 0.302, Train Acc: 87.22%, Val. Loss: 0.293, Val. Acc: 87.70%\n",
-      "Epoch: 03, Train Loss: 0.219, Train Acc: 91.38%, Val. Loss: 0.281, Val. Acc: 88.43%\n",
-      "Epoch: 04, Train Loss: 0.146, Train Acc: 94.66%, Val. Loss: 0.277, Val. Acc: 88.92%\n",
-      "Epoch: 05, Train Loss: 0.090, Train Acc: 97.07%, Val. Loss: 0.292, Val. Acc: 89.16%\n"
+      "| Epoch: 01 | Train Loss: 0.495 | Train Acc: 74.84% | Val. Loss: 0.347 | Val. Acc: 85.16% |\n",
+      "| Epoch: 02 | Train Loss: 0.302 | Train Acc: 87.22% | Val. Loss: 0.293 | Val. Acc: 87.70% |\n",
+      "| Epoch: 03 | Train Loss: 0.219 | Train Acc: 91.47% | Val. Loss: 0.282 | Val. Acc: 88.43% |\n",
+      "| Epoch: 04 | Train Loss: 0.147 | Train Acc: 94.58% | Val. Loss: 0.278 | Val. Acc: 89.00% |\n",
+      "| Epoch: 05 | Train Loss: 0.090 | Train Acc: 97.05% | Val. Loss: 0.292 | Val. Acc: 89.14% |\n"
     ]
    }
   ],
@ -492,7 +483,7 @@
    "    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)\n",
    "    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)\n",
    "    \n",
-    "    print(f'Epoch: {epoch+1:02}, Train Loss: {train_loss:.3f}, Train Acc: {train_acc*100:.2f}%, Val. Loss: {valid_loss:.3f}, Val. Acc: {valid_acc*100:.2f}%')"
+    "    print(f'| Epoch: {epoch+1:02} | Train Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}% | Val. Loss: {valid_loss:.3f} | Val. Acc: {valid_acc*100:.2f}% |')"
   ]
  },
  {
@ -507,26 +498,18 @@
   "execution_count": 13,
   "metadata": {},
   "outputs": [
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "/home/ben/.conda/envs/pytorch04/lib/python3.6/site-packages/torchtext/data/field.py:322: UserWarning: volatile was removed and now has no effect. Use `with torch.no_grad():` instead.\n",
-      "  return Variable(arr, volatile=not train)\n"
-     ]
-    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
-      "Test Loss: 0.306, Test Acc: 88.48%\n"
+      "| Test Loss: 0.307 | Test Acc: 88.43% |\n"
     ]
    }
   ],
   "source": [
    "test_loss, test_acc = evaluate(model, test_iterator, criterion)\n",
    "\n",
-    "print(f'Test Loss: {test_loss:.3f}, Test Acc: {test_acc*100:.2f}%')"
+    "print(f'| Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}% |')"
   ]
  },
  {
@ -537,7 +520,7 @@
    "\n",
    "And again, as a sanity check we can check some input sentences\n",
    "\n",
-    "**Note**: As mentioned in the implementation details, the input sentence has to be at least as long as the largest filter height used. In TorchText the default padding token is `<pad>` so we add that to both of our examples below so they are at least 5 tokens long."
+    "**Note**: As mentioned in the implementation details, the input sentence has to be at least as long as the largest filter height used. We modify our `predict_sentiment` function to also accept a minimum length argument. If the tokenized input sentence is less than `min_len` tokens, we append padding tokens (`<pad>`) to make it `min_len` tokens."
   ]
  },
  {
@ -549,12 +532,14 @@
    "import spacy\n",
    "nlp = spacy.load('en')\n",
    "\n",
-    "def predict_sentiment(sentence):\n",
+    "def predict_sentiment(sentence, min_len=5):\n",
    "    tokenized = [tok.text for tok in nlp.tokenizer(sentence)]\n",
+    "    if len(tokenized) < min_len:\n",
+    "        tokenized += ['<pad>'] * (min_len - len(tokenized))\n",
    "    indexed = [TEXT.vocab.stoi[t] for t in tokenized]\n",
    "    tensor = torch.LongTensor(indexed).to(device)\n",
    "    tensor = tensor.unsqueeze(1)\n",
-    "    prediction = F.sigmoid(model(tensor))\n",
+    "    prediction = torch.sigmoid(model(tensor))\n",
    "    return prediction.item()"
   ]
  },
@ -573,7 +558,7 @@
    {
     "data": {
      "text/plain": [
-       "0.010326345451176167"
+       "0.04627241939306259"
      ]
     },
     "execution_count": 15,
@ -582,7 +567,7 @@
    }
   ],
   "source": [
-    "predict_sentiment(\"This film is terrible <pad>\")"
+    "predict_sentiment(\"This film is terrible\")"
   ]
  },
  {
@ -600,7 +585,7 @@
    {
     "data": {
      "text/plain": [
-       "0.9676147103309631"
+       "0.9447714686393738"
      ]
     },
     "execution_count": 16,
@ -609,7 +594,7 @@
    }
   ],
   "source": [
-    "predict_sentiment(\"This film is great <pad>\")"
+    "predict_sentiment(\"This film is great\")"
   ]
  }
 ],
--- a/Datasets.ipynb
+++ b/Datasets.ipynb
@ -18,7 +18,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 119,
+   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
@ -57,7 +57,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 88,
+   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
@ -84,20 +84,20 @@
    "\n",
    "* The order of the keys in the `fields` dictionary does not matter, as long as its keys match the `json` data keys.\n",
    "\n",
-    "- The `Field` name does not have to match the key in the `json` object, i.e. we use `PLACE` for the `\"location\"` field.\n",
+    "- The `Field` name does not have to match the key in the `json` object, e.g. we use `PLACE` for the `\"location\"` field.\n",
    "\n",
-    "- When dealing with `json` data, not all of the keys have to be used, i.e. we did not use the `\"age\"` field.\n",
+    "- When dealing with `json` data, not all of the keys have to be used, e.g. we did not use the `\"age\"` field.\n",
    "\n",
-    "- Also, if the values of `json` field are a string then the `Fields` tokenization is applied, however if the values are a list then no tokenization is applied. Usually it is a good idea for the data to already be tokenized into a list, this saves time as you don't have to wait for TorchText to do it.\n",
+    "- Also, if the values of `json` field are a string then the `Fields` tokenization is applied (default is to split the string on spaces), however if the values are a list then no tokenization is applied. Usually it is a good idea for the data to already be tokenized into a list, this saves time as you don't have to wait for TorchText to do it.\n",
    "\n",
    "- The value of the `json` fields do not have to be the same type. Some examples can have their `\"quote\"` as a string, and some as a list. The tokenization will only get applied to the ones with their `\"quote\"` as a string.\n",
    "\n",
-    "- If you are using a `json` field, every single example must have an instance of that field, i.e. in this example all examples must have a name, location and quote. However, as we are not using the age field, it does not matter if an example does not have it."
+    "- If you are using a `json` field, every single example must have an instance of that field, e.g. in this example all examples must have a name, location and quote. However, as we are not using the age field, it does not matter if an example does not have it."
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 94,
+   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
@ -108,25 +108,25 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "We then create our `train` and `test` datasets with the `TabularDataset.splits` function. \n",
+    "We then create our datasets (`train_data` and `test_data`) with the `TabularDataset.splits` function. \n",
    "\n",
-    "The `path` argument specifices the top level folder common among both datasets, and the `train` and `test` arguments specify the filename of each dataset, i.e. here the train dataset is located at `data/train.json`.\n",
+    "The `path` argument specifices the top level folder common among both datasets, and the `train` and `test` arguments specify the filename of each dataset, e.g. here the train dataset is located at `data/train.json`.\n",
    "\n",
    "We tell the function we are using `json` data, and pass in our `fields` dictionary defined previously."
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 95,
+   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
-    "train, test = data.TabularDataset.splits(\n",
-    "                path = 'data',\n",
-    "                train = 'train.json',\n",
-    "                test = 'test.json',\n",
-    "                format = 'json',\n",
-    "                fields = fields\n",
+    "train_data, test_data = data.TabularDataset.splits(\n",
+    "                            path = 'data',\n",
+    "                            train = 'train.json',\n",
+    "                            test = 'test.json',\n",
+    "                            format = 'json',\n",
+    "                            fields = fields\n",
    ")"
   ]
  },
@ -139,17 +139,17 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 96,
+   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
-    "train, valid, test = data.TabularDataset.splits(\n",
-    "                path = 'data',\n",
-    "                train = 'train.json',\n",
-    "                validation = 'valid.json',\n",
-    "                test = 'test.json',\n",
-    "                format = 'json',\n",
-    "                fields = fields\n",
+    "train_data, valid_data, test_data = data.TabularDataset.splits(\n",
+    "                                        path = 'data',\n",
+    "                                        train = 'train.json',\n",
+    "                                        validation = 'valid.json',\n",
+    "                                        test = 'test.json',\n",
+    "                                        format = 'json',\n",
+    "                                        fields = fields\n",
    ")"
   ]
  },
@ -166,30 +166,30 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 97,
+   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
-      "vars(train[0]): {'n': ['John'], 'p': ['United', 'Kingdom'], 's': ['i', 'love', 'the', 'united kingdom']}\n"
+      "{'text': ['I', 'simply', 'cannot', 'believe', 'the', 'number', 'of', 'people', 'comparing', 'this', 'favourably', 'with', 'the', 'first', 'film.', 'It', 'moved', 'me', 'to', 'leave', 'this', 'comment!', 'This', 'is', 'just', 'an', 'obvious', 'attempt', 'to', 'cash-in', 'on', 'the', 'success', 'of', 'the', 'first', 'film.', 'The', 'dialogue', 'is', 'appalling', 'and', 'nothing', 'like', 'as', 'authentic', 'or', 'compelling', 'as', 'the', 'original', 'film.<br', '/><br', '/>The', 'storyline', 'is', 'ridiculous,', 'the', 'portrayal', 'of', 'the', 'French', 'police', 'laughable', 'and', 'the', 'characterisation', 'of', 'Doyle', 'a', 'mile', 'away', 'from', 'the', 'first', 'film.<br', '/><br', '/>How', 'many', 'drug', 'bosses', 'do', 'you', 'think', 'go', 'down', 'to', 'the', 'docks', 'in', 'person', 'to', 'see', 'a', 'shipment', 'come', 'in?', 'The', 'ease', 'at', 'which', 'Doyle', 'finds', 'his', 'guy', 'is', 'just', 'pathetic.', 'Like', 'all', 'the', 'French', 'Police', 'were', 'just', 'drinking', 'coffee', 'until', 'Doyle', 'turns', 'up', 'from', 'America', 'and', 'does', 'some', 'REAL', 'police', 'work.', 'What', 'a', 'joke.', 'Try', 'going', 'to', 'a', 'foreign', 'city', 'and', 'unearthing', 'the', 'biggest', 'crims', 'in', 'the', 'place', 'with', 'a', 'travel', 'map', 'and', 'some', 'tourist', 'pamphlets.', 'Pathetic.', '<br', '/><br', '/>A', 'truly', 'awful', 'sequel,', 'anyone', 'who', 'thinks', 'otherwise', 'is', 'crazy.'], 'label': 'neg'}\n"
     ]
    }
   ],
   "source": [
-    "print('vars(train[0]):', vars(train[0]))"
+    "print(vars(train[0]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "We can now use `train`, `test` and `valid` to build a vocabulary and create iterators, as in the other notebooks.\n",
+    "We can now use `train_data`, `test_data` and `valid_data` to build a vocabulary and create iterators, as in the other notebooks. We can access all attributes by using `batch.n`, `batch.p` and `batch.s` for the names, places and sayings, respectively.\n",
    "\n",
    "## Reading CSV/TSV\n",
    "\n",
-    "`csv` and `tsv` are very similar, one has elements separated by commas and one by tabs.\n",
+    "`csv` and `tsv` are very similar, except csv has elements separated by commas and tsv by tabs.\n",
    "\n",
    "Using the same example above, our `tsv` data will be in the form of:\n",
    "\n",
@ -199,11 +199,11 @@
    "Mary\tUnited States\t36\ti want more telescopes\n",
    "```\n",
    "\n",
-    "That is, on each row the elements are separated by tabs and we have one example by row. The first row is usually a header (i.e. the name of each of the columns), but your data could have no header.\n",
+    "That is, on each row the elements are separated by tabs and we have one example per row. The first row is usually a header (i.e. the name of each of the columns), but your data could have no header.\n",
    "\n",
    "You cannot have lists within `tsv` or `csv` data.\n",
    "\n",
-    "The way the fields are defined is a bit different to `json`. We now use a list of tuples, where the elements are tuples are the same as before, i.e. first element is the batch object's attribute name, second element is the `Field` name. Unlike the `json` formatted data, \n",
+    "The way the fields are defined is a bit different to `json`. We now use a list of tuples, where each element is also a tuple. The first element of these inner tuples will become the batch object's attribute name, second element is the `Field` name.\n",
    "\n",
    "Unlike the `json` data, the tuples have to be in the same order that they are within the `tsv` data. Due to this, when skipping a column of data a tuple of `None`s needs to be used, if not then our `SAYING` field will be applied to the `age` column of the `tsv` data and the `quote` column will not be used. \n",
    "\n",
@ -216,7 +216,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 116,
+   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
@ -225,36 +225,36 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 117,
+   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
-    "train, valid, test = data.TabularDataset.splits(\n",
-    "                path = 'data',\n",
-    "                train = 'train.tsv',\n",
-    "                validation = 'valid.tsv',\n",
-    "                test = 'test.tsv',\n",
-    "                format = 'tsv',\n",
-    "                fields = fields,\n",
-    "                skip_header = True\n",
+    "train_data, valid_data, test_data = data.TabularDataset.splits(\n",
+    "                                        path = 'data',\n",
+    "                                        train = 'train.tsv',\n",
+    "                                        validation = 'valid.tsv',\n",
+    "                                        test = 'test.tsv',\n",
+    "                                        format = 'tsv',\n",
+    "                                        fields = fields,\n",
+    "                                        skip_header = True\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 118,
+   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
-      "vars(train[0]): {'n': ['John'], 'p': ['United', 'Kingdom'], 's': ['i', 'love', 'the', 'united', 'kingdom']}\n"
+      "{'text': ['I', 'simply', 'cannot', 'believe', 'the', 'number', 'of', 'people', 'comparing', 'this', 'favourably', 'with', 'the', 'first', 'film.', 'It', 'moved', 'me', 'to', 'leave', 'this', 'comment!', 'This', 'is', 'just', 'an', 'obvious', 'attempt', 'to', 'cash-in', 'on', 'the', 'success', 'of', 'the', 'first', 'film.', 'The', 'dialogue', 'is', 'appalling', 'and', 'nothing', 'like', 'as', 'authentic', 'or', 'compelling', 'as', 'the', 'original', 'film.<br', '/><br', '/>The', 'storyline', 'is', 'ridiculous,', 'the', 'portrayal', 'of', 'the', 'French', 'police', 'laughable', 'and', 'the', 'characterisation', 'of', 'Doyle', 'a', 'mile', 'away', 'from', 'the', 'first', 'film.<br', '/><br', '/>How', 'many', 'drug', 'bosses', 'do', 'you', 'think', 'go', 'down', 'to', 'the', 'docks', 'in', 'person', 'to', 'see', 'a', 'shipment', 'come', 'in?', 'The', 'ease', 'at', 'which', 'Doyle', 'finds', 'his', 'guy', 'is', 'just', 'pathetic.', 'Like', 'all', 'the', 'French', 'Police', 'were', 'just', 'drinking', 'coffee', 'until', 'Doyle', 'turns', 'up', 'from', 'America', 'and', 'does', 'some', 'REAL', 'police', 'work.', 'What', 'a', 'joke.', 'Try', 'going', 'to', 'a', 'foreign', 'city', 'and', 'unearthing', 'the', 'biggest', 'crims', 'in', 'the', 'place', 'with', 'a', 'travel', 'map', 'and', 'some', 'tourist', 'pamphlets.', 'Pathetic.', '<br', '/><br', '/>A', 'truly', 'awful', 'sequel,', 'anyone', 'who', 'thinks', 'otherwise', 'is', 'crazy.'], 'label': 'neg'}\n"
     ]
    }
   ],
   "source": [
-    "print('vars(train[0]):', vars(train[0]))"
+    "print(vars(train[0]))"
   ]
  },
  {
@ -268,7 +268,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 123,
+   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
@ -277,36 +277,36 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 124,
+   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
-    "train, valid, test = data.TabularDataset.splits(\n",
-    "                path = 'data',\n",
-    "                train = 'train.csv',\n",
-    "                validation = 'valid.csv',\n",
-    "                test = 'test.csv',\n",
-    "                format = 'csv',\n",
-    "                fields = fields,\n",
-    "                skip_header = True\n",
+    "train_data, valid_data, test_data = data.TabularDataset.splits(\n",
+    "                                        path = 'data',\n",
+    "                                        train = 'train.csv',\n",
+    "                                        validation = 'valid.csv',\n",
+    "                                        test = 'test.csv',\n",
+    "                                        format = 'csv',\n",
+    "                                        fields = fields,\n",
+    "                                        skip_header = True\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 125,
+   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
-      "vars(train[0]): {'n': ['John'], 'p': ['United', 'Kingdom'], 's': ['i', 'love', 'the']}\n"
+      "{'text': ['I', 'simply', 'cannot', 'believe', 'the', 'number', 'of', 'people', 'comparing', 'this', 'favourably', 'with', 'the', 'first', 'film.', 'It', 'moved', 'me', 'to', 'leave', 'this', 'comment!', 'This', 'is', 'just', 'an', 'obvious', 'attempt', 'to', 'cash-in', 'on', 'the', 'success', 'of', 'the', 'first', 'film.', 'The', 'dialogue', 'is', 'appalling', 'and', 'nothing', 'like', 'as', 'authentic', 'or', 'compelling', 'as', 'the', 'original', 'film.<br', '/><br', '/>The', 'storyline', 'is', 'ridiculous,', 'the', 'portrayal', 'of', 'the', 'French', 'police', 'laughable', 'and', 'the', 'characterisation', 'of', 'Doyle', 'a', 'mile', 'away', 'from', 'the', 'first', 'film.<br', '/><br', '/>How', 'many', 'drug', 'bosses', 'do', 'you', 'think', 'go', 'down', 'to', 'the', 'docks', 'in', 'person', 'to', 'see', 'a', 'shipment', 'come', 'in?', 'The', 'ease', 'at', 'which', 'Doyle', 'finds', 'his', 'guy', 'is', 'just', 'pathetic.', 'Like', 'all', 'the', 'French', 'Police', 'were', 'just', 'drinking', 'coffee', 'until', 'Doyle', 'turns', 'up', 'from', 'America', 'and', 'does', 'some', 'REAL', 'police', 'work.', 'What', 'a', 'joke.', 'Try', 'going', 'to', 'a', 'foreign', 'city', 'and', 'unearthing', 'the', 'biggest', 'crims', 'in', 'the', 'place', 'with', 'a', 'travel', 'map', 'and', 'some', 'tourist', 'pamphlets.', 'Pathetic.', '<br', '/><br', '/>A', 'truly', 'awful', 'sequel,', 'anyone', 'who', 'thinks', 'otherwise', 'is', 'crazy.'], 'label': 'neg'}\n"
     ]
    }
   ],
   "source": [
-    "print('vars(train[0]):', vars(train[0]))"
+    "print(vars(train[0]))"
   ]
  },
  {
@ -315,9 +315,9 @@
   "source": [
    "## Why JSON over CSV/TSV?\n",
    "\n",
-    "1. Your `csv` or `tsv` data cannot be lists. This means data cannot be tokenized during the pre-processing step, which means everytime you run your Python script that reads this data via TorchText, it has to be tokenized. Using advanced tokenizers, such as the `spaCy` tokenizer takes a non-negligible amount of time, especially if you are running your script multiple times. Thus, it is better to tokenize your data from a string into a list and use the `json` format.\n",
+    "1. Your `csv` or `tsv` data cannot be stored lists. This means data cannot be already be tokenized, thus everytime you run your Python script that reads this data via TorchText, it has to be tokenized. Using advanced tokenizers, such as the `spaCy` tokenizer, takes a non-negligible amount of time. Thus, it is better to tokenize your datasets and store them in the `json lines` format.\n",
    "\n",
-    "2. If tabs appear in your `tsv` data, or commas appear in your `csv` data, TorchText will think they are delimiters between columns. This will cause your data to be parsed incorrectly, and worst of all TorchText will not alert you to this as it cannot tell the difference between a tab/comma in a field and a tab/comma as a delimiter. As `json` data is essential a dictionary, you access the data within the fields via its key, so do not have to worry about surprise delimiters."
+    "2. If tabs appear in your `tsv` data, or commas appear in your `csv` data, TorchText will think they are delimiters between columns. This will cause your data to be parsed incorrectly. Worst of all TorchText will not alert you to this as it cannot tell the difference between a tab/comma in a field and a tab/comma as a delimiter. As `json` data is essentially a dictionary, you access the data within the fields via its key, so do not have to worry about \"surprise\" delimiters."
   ]
  }
 ],
--- a/Embeddings.ipynb
+++ b/Embeddings.ipynb
@ -10,13 +10,13 @@
    "\n",
    "Embeddings transform a one-hot encoded vector (a vector that is 0 in elements except one, which is 1) into a much smaller dimension vector of real numbers. The one-hot encoded vector is also known as a *sparse vector*, whilst the real valued vector is known as a *dense vector*. \n",
    "\n",
-    "The key concept in these word embeddings is that words that appear in similar _contexts_ appear nearby in the vector space, i.e. the Euclidean distance between these two word vectors is small. By context here, we mean the surrounding words, for example in the sentences \"I purchased some items at the shop\" and \"I purchased some items at the store\" the words 'shop' and 'store' appear in the same context and thus should be close together in vector space.\n",
+    "The key concept in these word embeddings is that words that appear in similar _contexts_ appear nearby in the vector space, i.e. the Euclidean distance between these two word vectors is small. By context here, we mean the surrounding words. For example in the sentences \"I purchased some items at the shop\" and \"I purchased some items at the store\" the words 'shop' and 'store' appear in the same context and thus should be close together in vector space.\n",
    "\n",
    "You may have also heard about *word2vec*. *word2vec* is an algorithm (actually a bunch of algorithms) that calculates word vectors from a corpus. In this appendix we use *GloVe* vectors, *GloVe* being another algorithm to calculate word vectors. If you want to know how *word2vec* works, check out a two part series [here](http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/) and [here](http://mccormickml.com/2017/01/11/word2vec-tutorial-part-2-negative-sampling/), and if you want to find out more about *GloVe*, check the website [here](https://nlp.stanford.edu/projects/glove/).\n",
    "\n",
    "In PyTorch, we use word vectors with the `nn.Embedding` layer, which takes a _**[sentence length, batch size]**_ tensor and transforms it into a _**[sentence length, batch size, embedding dimensions]**_ tensor.\n",
    "\n",
-    "In tutorial 2 onwards, we also used pre-trained word embeddings (specifically the GloVe vectors) provided in TorchText. These embeddings have been trained on a gigantic corpus. We can use these pre-trained vectors within any of our models, with the idea that as they have already learned the context of each word it gives us a better initial starting point for our word vectors. This usually leads to faster training time and/or improved accuracy.\n",
+    "In tutorial 2 onwards, we also used pre-trained word embeddings (specifically the GloVe vectors) provided by TorchText. These embeddings have been trained on a gigantic corpus. We can use these pre-trained vectors within any of our models, with the idea that as they have already learned the context of each word they will give us a better starting point for our word vectors. This usually leads to faster training time and/or improved accuracy.\n",
    "\n",
    "In this appendix we won't be training any models, instead we'll be looking at the word embeddings and finding a few interesting things about them.\n",
    "\n",
@ -54,7 +54,7 @@
   "source": [
    "As shown above, there are 400,000 unique words in the GloVe vocabulary. These are the most common words found in the corpus the vectors were trained on. **In these set of GloVe vectors, every single word is lower-case only.**\n",
    "\n",
-    "`glove.vectors` are the actual tensors containing the values of the embeddings. Each row is a word with each word having 100 columns."
+    "`glove.vectors` is the actual tensor containing the values of the embeddings."
   ]
  },
  {
@ -81,7 +81,9 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "We can see what the word associated with each row is by checking the `itos` (int to string) list. Below implies that row 0 is the vector associated with the word 'the', row 1 for ',' (comma), etc."
+    "We can see what word is associated with each row by checking the `itos` (int to string) list. \n",
+    "\n",
+    "Below implies that row 0 is the vector associated with the word 'the', row 1 for ',' (comma), row 2 for '.' (period), etc."
   ]
  },
  {
@ -235,7 +237,7 @@
   "source": [
    "Let's try it out with 'korea'. The closest word is the word 'korea' itself (not very interesting), however all of the words are related in some way. Pyongyang is the capital of North Korea, DPRK is the official name of North Korea, etc.\n",
    "\n",
-    "Interestingly, we also get 'Japan' and 'China'. These are countries, however they are also countries that are geographically near Korea. "
+    "Interestingly, we also get 'Japan' and 'China',  implies that Korea, Japan and China are frequently talked about together in similar contexts. This makes sense as they are geographically situated near each other. "
   ]
  },
  {
@ -271,7 +273,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Looking at another country, India, we also get nearby countries: Thailand, Malaysia and Sri Lanka (as two separate words). Australia is relatively close to India, but Thailand and Malaysia are closer, could it be due to India and Australia appearing in the context of cricket matches together?"
+    "Looking at another country, India, we also get nearby countries: Thailand, Malaysia and Sri Lanka (as two separate words). Australia is relatively close to India (geographically), but Thailand and Malaysia are closer. So why is Australia closer to India in vector space? This is most probably due to India and Australia appearing in the context of [cricket](https://en.wikipedia.org/wiki/Cricket) matches together."
   ]
  },
  {
@ -408,9 +410,9 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "This is the canonical example which shows off this propert of word embeddings. So why does it work? Why does the vector of 'woman' added to the vector of 'king' minus the vector of 'man' give us 'queen'?\n",
+    "This is the canonical example which shows off this property of word embeddings. So why does it work? Why does the vector of 'woman' added to the vector of 'king' minus the vector of 'man' give us 'queen'?\n",
    "\n",
-    "If we think about it, the vector calculated from 'king' minus 'man' gives us a \"royalty vector\". This is the vector associated with traveling from a man to his royal counterpart, a king. If we add this \"royality vector\" to 'woman', this should travel to her royal equivalent, which is queen!\n",
+    "If we think about it, the vector calculated from 'king' minus 'man' gives us a \"royalty vector\". This is the vector associated with traveling from a man to his royal counterpart, a king. If we add this \"royality vector\" to 'woman', this should travel to her royal equivalent, which is a queen!\n",
    "\n",
    "We can do this with other analogies too. For example, this gets an \"acting career vector\":"
   ]
@ -563,7 +565,9 @@
    "\n",
    "We'll put their findings into code and briefly explain them, but to read more about this, check out the [original thread](http://forums.fast.ai/t/nlp-any-libraries-dictionaries-out-there-for-fixing-common-spelling-errors/16411) and the associated [write-up](https://blog.usejournal.com/a-simple-spell-checker-built-from-word-vectors-9f28452b6f26).\n",
    "\n",
-    "First, we need to load up the much larger vocabulary GloVe vectors, this is due to the spelling mistakes not appearing in the smaller vocabulary. **Note**: these vectors are very large (~2GB), so watch out if you have a limited internet connection."
+    "First, we need to load up the much larger vocabulary GloVe vectors, this is due to the spelling mistakes not appearing in the smaller vocabulary. \n",
+    "\n",
+    "**Note**: these vectors are very large (~2GB), so watch out if you have a limited internet connection."
   ]
  },
  {
@ -749,7 +753,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "For a misspelling of definitely:"
+    "For a misspelling of \"definitely\":"
   ]
  },
  {
@ -782,7 +786,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "For a misspelling of consistent:"
+    "For a misspelling of \"consistent\":"
   ]
  },
  {
@ -815,7 +819,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "For a misspelling of package:"
+    "For a misspelling of \"package\":"
   ]
  },
  {
--- a/README.md
+++ b/README.md
@ -1,6 +1,6 @@
 # PyTorch Sentiment Analysis

-This repo contains tutorials covering how to do sentiment analysis using [PyTorch](https://github.com/pytorch/pytorch) 0.4 and [TorchText](https://github.com/pytorch/text) 0.2.3 using Python 3.6.
+This repo contains tutorials covering how to do sentiment analysis using [PyTorch](https://github.com/pytorch/pytorch) 0.4 and [TorchText](https://github.com/pytorch/text) 0.3 using Python 3.6.

 The first 2 tutorials will cover getting started with the de facto approach to sentiment analysis: recurrent neural networks (RNNs). The third notebook covers the [FastText](https://arxiv.org/abs/1607.01759) model and the final covers a [convolutional neural network](https://arxiv.org/abs/1408.5882) (CNN) model.

--- a/assets/sentiment1.png
+++ b/assets/sentiment1.png
--- a/assets/sentiment1.xml
+++ b/assets/sentiment1.xml
@ -0,0 +1 @@
+<mxfile userAgent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36" version="9.2.7" editor="www.draw.io" type="device"><diagram id="c468a845-3572-b292-0351-cbae11ed401d" name="Page-1">7Vpbb5swGP01edwEGNzksc26i7RJk/qw9tEBB7wSnDlOQ/brZ8DmZmeNSCASCZUafGxj+M45X7CdCZiv0i8MraMfNMDxxLGCdAI+TRwHep74nwH7AgAzUAAhI0EB2RXwRP5iCVoS3ZIAbxoNOaUxJ+sm6NMkwT5vYIgxums2W9K4OeoahVgDnnwU6+gvEvCoQKcOrPCvmISRGtmGs6JmgfzXkNFtIsebOGCZH0X1CqlryQfdRCiguxoEHidgzijlxdkqneM4C60KW9Hv84Ha8r4ZTvhRHSRRG75Xz44DEQpZpIxHNKQJih8r9CF/PpxdwRKliK9icWqLU5wS/lw7f8mafPSyUsLZ/ln2yAtV3W/M+V5qAG05FVA17ndK1/KKS5pw2cyGWZnE8ZzGlOX3reIMHjac0Vdcq7HyQ9QUz5o94MFwSWhDt8yXrRwpQMRCLFuBkirhAExXWDyQaMJwjDh5a14dSS2GZbuKD3EiKTHTI4d+Q/EWK0G16GqSsYsIx09rlN/7TnizSZAeMsf3TSEL4AJ6sAzZG2Ycp/8Pmh4O1UGJXSaDqSzuKme5EopqplLYKfGz4U3e78kb6PJ2B5I3GIO8Hety8r67yfs9ebu6vL2B5O2OQt7Ty8l7Ooy8u0q4myBPVZ/s+pMSMUhJk+s2abKtFgGFAWSvFgflbRxFizcGWQN4OVnbWrxOkXWZm6tU/SLr+hd2Pa06vQjbbgsbtigovHUOYd8ZhA1jnklxjZIGY/DPNpui5ZH7sMlDdy8a2HCdVpXiLJSf+VUWCvimEHFHC60VU0haDr8tW9m1rjVY71xvl9+9glvSExbiTUU1fZfQBLdMKiEUkzARRV+IAgv8ITMkEZPoe1mxIkGQi9eUAJqCNivwrNMO++44izvnsLieEkdh8VPnvWaLO22m+rP4dCiLR4jj7i53bi7vOPsa1OVgnC4/dfpvdjloM9Wfy2c9uzxrfMZcwSOy6Z4rQIdcUTzALYUYZ7iDphB3nCmkn0kuaDPVXwpRO1H9vykIw6y6u9+9vSl0nPEPafNyu29UK1n2rBeXe+3pQH9LWbbpTeGktax3Vrm97E/zlKiB+SGvUMOL4zwOaK8RDrjmpeJqSKdlzrLMWfBa8lNb9rMB09N530IG3UfqlrpMu56nrl8eHW192zMyfLNbh77Zr8UQ7uX8oG9xmBg6uBZ7LQzZrVcqOCBF8CiKDi6kXQtFDphdjCJ9P8VE0cH1i2uhCDiXc5G+Hm6i6OAk81oocq2hKBLF6tefxYym+oUtePwH</diagram></mxfile>
--- a/assets/sentiment2.png
+++ b/assets/sentiment2.png
--- a/assets/sentiment2.xml
+++ b/assets/sentiment2.xml
@ -0,0 +1 @@
+<mxfile userAgent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36" version="9.2.7" editor="www.draw.io" type="device"><diagram id="c468a845-3572-b292-0351-cbae11ed401d" name="Page-1">7Vtbb5swFP41kbaHTYDBTR7brLtImzSpD+seHXCBleDMcZpkv34GzN0pBIjTplBpxZ+vnPN9R/ZxNwHz5e4LRSvvB3FwMDE0ZzcBnyaGAS2L/xsB+wQAM5AALvWdBNJz4M7/hwWoCXTjO3hdasgICZi/KoM2CUNssxKGKCXbcrMHEpRnXSEX14A7GwV19JfvMC9BpwbM8a/Yd710Zh3OkpoFsh9dSjahmG9igIf4SaqXKB1LfOjaQw7ZFiBwOwFzSghL3pa7OQ4i06ZmS/p9PlCbrZvikLXqIBy1Zvv027HDTSGKhDKPuCREwW2O3sTfh6MRNF7y2DLgrzp/xTuf3Rfef0dNPlpRKWR0fy96xIW87g9mbC84gDaMcCif9zshKzHiAwmZaKbDqOwHwZwEhMbrTu0MbtaMkkdcqNHih9ck3xp94EFzCWhNNtQWrQxBQERdLFqBzFVcAZgsMf8g3oTiADH/qTw6Elx0s3a5P/iLcIncPWLqJxRscEqoirvKzth6PsN3KxSvfcu1WXZQ3WSGbctM5sAFtGBmsidMGd49b7S6OdIOKdlFMJiK4jZXlikgryCqFOtjPx2O9G6iN6jT21REb3AJ9DbOSO+rkd5N9Dbr9LYU0du8BHqDM9J7qobeXSn8nCy6kdWqk1Wf9WSr6PqT+HzizK2WVnarrlUclqxB9Kr4LFtGKzdalyAD84wy0Gv26iODjLR5aP8t6roKoT3Zi8w2TkJs3awQG1ZckOhtCGJfSYgNAxZRcYXCksfg3010pIst92Edm+6aN9DhapdX8jdX/I5HWaTAtxThK1rUWtEU2WXTb7JWeqFrAa53LraLV5/CFepxCbEyo8q6C0mIKyIVEAp8N+RFm5MCc/wmEqTPD93XomLpO05MXlkAKBNazsBBjyn6VTuJG0NIvB4SL0Lifc/Jcokb6iQ+VSVxDzHcXeXGqPKOpzWlKgeXqfK+6QK5yoE6lc9OrPKo8YCxgnn+unusAB1iRfIBYwiRnoiVhhDzMkNI35SMPISY6kJIenN1+p0CF8yyu/rNcafQ8cSvUubZ9eALzXy1F3kxu3WiVBZUl8rSZTuFXrmshqy4Ff3UNMVrYPyIEQp48gyjAOt8Oa/UrpJwmsUsTR4F30p8qtJ+pjA8DbsLUXrv1C10yW5J++YvW1tbdk165MHClG0t3nmSDYJ2aIPA16nZ7Tu8P3h8eCsCBY36tE6lz8MH2sEY03ZTKx3n+ER1bRtapuN5V/OKuW4Ow/XsZCXILtkrwxORHcjuF1842RvytYrJftxqRrIbZyQ7fH1kb0g4Kib7casZyQ7OSPbDF+wvluwN+TXFZD9uNSPZTWVk58X8r/yTTFT+PynA7X8=</diagram></mxfile>
--- a/assets/sentiment3.png
+++ b/assets/sentiment3.png
--- a/assets/sentiment3.xml
+++ b/assets/sentiment3.xml
@ -0,0 +1 @@
+<mxfile userAgent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36" version="9.2.7" editor="www.draw.io" type="device"><diagram id="c468a845-3572-b292-0351-cbae11ed401d" name="Page-1">7Ztdb6M4FIZ/DZezAhvT5LLNdHZXmpVW6ko7c7VywAF2CM6A0yT769cGm0+TRIQPNW0qtXBsY3PO857YhhpwtT3+muBd8Af1SGQA0zsa8LMBgIMQ/y0Mp9wAlzA3+Eno5SarNLyE/xFpNKV1H3okrVVklEYs3NWNLo1j4rKaDScJPdSrbWhU73WHfdIyvLg4alv/Dj0W5NYFcEr7byT0A9Wz5SzzkjV2f/gJ3ceyPwPATfbJi7dYXUv2kAbYo4fclN07fDbgKqGU5Ufb44pEwrXKbbmDvnSUFuNOSMyuaQBtOQ52UvdOPO4KeUoTFlCfxjh6Lq1P2f0RcQWTnwVsG/FDix+SY8i+SbM4/i6Of0HiLGbJ6ZuqJk7Ksn8JYyfJAN4zyk1lv18p3clmHk6DrFdxkg9bjLXzzpWL6T5xZS0IJEw48YmsBq3C7xxnQreEj47XSUiEWfhavz6WYPlFvdK5/ED6t8PXsvNXHO2JwqPh/LprD0HIyMsOZ8M/cKXV3b0Jo2hFI5pkbaGHyMKzuT1lCf1BKiULsIaOU3jtlSSMHM/7re0P2QA+SI2elGTl+aEUii1NQUUjynaTB9H7olWlviqtpj46w9Nq3QOtYDkjrc77otVs0wqWU9Fq3gWt1oy0PoxCa06o4rVKqzkrrWDZptWailbV+dum1YLz0WqNMhOwunPrDbRuaMxkNctpB0uuDzTBMrNPEaxLRGsmC1ptTAH4uHxzfwHX1bnMc9YOcs65rIV8N9+gwTeckO9R5g5vmm/Y5tuei294D3wDc0a+R5ltvGm+7TbfaC6+7bvgezEf33AxBt/nZtMl7t9lq2v51kJ2EVfUxnWy+TO6BzyhM2P6tVoOuwXPASBsJtlOB9e2F4bOj7LpnzTkvZYTQbseqWKZri6R60K2akShGMZVgXnQkO1ETLC4w3EtYs7PvXhKkHnuU5q57pFXsJzdsSzkR778m11lrQy/Kwsf0bpVK1GWY9H9vqhlVZpWzO3G1XrZ6JW5gR4XDKsTVRdeTGPSUKk04Sj0Y37qckoItz8J+YUujh5lwTb0vAxeXQaoA301gf1XEOppzCWJgyEk3s6Jb1Higy9h9RIHzUiNJ/HFVBIPMCP9VQ4+VN5zHTWpyuFdqHzwhbxe5bAZqfFUrtvhHVLlovKAuYIFYdo/V8AeuSK/gY8Uol2qTppChn0FZK4UMvheSUcKaUZqvBSirjz+TIELZttf/fbHTKHnkn9KmcPloDLv3HEdd9fp5i0mvartxgrfaTg8TzSDiHrwR78XdqeR+GlJiJc42UdeoWLPP+MAb0+3xaX8qsmeRYoy9UnvvaSjJvVoumwERnnvdJrnP1dNRybammw+4kCjzUZs3WyER9EO/jEN9GygVSIQyd6+zguuEFJC+FwFr7MKIkI7MdBs6OjJQJ+FhHhgUun8iqIismEaPTERNDE7csPY/yuL4Ce7M4Q3vf2CGp7XfJUXtqHVY1/xfI7E3mMWC55+IpymoXv+e+PcW4Q937eq7eMBvTcr7kIabynbrd/vXdG6oJPWhVDznRBztJmCrXvKVROcEMG70ZvdmEosNXrTEdRDb/y0/A+MPHLlf7nA5/8B</diagram></mxfile>
--- a/assets/sentiment4.png
+++ b/assets/sentiment4.png
--- a/assets/sentiment4.xml
+++ b/assets/sentiment4.xml
@ -0,0 +1 @@
+<mxfile userAgent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36" version="9.2.7" editor="www.draw.io" type="device"><diagram id="c468a845-3572-b292-0351-cbae11ed401d" name="Page-1">7Vtdj5s4FP01eewKMBDy2JlOP6SutNKstO3TygEP0BKcOs4ks79+bbADGKeJiDGdNIw0g68/gHvPOdjXzAzcr/YfCFxnf+IEFTPPSfYz8G7mefMoZL+54aU2gHlQG1KSJ7XJbQyP+X9IGB1h3eYJ2nQaUowLmq+7xhiXJYppxwYJwbtusydcdK+6hinqGR5jWPSt/+QJzWpr5IWN/SPK00xe2Q0Xdc0Sxt9TgreluN7MA0/VUVevoBxLXGGTwQTvalP17OBhBu4JxrQ+W+3vUcFdK91WO+j9kdrDfRNU0rM6iLhs6It8dpQwV4giJjTDKS5h8dBY76rnQ3wEh5UyuirYqctO0T6nX1rnX3mTPwJeKil5+SJ6VIWm7hui9EVgAG4pZqbmup8xXosRn3BJRTM35OW8KO5xgUl139LP4G5DCf6OWjVOdbCavndkGPCWxOL5PYE3SFIkWoHaxD3T6iY8+gHhFWIPxBoQVECaP3dBBAUW00O7Jh7sRIREHx7gjREeERIZoE54mmh9Fb3ODY/WR0O87UWWvCuu/QyLLZJ0Vbzd9eUuyyl6XMPq5ndM+br+7QPSi2MdIJNwGQbhzwD5jAhF+5+CTdYKhjtCat1QlHeNcPnClLU0S9oucaAb3tRDwTPo49mfTD3AtauHxtvewpZ3r0E9PGdC9Zjf1EPBs9/HczCZevjXrh4abwPHknf9q1CPaDr1AKOsXH4leAYaeNoif3AN8AThhC83t+ewS+BpAITqK+yogzuzKdNvH9H1L5yzqzaLGL8bqV4Ial6IXkoUDrdxVmDmGmSHBeVYXMOyE7Hwx5bnYSrPvdlUrnvLGrjhet9UsrNU/K1GWUrDJ2lhd7TstSLSsj9cfnto5ba6tsz9zu121d1LswI9RhjaRVSXeCUukcJSYYJFnpasGDOUIGa/4/TLY1i8FRWrPEkq8OoUoAvosxE4fPXrBedR3DNBcbPJmakobjy5pae4p0ZqPIpHtiieQYqGs9y7sXzgKtUqy80mUaZiufEklJ7lQI3UeCxfjMxy3tigVtAs3wzXCjBAK+oHuEmIdqlqVULMZlKmkhDjmagjEqJGajwJkdvN488UGGFWw9nv32YKA5f8Nmkuxxg7n22c5O2kk7uww/JA3bZ0lAjUymOE5bqZwkXJrBObAQH/6XGK1YTVIUZo2evDDAN8JZUS2ct5Sb9q5PSgWY5eBX8XfVJhv7AoT6Ps59jZbjtrfmIpV6nueUSjTU/AK/q8Quv701//RJow2trA1+aJLnovKO+BKEb6TY5lFPhBsynckw+N387OgFrUe2BpA39kDC76GLS2DazNYrw6DKr5OZsYjK5dI4Gjwae1j0B0K+TXh89oOnwurh6fmqmYa+0jPDkDf9X4VLMWFvHpm/1OYYJVwWm8mv9cYWDy4uRaQh3H3FrC132uylaEfvavMwse3LrQziiemfusko7H0p+XDnMyLUAQGwcuqwYcWWvuqsp5wd0seMcTAgxQGwGaVn6gQE9Ukx2gHGw8+RvnZfp3hbw3/lHoXUT6oBv7eZ/zcw3ngQnO6z4+bLDgDMaCCSD83ihw/aAHg4UZ6WfF5v/KagVp/ncPPPwP</diagram></mxfile>
				`@ -0,0 +1 @@`
				<mxfile userAgent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36" version="9.2.7" editor="www.draw.io" type="device"><diagram id="c468a845-3572-b292-0351-cbae11ed401d" name="Page-1">7Vpbb5swGP01edwEGNzksc26i7RJk/qw9tEBB7wSnDlOQ/brZ8DmZmeNSCASCZUafGxj+M45X7CdCZiv0i8MraMfNMDxxLGCdAI+TRwHep74nwH7AgAzUAAhI0EB2RXwRP5iCVoS3ZIAbxoNOaUxJ+sm6NMkwT5vYIgxums2W9K4OeoahVgDnnwU6+gvEvCoQKcOrPCvmISRGtmGs6JmgfzXkNFtIsebOGCZH0X1CqlryQfdRCiguxoEHidgzijlxdkqneM4C60KW9Hv84Ha8r4ZTvhRHSRRG75Xz44DEQpZpIxHNKQJih8r9CF/PpxdwRKliK9icWqLU5wS/lw7f8mafPSyUsLZ/ln2yAtV3W/M+V5qAG05FVA17ndK1/KKS5pw2cyGWZnE8ZzGlOX3reIMHjac0Vdcq7HyQ9QUz5o94MFwSWhDt8yXrRwpQMRCLFuBkirhAExXWDyQaMJwjDh5a14dSS2GZbuKD3EiKTHTI4d+Q/EWK0G16GqSsYsIx09rlN/7TnizSZAeMsf3TSEL4AJ6sAzZG2Ycp/8Pmh4O1UGJXSaDqSzuKme5EopqplLYKfGz4U3e78kb6PJ2B5I3GIO8Hety8r67yfs9ebu6vL2B5O2OQt7Ty8l7Ooy8u0q4myBPVZ/s+pMSMUhJk+s2abKtFgGFAWSvFgflbRxFizcGWQN4OVnbWrxOkXWZm6tU/SLr+hd2Pa06vQjbbgsbtigovHUOYd8ZhA1jnklxjZIGY/DPNpui5ZH7sMlDdy8a2HCdVpXiLJSf+VUWCvimEHFHC60VU0haDr8tW9m1rjVY71xvl9+9glvSExbiTUU1fZfQBLdMKiEUkzARRV+IAgv8ITMkEZPoe1mxIkGQi9eUAJqCNivwrNMO++44izvnsLieEkdh8VPnvWaLO22m+rP4dCiLR4jj7i53bi7vOPsa1OVgnC4/dfpvdjloM9Wfy2c9uzxrfMZcwSOy6Z4rQIdcUTzALYUYZ7iDphB3nCmkn0kuaDPVXwpRO1H9vykIw6y6u9+9vSl0nPEPafNyu29UK1n2rBeXe+3pQH9LWbbpTeGktax3Vrm97E/zlKiB+SGvUMOL4zwOaK8RDrjmpeJqSKdlzrLMWfBa8lNb9rMB09N530IG3UfqlrpMu56nrl8eHW192zMyfLNbh77Zr8UQ7uX8oG9xmBg6uBZ7LQzZrVcqOCBF8CiKDi6kXQtFDphdjCJ9P8VE0cH1i2uhCDiXc5G+Hm6i6OAk81oocq2hKBLF6tefxYym+oUtePwH</diagram></mxfile>