pytorch-sentiment-analysis/1 - Simple Sentiment Analysis.ipynb

826 lines
32 KiB
Plaintext
Raw Normal View History

2018-06-01 06:55:32 +08:00
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 1 - Simple Sentiment Analysis\n",
"\n",
"In this series we'll be building a machine learning model to detect sentiment (i.e. detect if a sentence is positive or negative) using PyTorch and TorchText. This will be done on movie reviews, using the [IMDb dataset](http://ai.stanford.edu/~amaas/data/sentiment/).\n",
2018-06-01 06:55:32 +08:00
"\n",
"In this first notebook, we'll start very simple to understand the general concepts whilst not really caring about good results. Further notebooks will build on this knowledge and we'll actually get good results.\n",
2018-06-01 06:55:32 +08:00
"\n",
"### Introduction\n",
2018-06-01 06:55:32 +08:00
"\n",
"We'll be using a **recurrent neural network** (RNN) as they are commonly used in analysing sequences. An RNN takes in sequence of words, $X=\\{x_1, ..., x_T\\}$, one at a time, and produces a _hidden state_, $h$, for each word. We use the RNN _recurrently_ by feeding in the current word $x_t$ as well as the hidden state from the previous word, $h_{t-1}$, to produce the next hidden state, $h_t$. \n",
"\n",
"$$h_t = \\text{RNN}(x_t, h_{t-1})$$\n",
"\n",
"Once we have our final hidden state, $h_T$, (from feeding in the last word in the sequence, $x_T$) we feed it through a linear layer, $f$, (also known as a fully connected layer), to receive our predicted sentiment, $\\hat{y} = f(h_T)$.\n",
"\n",
"Below shows an example sentence, with the RNN predicting zero, which indicates a negative sentiment. The RNN is shown in orange and the linear layer shown in silver. Note that we use the same RNN for every word, i.e. it has the same parameters. The initial hidden state, $h_0$, is a tensor initialized to all zeros. \n",
"\n",
"![](assets/sentiment1.png)\n",
"\n",
"**Note:** some layers and steps have been omitted from the diagram, but these will be explained later."
2018-06-01 06:55:32 +08:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Preparing Data\n",
"\n",
"One of the main concepts of TorchText is the `Field`. These define how your data should be processed. In our sentiment classification task the data consists of both the raw string of the review and the sentiment, either \"pos\" or \"neg\".\n",
2018-06-01 06:55:32 +08:00
"\n",
"The parameters of a `Field` specify how the data should be processed. \n",
"\n",
"We use the `TEXT` field to define how the review should be processed, and the `LABEL` field to process the sentiment. \n",
2018-06-01 06:55:32 +08:00
"\n",
"Our `TEXT` field has `tokenize='spacy'` as an argument. This defines that the \"tokenization\" (the act of splitting the string into discrete \"tokens\") should be done using the [spaCy](https://spacy.io) tokenizer. If no `tokenize` argument is passed, the default is simply splitting the string on spaces. We also need to specify a `tokenizer_language` which tells torchtext which spaCy model to use. We use the `en_core_web_sm` model which has to be downloaded with `python -m spacy download en_core_web_sm` before you run this notebook!\n",
"\n",
"`LABEL` is defined by a `LabelField`, a special subset of the `Field` class specifically used for handling labels. We will explain the `dtype` argument later.\n",
2018-06-01 06:55:32 +08:00
"\n",
"For more on `Fields`, go [here](https://github.com/pytorch/text/blob/master/torchtext/data/field.py).\n",
"\n",
"We also set the random seeds for reproducibility. "
2018-06-01 06:55:32 +08:00
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import torch\n",
2021-03-12 21:25:47 +08:00
"from torchtext.legacy import data\n",
2018-06-01 06:55:32 +08:00
"\n",
"SEED = 1234\n",
"\n",
"torch.manual_seed(SEED)\n",
"torch.backends.cudnn.deterministic = True\n",
"\n",
"TEXT = data.Field(tokenize = 'spacy',\n",
" tokenizer_language = 'en_core_web_sm')\n",
2019-03-30 00:57:00 +08:00
"LABEL = data.LabelField(dtype = torch.float)"
2018-06-01 06:55:32 +08:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
2019-03-10 22:16:18 +08:00
"Another handy feature of TorchText is that it has support for common datasets used in natural language processing (NLP). \n",
2018-06-01 06:55:32 +08:00
"\n",
2019-03-10 22:16:18 +08:00
"The following code automatically downloads the IMDb dataset and splits it into the canonical train/test splits as `torchtext.datasets` objects. It process the data using the `Fields` we have previously defined. The IMDb dataset consists of 50,000 movie reviews, each marked as being a positive or negative review."
2018-06-01 06:55:32 +08:00
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
2018-06-01 06:55:32 +08:00
"source": [
2021-03-12 21:25:47 +08:00
"from torchtext.legacy import datasets\n",
2018-06-01 06:55:32 +08:00
"\n",
"train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)"
2018-06-01 06:55:32 +08:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see how many examples are in each split by checking their length."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Number of training examples: 25000\n",
"Number of testing examples: 25000\n"
2018-06-01 06:55:32 +08:00
]
}
],
"source": [
"print(f'Number of training examples: {len(train_data)}')\n",
"print(f'Number of testing examples: {len(test_data)}')"
2018-06-01 06:55:32 +08:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can also check an example."
]
},
{
"cell_type": "code",
"execution_count": 4,
2018-06-01 06:55:32 +08:00
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'text': ['elvira', 'mistress', 'of', 'the', 'dark', 'is', 'one', 'of', 'my', 'fav', 'movies', ',', 'it', 'has', 'every', 'thing', 'you', 'would', 'want', 'in', 'a', 'film', ',', 'like', 'great', 'one', 'liners', ',', 'sexy', 'star', 'and', 'a', 'Outrageous', 'story', '!', 'if', 'you', 'have', 'not', 'seen', 'it', ',', 'you', 'are', 'missing', 'out', 'on', 'one', 'of', 'the', 'greatest', 'films', 'made', '.', 'i', 'ca', \"n't\", 'wait', 'till', 'her', 'new', 'movie', 'comes', 'out', '!'], 'label': 'pos'}\n"
2018-06-01 06:55:32 +08:00
]
}
],
"source": [
"print(vars(train_data.examples[0]))"
2018-06-01 06:55:32 +08:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The IMDb dataset only has train/test splits, so we need to create a validation set. We can do this with the `.split()` method. \n",
"\n",
"By default this splits 70/30, however by passing a `split_ratio` argument, we can change the ratio of the split, i.e. a `split_ratio` of 0.8 would mean 80% of the examples make up the training set and 20% make up the validation set. \n",
"\n",
"We also pass our random seed to the `random_state` argument, ensuring that we get the same train/validation split each time."
2018-06-01 06:55:32 +08:00
]
},
{
"cell_type": "code",
"execution_count": 5,
2018-06-01 06:55:32 +08:00
"metadata": {},
"outputs": [],
"source": [
"import random\n",
"\n",
2019-03-30 00:57:00 +08:00
"train_data, valid_data = train_data.split(random_state = random.seed(SEED))"
2018-06-01 06:55:32 +08:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Again, we'll view how many examples are in each split."
]
},
{
"cell_type": "code",
"execution_count": 6,
2018-06-01 06:55:32 +08:00
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Number of training examples: 17500\n",
"Number of validation examples: 7500\n",
"Number of testing examples: 25000\n"
2018-06-01 06:55:32 +08:00
]
}
],
"source": [
"print(f'Number of training examples: {len(train_data)}')\n",
"print(f'Number of validation examples: {len(valid_data)}')\n",
"print(f'Number of testing examples: {len(test_data)}')"
2018-06-01 06:55:32 +08:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, we have to build a _vocabulary_. This is a effectively a look up table where every unique word in your data set has a corresponding _index_ (an integer).\n",
"\n",
"We do this as our machine learning model cannot operate on strings, only numbers. Each _index_ is used to construct a _one-hot_ vector for each word. A one-hot vector is a vector where all of the elements are 0, except one, which is 1, and dimensionality is the total number of unique words in your vocabulary, commonly denoted by $V$.\n",
2018-06-01 06:55:32 +08:00
"\n",
2019-03-10 22:45:44 +08:00
"![](assets/sentiment5.png)\n",
2018-06-01 06:55:32 +08:00
"\n",
"The number of unique words in our training set is over 100,000, which means that our one-hot vectors will have over 100,000 dimensions! This will make training slow and possibly won't fit onto your GPU (if you're using one). \n",
2018-06-01 06:55:32 +08:00
"\n",
"There are two ways effectively cut down our vocabulary, we can either only take the top $n$ most common words or ignore words that appear less than $m$ times. We'll do the former, only keeping the top 25,000 words.\n",
2018-06-01 06:55:32 +08:00
"\n",
"What do we do with words that appear in examples but we have cut from the vocabulary? We replace them with a special _unknown_ or `<unk>` token. For example, if the sentence was \"This film is great and I love it\" but the word \"love\" was not in the vocabulary, it would become \"This film is great and I `<unk>` it\".\n",
2018-06-01 06:55:32 +08:00
"\n",
"The following builds the vocabulary, only keeping the most common `max_size` tokens."
2018-06-01 06:55:32 +08:00
]
},
{
"cell_type": "code",
"execution_count": 7,
2018-06-01 06:55:32 +08:00
"metadata": {},
"outputs": [],
"source": [
2019-03-30 00:57:00 +08:00
"MAX_VOCAB_SIZE = 25_000\n",
"\n",
"TEXT.build_vocab(train_data, max_size = MAX_VOCAB_SIZE)\n",
"LABEL.build_vocab(train_data)"
2018-06-01 06:55:32 +08:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Why do we only build the vocabulary on the training set? When testing any machine learning system you do not want to look at the test set in any way. We do not include the validation set as we want it to reflect the test set as much as possible."
]
},
{
"cell_type": "code",
"execution_count": 8,
2018-06-01 06:55:32 +08:00
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Unique tokens in TEXT vocabulary: 25002\n",
"Unique tokens in LABEL vocabulary: 2\n"
2018-06-01 06:55:32 +08:00
]
}
],
"source": [
"print(f\"Unique tokens in TEXT vocabulary: {len(TEXT.vocab)}\")\n",
"print(f\"Unique tokens in LABEL vocabulary: {len(LABEL.vocab)}\")"
2018-06-01 06:55:32 +08:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Why is the vocab size 25002 and not 25000? One of the addition tokens is the `<unk>` token and the other is a `<pad>` token.\n",
2018-06-01 06:55:32 +08:00
"\n",
"When we feed sentences into our model, we feed a _batch_ of them at a time, i.e. more than one at a time, and all sentences in the batch need to be the same size. Thus, to ensure each sentence in the batch is the same size, any shorter than the longest within the batch are padded.\n",
2018-06-01 06:55:32 +08:00
"\n",
2019-03-10 22:45:44 +08:00
"![](assets/sentiment6.png)\n",
2018-06-01 06:55:32 +08:00
"\n",
2019-03-10 22:16:18 +08:00
"We can also view the most common words in the vocabulary and their frequencies."
2018-06-01 06:55:32 +08:00
]
},
{
"cell_type": "code",
"execution_count": 9,
2018-06-01 06:55:32 +08:00
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[('the', 202789), (',', 192769), ('.', 165632), ('and', 109469), ('a', 109242), ('of', 100791), ('to', 93641), ('is', 76253), ('in', 61374), ('I', 54030), ('it', 53487), ('that', 49111), ('\"', 44657), (\"'s\", 43331), ('this', 42385), ('-', 36979), ('/><br', 35822), ('was', 35035), ('as', 30388), ('with', 29940)]\n"
2018-06-01 06:55:32 +08:00
]
}
],
"source": [
"print(TEXT.vocab.freqs.most_common(20))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can also see the vocabulary directly using either the `stoi` (**s**tring **to** **i**nt) or `itos` (**i**nt **to** **s**tring) method."
]
},
{
"cell_type": "code",
"execution_count": 10,
2018-06-01 06:55:32 +08:00
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['<unk>', '<pad>', 'the', ',', '.', 'and', 'a', 'of', 'to', 'is']\n"
]
}
],
"source": [
"print(TEXT.vocab.itos[:10])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can also check the labels, ensuring 0 is for negative and 1 is for positive."
]
},
{
"cell_type": "code",
"execution_count": 11,
2018-06-01 06:55:32 +08:00
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"defaultdict(<function _default_unk_index at 0x7ff0c24dbf28>, {'neg': 0, 'pos': 1})\n"
2018-06-01 06:55:32 +08:00
]
}
],
"source": [
"print(LABEL.vocab.stoi)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The final step of preparing the data is creating the iterators. We iterate over these in the training/evaluation loop, and they return a batch of examples (indexed and converted into tensors) at each iteration.\n",
"\n",
"We'll use a `BucketIterator` which is a special type of iterator that will return a batch of examples where each example is of a similar length, minimizing the amount of padding per example.\n",
2018-06-01 06:55:32 +08:00
"\n",
"We also want to place the tensors returned by the iterator on the GPU (if you're using one). PyTorch handles this using `torch.device`, we then pass this device to the iterator."
2018-06-01 06:55:32 +08:00
]
},
{
"cell_type": "code",
"execution_count": 12,
2018-06-01 06:55:32 +08:00
"metadata": {},
"outputs": [],
"source": [
"BATCH_SIZE = 64\n",
"\n",
"device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')\n",
"\n",
"train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(\n",
" (train_data, valid_data, test_data), \n",
2019-03-30 00:57:00 +08:00
" batch_size = BATCH_SIZE,\n",
" device = device)"
2018-06-01 06:55:32 +08:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Build the Model\n",
"\n",
"The next stage is building the model that we'll eventually train and evaluate. \n",
"\n",
"There is a small amount of boilerplate code when creating models in PyTorch, note how our `RNN` class is a sub-class of `nn.Module` and the use of `super`.\n",
"\n",
"Within the `__init__` we define the _layers_ of the module. Our three layers are an _embedding_ layer, our RNN, and a _linear_ layer. All layers have their parameters initialized to random values, unless explicitly specified.\n",
2018-06-01 06:55:32 +08:00
"\n",
"The embedding layer is used to transform our sparse one-hot vector (sparse as most of the elements are 0) into a dense embedding vector (dense as the dimensionality is a lot smaller and all the elements are real numbers). This embedding layer is simply a single fully connected layer. As well as reducing the dimensionality of the input to the RNN, there is the theory that words which have similar impact on the sentiment of the review are mapped close together in this dense vector space. For more information about word embeddings, see [here](https://monkeylearn.com/blog/word-embeddings-transform-text-numbers/).\n",
2018-06-01 06:55:32 +08:00
"\n",
"The RNN layer is our RNN which takes in our dense vector and the previous hidden state $h_{t-1}$, which it uses to calculate the next hidden state, $h_t$.\n",
"\n",
2019-03-10 22:45:44 +08:00
"![](assets/sentiment7.png)\n",
2018-06-01 06:55:32 +08:00
"\n",
2019-03-10 22:16:18 +08:00
"Finally, the linear layer takes the final hidden state and feeds it through a fully connected layer, $f(h_T)$, transforming it to the correct output dimension.\n",
"\n",
2018-06-01 06:55:32 +08:00
"The `forward` method is called when we feed examples into our model.\n",
"\n",
2019-03-10 22:16:18 +08:00
"Each batch, `text`, is a tensor of size _**[sentence length, batch size]**_. That is a batch of sentences, each having each word converted into a one-hot vector. \n",
2018-06-01 06:55:32 +08:00
"\n",
2019-03-10 22:16:18 +08:00
"You may notice that this tensor should have another dimension due to the one-hot vectors, however PyTorch conveniently stores a one-hot vector as it's index value, i.e. the tensor representing a sentence is just a tensor of the indexes for each token in that sentence. The act of converting a list of tokens into a list of indexes is commonly called *numericalizing*.\n",
2018-06-01 06:55:32 +08:00
"\n",
"The input batch is then passed through the embedding layer to get `embedded`, which gives us a dense vector representation of our sentences. `embedded` is a tensor of size _**[sentence length, batch size, embedding dim]**_.\n",
2018-06-01 06:55:32 +08:00
"\n",
"`embedded` is then fed into the RNN. In some frameworks you must feed the initial hidden state, $h_0$, into the RNN, however in PyTorch, if no initial hidden state is passed as an argument it defaults to a tensor of all zeros.\n",
"\n",
2018-11-23 05:15:45 +08:00
"The RNN returns 2 tensors, `output` of size _**[sentence length, batch size, hidden dim]**_ and `hidden` of size _**[1, batch size, hidden dim]**_. `output` is the concatenation of the hidden state from every time step, whereas `hidden` is simply the final hidden state. We verify this using the `assert` statement. Note the `squeeze` method, which is used to remove a dimension of size 1. \n",
2018-06-01 06:55:32 +08:00
"\n",
"Finally, we feed the last hidden state, `hidden`, through the linear layer, `fc`, to produce a prediction."
2018-06-01 06:55:32 +08:00
]
},
{
"cell_type": "code",
"execution_count": 13,
2018-06-01 06:55:32 +08:00
"metadata": {},
"outputs": [],
"source": [
"import torch.nn as nn\n",
"\n",
"class RNN(nn.Module):\n",
" def __init__(self, input_dim, embedding_dim, hidden_dim, output_dim):\n",
2019-03-30 00:57:00 +08:00
" \n",
2018-06-01 06:55:32 +08:00
" super().__init__()\n",
" \n",
" self.embedding = nn.Embedding(input_dim, embedding_dim)\n",
2019-03-30 00:57:00 +08:00
" \n",
2018-06-01 06:55:32 +08:00
" self.rnn = nn.RNN(embedding_dim, hidden_dim)\n",
2019-03-30 00:57:00 +08:00
" \n",
2018-06-01 06:55:32 +08:00
" self.fc = nn.Linear(hidden_dim, output_dim)\n",
" \n",
2019-03-10 22:16:18 +08:00
" def forward(self, text):\n",
2018-06-01 06:55:32 +08:00
"\n",
2019-03-10 22:16:18 +08:00
" #text = [sent len, batch size]\n",
2018-06-01 06:55:32 +08:00
" \n",
2019-03-10 22:16:18 +08:00
" embedded = self.embedding(text)\n",
2018-06-01 06:55:32 +08:00
" \n",
" #embedded = [sent len, batch size, emb dim]\n",
" \n",
" output, hidden = self.rnn(embedded)\n",
" \n",
" #output = [sent len, batch size, hid dim]\n",
" #hidden = [1, batch size, hid dim]\n",
2018-06-01 06:55:32 +08:00
" \n",
" assert torch.equal(output[-1,:,:], hidden.squeeze(0))\n",
" \n",
" return self.fc(hidden.squeeze(0))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We now create an instance of our RNN class. \n",
"\n",
"The input dimension is the dimension of the one-hot vectors, which is equal to the vocabulary size. \n",
"\n",
"The embedding dimension is the size of the dense word vectors. This is usually around 50-250 dimensions, but depends on the size of the vocabulary.\n",
2018-06-01 06:55:32 +08:00
"\n",
"The hidden dimension is the size of the hidden states. This is usually around 100-500 dimensions, but also depends on factors such as on the vocabulary size, the size of the dense vectors and the complexity of the task.\n",
2018-06-01 06:55:32 +08:00
"\n",
"The output dimension is usually the number of classes, however in the case of only 2 classes the output value is between 0 and 1 and thus can be 1-dimensional, i.e. a single scalar real number."
2018-06-01 06:55:32 +08:00
]
},
{
"cell_type": "code",
"execution_count": 14,
2018-06-01 06:55:32 +08:00
"metadata": {},
"outputs": [],
"source": [
"INPUT_DIM = len(TEXT.vocab)\n",
"EMBEDDING_DIM = 100\n",
2018-06-01 06:55:32 +08:00
"HIDDEN_DIM = 256\n",
"OUTPUT_DIM = 1\n",
"\n",
"model = RNN(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's also create a function that will tell us how many trainable parameters our model has so we can compare the number of parameters across different models."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The model has 2,592,105 trainable parameters\n"
]
}
],
"source": [
"def count_parameters(model):\n",
" return sum(p.numel() for p in model.parameters() if p.requires_grad)\n",
"\n",
"print(f'The model has {count_parameters(model):,} trainable parameters')"
]
},
2018-06-01 06:55:32 +08:00
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Train the Model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we'll set up the training and then train the model.\n",
"\n",
"First, we'll create an optimizer. This is the algorithm we use to update the parameters of the module. Here, we'll use _stochastic gradient descent_ (SGD). The first argument is the parameters will be updated by the optimizer, the second is the learning rate, i.e. how much we'll change the parameters by when we do a parameter update."
2018-06-01 06:55:32 +08:00
]
},
{
"cell_type": "code",
"execution_count": 16,
2018-06-01 06:55:32 +08:00
"metadata": {},
"outputs": [],
"source": [
"import torch.optim as optim\n",
"\n",
"optimizer = optim.SGD(model.parameters(), lr=1e-3)"
2018-06-01 06:55:32 +08:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, we'll define our loss function. In PyTorch this is commonly called a criterion. \n",
"\n",
"The loss function here is _binary cross entropy with logits_. \n",
"\n",
2019-03-10 22:16:18 +08:00
"Our model currently outputs an unbound real number. As our labels are either 0 or 1, we want to restrict the predictions to a number between 0 and 1. We do this using the _sigmoid_ or _logit_ functions. \n",
"\n",
2019-03-10 22:16:18 +08:00
"We then use this this bound scalar to calculate the loss using binary cross entropy. \n",
2018-06-01 06:55:32 +08:00
"\n",
"The `BCEWithLogitsLoss` criterion carries out both the sigmoid and the binary cross entropy steps."
2018-06-01 06:55:32 +08:00
]
},
{
"cell_type": "code",
"execution_count": 17,
2018-06-01 06:55:32 +08:00
"metadata": {},
"outputs": [],
"source": [
"criterion = nn.BCEWithLogitsLoss()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Using `.to`, we can place the model and the criterion on the GPU (if we have one). "
2018-06-01 06:55:32 +08:00
]
},
{
"cell_type": "code",
"execution_count": 18,
2018-06-01 06:55:32 +08:00
"metadata": {},
"outputs": [],
"source": [
"model = model.to(device)\n",
"criterion = criterion.to(device)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Our criterion function calculates the loss, however we have to write our function to calculate the accuracy. \n",
"\n",
"This function first feeds the predictions through a sigmoid layer, squashing the values between 0 and 1, we then round them to the nearest integer. This rounds any value greater than 0.5 to 1 (a positive sentiment) and the rest to 0 (a negative sentiment).\n",
2018-06-01 06:55:32 +08:00
"\n",
"We then calculate how many rounded predictions equal the actual labels and average it across the batch."
]
},
{
"cell_type": "code",
"execution_count": 19,
2018-06-01 06:55:32 +08:00
"metadata": {},
"outputs": [],
"source": [
"def binary_accuracy(preds, y):\n",
" \"\"\"\n",
" Returns accuracy per batch, i.e. if you get 8/10 right, this returns 0.8, NOT 8\n",
" \"\"\"\n",
"\n",
" #round predictions to the closest integer\n",
" rounded_preds = torch.round(torch.sigmoid(preds))\n",
2018-06-01 06:55:32 +08:00
" correct = (rounded_preds == y).float() #convert into float for division \n",
2019-03-30 00:57:00 +08:00
" acc = correct.sum() / len(correct)\n",
2018-06-01 06:55:32 +08:00
" return acc"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `train` function iterates over all examples, one batch at a time. \n",
2018-06-01 06:55:32 +08:00
"\n",
"`model.train()` is used to put the model in \"training mode\", which turns on _dropout_ and _batch normalization_. Although we aren't using them in this model, it's good practice to include it.\n",
"\n",
"For each batch, we first zero the gradients. Each parameter in a model has a `grad` attribute which stores the gradient calculated by the `criterion`. PyTorch does not automatically remove (or \"zero\") the gradients calculated from the last gradient calculation, so they must be manually zeroed.\n",
2018-06-01 06:55:32 +08:00
"\n",
2019-03-10 22:16:18 +08:00
"We then feed the batch of sentences, `batch.text`, into the model. Note, you do not need to do `model.forward(batch.text)`, simply calling the model works. The `squeeze` is needed as the predictions are initially size _**[batch size, 1]**_, and we need to remove the dimension of size 1 as PyTorch expects the predictions input to our criterion function to be of size _**[batch size]**_.\n",
2018-06-01 06:55:32 +08:00
"\n",
2019-03-10 22:16:18 +08:00
"The loss and accuracy are then calculated using our predictions and the labels, `batch.label`, with the loss being averaged over all examples in the batch.\n",
2018-06-01 06:55:32 +08:00
"\n",
"We calculate the gradient of each parameter with `loss.backward()`, and then update the parameters using the gradients and optimizer algorithm with `optimizer.step()`.\n",
"\n",
"The loss and accuracy is accumulated across the epoch, the `.item()` method is used to extract a scalar from a tensor which only contains a single value.\n",
"\n",
"Finally, we return the loss and accuracy, averaged across the epoch. The `len` of an iterator is the number of batches in the iterator.\n",
"\n",
2019-03-10 22:16:18 +08:00
"You may recall when initializing the `LABEL` field, we set `dtype=torch.float`. This is because TorchText sets tensors to be `LongTensor`s by default, however our criterion expects both inputs to be `FloatTensor`s. Setting the `dtype` to be `torch.float`, did this for us. The alternative method of doing this would be to do the conversion inside the `train` function by passing `batch.label.float()` instad of `batch.label` to the criterion. "
2018-06-01 06:55:32 +08:00
]
},
{
"cell_type": "code",
"execution_count": 20,
2018-06-01 06:55:32 +08:00
"metadata": {},
"outputs": [],
"source": [
"def train(model, iterator, optimizer, criterion):\n",
" \n",
" epoch_loss = 0\n",
" epoch_acc = 0\n",
" \n",
" model.train()\n",
" \n",
" for batch in iterator:\n",
" \n",
" optimizer.zero_grad()\n",
" \n",
2018-06-01 06:55:32 +08:00
" predictions = model(batch.text).squeeze(1)\n",
" \n",
" loss = criterion(predictions, batch.label)\n",
" \n",
" acc = binary_accuracy(predictions, batch.label)\n",
" \n",
" loss.backward()\n",
" \n",
" optimizer.step()\n",
" \n",
" epoch_loss += loss.item()\n",
" epoch_acc += acc.item()\n",
" \n",
" return epoch_loss / len(iterator), epoch_acc / len(iterator)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`evaluate` is similar to `train`, with a few modifications as you don't want to update the parameters when evaluating.\n",
"\n",
"`model.eval()` puts the model in \"evaluation mode\", this turns off _dropout_ and _batch normalization_. Again, we are not using them in this model, but it is good practice to include them.\n",
2018-06-01 06:55:32 +08:00
"\n",
"No gradients are calculated on PyTorch operations inside the `with no_grad()` block. This causes less memory to be used and speeds up computation.\n",
2018-06-01 06:55:32 +08:00
"\n",
"The rest of the function is the same as `train`, with the removal of `optimizer.zero_grad()`, `loss.backward()` and `optimizer.step()`, as we do not update the model's parameters when evaluating."
2018-06-01 06:55:32 +08:00
]
},
{
"cell_type": "code",
"execution_count": 21,
2018-06-01 06:55:32 +08:00
"metadata": {},
"outputs": [],
"source": [
"def evaluate(model, iterator, criterion):\n",
2018-06-01 06:55:32 +08:00
" \n",
" epoch_loss = 0\n",
" epoch_acc = 0\n",
" \n",
" model.eval()\n",
" \n",
" with torch.no_grad():\n",
" \n",
" for batch in iterator:\n",
"\n",
" predictions = model(batch.text).squeeze(1)\n",
" \n",
" loss = criterion(predictions, batch.label)\n",
" \n",
" acc = binary_accuracy(predictions, batch.label)\n",
"\n",
" epoch_loss += loss.item()\n",
" epoch_acc += acc.item()\n",
" \n",
" return epoch_loss / len(iterator), epoch_acc / len(iterator)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We'll also create a function to tell us how long an epoch takes to compare training times between models."
2018-06-01 06:55:32 +08:00
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [],
"source": [
"import time\n",
"\n",
"def epoch_time(start_time, end_time):\n",
" elapsed_time = end_time - start_time\n",
" elapsed_mins = int(elapsed_time / 60)\n",
" elapsed_secs = int(elapsed_time - (elapsed_mins * 60))\n",
" return elapsed_mins, elapsed_secs"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We then train the model through multiple epochs, an epoch being a complete pass through all examples in the training and validation sets.\n",
"\n",
"At each epoch, if the validation loss is the best we have seen so far, we'll save the parameters of the model and then after training has finished we'll use that model on the test set."
]
},
{
"cell_type": "code",
"execution_count": 23,
2018-06-01 06:55:32 +08:00
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Epoch: 01 | Epoch Time: 0m 17s\n",
"\tTrain Loss: 0.694 | Train Acc: 50.12%\n",
"\t Val. Loss: 0.696 | Val. Acc: 50.17%\n",
"Epoch: 02 | Epoch Time: 0m 16s\n",
"\tTrain Loss: 0.693 | Train Acc: 49.72%\n",
"\t Val. Loss: 0.696 | Val. Acc: 51.01%\n",
"Epoch: 03 | Epoch Time: 0m 16s\n",
"\tTrain Loss: 0.693 | Train Acc: 50.22%\n",
"\t Val. Loss: 0.696 | Val. Acc: 50.87%\n",
"Epoch: 04 | Epoch Time: 0m 16s\n",
"\tTrain Loss: 0.693 | Train Acc: 49.94%\n",
"\t Val. Loss: 0.696 | Val. Acc: 49.91%\n",
"Epoch: 05 | Epoch Time: 0m 17s\n",
"\tTrain Loss: 0.693 | Train Acc: 50.07%\n",
"\t Val. Loss: 0.696 | Val. Acc: 51.00%\n"
2018-06-01 06:55:32 +08:00
]
}
],
"source": [
"N_EPOCHS = 5\n",
2018-06-01 06:55:32 +08:00
"\n",
"best_valid_loss = float('inf')\n",
"\n",
2018-06-01 06:55:32 +08:00
"for epoch in range(N_EPOCHS):\n",
"\n",
" start_time = time.time()\n",
" \n",
" train_loss, train_acc = train(model, train_iterator, optimizer, criterion)\n",
" valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)\n",
2018-06-01 06:55:32 +08:00
" \n",
" end_time = time.time()\n",
"\n",
" epoch_mins, epoch_secs = epoch_time(start_time, end_time)\n",
" \n",
" if valid_loss < best_valid_loss:\n",
" best_valid_loss = valid_loss\n",
" torch.save(model.state_dict(), 'tut1-model.pt')\n",
" \n",
" print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')\n",
" print(f'\\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')\n",
" print(f'\\t Val. Loss: {valid_loss:.3f} | Val. Acc: {valid_acc*100:.2f}%')"
2018-06-01 06:55:32 +08:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You may have noticed the loss is not really decreasing and the accuracy is poor. This is due to several issues with the model which we'll improve in the next notebook.\n",
"\n",
"Finally, the metric we actually care about, the test loss and accuracy, which we get from our parameters that gave us the best validation loss."
2018-06-01 06:55:32 +08:00
]
},
{
"cell_type": "code",
"execution_count": 24,
2018-06-01 06:55:32 +08:00
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Test Loss: 0.708 | Test Acc: 47.87%\n"
]
}
],
2018-06-01 06:55:32 +08:00
"source": [
"model.load_state_dict(torch.load('tut1-model.pt'))\n",
"\n",
"test_loss, test_acc = evaluate(model, test_iterator, criterion)\n",
2018-06-01 06:55:32 +08:00
"\n",
"print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')"
2018-06-01 06:55:32 +08:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Next Steps\n",
"\n",
"In the next notebook, the improvements we will make are:\n",
"- packed padded sequences\n",
"- pre-trained word embeddings\n",
2018-06-01 06:55:32 +08:00
"- different RNN architecture\n",
"- bidirectional RNN\n",
"- multi-layer RNN\n",
"- regularization\n",
"- a different optimizer\n",
"\n",
"This will allow us to achieve ~84% accuracy."
2018-06-01 06:55:32 +08:00
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.5"
2018-06-01 06:55:32 +08:00
}
},
"nbformat": 4,
"nbformat_minor": 2
2021-03-12 21:25:47 +08:00
}