# C - Loading, Saving and Freezing Embeddings

This notebook will cover: how to load custom word embeddings in TorchText, how to save all the embeddings we learn during training and how to freeze/unfreeze embeddings during training. 

## Loading Custom Embeddings

First, lets look at loading a custom set of embeddings.

Your embeddings need to be formatted so each line starts with the word followed by the values of the embedding vector, all space separated. All vectors need to have the same number of elements.

Let's look at the custom embeddings provided by these tutorials. These are 20-dimensional embeddings for 7 words.

In [1]:
with open('custom_embeddings/embeddings.txt', 'r') as f:
    print(f.read())

good 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
great 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
awesome 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
bad -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0
terrible -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0
awful -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0
kwyjibo 0.5 -0.5 0.5 -0.5 0.5 -0.5 0.5 -0.5 0.5 -0.5 0.5 -0.5 0.5 -0.5 0.5 -0.5 0.5 -0.5 0.5 -0.5



Now, let's setup the fields.

In [2]:
import torch
from torchtext.legacy import data

SEED = 1234

torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

TEXT = data.Field(tokenize = 'spacy')
LABEL = data.LabelField(dtype = torch.float)

Then, we'll load our dataset and create the validation set.

In [3]:
from torchtext.legacy import datasets
import random

train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)

train_data, valid_data = train_data.split(random_state = random.seed(SEED))

We can only load our custom embeddings after they have been turned into a `Vectors` object.

We create a `Vector` object by passing it the location of the embeddings (`name`), a location for the cached embeddings (`cache`) and a function that will later initialize tokens in our embeddings that aren't within our dataset (`unk_init`). As have done in previous notebooks, we have initialized these to $\mathcal{N}(0,1)$.

In [4]:
import torchtext.vocab as vocab

custom_embeddings = vocab.Vectors(name = 'custom_embeddings/embeddings.txt',
                                  cache = 'custom_embeddings',
                                  unk_init = torch.Tensor.normal_)

  0%|          | 0/7 [00:00<?, ?it/s]


To check the embeddings have loaded correctly we can print out the words loaded from our custom embedding.

In [5]:
print(custom_embeddings.stoi)

{'good': 0, 'great': 1, 'awesome': 2, 'bad': 3, 'terrible': 4, 'awful': 5, 'kwyjibo': 6}


We can also directly print out the embedding values.

In [6]:
print(custom_embeddings.vectors)

tensor([[ 1.0000,  1.0000,  1.0000,  1.0000,  1.0000,  1.0000,  1.0000,  1.0000,
          1.0000,  1.0000,  1.0000,  1.0000,  1.0000,  1.0000,  1.0000,  1.0000,
          1.0000,  1.0000,  1.0000,  1.0000],
        [ 1.0000,  1.0000,  1.0000,  1.0000,  1.0000,  1.0000,  1.0000,  1.0000,
          1.0000,  1.0000,  1.0000,  1.0000,  1.0000,  1.0000,  1.0000,  1.0000,
          1.0000,  1.0000,  1.0000,  1.0000],
        [ 1.0000,  1.0000,  1.0000,  1.0000,  1.0000,  1.0000,  1.0000,  1.0000,
          1.0000,  1.0000,  1.0000,  1.0000,  1.0000,  1.0000,  1.0000,  1.0000,
          1.0000,  1.0000,  1.0000,  1.0000],
        [-1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
         -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
         -1.0000, -1.0000, -1.0000, -1.0000],
        [-1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
         -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
      

We then build our vocabulary, passing our `Vectors` object.

Note that the `unk_init` should be declared when creating our `Vectors`, and not here!

In [7]:
MAX_VOCAB_SIZE = 25_000

TEXT.build_vocab(train_data, 
                 max_size = MAX_VOCAB_SIZE, 
                 vectors = custom_embeddings)

LABEL.build_vocab(train_data)

Now our vocabulary vectors for the words in our custom embeddings should match what we loaded.

In [8]:
TEXT.vocab.vectors[TEXT.vocab.stoi['good']]

tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1.])

In [9]:
TEXT.vocab.vectors[TEXT.vocab.stoi['bad']]

tensor([-1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1.,
        -1., -1., -1., -1., -1., -1.])

Words that were in our custom embeddings but not in our dataset vocabulary are initialized by the `unk_init` function we passed earlier, $\mathcal{N}(0,1)$. They are also the same size as our custom embeddings (20-dimensional).

In [10]:
TEXT.vocab.vectors[TEXT.vocab.stoi['kwjibo']]

tensor([-0.1117, -0.4966,  0.1631, -0.8817,  0.2891,  0.4899, -0.3853, -0.7120,
         0.6369, -0.7141, -1.0831, -0.5547, -1.3248,  0.6970, -0.6631,  1.2158,
        -2.5273,  1.4778, -0.1696, -0.9919])

The rest of the set-up is the same as it is when using the GloVe vectors, with the next step being to set-up the iterators.

In [11]:
BATCH_SIZE = 64

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, valid_data, test_data), 
    batch_size = BATCH_SIZE,
    device = device)

Then, we define our model.

In [12]:
import torch.nn as nn
import torch.nn.functional as F

class CNN(nn.Module):
    def __init__(self, vocab_size, embedding_dim, n_filters, filter_sizes, output_dim, 
                 dropout, pad_idx):
        super().__init__()
        
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx = pad_idx)
        
        self.convs = nn.ModuleList([
                                    nn.Conv2d(in_channels = 1, 
                                              out_channels = n_filters, 
                                              kernel_size = (fs, embedding_dim)) 
                                    for fs in filter_sizes
                                    ])
        
        self.fc = nn.Linear(len(filter_sizes) * n_filters, output_dim)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, text):
        
        #text = [sent len, batch size]
        
        text = text.permute(1, 0)
                
        #text = [batch size, sent len]
        
        embedded = self.embedding(text)
                
        #embedded = [batch size, sent len, emb dim]
        
        embedded = embedded.unsqueeze(1)
        
        #embedded = [batch size, 1, sent len, emb dim]
        
        conved = [F.relu(conv(embedded)).squeeze(3) for conv in self.convs]
            
        #conv_n = [batch size, n_filters, sent len - filter_sizes[n]]
        
        pooled = [F.max_pool1d(conv, conv.shape[2]).squeeze(2) for conv in conved]
        
        #pooled_n = [batch size, n_filters]
        
        cat = self.dropout(torch.cat(pooled, dim = 1))

        #cat = [batch size, n_filters * len(filter_sizes)]
            
        return self.fc(cat)

We then initialize our model, making sure `EMBEDDING_DIM` is the same as our custom embedding dimensionality, i.e. 20.

In [13]:
INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 20
N_FILTERS = 100
FILTER_SIZES = [3,4,5]
OUTPUT_DIM = 1
DROPOUT = 0.5
PAD_IDX = TEXT.vocab.stoi[TEXT.pad_token]

model = CNN(INPUT_DIM, EMBEDDING_DIM, N_FILTERS, FILTER_SIZES, OUTPUT_DIM, DROPOUT, PAD_IDX)

We have a lot less parameters in this model due to the smaller embedding size used.

In [14]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 524,641 trainable parameters


Next, we initialize our embedding layer to use our vocabulary vectors.

In [15]:
embeddings = TEXT.vocab.vectors

model.embedding.weight.data.copy_(embeddings)

tensor([[-0.1117, -0.4966,  0.1631,  ...,  1.4778, -0.1696, -0.9919],
        [-0.5675, -0.2772, -2.1834,  ...,  0.8504,  1.0534,  0.3692],
        [-0.0552, -0.6125,  0.7500,  ..., -0.1261, -1.6770,  1.2068],
        ...,
        [ 0.5383, -0.1504,  1.6720,  ..., -0.3857, -1.0168,  0.1849],
        [ 2.5640, -0.8564, -0.0219,  ..., -0.3389,  0.2203, -1.6119],
        [ 0.1203,  1.5286,  0.6824,  ...,  0.3330, -0.6704,  0.5883]])

Then, we initialize the unknown and padding token embeddings to all zeros.

In [16]:
UNK_IDX = TEXT.vocab.stoi[TEXT.unk_token]

model.embedding.weight.data[UNK_IDX] = torch.zeros(EMBEDDING_DIM)
model.embedding.weight.data[PAD_IDX] = torch.zeros(EMBEDDING_DIM)

Following standard procedure, we create our optimizer.

In [17]:
import torch.optim as optim

optimizer = optim.Adam(model.parameters())

Define our loss function (criterion).

In [18]:
criterion = nn.BCEWithLogitsLoss()

Then place the loss function and the model on the GPU.

In [19]:
model = model.to(device)
criterion = criterion.to(device)

Create the function to calculate accuracy.

In [20]:
def binary_accuracy(preds, y):
    rounded_preds = torch.round(torch.sigmoid(preds))
    correct = (rounded_preds == y).float()
    acc = correct.sum() / len(correct)
    return acc

Then implement our training function...

In [21]:
def train(model, iterator, optimizer, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.train()
    
    for batch in iterator:
        
        optimizer.zero_grad()
                
        predictions = model(batch.text).squeeze(1)
        
        loss = criterion(predictions, batch.label)
        
        acc = binary_accuracy(predictions, batch.label)
        
        loss.backward()
            
        optimizer.step()
        
        epoch_loss += loss.item()
        epoch_acc += acc.item()
                
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

...evaluation function...

In [22]:
def evaluate(model, iterator, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.eval()
    
    with torch.no_grad():
    
        for batch in iterator:
            
            predictions = model(batch.text).squeeze(1)
            
            loss = criterion(predictions, batch.label)
            
            acc = binary_accuracy(predictions, batch.label)

            epoch_loss += loss.item()
            epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

...and our helpful function that tells us how long an epoch takes.

In [23]:
import time

def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

We've finally reached training our model!

## Freezing and Unfreezing Embeddings

We're going to train our model for 10 epochs. During the first 5 epochs we are going to freeze the weights (parameters) of our embedding layer. For the last 10 epochs we'll allow our embeddings to be trained. 

Why would we ever want to do this? Sometimes the pre-trained word embeddings we use will already be good enough and won't need to be fine-tuned with our model. If we keep the embeddings frozen then we don't have to calculate the gradients and update the weights for these parameters, giving us faster training times. This doesn't really apply for the model used here, but we're mainly covering it to show how it's done. Another reason is that if our model has a large amount of parameters it may make training difficult, so by freezing our pre-trained embeddings we reduce the amount of parameters needing to be learned.

To freeze the embedding weights, we set `model.embedding.weight.requires_grad` to `False`. This will cause no gradients to be calculated for the weights in the embedding layer, and thus no parameters will be updated when `optimizer.step()` is called.

Then, during training we check if `FREEZE_FOR` (which we set to 5) epochs have passed. If they have then we set `model.embedding.weight.requires_grad` to `True`, telling PyTorch that we should calculate gradients in the embedding layer and update them with our optimizer.

In [24]:
N_EPOCHS = 10
FREEZE_FOR = 5

best_valid_loss = float('inf')

#freeze embeddings
model.embedding.weight.requires_grad = unfrozen = False

for epoch in range(N_EPOCHS):

    start_time = time.time()
    
    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
    
    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s | Frozen? {not unfrozen}')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tutC-model.pt')
    
    if (epoch + 1) >= FREEZE_FOR:
        #unfreeze embeddings
        model.embedding.weight.requires_grad = unfrozen = True

Epoch: 01 | Epoch Time: 0m 7s | Frozen? True
	Train Loss: 0.724 | Train Acc: 53.68%
	 Val. Loss: 0.658 |  Val. Acc: 62.27%
Epoch: 02 | Epoch Time: 0m 6s | Frozen? True
	Train Loss: 0.670 | Train Acc: 59.36%
	 Val. Loss: 0.626 |  Val. Acc: 67.51%
Epoch: 03 | Epoch Time: 0m 6s | Frozen? True
	Train Loss: 0.636 | Train Acc: 63.62%
	 Val. Loss: 0.592 |  Val. Acc: 70.22%
Epoch: 04 | Epoch Time: 0m 6s | Frozen? True
	Train Loss: 0.613 | Train Acc: 66.22%
	 Val. Loss: 0.573 |  Val. Acc: 71.77%
Epoch: 05 | Epoch Time: 0m 6s | Frozen? True
	Train Loss: 0.599 | Train Acc: 67.40%
	 Val. Loss: 0.569 |  Val. Acc: 70.86%
Epoch: 06 | Epoch Time: 0m 7s | Frozen? False
	Train Loss: 0.577 | Train Acc: 69.53%
	 Val. Loss: 0.520 |  Val. Acc: 76.17%
Epoch: 07 | Epoch Time: 0m 7s | Frozen? False
	Train Loss: 0.544 | Train Acc: 72.21%
	 Val. Loss: 0.487 |  Val. Acc: 78.03%
Epoch: 08 | Epoch Time: 0m 7s | Frozen? False
	Train Loss: 0.507 | Train Acc: 74.96%
	 Val. Loss: 0.450 |  Val. Acc: 80.02%
Epoch: 09 | E

Another option would be to unfreeze the embeddings whenever the validation loss stops increasing using the following code snippet instead of the `FREEZE_FOR` condition:
    
```python
if valid_loss < best_valid_loss:
    best_valid_loss = valid_loss
    torch.save(model.state_dict(), 'tutC-model.pt')
else:
    #unfreeze embeddings
    model.embedding.weight.requires_grad = unfrozen = True
```

In [25]:
model.load_state_dict(torch.load('tutC-model.pt'))

test_loss, test_acc = evaluate(model, test_iterator, criterion)

print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')

Test Loss: 0.396 | Test Acc: 82.36%


## Saving Embeddings

We might want to re-use the embeddings we have trained here with another model. To do this, we'll write a function that will loop through our vocabulary, getting the word and embedding for each word, writing them to a text file in the same format as our custom embeddings so they can be used with TorchText again.

Currently, TorchText Vectors seem to have issues with loading certain unicode words, so we skip these by only writing words without unicode symbols. **If you know a better solution to this then let me know**

In [26]:
from tqdm import tqdm

def write_embeddings(path, embeddings, vocab):
    
    with open(path, 'w') as f:
        for i, embedding in enumerate(tqdm(embeddings)):
            word = vocab.itos[i]
            #skip words with unicode symbols
            if len(word) != len(word.encode()):
                continue
            vector = ' '.join([str(i) for i in embedding.tolist()])
            f.write(f'{word} {vector}\n')

We'll write our embeddings to `trained_embeddings.txt`.

In [27]:
write_embeddings('custom_embeddings/trained_embeddings.txt', 
                 model.embedding.weight.data, 
                 TEXT.vocab)

100%|██████████| 25002/25002 [00:00<00:00, 38085.03it/s]


To double check they've written correctly, we can load them as `Vectors`.

In [28]:
trained_embeddings = vocab.Vectors(name = 'custom_embeddings/trained_embeddings.txt',
                                   cache = 'custom_embeddings',
                                   unk_init = torch.Tensor.normal_)

 70%|███████   | 17550/24946 [00:00<00:00, 87559.48it/s]


Finally, let's print out the first 5 rows of our loaded vectors and the same from our model's embeddings weights, checking they are the same values.

In [29]:
print(trained_embeddings.vectors[:5])

tensor([[-0.2573, -0.2088,  0.2413, -0.1549,  0.1940, -0.1466, -0.2195, -0.1011,
         -0.1327,  0.1803,  0.2369, -0.2182,  0.1543, -0.2150, -0.0699, -0.0430,
         -0.1958, -0.0506, -0.0059, -0.0024],
        [ 0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,
          0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,
          0.0000,  0.0000,  0.0000,  0.0000],
        [-0.1427, -0.4414,  0.7181, -0.5751, -0.3183,  0.0552, -1.6764, -0.3177,
          0.6592,  1.6143, -0.1920, -0.1881, -0.4321, -0.8578,  0.5266,  0.5243,
         -0.7083, -0.0048, -1.4680,  1.1425],
        [-0.4700, -0.0363,  0.0560, -0.7394, -0.2412, -0.4197, -1.7096,  0.9444,
          0.9633,  0.3703, -0.2243, -1.5279, -1.9086,  0.5718, -0.5721, -0.6015,
          0.3579, -0.3834,  0.8079,  1.0553],
        [-0.7055,  0.0954,  0.4646, -1.6595,  0.1138,  0.2208, -0.0220,  0.7397,
         -0.1153,  0.3586,  0.3040, -0.6414, -0.1579, -0.2738, -0.6942,  0.0083,
      

In [30]:
print(model.embedding.weight.data[:5])

tensor([[-0.2573, -0.2088,  0.2413, -0.1549,  0.1940, -0.1466, -0.2195, -0.1011,
         -0.1327,  0.1803,  0.2369, -0.2182,  0.1543, -0.2150, -0.0699, -0.0430,
         -0.1958, -0.0506, -0.0059, -0.0024],
        [ 0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,
          0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,
          0.0000,  0.0000,  0.0000,  0.0000],
        [-0.1427, -0.4414,  0.7181, -0.5751, -0.3183,  0.0552, -1.6764, -0.3177,
          0.6592,  1.6143, -0.1920, -0.1881, -0.4321, -0.8578,  0.5266,  0.5243,
         -0.7083, -0.0048, -1.4680,  1.1425],
        [-0.4700, -0.0363,  0.0560, -0.7394, -0.2412, -0.4197, -1.7096,  0.9444,
          0.9633,  0.3703, -0.2243, -1.5279, -1.9086,  0.5718, -0.5721, -0.6015,
          0.3579, -0.3834,  0.8079,  1.0553],
        [-0.7055,  0.0954,  0.4646, -1.6595,  0.1138,  0.2208, -0.0220,  0.7397,
         -0.1153,  0.3586,  0.3040, -0.6414, -0.1579, -0.2738, -0.6942,  0.0083,
      

All looks good! The only difference between the two is the removal of the ~50 words in the vocabulary that contain unicode symbols.