updated appendix B - formatting and typos

This commit is contained in:
bentrevett 2019-04-01 17:08:38 +01:00
parent 669a9b8436
commit 856d9e5598

View File

@ -43,7 +43,7 @@
"source": [
"import torchtext.vocab\n",
"\n",
"glove = torchtext.vocab.GloVe(name='6B', dim=100)\n",
"glove = torchtext.vocab.GloVe(name = '6B', dim = 100)\n",
"\n",
"print(f'There are {len(glove.itos)} words in the vocabulary')"
]
@ -164,7 +164,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We'll be doing this a lot, so we'll create a function that takes in word embeddings and a word and returns the associated vector. It'll also throw an error if the word doesn't exist in the vocabulary."
"We'll be doing this a lot, so we'll create a function that takes in word embeddings and a word then returns the associated vector. It'll also throw an error if the word doesn't exist in the vocabulary."
]
},
{
@ -213,7 +213,7 @@
"\n",
"Now to start looking at the context of different words. \n",
"\n",
"If we want to find the words similar to a certain input word, we first find the vector of this input word, then we scan through our vocabulary finding any vectors similar to this input word vector.\n",
"If we want to find the words similar to a certain input word, we first find the vector of this input word, then we scan through our vocabulary calculating the distance between the vector of each word and our input word vector. We then sort these from closest to furthest away.\n",
"\n",
"The function below returns the closest 10 words to an input word vector:"
]
@ -226,8 +226,11 @@
"source": [
"import torch\n",
"\n",
"def closest_words(embeddings, vector, n=10):\n",
" distances = [(w, torch.dist(vector, get_vector(embeddings, w)).item()) for w in embeddings.itos]\n",
"def closest_words(embeddings, vector, n = 10):\n",
" \n",
" distances = [(word, torch.dist(vector, get_vector(embeddings, word)).item())\n",
" for word in embeddings.itos]\n",
" \n",
" return sorted(distances, key = lambda w: w[1])[:n]"
]
},
@ -266,7 +269,9 @@
}
],
"source": [
"closest_words(glove, get_vector(glove, 'korea'))"
"word_vector = get_vector(glove, 'korea')\n",
"\n",
"closest_words(glove, word_vector)"
]
},
{
@ -302,7 +307,9 @@
}
],
"source": [
"closest_words(glove, get_vector(glove, 'india'))"
"word_vector = get_vector(glove, 'india')\n",
"\n",
"closest_words(glove, word_vector)"
]
},
{
@ -353,7 +360,9 @@
}
],
"source": [
"print_tuples(closest_words(glove, get_vector(glove, 'sports')))"
"word_vector = get_vector(glove, 'sports')\n",
"\n",
"print_tuples(closest_words(glove, word_vector))"
]
},
{
@ -375,9 +384,20 @@
"source": [
"def analogy(embeddings, word1, word2, word3, n=5):\n",
" \n",
" candidate_words = closest_words(embeddings, get_vector(embeddings, word2) - get_vector(embeddings, word1) + get_vector(embeddings, word3), n+3)\n",
" #get vectors for each word\n",
" word1_vector = get_vector(embeddings, word1)\n",
" word2_vector = get_vector(embeddings, word2)\n",
" word3_vector = get_vector(embeddings, word3)\n",
" \n",
" candidate_words = [x for x in candidate_words if x[0] not in [word1, word2, word3]][:n]\n",
" #calculate analogy vector\n",
" analogy_vector = word2_vector - word1_vector + word3_vector\n",
" \n",
" #find closest words to analogy vector\n",
" candidate_words = closest_words(embeddings, analogy_vector, n+3)\n",
" \n",
" #filter out words already in analogy\n",
" candidate_words = [(word, dist) for (word, dist) in candidate_words \n",
" if word not in [word1, word2, word3]][:n]\n",
" \n",
" print(f'{word1} is to {word2} as {word3} is to...')\n",
" \n",
@ -561,7 +581,7 @@
"source": [
"## Correcting Spelling Mistakes\n",
"\n",
"Very recently, someone has found out that you can actually use word embeddings to correct spelling mistakes! \n",
"Another interesting property of word embeddings is that they can actually be used to correct spelling mistakes! \n",
"\n",
"We'll put their findings into code and briefly explain them, but to read more about this, check out the [original thread](http://forums.fast.ai/t/nlp-any-libraries-dictionaries-out-there-for-fixing-common-spelling-errors/16411) and the associated [write-up](https://blog.usejournal.com/a-simple-spell-checker-built-from-word-vectors-9f28452b6f26).\n",
"\n",
@ -576,7 +596,7 @@
"metadata": {},
"outputs": [],
"source": [
"glove = torchtext.vocab.GloVe(name='840B', dim=300)"
"glove = torchtext.vocab.GloVe(name = '840B', dim = 300)"
]
},
{
@ -636,7 +656,9 @@
}
],
"source": [
"print_tuples(closest_words(glove, get_vector(glove, 'korea')))"
"word_vector = get_vector(glove, 'korea')\n",
"\n",
"print_tuples(closest_words(glove, word_vector))"
]
},
{
@ -669,20 +691,22 @@
}
],
"source": [
"print_tuples(closest_words(glove, get_vector(glove, 'relieable')))"
"word_vector = get_vector(glove, 'relieable')\n",
"\n",
"print_tuples(closest_words(glove, word_vector))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Notice how the correct spelling of \"reliable\" does not appear in the top 10 closest words. Surely the misspellings of a word should appear next to the correct spelling of the word as they appear in the same context, right? \n",
"Notice how the correct spelling, \"reliable\", does not appear in the top 10 closest words. Surely the misspellings of a word should appear next to the correct spelling of the word as they appear in the same context, right? \n",
"\n",
"The hypothesis is that misspellings of a certain word are all shifted away from the correct spelling. This is because articles of text that contain spelling mistakes are usually written in an informal manner (such as tweets/blog posts), thus spelling errors will appear together as they appear in context of informal articles.\n",
"The hypothesis is that misspellings of words are all equally shifted away from their correct spelling. This is because articles of text that contain spelling mistakes are usually written in an informal manner where correct spelling doesn't matter as much (such as tweets/blog posts), thus spelling errors will appear together as they appear in context of informal articles.\n",
"\n",
"Similar to how we created analogies before, we can create a \"correct spelling\" vector. This time, instead of using a single example to create our vector, we'll use the average of multiple examples. This will hopefully give better accuracy!\n",
"\n",
"We first create a vector for the correct spelling, 'reliable', then calculate the difference between the \"reliable vector\" and each of the 8 misspellings of 'reliable'."
"We first create a vector for the correct spelling, 'reliable', then calculate the difference between the \"reliable vector\" and each of the 8 misspellings of 'reliable'. As we are going to concatenate these 8 misspelling tensors together we need to unsqueeze a \"batch\" dimension to them."
]
},
{
@ -693,9 +717,11 @@
"source": [
"reliable_vector = get_vector(glove, 'reliable')\n",
"\n",
"reliable_misspellings = ['relieable', 'relyable', 'realible', 'realiable', 'relable', 'relaible', 'reliabe', 'relaiable']\n",
"reliable_misspellings = ['relieable', 'relyable', 'realible', 'realiable', \n",
" 'relable', 'relaible', 'reliabe', 'relaiable']\n",
"\n",
"diff_reliable = [(reliable_vector - get_vector(glove, s)).unsqueeze(0) for s in reliable_misspellings]"
"diff_reliable = [(reliable_vector - get_vector(glove, s)).unsqueeze(0) \n",
" for s in reliable_misspellings]"
]
},
{
@ -711,7 +737,7 @@
"metadata": {},
"outputs": [],
"source": [
"misspelling_vector = torch.cat(diff_reliable, dim=0).mean(dim=0)"
"misspelling_vector = torch.cat(diff_reliable, dim = 0).mean(dim = 0)"
]
},
{
@ -746,7 +772,9 @@
}
],
"source": [
"print_tuples(closest_words(glove, get_vector(glove, 'becuase') + misspelling_vector))"
"word_vector = get_vector(glove, 'becuase')\n",
"\n",
"print_tuples(closest_words(glove, word_vector + misspelling_vector))"
]
},
{
@ -779,7 +807,9 @@
}
],
"source": [
"print_tuples(closest_words(glove, get_vector(glove, 'defintiely') + misspelling_vector))"
"word_vector = get_vector(glove, 'defintiely')\n",
"\n",
"print_tuples(closest_words(glove, word_vector + misspelling_vector))"
]
},
{
@ -812,7 +842,9 @@
}
],
"source": [
"print_tuples(closest_words(glove, get_vector(glove, 'consistant') + misspelling_vector))"
"word_vector = get_vector(glove, 'consistant')\n",
"\n",
"print_tuples(closest_words(glove, word_vector + misspelling_vector))"
]
},
{
@ -845,7 +877,9 @@
}
],
"source": [
"print_tuples(closest_words(glove, get_vector(glove, 'pakage') + misspelling_vector))"
"word_vector = get_vector(glove, 'pakage')\n",
"\n",
"print_tuples(closest_words(glove, word_vector + misspelling_vector))"
]
},
{
@ -872,7 +906,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
"version": "3.7.0"
}
},
"nbformat": 4,