added optional appendix for how to use your own dataset with torchtext

2018-06-06 16:06:25 +01:00 · 2018-06-06 16:06:25 +01:00 · f297a84cab
commit f297a84cab
parent 73bc184c46
14 changed files with 380 additions and 2004 deletions
--- a/.gitignore
+++ b/.gitignore
@ -102,6 +102,5 @@ ENV/

 #data
 .data/*
-data/*
 .vector_cache/*
 saves/*
--- a/Analysis.ipynb
+++ b/Analysis.ipynb
@ -27,7 +27,7 @@
    "\n",
    "The parameters of a `Field` specify how the data should be processed. \n",
    "\n",
-    "Our `TEXT` field has `tokenize='spacy'`, which defines that the \"tokenization\" (the act of splitting the string into discrete \"tokens\") should be done using the [spaCy](https://spacy.io) tokenizer. \n",
+    "Our `TEXT` field has `tokenize='spacy'`, which defines that the \"tokenization\" (the act of splitting the string into discrete \"tokens\") should be done using the [spaCy](https://spacy.io) tokenizer. If no `tokenize` argument is passed, the default is simply splitting the string on spaces.\n",
    "\n",
    "`LABEL` is defined by a `LabelField`, a special subset of the `Field` class specifically for handling labels. We will explain the `tensor_type` argument later.\n",
    "\n",
@ -159,7 +159,9 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "The IMDb dataset only has train/test splits, so we need to create a validation set. We can do this with the `.split()` method."
+    "The IMDb dataset only has train/test splits, so we need to create a validation set. We can do this with the `.split()` method. \n",
+    "\n",
+    "By default this splits 70/30, however by passing a `split_ratio` argument, we can change the ratio of the split, i.e. a `split_ratio` of 0.8 would mean 80% of the examples make up the training set and 20% make up the validation set."
   ]
  },
  {
@ -571,7 +573,11 @@
    "\n",
    "The loss and accuracy is accumulated across the epoch, the `.item()` method is used to extract a scalar from a tensor which only contains a single value.\n",
    "\n",
-    "Finally, we return the loss and accuracy, averaged across the epoch. The len of an iterator is the number of batches in the iterator."
+    "Finally, we return the loss and accuracy, averaged across the epoch. The len of an iterator is the number of batches in the iterator.\n",
+    "\n",
+    "You may recall when initializing the `LABEL` field, we set `tensor_type=torch.FloatTensor`. This is because TorchText sets tensors to be `LongTensor`s by default, however our criterion expects both inputs to be `FloatTensor`s. As we have manually set the `tensor_type` to be `FloatTensor`s, this conversion is done for us.\n",
+    "\n",
+    "Another method would be to do the conversion inside the `train` function by passing `batch.label.float()` instad of `batch.label` to the criterion. "
   ]
  },
  {
--- a/Analysis.ipynb
+++ b/Analysis.ipynb
@ -69,7 +69,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "TorchText `Field`s have a `preprocessing` argument. A function passed here will be applied to a sentence after it has been tokenized, but before it has been indexed. Here, we pass our `generate_bigrams` function."
+    "TorchText `Field`s have a `preprocessing` argument. A function passed here will be applied to a sentence after it has been tokenized (transformed from a string into a list of tokens), but before it has been indexed (transformed from a token to an integer). Here, we pass our `generate_bigrams` function."
   ]
  },
  {
@ -110,7 +110,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Load the pre-trained word embeddings."
+    "Build the vocab and load the pre-trained word embeddings."
   ]
  },
  {
@ -581,21 +581,6 @@
   "source": [
    "predict_sentiment(\"This film is great\")"
   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Next Steps\n",
-    "\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": []
  }
 ],
 "metadata": {
--- a/Datasets.ipynb
+++ b/Datasets.ipynb
@ -0,0 +1,345 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# A - Using TorchText with Your Own Datasets\n",
+    "\n",
+    "In this series we have used the IMDb dataset included as a dataset in TorchText. TorchText has many canonical datasets included for classification, language modelling, sequence tagging, etc. However, frequently you'll be wanting to use your own datasets. Luckily, TorchText has functions to help you to this.\n",
+    "\n",
+    "Recall in the series, we:\n",
+    "- defined the `Field`s\n",
+    "- loaded the dataset\n",
+    "- created the splits\n",
+    "\n",
+    "As a reminder, the code is shown below:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 119,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from torchtext import data\n",
+    "from torchtext import datasets\n",
+    "\n",
+    "TEXT = data.Field()\n",
+    "LABEL = data.LabelField()\n",
+    "\n",
+    "train, test = datasets.IMDB.splits(TEXT, LABEL)\n",
+    "\n",
+    "train, valid = train.split()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "There are three data formats TorchText can read: `json`, `tsv` (tab separated values) and`csv` (comma separated values).\n",
+    "\n",
+    "**In my opinion, the best formatting for TorchText is `json`, which I'll explain later on.**\n",
+    "\n",
+    "## Reading JSON\n",
+    "\n",
+    "Starting with `json`, your data must be in the `json lines` format, i.e. it must be something like:\n",
+    "\n",
+    "```\n",
+    "{\"name\": \"John\", \"location\": \"United Kingdom\", \"age\": 42, \"quote\": [\"i\", \"love\", \"the\", \"united kingdom\"]}\n",
+    "{\"name\": \"Mary\", \"location\": \"United States\", \"age\": 36, \"quote\": [\"i\", \"want\", \"more\", \"telescopes\"]}\n",
+    "```\n",
+    "\n",
+    "That is, each line is a `json` object. See `data/train.json` for an example.\n",
+    "\n",
+    "We then define the fields:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 88,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "NAME = data.Field()\n",
+    "SAYING = data.Field()\n",
+    "PLACE = data.Field()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Next, we must tell TorchText which fields apply to which elements of the `json` object. \n",
+    "\n",
+    "For `json` data, we must create a dictionary where:\n",
+    "- the key matches the key of the `json` object\n",
+    "- the value is a tuple where:\n",
+    "  - the first element becomes the batch object's attribute name\n",
+    "  - the second element is the name of the `Field`\n",
+    "  \n",
+    "What do we mean when we say \"becomes the batch object's attribute name\"? Recall in the previous exercises where we accessed the `TEXT` and `LABEL` fields in the train/evaluation loop by using `batch.text` and `batch.label`. Here, to access the name we use `batch.n`, to access the location we use `batch.p`, etc.\n",
+    "\n",
+    "A few notes:\n",
+    "\n",
+    "* The order of the keys in the `fields` dictionary does not matter, as long as its keys match the `json` data keys.\n",
+    "\n",
+    "- The `Field` name does not have to match the key in the `json` object, i.e. we use `PLACE` for the `\"location\"` field.\n",
+    "\n",
+    "- When dealing with `json` data, not all of the keys have to be used, i.e. we did not use the `\"age\"` field.\n",
+    "\n",
+    "- Also, if the values of `json` field are a string then the `Fields` tokenization is applied, however if the values are a list then no tokenization is applied. Usually it is a good idea for the data to already be tokenized into a list, this saves time as you don't have to wait for TorchText to do it.\n",
+    "\n",
+    "- The value of the `json` fields do not have to be the same type. Some examples can have their `\"quote\"` as a string, and some as a list. The tokenization will only get applied to the ones with their `\"quote\"` as a string.\n",
+    "\n",
+    "- If you are using a `json` field, every single example must have an instance of that field, i.e. in this example all examples must have a name, location and quote. However, as we are not using the age field, it does not matter if an example does not have it."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 94,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "fields = {'name': ('n', NAME), 'location': ('p', PLACE), 'quote': ('s', SAYING)}"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We then create our `train` and `test` datasets with the `TabularDataset.splits` function. \n",
+    "\n",
+    "The `path` argument specifices the top level folder common among both datasets, and the `train` and `test` arguments specify the filename of each dataset, i.e. here the train dataset is located at `data/train.json`.\n",
+    "\n",
+    "We tell the function we are using `json` data, and pass in our `fields` dictionary defined previously."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 95,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "train, test = data.TabularDataset.splits(\n",
+    "                path = 'data',\n",
+    "                train = 'train.json',\n",
+    "                test = 'test.json',\n",
+    "                format = 'json',\n",
+    "                fields = fields\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "If you already had a validation dataset, the location of this can be passed as the `validation` argument."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 96,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "train, valid, test = data.TabularDataset.splits(\n",
+    "                path = 'data',\n",
+    "                train = 'train.json',\n",
+    "                validation = 'valid.json',\n",
+    "                test = 'test.json',\n",
+    "                format = 'json',\n",
+    "                fields = fields\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We can then view an example to make sure it has worked correctly.\n",
+    "\n",
+    "Notice how the field names (`n`, `p` and `s`) match up with what was defined in the `fields` dictionary.\n",
+    "\n",
+    "Also notice how the word `\"United Kingdom\"` in `p` has been split by the tokenization, whereas the `\"united kingdom\"` in `s` has not. This is due to what was mentioned previously, where TorchText assumes that any `json` fields that are lists are already tokenized and no further tokenization is applied. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 97,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "vars(train[0]): {'n': ['John'], 'p': ['United', 'Kingdom'], 's': ['i', 'love', 'the', 'united kingdom']}\n"
+     ]
+    }
+   ],
+   "source": [
+    "print('vars(train[0]):', vars(train[0]))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We can now use `train`, `test` and `valid` to build a vocabulary and create iterators, as in the other notebooks.\n",
+    "\n",
+    "## Reading CSV/TSV\n",
+    "\n",
+    "`csv` and `tsv` are very similar, one has elements separated by commas and one by tabs.\n",
+    "\n",
+    "Using the same example above, our `tsv` data will be in the form of:\n",
+    "\n",
+    "```\n",
+    "name\tlocation\tage\tquote\n",
+    "John\tUnited Kingdom\t42\ti love the united kingdom\n",
+    "Mary\tUnited States\t36\ti want more telescopes\n",
+    "```\n",
+    "\n",
+    "That is, on each row the elements are separated by tabs and we have one example by row. The first row is usually a header (i.e. the name of each of the columns), but your data could have no header.\n",
+    "\n",
+    "You cannot have lists within `tsv` or `csv` data.\n",
+    "\n",
+    "The way the fields are defined is a bit different to `json`. We now use a list of tuples, where the elements are tuples are the same as before, i.e. first element is the batch object's attribute name, second element is the `Field` name. Unlike the `json` formatted data, \n",
+    "\n",
+    "Unlike the `json` data, the tuples have to be in the same order that they are within the `tsv` data. Due to this, when skipping a column of data a tuple of `None`s needs to be used, if not then our `SAYING` field will be applied to the `age` column of the `tsv` data and the `quote` column will not be used. \n",
+    "\n",
+    "However, if you only wanted to use the `name` and `age` column, you could just use two tuples as they are the first two columns.\n",
+    "\n",
+    "We change our `TabularDataset` to read the correct `.tsv` files, and change the `format` argument to `'tsv'`.\n",
+    "\n",
+    "If your data has a header, which ours does, it must be skipped by passing `skip_header = True`. If not, TorchText will think the header is an example. By default, `skip_header` will be `False`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 116,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "fields = [('n', NAME), ('p', PLACE), (None, None), ('s', SAYING)]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 117,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "train, valid, test = data.TabularDataset.splits(\n",
+    "                path = 'data',\n",
+    "                train = 'train.tsv',\n",
+    "                validation = 'valid.tsv',\n",
+    "                test = 'test.tsv',\n",
+    "                format = 'tsv',\n",
+    "                fields = fields,\n",
+    "                skip_header = True\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 118,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "vars(train[0]): {'n': ['John'], 'p': ['United', 'Kingdom'], 's': ['i', 'love', 'the', 'united', 'kingdom']}\n"
+     ]
+    }
+   ],
+   "source": [
+    "print('vars(train[0]):', vars(train[0]))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Finally, we'll cover `csv` files. \n",
+    "\n",
+    "This is pretty much the exact same as the `tsv` files, expect with the `format` argument set to `'csv'`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 123,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "fields = [('n', NAME), ('p', PLACE), (None, None), ('s', SAYING)]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 124,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "train, valid, test = data.TabularDataset.splits(\n",
+    "                path = 'data',\n",
+    "                train = 'train.csv',\n",
+    "                validation = 'valid.csv',\n",
+    "                test = 'test.csv',\n",
+    "                format = 'csv',\n",
+    "                fields = fields,\n",
+    "                skip_header = True\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 125,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "vars(train[0]): {'n': ['John'], 'p': ['United', 'Kingdom'], 's': ['i', 'love', 'the']}\n"
+     ]
+    }
+   ],
+   "source": [
+    "print('vars(train[0]):', vars(train[0]))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Why JSON over CSV/TSV?\n",
+    "\n",
+    "1. Your `csv` or `tsv` data cannot be lists. This means data cannot be tokenized during the pre-processing step, which means everytime you run your Python script that reads this data via TorchText, it has to be tokenized. Using advanced tokenizers, such as the `spaCy` tokenizer takes a non-negligible amount of time, especially if you are running your script multiple times. Thus, it is better to tokenize your data from a string into a list and use the `json` format.\n",
+    "\n",
+    "2. If tabs appear in your `tsv` data, or commas appear in your `csv` data, TorchText will think they are delimiters between columns. This will cause your data to be parsed incorrectly, and worst of all TorchText will not alert you to this as it cannot tell the difference between a tab/comma in a field and a tab/comma as a delimiter. As `json` data is essential a dictionary, you access the data within the fields via its key, so do not have to worry about surprise delimiters."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.6.5"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
--- a/Untitled.ipynb
+++ b/Untitled.ipynb
--- a/data/test.csv
+++ b/data/test.csv
@ -0,0 +1,3 @@
+name,location,age,quote
+Craig,Finland,29,go baseball team!
+Janet,Hong Kong,24,knowledge is great
--- a/data/test.json
+++ b/data/test.json
@ -0,0 +1,2 @@
+{"name": "Craig", "location": "Finland", "age": 29, "quote": ["go", "baseball", "team", "!"]}
+{"name": "Janet", "location": "Hong Kong", "age": 24, "quote": ["knowledge", "is", "great"]}
--- a/data/test.tsv
+++ b/data/test.tsv
@ -0,0 +1,3 @@
+name	location	age	quote
+Craig	Finland	29	go baseball team!
+Janet	Hong Kong	24	knowledge is great
--- a/data/train.csv
+++ b/data/train.csv
@ -0,0 +1,3 @@
+name,location,age,quote
+John,United Kingdom,42,i love the united kingdom
+Mary,United States,36,i want more telescopes
--- a/data/train.json
+++ b/data/train.json
@ -0,0 +1,2 @@
+{"name": "John", "location": "United Kingdom", "age": 42, "quote": ["i", "love", "the", "united kingdom"]}
+{"name": "Mary", "location": "United States", "age": 36, "quote": ["i", "want", "more", "telescopes"]}
--- a/data/train.tsv
+++ b/data/train.tsv
@ -0,0 +1,3 @@
+name	location	age	quote
+John	United Kingdom	42	i love the united kingdom
+Mary	United States	36	i want more telescopes
--- a/data/valid.csv
+++ b/data/valid.csv
@ -0,0 +1,3 @@
+name,location,age,quote
+Fred,France,21,what am i doing?
+Pauline,Canada,44,hello world
--- a/data/valid.json
+++ b/data/valid.json
@ -0,0 +1,2 @@
+{"name": "Fred", "location": "France", "age": 21, "quote": ["what", "am", "i", "doing", "?"]}
+{"name": "Pauline", "location": "Canada", "age": 44, "quote": ["hello", "world"]}
--- a/data/valid.tsv
+++ b/data/valid.tsv
@ -0,0 +1,3 @@
+name	location	age	quote
+Fred	France	21	what am i doing?
+Pauline	Canada	44	hello world