7ae45fcff7
This commit adds a new output format to the program, namely json. When invoked with the --json flag, the program will write several files with several pages per file (as before) with one json object per line representing a single page. The information contained in this json object is the same as in the default format, but is somewhat more straightforward to parse for other tools. The running time is the same as the default format, as well as the compressed output size. The following is a simple benchmark on an I5 machine using as input enwiki-20170120-pages-meta-current1.xml-p000000010p000030303.bz2: $ ./WikiExtractor.py -o xml --compress --no-templates input.bz2 INFO: Finished 3-process extraction of 15084 articles in 335.7s (44.9 art/s) $ ./WikiExtractor.py -o json --json --compress --no-templates input.bz2 INFO: Finished 3-process extraction of 15084 articles in 336.8s (44.8 art/s) $ du -sh json xml 69M json 69M xml |
||
---|---|---|
.gitignore | ||
ChangeLog | ||
cirrus-extract.py | ||
extractPage.py | ||
LICENSE | ||
README.md | ||
setup.py | ||
tests.py | ||
tox.ini | ||
WikiExtractor.py |
WikiExtractor
WikiExtractor.py is a Python script that extracts and cleans text from a Wikipedia database dump.
The tool is written in Python and requires Python 2.7 or Python 3.3+ but no additional library.
For further information, see the project Home Page or the Wiki.
Wikipedia Cirrus Extractor
cirrus-extractor.py
is a version of the script that performs extraction from a Wikipedia Cirrus dump.
Cirrus dumps contain text with already expanded templates.
Cirrus dumps are available at: cirrussearch.
Details
WikiExtractor performs template expansion by preprocessing the whole dump and extracting template definitions.
In order to speed up processing:
- multiprocessing is used for dealing with articles in parallel
- a cache is kept of parsed templates (only useful for repeated extractions).
Installation
The script may be invoked directly, however it can be installed by doing:
(sudo) python setup.py install
Usage
The script is invoked with a Wikipedia dump file as an argument. The output is stored in several files of similar size in a given directory. Each file will contains several documents in this document format.
usage: WikiExtractor.py [-h] [-o OUTPUT] [-b n[KMG]] [-c] [--html] [-l] [-s]
[--lists] [-ns ns1,ns2] [-xns ns1,ns2]
[--templates TEMPLATES] [--no-templates]
[-r] [--min_text_length MIN_TEXT_LENGTH]
[--filter_disambig_pages] [--processes PROCESSES] [-q]
[--debug] [-a] [-v]
input
Wikipedia Extractor:
Extracts and cleans text from a Wikipedia database dump and stores output in a
number of files of similar size in a given directory.
Each file will contain several documents in the format:
<doc id="" revid="" url="" title="">
...
</doc>
Template expansion requires preprocesssng first the whole dump and
collecting template definitions.
positional arguments:
input XML wiki dump file
optional arguments:
-h, --help show this help message and exit
--processes PROCESSES number of processes to use (default: number of CPU cores)
Output:
-o OUTPUT, --output OUTPUT
directory for extracted files (or '-' for dumping to
stdout)
-b n[KMG], --bytes n[KMG]
maximum bytes per output file (default 1M)
-c, --compress compress output files using bzip
Processing:
--html produce HTML output, subsumes --links
-l, --links preserve links
-s, --sections preserve sections
--lists preserve lists
-ns ns1,ns2, --namespaces ns1,ns2
accepted link namespaces
-xns ns1,ns2, --xml_namespaces ns1,ns2
accepted page xml namespaces -- 0 for main/articles
--templates TEMPLATES
use or create file containing templates
--no-templates Do not expand templates
-r, --revision Include the document revision id (default=False)
--min_text_length MIN_TEXT_LENGTH
Minimum expanded text length required to write
document (default=0)
--filter_disambig_pages
Remove pages from output that contain disabmiguation
markup (default=False)
-it, --ignored_tags
comma separated list of tags that will be dropped, keeping their content
-de, --discard_elements
comma separated list of elements that will be removed from the article text
--keep_tables
Preserve tables in the output article text (default=False)
Special:
-q, --quiet suppress reporting progress info
--debug print debug info
-a, --article analyze a file containing a single article (debug
option)
-v, --version print program version
Saving templates to a file will speed up performing extraction the next time, assuming template definitions have not changed.
Option --no-templates significantly speeds up the extractor, avoiding the cost of expanding MediaWiki templates.
For further information, visit the documentation.