A tool for extracting plain text from Wikipedia dumps
Go to file
Roman Prokofyev 70ee947a8b multiline tag match fix
need to add re.DOTALL so that multiline tag definitions are also matched
2015-06-12 14:23:14 +02:00
ChangeLog See ChangeLog. 2015-06-03 00:01:45 +02:00
extractPage.py See ChangeLog. 2015-04-15 14:30:55 +02:00
LICENSE Initial commit 2015-03-22 13:03:01 +01:00
README.md Update README.md 2015-04-26 08:57:25 +02:00
WikiExtractor.py multiline tag match fix 2015-06-12 14:23:14 +02:00

wikiextractor

WikiExtractor.py is a Python script that extracts and cleans text from a Wikipedia database dump.

The tool is written in Python and requires no additional library.

For further information, see the project Home Page or the Wiki.

This is a beta version that performs template expansion by preprocesssng the whole dump and extracting template definitions. The current version keeps a cache of parsed templates, achieving a speedup of twice over the previous version.

Usage

The script is invoked with a Wikipedia dump file as an argument. The output is stored in a number of files of similar size in a chosen directory. Each file will contains several documents in this document format.

usage: WikiExtractor.py [-h] [-o OUTPUT] [-b n[KMG]] [-c] [--html] [-l]
		    [-ns ns1,ns2] [-s] [--templates TEMPLATES]
		    [--no-templates] [--threads THREADS] [-q] [--debug]
		    [-a] [-v]
		    input

positional arguments:
  input                 XML wiki dump file

optional arguments:
  -h, --help            show this help message and exit
  --threads THREADS     Number of threads to use (default 2)

Output:
  -o OUTPUT, --output OUTPUT
		    output directory
  -b n[KMG], --bytes n[KMG]
		    put specified bytes per output file (default is 1M)
  -c, --compress        compress output files using bzip

Processing:
  --html                produce HTML output, subsumes --links and --sections
  -l, --links           preserve links
  -ns ns1,ns2, --namespaces ns1,ns2
		    accepted namespaces
  -s, --sections        preserve sections
  --templates TEMPLATES
		    use or create file containing templates
  --no-templates        Do not expand templates

Special:
  -q, --quiet           suppress reporting progress info
  --debug               print debug info
  -a, --article         analyze a file containing a single article (debug)
		    option
  -v, --version         print program version

Saving templates to a file will speed up performing extraction the next time, assuming template definitions have not changed.

Option --no-templates significantly speeds up the extractor, avoiding the cost of expanding MediaWiki templates.

For further information, visit the documentation.