A tool for extracting plain text from Wikipedia dumps

Go to file

Nathan Davies 15e589e5cf putting the double line spacing back		2017-02-02 05:38:10 -08:00
.gitignore	tidying up some of the code and adding comments.	2017-01-31 16:52:59 -08:00
ChangeLog	See ChanngeLog.	2017-01-15 09:08:35 +01:00
cirrus-extract.py	See ChangeLog.	2016-08-29 23:34:47 +02:00
extractPage.py	See ChangeLog.	2016-02-15 01:22:38 +01:00
LICENSE	Initial commit	2015-03-22 13:03:01 +01:00
README.md	Added new flags	2017-01-23 14:18:21 +00:00
setup.py	Added new flags	2017-01-23 14:18:21 +00:00
tests.py	Add test cases for Python 2	2016-06-18 13:44:45 +09:00
tox.ini	Add tox.ini	2016-06-18 13:44:45 +09:00
WikiExtractor.py	putting the double line spacing back	2017-02-02 05:38:10 -08:00

README.md

WikiExtractor

WikiExtractor.py is a Python script that extracts and cleans text from a Wikipedia database dump.

The tool is written in Python and requires Python 2.7 or Python 3.3+ but no additional library.

For further information, see the project Home Page or the Wiki.

Wikipedia Cirrus Extractor

cirrus-extractor.py is a version of the script that performs extraction from a Wikipedia Cirrus dump. Cirrus dumps contain text with already expanded templates.

Cirrus dumps are available at: cirrussearch.

Details

WikiExtractor performs template expansion by preprocessing the whole dump and extracting template definitions.

In order to speed up processing:

multiprocessing is used for dealing with articles in parallel
a cache is kept of parsed templates (only useful for repeated extractions).

Installation

The script may be invoked directly, however it can be installed by doing:

(sudo) python setup.py install

Usage

The script is invoked with a Wikipedia dump file as an argument. The output is stored in several files of similar size in a given directory. Each file will contains several documents in this document format.

usage: WikiExtractor.py [-h] [-o OUTPUT] [-b n[KMG]] [-c] [--html] [-l] [-s]
                        [--lists] [-ns ns1,ns2] [-xns ns1,ns2]
                        [--templates TEMPLATES] [--no-templates]
                        [-r] [--min_text_length MIN_TEXT_LENGTH]
                        [--filter_disambig_pages] [--processes PROCESSES] [-q]
                        [--debug] [-a] [-v]
                        input

Wikipedia Extractor:
Extracts and cleans text from a Wikipedia database dump and stores output in a
number of files of similar size in a given directory.
Each file will contain several documents in the format:

    <doc id="" revid="" url="" title="">
        ...
        </doc>

Template expansion requires preprocesssng first the whole dump and
collecting template definitions.

positional arguments:
  input                 XML wiki dump file

optional arguments:
  -h, --help            show this help message and exit
  --processes PROCESSES number of processes to use (default: number of CPU cores)

Output:
  -o OUTPUT, --output OUTPUT
                        directory for extracted files (or '-' for dumping to
                        stdout)
  -b n[KMG], --bytes n[KMG]
                        maximum bytes per output file (default 1M)
  -c, --compress        compress output files using bzip

Processing:
  --html                produce HTML output, subsumes --links
  -l, --links           preserve links
  -s, --sections        preserve sections
  --lists               preserve lists
  -ns ns1,ns2, --namespaces ns1,ns2
                        accepted link namespaces
  -xns ns1,ns2, --xml_namespaces ns1,ns2
                        accepted page xml namespaces -- 0 for main/articles
  --templates TEMPLATES
                        use or create file containing templates
  --no-templates        Do not expand templates
  -r, --revision        Include the document revision id (default=False)
  --min_text_length MIN_TEXT_LENGTH
                        Minimum expanded text length required to write
                        document (default=0)
  --filter_disambig_pages
                        Remove pages from output that contain disabmiguation
                        markup (default=False)
  -it, --ignored_tags
                        comma separated list of tags that will be dropped, keeping their content
  -de, --discard_elements
                        comma separated list of elements that will be removed from the article text
  --keep_tables
                        Preserve tables in the output article text (default=False)

Special:
  -q, --quiet           suppress reporting progress info
  --debug               print debug info
  -a, --article         analyze a file containing a single article (debug
                        option)
  -v, --version         print program version

Saving templates to a file will speed up performing extraction the next time, assuming template definitions have not changed.

Option --no-templates significantly speeds up the extractor, avoiding the cost of expanding MediaWiki templates.

For further information, visit the documentation.