wikiextractor/README.md

# wikiextractor
[WikiExtractor.py](http://medialab.di.unipi.it/wiki/Wikipedia_Extractor) is a Python script that extracts and cleans text from a [Wikipedia database dump](http://download.wikimedia.org/).

The tool is written in Python and requires no additional library.

For further information, see the [project Home Page](http://medialab.di.unipi.it/wiki/Wikipedia_Extractor) or the [Wiki](wiki)

The current beta version of WikiExtrctor.py is capable of performing template expansion to some extent.

## Usage
The script is invoked with a Wikipedia dump file as an argument.
The output is stored in a number of files of similar size in a chosen directory.
Each file will contains several documents in this [document format](http://medialab.di.unipi.it/wiki/Document_Format).

This is a beta version that performs template expansion by preprocesssng the whole dump and
extracting template definitions.

    Usage:
     WikiExtractor.py [options] xml-dump-file
      
    optional arguments:
      -h, --help            show this help message and exit
      -o OUTPUT, --output OUTPUT
                            output directory
      -b n[KM], --bytes n[KM]
                        put specified bytes per output file (default is 1M)
      -B BASE, --base BASE  base URL for the Wikipedia pages
      -c, --compress        compress output files using bzip
      -l, --links           preserve links
      -ns ns1,ns2, --namespaces ns1,ns2
                            accepted namespaces
      -q, --quiet           suppress reporting progress info
      -s, --sections        preserve sections
      -a, --article         analyze a file containing a single article
      --templates TEMPLATES
                            use or create file containing templates
      -v, --version         print program version

Saving templates to a file will speed up performing extraction the next time, assuming template definitions have not changed.
Initial commit 2015-03-22 20:03:01 +08:00			`# wikiextractor`
Update README.md 2015-03-22 20:41:39 +08:00			`[WikiExtractor.py](http://medialab.di.unipi.it/wiki/Wikipedia_Extractor) is a Python script that extracts and cleans text from a [Wikipedia database dump](http://download.wikimedia.org/).`

			`The tool is written in Python and requires no additional library.`

Update README.md 2015-03-22 20:58:50 +08:00			`For further information, see the [project Home Page](http://medialab.di.unipi.it/wiki/Wikipedia_Extractor) or the [Wiki](wiki)`

Update README.md 2015-03-22 20:41:39 +08:00			`The current beta version of WikiExtrctor.py is capable of performing template expansion to some extent.`

			`## Usage`
			`The script is invoked with a Wikipedia dump file as an argument.`
			`The output is stored in a number of files of similar size in a chosen directory.`
			`Each file will contains several documents in this [document format](http://medialab.di.unipi.it/wiki/Document_Format).`

			`This is a beta version that performs template expansion by preprocesssng the whole dump and`
			`extracting template definitions.`

			`Usage:`
			`WikiExtractor.py [options] xml-dump-file`

			`optional arguments:`
			`-h, --help show this help message and exit`
			`-o OUTPUT, --output OUTPUT`
			`output directory`
			`-b n[KM], --bytes n[KM]`
			`put specified bytes per output file (default is 1M)`
			`-B BASE, --base BASE base URL for the Wikipedia pages`
			`-c, --compress compress output files using bzip`
			`-l, --links preserve links`
			`-ns ns1,ns2, --namespaces ns1,ns2`
			`accepted namespaces`
			`-q, --quiet suppress reporting progress info`
			`-s, --sections preserve sections`
			`-a, --article analyze a file containing a single article`
			`--templates TEMPLATES`
			`use or create file containing templates`
			`-v, --version print program version`

			`Saving templates to a file will speed up performing extraction the next time, assuming template definitions have not changed.`