Update README.md
This commit is contained in:
parent
1f6603b963
commit
0ab1bbb627
38
README.md
38
README.md
@ -1,2 +1,38 @@
|
||||
# wikiextractor
|
||||
A tool for extracting plain text from Wikipedia dumps
|
||||
[WikiExtractor.py](http://medialab.di.unipi.it/wiki/Wikipedia_Extractor) is a Python script that extracts and cleans text from a [Wikipedia database dump](http://download.wikimedia.org/).
|
||||
|
||||
The tool is written in Python and requires no additional library.
|
||||
|
||||
The current beta version of WikiExtrctor.py is capable of performing template expansion to some extent.
|
||||
|
||||
## Usage
|
||||
The script is invoked with a Wikipedia dump file as an argument.
|
||||
The output is stored in a number of files of similar size in a chosen directory.
|
||||
Each file will contains several documents in this [document format](http://medialab.di.unipi.it/wiki/Document_Format).
|
||||
|
||||
This is a beta version that performs template expansion by preprocesssng the whole dump and
|
||||
extracting template definitions.
|
||||
|
||||
Usage:
|
||||
WikiExtractor.py [options] xml-dump-file
|
||||
|
||||
optional arguments:
|
||||
-h, --help show this help message and exit
|
||||
-o OUTPUT, --output OUTPUT
|
||||
output directory
|
||||
-b n[KM], --bytes n[KM]
|
||||
put specified bytes per output file (default is 1M)
|
||||
-B BASE, --base BASE base URL for the Wikipedia pages
|
||||
-c, --compress compress output files using bzip
|
||||
-l, --links preserve links
|
||||
-ns ns1,ns2, --namespaces ns1,ns2
|
||||
accepted namespaces
|
||||
-q, --quiet suppress reporting progress info
|
||||
-s, --sections preserve sections
|
||||
-a, --article analyze a file containing a single article
|
||||
--templates TEMPLATES
|
||||
use or create file containing templates
|
||||
-v, --version print program version
|
||||
|
||||
Saving templates to a file will speed up performing extraction the next time, assuming template definitions have not changed.
|
||||
|
||||
|
Loading…
Reference in New Issue
Block a user