There were bugs that caused content to be dropped or erroneously included if a list was the first thing in a section and --lists was specified. This change should fix#117 and #118.
In order for things to work properly on Windows (and to make the communication between processes more clear in general), the parent process should communicate with the subprocess function only via its arguments, not via shared global variables. This change takes all the global variables that used to be implicitly shared with the subprocess, and puts them into a single "options" object (a dict-like SimpleNamespace). This object is then passed to the subprocess functions.
The "options" object includes not only the values of command-line arguments (e.g., "--no-templates") but also stuff like the "URL base" and template definitions that are precomputed before the main extraction begins.
This commit adds a new output format to the program, namely json. When
invoked with the --json flag, the program will write several files with
several pages per file (as before) with one json object per line
representing a single page.
The information contained in this json object is the same as in the
default format, but is somewhat more straightforward to parse for other
tools.
The running time is the same as the default format, as well as the
compressed output size. The following is a simple benchmark on an I5
machine using as input
enwiki-20170120-pages-meta-current1.xml-p000000010p000030303.bz2:
$ ./WikiExtractor.py -o xml --compress --no-templates input.bz2
INFO: Finished 3-process extraction of 15084 articles in 335.7s (44.9 art/s)
$ ./WikiExtractor.py -o json --json --compress --no-templates input.bz2
INFO: Finished 3-process extraction of 15084 articles in 336.8s (44.8 art/s)
$ du -sh json xml
69M json
69M xml
--discard_elements - allowing you to customise which elements are discarded
--ignored_tags - allowing you to customise which tags are ignored
--keep_tables - allows the contents of the tables in the original to articel to be retained. This does not render html tables
* Use `from __future__ import unicode_literals` and replace `u''` literals
with `''`.
* Use `io.StringIO` instead of `cStringIO.StringIO` for Py2/3 compatibility.
* Define a const `PY2` which is True in Python 2 but False in Python 3.
* Import `quote` and `name2codepoint` from differenct modules between
Python 2 and 3.
* Use Python 3's name in Python 2 for `zip`, `zip_longest`, `range` and `chr`.
* Use `text_type` as a type for `unicode` in Python 2 and `str` in Python 3.
* Use `sorted()` to sort dict's `items()`.
* Implement `__next__` in NextFile and call next() built-in function.