Commit Graph

150 Commits

Author SHA1 Message Date
Giuseppe Attardi
e3edc0c352 Merge pull request #108 from BrenBarn/globals-cleanup
Globals cleanup
2017-02-27 02:08:09 +01:00
BrenBarn
e7bb889e0e Removed some old comments 2017-02-26 12:41:58 -08:00
BrenBarn
ff51a19a1d Change to NextFile test so it will pass on Windows (use os.path.sep instead of /) 2017-02-26 12:02:11 -08:00
BrenBarn
19d358eee8 Factor all info that needs to be passed to subprocesses into "options" variable
In order for things to work properly on Windows (and to make the communication between processes more clear in general), the parent process should communicate with the subprocess function only via its arguments, not via shared global variables.  This change takes all the global variables that used to be implicitly shared with the subprocess, and puts them into a single "options" object (a dict-like SimpleNamespace).  This object is then passed to the subprocess functions.

The "options" object includes not only the values of command-line arguments (e.g., "--no-templates") but also stuff like the "URL base" and template definitions that are precomputed before the main extraction begins.
2017-02-26 11:58:44 -08:00
attardi
f6f80e2350 ignoredTags 2017-02-26 01:00:05 +01:00
attardi
5f1fb5c995 Declared global ignoredTags 2017-02-26 00:58:56 +01:00
attardi
25edeebafb Moved ignoredTags to top. 2017-02-26 00:53:48 +01:00
attardi
82196d1156 Define discardedElements 2017-02-26 00:39:26 +01:00
Giuseppe Attardi
db51f0b45c Merge pull request #105 from Cecca/master
Add json output
2017-02-11 18:12:25 +01:00
Matteo Ceccarello
7ae45fcff7 Add json output
This commit adds a new output format to the program, namely json. When
invoked with the --json flag, the program will write several files with
several pages per file (as before) with one json object per line
representing a single page.

The information contained in this json object is the same as in the
default format, but is somewhat more straightforward to parse for other
tools.

The running time is the same as the default format, as well as the
compressed output size. The following is a simple benchmark on an I5
machine using as input
enwiki-20170120-pages-meta-current1.xml-p000000010p000030303.bz2:

$ ./WikiExtractor.py -o xml --compress --no-templates input.bz2
INFO: Finished 3-process extraction of 15084 articles in 335.7s (44.9 art/s)
$ ./WikiExtractor.py -o json --json --compress --no-templates input.bz2
INFO: Finished 3-process extraction of 15084 articles in 336.8s (44.8 art/s)
$ du -sh json xml
69M     json
69M     xml
2017-02-10 20:13:27 +01:00
attardi
ea9c368e52 Fix to use of this instead of self. 2017-02-06 08:36:00 +01:00
Giuseppe Attardi
c88ae9736d Merge pull request #102 from nathj07/ISSUE-101_discard_and_ignore_flags
Issue 101 discard and ignore flags and keep_tables
2017-02-02 15:31:03 +01:00
Nathan Davies
15e589e5cf putting the double line spacing back 2017-02-02 05:38:10 -08:00
Nathan Davies
b9c29e36bb Merge branch 'master' into ISSUE-101_discard_and_ignore_flags 2017-02-01 05:29:17 -08:00
Nathan Davies
12fb5e587d improved comment around regex 2017-02-01 05:26:44 -08:00
Nathan Davies
663a3dea73 tidying up some of the code and adding comments. 2017-01-31 16:52:59 -08:00
Giuseppe Attardi
ba616e2f85 Merge pull request #103 from xiaoling/lists
closes lists at empty lines and adds list item count
2017-01-28 10:26:12 +01:00
Xiao Ling
e646440185 closes lists at empty lines and adds list item count 2017-01-27 16:20:47 -08:00
Nathan Davies
3701b779b3 added extra replace section to tidy up output when retaining tables 2017-01-23 14:37:24 +00:00
Nathan Davies
e835e8c004 Added new flags
--discard_elements - allowing you to customise which elements are discarded
--ignored_tags - allowing you to customise which tags are ignored
--keep_tables - allows the contents of the tables in the original to articel to be retained. This does not render html tables
2017-01-23 14:18:21 +00:00
Giuseppe Attardi
b5d97f64f9 Merge pull request #100 from orangain/fix-oom-in-single-core
Ensure that process_count >= 1
2017-01-15 14:38:19 +01:00
orangain
ddec683957 Ensure that process_count >= 1 2017-01-15 21:20:17 +09:00
attardi
7449ac95ba text_type 2017-01-15 10:09:40 +01:00
attardi
6660973646 See ChanngeLog. 2017-01-15 09:08:35 +01:00
Giuseppe Attardi
e00eacb372 Update version number. 2017-01-04 13:07:10 -08:00
Giuseppe Attardi
ba74b992da Python 3 decode
Skip decode('uff-8') in Python 3.
2017-01-04 13:05:15 -08:00
Giuseppe Attardi
87dbb62961 Merge pull request #94 from seong889/patch-1
Update README.md
2016-12-10 11:38:45 +01:00
Seongjun Hong
31f848d620 Update README.md
removed option : escapedoc
2016-12-08 15:48:05 +09:00
attardi
499997910c See ChangeLog. 2016-10-29 10:51:08 +02:00
attardi
f1989bcc11 See ChangeLog. 2016-10-29 10:19:28 +02:00
attardi
ce600138f4 See ChangeLog 2016-10-29 10:03:07 +02:00
Giuseppe Attardi
ff12b69e09 Merge pull request #82 from sente/patch-2
Fix typo in README.md
2016-10-29 09:08:58 +02:00
attardi
636c9ea9f4 See ChangeLog. 2016-08-31 09:00:51 +02:00
attardi
e6a051c949 See ChangeLog. 2016-08-30 18:17:14 +02:00
Stuart Powers
fe0ac18fb9 Fix typo in README.md
"preprocesssng" -> "preprocessing"
2016-08-30 00:41:21 -04:00
attardi
d167742d16 See ChangeLog. 2016-08-29 23:34:47 +02:00
attardi
5cb7da320e See ChangeLog. 2016-08-19 14:11:37 +02:00
attardi
2942d1e19d See ChangeLog. 2016-08-11 10:34:21 +02:00
Giuseppe Attardi
419fe97d7a Merge pull request #67 from sethcleveland/remove_python2_string_encoding
Remove python2 extract utf8 encoding and log extract exceptions
2016-06-24 09:03:11 +02:00
Seth Cleveland
eacccbc6eb Remove python2 extract utf8 encoding and log extract exceptions 2016-06-20 10:39:01 -05:00
attardi
0f703c0aae Merged PR from Seth Cleveland. 2016-06-19 13:10:36 +02:00
attardi
f9b8e8ac02 Added support for Python 3. 2016-06-19 12:53:31 +02:00
Giuseppe Attardi
aee387b566 Merge pull request #66 from orangain/support-python3
Support Python 3
2016-06-19 12:46:52 +02:00
orangain
3ccd368aa6 Update README.md about Python 2/3 support 2016-06-18 13:44:45 +09:00
orangain
cb7d42d10f Add tox.ini 2016-06-18 13:44:45 +09:00
orangain
7b21d10ccd Seek StringIO to position 0 on every output
truncate(0) does not guarantee that the position is seeked to 0.
2016-06-18 13:44:45 +09:00
orangain
b19e341ce2 Use text type as a page text and encode them when writing to a file 2016-06-18 13:44:45 +09:00
orangain
6851fe4b3f Use // instead of / for integer division
Add `division` to future import.
2016-06-18 13:44:45 +09:00
orangain
9322b7ba54 Make imports Python 2/3 compatible
* Use `from __future__ import unicode_literals` and replace `u''` literals
  with `''`.
* Use `io.StringIO` instead of `cStringIO.StringIO` for Py2/3 compatibility.
* Define a const `PY2` which is True in Python 2 but False in Python 3.
* Import `quote` and `name2codepoint` from differenct modules between
  Python 2 and 3.
* Use Python 3's name in Python 2 for `zip`, `zip_longest`, `range` and `chr`.
* Use `text_type` as a type for `unicode` in Python 2 and `str` in Python 3.
* Use `sorted()` to sort dict's `items()`.
* Implement `__next__` in NextFile and call next() built-in function.
2016-06-18 13:44:45 +09:00
orangain
8749df0a81 Add .gitignore for python 2016-06-18 13:44:45 +09:00