Commit Graph

110 Commits

Author SHA1 Message Date
Daniel
45e56d4e9e
Update WikiExtractor.py
Fix bug where it fails with an exception when n="1" or n="A"
2017-11-08 14:28:36 +00:00
BrenBarn
674e9a0264 Fix problems that occurred when a list was the first thing in a section.
There were bugs that caused content to be dropped or erroneously included if a list was the first thing in a section and --lists was specified.  This change should fix #117 and #118.
2017-03-08 01:01:31 -08:00
attardi
5414b7fda8 Completed module String 2017-03-04 04:22:30 +01:00
attardi
c9432abcd0 Define #ifexists 2017-03-03 19:44:48 +01:00
attardi
3ea2da809b Fix for empty templates. 2017-03-03 18:52:17 +01:00
attardi
8fd8da77f4 Updated version number. 2017-03-02 05:58:05 +01:00
BrenBarn
e7bb889e0e Removed some old comments 2017-02-26 12:41:58 -08:00
BrenBarn
19d358eee8 Factor all info that needs to be passed to subprocesses into "options" variable
In order for things to work properly on Windows (and to make the communication between processes more clear in general), the parent process should communicate with the subprocess function only via its arguments, not via shared global variables.  This change takes all the global variables that used to be implicitly shared with the subprocess, and puts them into a single "options" object (a dict-like SimpleNamespace).  This object is then passed to the subprocess functions.

The "options" object includes not only the values of command-line arguments (e.g., "--no-templates") but also stuff like the "URL base" and template definitions that are precomputed before the main extraction begins.
2017-02-26 11:58:44 -08:00
attardi
f6f80e2350 ignoredTags 2017-02-26 01:00:05 +01:00
attardi
5f1fb5c995 Declared global ignoredTags 2017-02-26 00:58:56 +01:00
attardi
25edeebafb Moved ignoredTags to top. 2017-02-26 00:53:48 +01:00
attardi
82196d1156 Define discardedElements 2017-02-26 00:39:26 +01:00
Matteo Ceccarello
7ae45fcff7 Add json output
This commit adds a new output format to the program, namely json. When
invoked with the --json flag, the program will write several files with
several pages per file (as before) with one json object per line
representing a single page.

The information contained in this json object is the same as in the
default format, but is somewhat more straightforward to parse for other
tools.

The running time is the same as the default format, as well as the
compressed output size. The following is a simple benchmark on an I5
machine using as input
enwiki-20170120-pages-meta-current1.xml-p000000010p000030303.bz2:

$ ./WikiExtractor.py -o xml --compress --no-templates input.bz2
INFO: Finished 3-process extraction of 15084 articles in 335.7s (44.9 art/s)
$ ./WikiExtractor.py -o json --json --compress --no-templates input.bz2
INFO: Finished 3-process extraction of 15084 articles in 336.8s (44.8 art/s)
$ du -sh json xml
69M     json
69M     xml
2017-02-10 20:13:27 +01:00
attardi
ea9c368e52 Fix to use of this instead of self. 2017-02-06 08:36:00 +01:00
Nathan Davies
15e589e5cf putting the double line spacing back 2017-02-02 05:38:10 -08:00
Nathan Davies
b9c29e36bb Merge branch 'master' into ISSUE-101_discard_and_ignore_flags 2017-02-01 05:29:17 -08:00
Nathan Davies
12fb5e587d improved comment around regex 2017-02-01 05:26:44 -08:00
Nathan Davies
663a3dea73 tidying up some of the code and adding comments. 2017-01-31 16:52:59 -08:00
Xiao Ling
e646440185 closes lists at empty lines and adds list item count 2017-01-27 16:20:47 -08:00
Nathan Davies
3701b779b3 added extra replace section to tidy up output when retaining tables 2017-01-23 14:37:24 +00:00
Nathan Davies
e835e8c004 Added new flags
--discard_elements - allowing you to customise which elements are discarded
--ignored_tags - allowing you to customise which tags are ignored
--keep_tables - allows the contents of the tables in the original to articel to be retained. This does not render html tables
2017-01-23 14:18:21 +00:00
orangain
ddec683957 Ensure that process_count >= 1 2017-01-15 21:20:17 +09:00
attardi
7449ac95ba text_type 2017-01-15 10:09:40 +01:00
attardi
6660973646 See ChanngeLog. 2017-01-15 09:08:35 +01:00
Giuseppe Attardi
e00eacb372 Update version number. 2017-01-04 13:07:10 -08:00
Giuseppe Attardi
ba74b992da Python 3 decode
Skip decode('uff-8') in Python 3.
2017-01-04 13:05:15 -08:00
attardi
ce600138f4 See ChangeLog 2016-10-29 10:03:07 +02:00
attardi
636c9ea9f4 See ChangeLog. 2016-08-31 09:00:51 +02:00
attardi
e6a051c949 See ChangeLog. 2016-08-30 18:17:14 +02:00
attardi
d167742d16 See ChangeLog. 2016-08-29 23:34:47 +02:00
attardi
5cb7da320e See ChangeLog. 2016-08-19 14:11:37 +02:00
attardi
2942d1e19d See ChangeLog. 2016-08-11 10:34:21 +02:00
Seth Cleveland
eacccbc6eb Remove python2 extract utf8 encoding and log extract exceptions 2016-06-20 10:39:01 -05:00
attardi
0f703c0aae Merged PR from Seth Cleveland. 2016-06-19 13:10:36 +02:00
attardi
f9b8e8ac02 Added support for Python 3. 2016-06-19 12:53:31 +02:00
orangain
7b21d10ccd Seek StringIO to position 0 on every output
truncate(0) does not guarantee that the position is seeked to 0.
2016-06-18 13:44:45 +09:00
orangain
b19e341ce2 Use text type as a page text and encode them when writing to a file 2016-06-18 13:44:45 +09:00
orangain
6851fe4b3f Use // instead of / for integer division
Add `division` to future import.
2016-06-18 13:44:45 +09:00
orangain
9322b7ba54 Make imports Python 2/3 compatible
* Use `from __future__ import unicode_literals` and replace `u''` literals
  with `''`.
* Use `io.StringIO` instead of `cStringIO.StringIO` for Py2/3 compatibility.
* Define a const `PY2` which is True in Python 2 but False in Python 3.
* Import `quote` and `name2codepoint` from differenct modules between
  Python 2 and 3.
* Use Python 3's name in Python 2 for `zip`, `zip_longest`, `range` and `chr`.
* Use `text_type` as a type for `unicode` in Python 2 and `str` in Python 3.
* Use `sorted()` to sort dict's `items()`.
* Implement `__next__` in NextFile and call next() built-in function.
2016-06-18 13:44:45 +09:00
attardi
60e4082440 See ChangeLog. 2016-03-23 15:14:13 +01:00
attardi
bcc3d124b4 See ChangeLog. 2016-03-19 11:36:24 +01:00
Sampo Pyysalo
7a5b5e5765 Match internal links in external links
See attardi/wikiextractor/issues/#55
2016-03-12 11:42:06 +00:00
attardi
6af9c283eb See ChangeLog. 2016-03-12 08:15:01 +01:00
attardi
ab0d008512 See ChangeLog. 2016-03-06 17:27:39 +01:00
attardi
3bdcf6a4ad Reduce spool queue to 10%. 2016-02-20 12:13:23 +01:00
attardi
0726948142 Typo. 2016-02-20 10:49:21 +01:00
attardi
730cfc07f9 See ChangeLog. 2016-02-20 10:45:58 +01:00
attardi
ca2a34ccce See ChangeLog. 2016-02-15 09:04:46 +01:00
attardi
6d0577ef10 See ChangeLog. 2016-02-15 01:22:38 +01:00
attardi
834cad6a35 See ChangeLog. 2016-02-12 23:31:21 +01:00