Commit Graph

220 Commits

Author SHA1 Message Date
Zhiwei Chen
169eaaf208 remove noisy print 2017-04-29 12:53:19 -04:00
Zhiwei Chen
e249508255 log categories statistics info 2017-04-29 12:50:47 -04:00
Zhiwei Chen
397a92894b filter_categories use depth 4 under Health 2017-04-29 12:44:13 -04:00
Zhiwei Chen
5274829e16 print friendly error msg 2017-04-28 14:57:54 -04:00
Zhiwei Chen
cc04dae71c log save to file; log page statistic info; 2017-04-28 12:36:46 -04:00
root
b8323a8efc encoding fix 2017-04-28 02:04:29 -04:00
Zhiwei Chen
1f76fd9473 encoding fix 2017-04-28 01:53:46 -04:00
Zhiwei Chen
ef0af20178 fix category not utf8 error 2017-04-28 01:42:21 -04:00
Zhiwei Chen
52ed1ef9ae fix category not utf8 error 2017-04-28 01:23:31 -04:00
Zhiwei Chen
7903b739f5 fix category not utf8 error 2017-04-28 01:17:45 -04:00
Zhiwei Chen
8e92f464cf add readme 2017-04-27 20:15:17 -04:00
Zhiwei Chen
9cf2a2a883 add feature filtering by category of wiki 2017-04-27 19:57:41 -04:00
Giuseppe Attardi
2a5e6aebc0 Merge pull request #119 from BrenBarn/compact-lists
Fix problems that occurred when a list was the first thing in a section.
2017-03-08 12:10:04 +01:00
BrenBarn
674e9a0264 Fix problems that occurred when a list was the first thing in a section.
There were bugs that caused content to be dropped or erroneously included if a list was the first thing in a section and --lists was specified.  This change should fix #117 and #118.
2017-03-08 01:01:31 -08:00
Giuseppe Attardi
05cbe1502d Merge pull request #113 from nkruglikov/master
Update README.md
2017-03-04 13:26:26 +01:00
attardi
5414b7fda8 Completed module String 2017-03-04 04:22:30 +01:00
attardi
c9432abcd0 Define #ifexists 2017-03-03 19:44:48 +01:00
attardi
3ea2da809b Fix for empty templates. 2017-03-03 18:52:17 +01:00
Nikolai Kruglikov
aa6f567935 Update README.md 2017-03-03 18:56:20 +03:00
attardi
8fd8da77f4 Updated version number. 2017-03-02 05:58:05 +01:00
Giuseppe Attardi
e3edc0c352 Merge pull request #108 from BrenBarn/globals-cleanup
Globals cleanup
2017-02-27 02:08:09 +01:00
BrenBarn
e7bb889e0e Removed some old comments 2017-02-26 12:41:58 -08:00
BrenBarn
ff51a19a1d Change to NextFile test so it will pass on Windows (use os.path.sep instead of /) 2017-02-26 12:02:11 -08:00
BrenBarn
19d358eee8 Factor all info that needs to be passed to subprocesses into "options" variable
In order for things to work properly on Windows (and to make the communication between processes more clear in general), the parent process should communicate with the subprocess function only via its arguments, not via shared global variables.  This change takes all the global variables that used to be implicitly shared with the subprocess, and puts them into a single "options" object (a dict-like SimpleNamespace).  This object is then passed to the subprocess functions.

The "options" object includes not only the values of command-line arguments (e.g., "--no-templates") but also stuff like the "URL base" and template definitions that are precomputed before the main extraction begins.
2017-02-26 11:58:44 -08:00
attardi
f6f80e2350 ignoredTags 2017-02-26 01:00:05 +01:00
attardi
5f1fb5c995 Declared global ignoredTags 2017-02-26 00:58:56 +01:00
attardi
25edeebafb Moved ignoredTags to top. 2017-02-26 00:53:48 +01:00
attardi
82196d1156 Define discardedElements 2017-02-26 00:39:26 +01:00
Giuseppe Attardi
db51f0b45c Merge pull request #105 from Cecca/master
Add json output
2017-02-11 18:12:25 +01:00
Matteo Ceccarello
7ae45fcff7 Add json output
This commit adds a new output format to the program, namely json. When
invoked with the --json flag, the program will write several files with
several pages per file (as before) with one json object per line
representing a single page.

The information contained in this json object is the same as in the
default format, but is somewhat more straightforward to parse for other
tools.

The running time is the same as the default format, as well as the
compressed output size. The following is a simple benchmark on an I5
machine using as input
enwiki-20170120-pages-meta-current1.xml-p000000010p000030303.bz2:

$ ./WikiExtractor.py -o xml --compress --no-templates input.bz2
INFO: Finished 3-process extraction of 15084 articles in 335.7s (44.9 art/s)
$ ./WikiExtractor.py -o json --json --compress --no-templates input.bz2
INFO: Finished 3-process extraction of 15084 articles in 336.8s (44.8 art/s)
$ du -sh json xml
69M     json
69M     xml
2017-02-10 20:13:27 +01:00
attardi
ea9c368e52 Fix to use of this instead of self. 2017-02-06 08:36:00 +01:00
Giuseppe Attardi
c88ae9736d Merge pull request #102 from nathj07/ISSUE-101_discard_and_ignore_flags
Issue 101 discard and ignore flags and keep_tables
2017-02-02 15:31:03 +01:00
Nathan Davies
15e589e5cf putting the double line spacing back 2017-02-02 05:38:10 -08:00
Nathan Davies
b9c29e36bb Merge branch 'master' into ISSUE-101_discard_and_ignore_flags 2017-02-01 05:29:17 -08:00
Nathan Davies
12fb5e587d improved comment around regex 2017-02-01 05:26:44 -08:00
Nathan Davies
663a3dea73 tidying up some of the code and adding comments. 2017-01-31 16:52:59 -08:00
Giuseppe Attardi
ba616e2f85 Merge pull request #103 from xiaoling/lists
closes lists at empty lines and adds list item count
2017-01-28 10:26:12 +01:00
Xiao Ling
e646440185 closes lists at empty lines and adds list item count 2017-01-27 16:20:47 -08:00
Nathan Davies
3701b779b3 added extra replace section to tidy up output when retaining tables 2017-01-23 14:37:24 +00:00
Nathan Davies
e835e8c004 Added new flags
--discard_elements - allowing you to customise which elements are discarded
--ignored_tags - allowing you to customise which tags are ignored
--keep_tables - allows the contents of the tables in the original to articel to be retained. This does not render html tables
2017-01-23 14:18:21 +00:00
Giuseppe Attardi
b5d97f64f9 Merge pull request #100 from orangain/fix-oom-in-single-core
Ensure that process_count >= 1
2017-01-15 14:38:19 +01:00
orangain
ddec683957 Ensure that process_count >= 1 2017-01-15 21:20:17 +09:00
attardi
7449ac95ba text_type 2017-01-15 10:09:40 +01:00
attardi
6660973646 See ChanngeLog. 2017-01-15 09:08:35 +01:00
Giuseppe Attardi
e00eacb372 Update version number. 2017-01-04 13:07:10 -08:00
Giuseppe Attardi
ba74b992da Python 3 decode
Skip decode('uff-8') in Python 3.
2017-01-04 13:05:15 -08:00
Giuseppe Attardi
87dbb62961 Merge pull request #94 from seong889/patch-1
Update README.md
2016-12-10 11:38:45 +01:00
Seongjun Hong
31f848d620 Update README.md
removed option : escapedoc
2016-12-08 15:48:05 +09:00
attardi
499997910c See ChangeLog. 2016-10-29 10:51:08 +02:00
attardi
f1989bcc11 See ChangeLog. 2016-10-29 10:19:28 +02:00