Commit Graph

190 Commits

Author SHA1 Message Date
Tino Desjardins
e4abb4cbd0
Fix typo 2020-03-29 07:48:13 +02:00
Giuseppe Attardi
16186e290d
Update WikiExtractor.py
cgi.escape was deprecated and removed in Python 3.8 . Using html.escape is recommended.
2020-03-01 16:29:00 +01:00
Giuseppe Attardi
e3dca79742
Update WikiExtractor.py
WikiExtractor takes the contributor ID as revision ID.
2020-03-01 16:23:39 +01:00
Giuseppe Attardi
3162bb6c3c
Merge pull request #137 from AriesLL/master
change argument parser for no_templates
2019-04-13 12:41:15 +02:00
Giuseppe Attardi
29e3a932dd
Merge pull request #134 from dvzubarev/fix-crash
Fix crash on entry without namespace attribute.
2019-04-13 12:40:09 +02:00
Giuseppe Attardi
f859630a20
Merge branch 'master' into fix-crash 2019-04-13 12:39:41 +02:00
attardi
57a75c5f0a git push origin masterMerge branch 'nathj07-add_extra_fields_to_cirrus_output' 2019-04-13 12:37:17 +02:00
attardi
93cbcdb9df Merge branch 'add_extra_fields_to_cirrus_output' of https://github.com/nathj07/wikiextractor into nathj07-add_extra_fields_to_cirrus_output 2019-04-13 12:36:05 +02:00
attardi
baa4794842 Merge branch 'zwChan-master' 2019-04-13 12:22:59 +02:00
attardi
45c2212f64 Merge branch 'master' of https://github.com/zwChan/wikiextractor into zwChan-master 2019-04-13 12:19:36 +02:00
Giuseppe Attardi
5bf4df62fa
Merge pull request #143 from danduma/master
Bug fix for list items
2019-04-13 11:43:09 +02:00
Giuseppe Attardi
275dcc9ac5
Merge pull request #152 from karlstratos/master
minor regex improvement
2019-04-13 11:42:02 +02:00
Nathan Davies
1e4236de42 extract language and revion from cirrus search
This simple push extracts the langauge and the page review. These are then added to the XML
2019-03-25 14:28:43 +00:00
Karl Stratos
f9d57324c2 minimized complexity 2018-03-22 16:10:12 -05:00
Karl Stratos
ecc7cef402 do not include title in text 2018-03-22 12:51:47 -05:00
Karl
e689ef3233 bash scripts for extraction commands 2018-03-22 09:54:34 -05:00
Karl Stratos
4ba4e9f683 Augmented disambig regex to catch disambiguation pages marked by the switch __DISAMBIG__. Augmented key regex to catch plus/minus signs. 2018-03-17 09:10:40 -07:00
Daniel
45e56d4e9e
Update WikiExtractor.py
Fix bug where it fails with an exception when n="1" or n="A"
2017-11-08 14:28:36 +00:00
Peipei Zhou
209e2b422f change argument parser for no_templates 2017-08-10 14:51:54 -07:00
denin
24db54b2c8 Fix crash on entry without namespace attribute.
It occurs on enwiki-20170508-cirrussearch-content.json.gz
for entry with id AVQXnGH_62ewIKYZMTMP
2017-05-23 15:24:10 +03:00
Zhiwei Chen
169eaaf208 remove noisy print 2017-04-29 12:53:19 -04:00
Zhiwei Chen
e249508255 log categories statistics info 2017-04-29 12:50:47 -04:00
Zhiwei Chen
397a92894b filter_categories use depth 4 under Health 2017-04-29 12:44:13 -04:00
Zhiwei Chen
5274829e16 print friendly error msg 2017-04-28 14:57:54 -04:00
Zhiwei Chen
cc04dae71c log save to file; log page statistic info; 2017-04-28 12:36:46 -04:00
root
b8323a8efc encoding fix 2017-04-28 02:04:29 -04:00
Zhiwei Chen
1f76fd9473 encoding fix 2017-04-28 01:53:46 -04:00
Zhiwei Chen
ef0af20178 fix category not utf8 error 2017-04-28 01:42:21 -04:00
Zhiwei Chen
52ed1ef9ae fix category not utf8 error 2017-04-28 01:23:31 -04:00
Zhiwei Chen
7903b739f5 fix category not utf8 error 2017-04-28 01:17:45 -04:00
Zhiwei Chen
8e92f464cf add readme 2017-04-27 20:15:17 -04:00
Zhiwei Chen
9cf2a2a883 add feature filtering by category of wiki 2017-04-27 19:57:41 -04:00
Giuseppe Attardi
2a5e6aebc0 Merge pull request #119 from BrenBarn/compact-lists
Fix problems that occurred when a list was the first thing in a section.
2017-03-08 12:10:04 +01:00
BrenBarn
674e9a0264 Fix problems that occurred when a list was the first thing in a section.
There were bugs that caused content to be dropped or erroneously included if a list was the first thing in a section and --lists was specified.  This change should fix #117 and #118.
2017-03-08 01:01:31 -08:00
Giuseppe Attardi
05cbe1502d Merge pull request #113 from nkruglikov/master
Update README.md
2017-03-04 13:26:26 +01:00
attardi
5414b7fda8 Completed module String 2017-03-04 04:22:30 +01:00
attardi
c9432abcd0 Define #ifexists 2017-03-03 19:44:48 +01:00
attardi
3ea2da809b Fix for empty templates. 2017-03-03 18:52:17 +01:00
Nikolai Kruglikov
aa6f567935 Update README.md 2017-03-03 18:56:20 +03:00
attardi
8fd8da77f4 Updated version number. 2017-03-02 05:58:05 +01:00
Giuseppe Attardi
e3edc0c352 Merge pull request #108 from BrenBarn/globals-cleanup
Globals cleanup
2017-02-27 02:08:09 +01:00
BrenBarn
e7bb889e0e Removed some old comments 2017-02-26 12:41:58 -08:00
BrenBarn
ff51a19a1d Change to NextFile test so it will pass on Windows (use os.path.sep instead of /) 2017-02-26 12:02:11 -08:00
BrenBarn
19d358eee8 Factor all info that needs to be passed to subprocesses into "options" variable
In order for things to work properly on Windows (and to make the communication between processes more clear in general), the parent process should communicate with the subprocess function only via its arguments, not via shared global variables.  This change takes all the global variables that used to be implicitly shared with the subprocess, and puts them into a single "options" object (a dict-like SimpleNamespace).  This object is then passed to the subprocess functions.

The "options" object includes not only the values of command-line arguments (e.g., "--no-templates") but also stuff like the "URL base" and template definitions that are precomputed before the main extraction begins.
2017-02-26 11:58:44 -08:00
attardi
f6f80e2350 ignoredTags 2017-02-26 01:00:05 +01:00
attardi
5f1fb5c995 Declared global ignoredTags 2017-02-26 00:58:56 +01:00
attardi
25edeebafb Moved ignoredTags to top. 2017-02-26 00:53:48 +01:00
attardi
82196d1156 Define discardedElements 2017-02-26 00:39:26 +01:00
Giuseppe Attardi
db51f0b45c Merge pull request #105 from Cecca/master
Add json output
2017-02-11 18:12:25 +01:00
Matteo Ceccarello
7ae45fcff7 Add json output
This commit adds a new output format to the program, namely json. When
invoked with the --json flag, the program will write several files with
several pages per file (as before) with one json object per line
representing a single page.

The information contained in this json object is the same as in the
default format, but is somewhat more straightforward to parse for other
tools.

The running time is the same as the default format, as well as the
compressed output size. The following is a simple benchmark on an I5
machine using as input
enwiki-20170120-pages-meta-current1.xml-p000000010p000030303.bz2:

$ ./WikiExtractor.py -o xml --compress --no-templates input.bz2
INFO: Finished 3-process extraction of 15084 articles in 335.7s (44.9 art/s)
$ ./WikiExtractor.py -o json --json --compress --no-templates input.bz2
INFO: Finished 3-process extraction of 15084 articles in 336.8s (44.8 art/s)
$ du -sh json xml
69M     json
69M     xml
2017-02-10 20:13:27 +01:00