Giuseppe Attardi
79f76061e3
Update README.md
2020-07-22 11:39:15 +02:00
Giuseppe Attardi
6e9cac2357
Update README.md
2020-07-22 11:38:18 +02:00
Giuseppe Attardi
f8282ab410
Create LICENSE
2020-07-22 11:34:16 +02:00
Giuseppe Attardi
16186e290d
Update WikiExtractor.py
...
cgi.escape was deprecated and removed in Python 3.8 . Using html.escape is recommended.
2020-03-01 16:29:00 +01:00
Giuseppe Attardi
e3dca79742
Update WikiExtractor.py
...
WikiExtractor takes the contributor ID as revision ID.
2020-03-01 16:23:39 +01:00
Giuseppe Attardi
3162bb6c3c
Merge pull request #137 from AriesLL/master
...
change argument parser for no_templates
2019-04-13 12:41:15 +02:00
Giuseppe Attardi
29e3a932dd
Merge pull request #134 from dvzubarev/fix-crash
...
Fix crash on entry without namespace attribute.
2019-04-13 12:40:09 +02:00
Giuseppe Attardi
f859630a20
Merge branch 'master' into fix-crash
2019-04-13 12:39:41 +02:00
attardi
57a75c5f0a
git push origin masterMerge branch 'nathj07-add_extra_fields_to_cirrus_output'
2019-04-13 12:37:17 +02:00
attardi
93cbcdb9df
Merge branch 'add_extra_fields_to_cirrus_output' of https://github.com/nathj07/wikiextractor into nathj07-add_extra_fields_to_cirrus_output
2019-04-13 12:36:05 +02:00
attardi
baa4794842
Merge branch 'zwChan-master'
2019-04-13 12:22:59 +02:00
attardi
45c2212f64
Merge branch 'master' of https://github.com/zwChan/wikiextractor into zwChan-master
2019-04-13 12:19:36 +02:00
Giuseppe Attardi
5bf4df62fa
Merge pull request #143 from danduma/master
...
Bug fix for list items
2019-04-13 11:43:09 +02:00
Giuseppe Attardi
275dcc9ac5
Merge pull request #152 from karlstratos/master
...
minor regex improvement
2019-04-13 11:42:02 +02:00
Nathan Davies
1e4236de42
extract language and revion from cirrus search
...
This simple push extracts the langauge and the page review. These are then added to the XML
2019-03-25 14:28:43 +00:00
Karl Stratos
f9d57324c2
minimized complexity
2018-03-22 16:10:12 -05:00
Karl Stratos
ecc7cef402
do not include title in text
2018-03-22 12:51:47 -05:00
Karl
e689ef3233
bash scripts for extraction commands
2018-03-22 09:54:34 -05:00
Karl Stratos
4ba4e9f683
Augmented disambig regex to catch disambiguation pages marked by the switch __DISAMBIG__. Augmented key regex to catch plus/minus signs.
2018-03-17 09:10:40 -07:00
Daniel
45e56d4e9e
Update WikiExtractor.py
...
Fix bug where it fails with an exception when n="1" or n="A"
2017-11-08 14:28:36 +00:00
Peipei Zhou
209e2b422f
change argument parser for no_templates
2017-08-10 14:51:54 -07:00
denin
24db54b2c8
Fix crash on entry without namespace attribute.
...
It occurs on enwiki-20170508-cirrussearch-content.json.gz
for entry with id AVQXnGH_62ewIKYZMTMP
2017-05-23 15:24:10 +03:00
Zhiwei Chen
169eaaf208
remove noisy print
2017-04-29 12:53:19 -04:00
Zhiwei Chen
e249508255
log categories statistics info
2017-04-29 12:50:47 -04:00
Zhiwei Chen
397a92894b
filter_categories use depth 4 under Health
2017-04-29 12:44:13 -04:00
Zhiwei Chen
5274829e16
print friendly error msg
2017-04-28 14:57:54 -04:00
Zhiwei Chen
cc04dae71c
log save to file; log page statistic info;
2017-04-28 12:36:46 -04:00
root
b8323a8efc
encoding fix
2017-04-28 02:04:29 -04:00
Zhiwei Chen
1f76fd9473
encoding fix
2017-04-28 01:53:46 -04:00
Zhiwei Chen
ef0af20178
fix category not utf8 error
2017-04-28 01:42:21 -04:00
Zhiwei Chen
52ed1ef9ae
fix category not utf8 error
2017-04-28 01:23:31 -04:00
Zhiwei Chen
7903b739f5
fix category not utf8 error
2017-04-28 01:17:45 -04:00
Zhiwei Chen
8e92f464cf
add readme
2017-04-27 20:15:17 -04:00
Zhiwei Chen
9cf2a2a883
add feature filtering by category of wiki
2017-04-27 19:57:41 -04:00
Giuseppe Attardi
2a5e6aebc0
Merge pull request #119 from BrenBarn/compact-lists
...
Fix problems that occurred when a list was the first thing in a section.
2017-03-08 12:10:04 +01:00
BrenBarn
674e9a0264
Fix problems that occurred when a list was the first thing in a section.
...
There were bugs that caused content to be dropped or erroneously included if a list was the first thing in a section and --lists was specified. This change should fix #117 and #118 .
2017-03-08 01:01:31 -08:00
Giuseppe Attardi
05cbe1502d
Merge pull request #113 from nkruglikov/master
...
Update README.md
2017-03-04 13:26:26 +01:00
attardi
5414b7fda8
Completed module String
2017-03-04 04:22:30 +01:00
attardi
c9432abcd0
Define #ifexists
2017-03-03 19:44:48 +01:00
attardi
3ea2da809b
Fix for empty templates.
2017-03-03 18:52:17 +01:00
Nikolai Kruglikov
aa6f567935
Update README.md
2017-03-03 18:56:20 +03:00
attardi
8fd8da77f4
Updated version number.
2017-03-02 05:58:05 +01:00
Giuseppe Attardi
e3edc0c352
Merge pull request #108 from BrenBarn/globals-cleanup
...
Globals cleanup
2017-02-27 02:08:09 +01:00
BrenBarn
e7bb889e0e
Removed some old comments
2017-02-26 12:41:58 -08:00
BrenBarn
ff51a19a1d
Change to NextFile test so it will pass on Windows (use os.path.sep instead of /)
2017-02-26 12:02:11 -08:00
BrenBarn
19d358eee8
Factor all info that needs to be passed to subprocesses into "options" variable
...
In order for things to work properly on Windows (and to make the communication between processes more clear in general), the parent process should communicate with the subprocess function only via its arguments, not via shared global variables. This change takes all the global variables that used to be implicitly shared with the subprocess, and puts them into a single "options" object (a dict-like SimpleNamespace). This object is then passed to the subprocess functions.
The "options" object includes not only the values of command-line arguments (e.g., "--no-templates") but also stuff like the "URL base" and template definitions that are precomputed before the main extraction begins.
2017-02-26 11:58:44 -08:00
attardi
f6f80e2350
ignoredTags
2017-02-26 01:00:05 +01:00
attardi
5f1fb5c995
Declared global ignoredTags
2017-02-26 00:58:56 +01:00
attardi
25edeebafb
Moved ignoredTags to top.
2017-02-26 00:53:48 +01:00
attardi
82196d1156
Define discardedElements
2017-02-26 00:39:26 +01:00