Commit Graph

206 Commits

Author SHA1 Message Date
attardi
8ef37c87e2 Fix to script invocation. 2020-12-04 11:31:15 +01:00
attardi
3179a4c393 Import default_template. 2020-12-04 09:52:05 +01:00
Giuseppe Attardi
87549a91a6
Create python-publish.yml
Action to publish package to PyPi.
2020-11-27 16:48:03 +01:00
attardi
3eeffcb470 Put back the missing scripts. 2020-10-05 12:07:41 +02:00
Giuseppe Attardi
834e384cbe
Merge pull request #214 from AndyTheFactory/patch-1
Update README.md
2020-07-24 18:27:18 +02:00
Andy ThePHPFactory
0577d75e71
Update README.md
without the camelcase i get  "No module named wikiextractor.Wikiextractor" in python3 on windows 10
2020-07-24 00:32:51 +03:00
Giuseppe Attardi
6beefe96df
Update extract.sh 2020-07-22 18:29:37 +02:00
attardi
6675c69fcc Created PyPi release. 2020-07-22 18:25:35 +02:00
attardi
b13d447d93 Removed scripts directory. 2020-07-22 15:03:20 +02:00
attardi
62bdbe6106 Upgrade to Python 3.3+. 2020-07-22 14:12:37 +02:00
Giuseppe Attardi
6408a430fc
Merge pull request #183 from albertvillanova/fix-encoding
Force 'utf-8' encoding without relying on platform-dependent default
2020-07-22 12:18:17 +02:00
Giuseppe Attardi
08985cab49
Merge pull request #205 from TDesjardins/patch-1
Fix typo
2020-07-22 12:14:36 +02:00
Giuseppe Attardi
79f76061e3
Update README.md 2020-07-22 11:39:15 +02:00
Giuseppe Attardi
6e9cac2357
Update README.md 2020-07-22 11:38:18 +02:00
Giuseppe Attardi
f8282ab410
Create LICENSE 2020-07-22 11:34:16 +02:00
Tino Desjardins
e4abb4cbd0
Fix typo 2020-03-29 07:48:13 +02:00
Giuseppe Attardi
16186e290d
Update WikiExtractor.py
cgi.escape was deprecated and removed in Python 3.8 . Using html.escape is recommended.
2020-03-01 16:29:00 +01:00
Giuseppe Attardi
e3dca79742
Update WikiExtractor.py
WikiExtractor takes the contributor ID as revision ID.
2020-03-01 16:23:39 +01:00
Albert Villanova del Moral
ff9a70cd6d Force 'utf-8' encoding without relying on platform-dependent default
On Windows, the default encoding is 'cp1252' and this raises a UnicodeDecodeError.

Fix #89 #144 #165
2019-07-13 18:21:43 +02:00
Giuseppe Attardi
3162bb6c3c
Merge pull request #137 from AriesLL/master
change argument parser for no_templates
2019-04-13 12:41:15 +02:00
Giuseppe Attardi
29e3a932dd
Merge pull request #134 from dvzubarev/fix-crash
Fix crash on entry without namespace attribute.
2019-04-13 12:40:09 +02:00
Giuseppe Attardi
f859630a20
Merge branch 'master' into fix-crash 2019-04-13 12:39:41 +02:00
attardi
57a75c5f0a git push origin masterMerge branch 'nathj07-add_extra_fields_to_cirrus_output' 2019-04-13 12:37:17 +02:00
attardi
93cbcdb9df Merge branch 'add_extra_fields_to_cirrus_output' of https://github.com/nathj07/wikiextractor into nathj07-add_extra_fields_to_cirrus_output 2019-04-13 12:36:05 +02:00
attardi
baa4794842 Merge branch 'zwChan-master' 2019-04-13 12:22:59 +02:00
attardi
45c2212f64 Merge branch 'master' of https://github.com/zwChan/wikiextractor into zwChan-master 2019-04-13 12:19:36 +02:00
Giuseppe Attardi
5bf4df62fa
Merge pull request #143 from danduma/master
Bug fix for list items
2019-04-13 11:43:09 +02:00
Giuseppe Attardi
275dcc9ac5
Merge pull request #152 from karlstratos/master
minor regex improvement
2019-04-13 11:42:02 +02:00
Nathan Davies
1e4236de42 extract language and revion from cirrus search
This simple push extracts the langauge and the page review. These are then added to the XML
2019-03-25 14:28:43 +00:00
Karl Stratos
f9d57324c2 minimized complexity 2018-03-22 16:10:12 -05:00
Karl Stratos
ecc7cef402 do not include title in text 2018-03-22 12:51:47 -05:00
Karl
e689ef3233 bash scripts for extraction commands 2018-03-22 09:54:34 -05:00
Karl Stratos
4ba4e9f683 Augmented disambig regex to catch disambiguation pages marked by the switch __DISAMBIG__. Augmented key regex to catch plus/minus signs. 2018-03-17 09:10:40 -07:00
Daniel
45e56d4e9e
Update WikiExtractor.py
Fix bug where it fails with an exception when n="1" or n="A"
2017-11-08 14:28:36 +00:00
Peipei Zhou
209e2b422f change argument parser for no_templates 2017-08-10 14:51:54 -07:00
denin
24db54b2c8 Fix crash on entry without namespace attribute.
It occurs on enwiki-20170508-cirrussearch-content.json.gz
for entry with id AVQXnGH_62ewIKYZMTMP
2017-05-23 15:24:10 +03:00
Zhiwei Chen
169eaaf208 remove noisy print 2017-04-29 12:53:19 -04:00
Zhiwei Chen
e249508255 log categories statistics info 2017-04-29 12:50:47 -04:00
Zhiwei Chen
397a92894b filter_categories use depth 4 under Health 2017-04-29 12:44:13 -04:00
Zhiwei Chen
5274829e16 print friendly error msg 2017-04-28 14:57:54 -04:00
Zhiwei Chen
cc04dae71c log save to file; log page statistic info; 2017-04-28 12:36:46 -04:00
root
b8323a8efc encoding fix 2017-04-28 02:04:29 -04:00
Zhiwei Chen
1f76fd9473 encoding fix 2017-04-28 01:53:46 -04:00
Zhiwei Chen
ef0af20178 fix category not utf8 error 2017-04-28 01:42:21 -04:00
Zhiwei Chen
52ed1ef9ae fix category not utf8 error 2017-04-28 01:23:31 -04:00
Zhiwei Chen
7903b739f5 fix category not utf8 error 2017-04-28 01:17:45 -04:00
Zhiwei Chen
8e92f464cf add readme 2017-04-27 20:15:17 -04:00
Zhiwei Chen
9cf2a2a883 add feature filtering by category of wiki 2017-04-27 19:57:41 -04:00
Giuseppe Attardi
2a5e6aebc0 Merge pull request #119 from BrenBarn/compact-lists
Fix problems that occurred when a list was the first thing in a section.
2017-03-08 12:10:04 +01:00
BrenBarn
674e9a0264 Fix problems that occurred when a list was the first thing in a section.
There were bugs that caused content to be dropped or erroneously included if a list was the first thing in a section and --lists was specified.  This change should fix #117 and #118.
2017-03-08 01:01:31 -08:00