attardi
|
1c57c06596
|
Fix to urllib.parse.quote.
|
2020-12-16 17:13:58 +01:00 |
|
attardi
|
ca5c40f68b
|
Use param escape_doc.
|
2020-12-05 20:37:12 +01:00 |
|
attardi
|
3150f604e9
|
Use bz2.open.
|
2020-12-05 20:13:46 +01:00 |
|
attardi
|
a2e078f3be
|
Better handling of encoding.
|
2020-12-05 19:27:01 +01:00 |
|
attardi
|
0933664d70
|
Fully working on python 3.
|
2020-12-05 11:21:46 +01:00 |
|
attardi
|
5b4302bca0
|
Merge branch 'master' of https://github.com/attardi/wikiextractor
|
2020-12-05 09:08:38 +01:00 |
|
attardi
|
9b5c1cb238
|
Fix to load_templates.
|
2020-12-05 09:07:43 +01:00 |
|
Giuseppe Attardi
|
d2732b1477
|
Update README.md
|
2020-12-04 19:12:01 +01:00 |
|
attardi
|
2ba214ab99
|
Fix NameError #225.
|
2020-12-04 18:43:22 +01:00 |
|
attardi
|
8ef37c87e2
|
Fix to script invocation.
|
2020-12-04 11:31:15 +01:00 |
|
attardi
|
3179a4c393
|
Import default_template.
|
2020-12-04 09:52:05 +01:00 |
|
Giuseppe Attardi
|
87549a91a6
|
Create python-publish.yml
Action to publish package to PyPi.
|
2020-11-27 16:48:03 +01:00 |
|
attardi
|
3eeffcb470
|
Put back the missing scripts.
|
2020-10-05 12:07:41 +02:00 |
|
Giuseppe Attardi
|
834e384cbe
|
Merge pull request #214 from AndyTheFactory/patch-1
Update README.md
|
2020-07-24 18:27:18 +02:00 |
|
Andy ThePHPFactory
|
0577d75e71
|
Update README.md
without the camelcase i get "No module named wikiextractor.Wikiextractor" in python3 on windows 10
|
2020-07-24 00:32:51 +03:00 |
|
Giuseppe Attardi
|
6beefe96df
|
Update extract.sh
|
2020-07-22 18:29:37 +02:00 |
|
attardi
|
6675c69fcc
|
Created PyPi release.
|
2020-07-22 18:25:35 +02:00 |
|
attardi
|
b13d447d93
|
Removed scripts directory.
|
2020-07-22 15:03:20 +02:00 |
|
attardi
|
62bdbe6106
|
Upgrade to Python 3.3+.
|
2020-07-22 14:12:37 +02:00 |
|
Giuseppe Attardi
|
6408a430fc
|
Merge pull request #183 from albertvillanova/fix-encoding
Force 'utf-8' encoding without relying on platform-dependent default
|
2020-07-22 12:18:17 +02:00 |
|
Giuseppe Attardi
|
08985cab49
|
Merge pull request #205 from TDesjardins/patch-1
Fix typo
|
2020-07-22 12:14:36 +02:00 |
|
Giuseppe Attardi
|
79f76061e3
|
Update README.md
|
2020-07-22 11:39:15 +02:00 |
|
Giuseppe Attardi
|
6e9cac2357
|
Update README.md
|
2020-07-22 11:38:18 +02:00 |
|
Giuseppe Attardi
|
f8282ab410
|
Create LICENSE
|
2020-07-22 11:34:16 +02:00 |
|
Tino Desjardins
|
e4abb4cbd0
|
Fix typo
|
2020-03-29 07:48:13 +02:00 |
|
Giuseppe Attardi
|
16186e290d
|
Update WikiExtractor.py
cgi.escape was deprecated and removed in Python 3.8 . Using html.escape is recommended.
|
2020-03-01 16:29:00 +01:00 |
|
Giuseppe Attardi
|
e3dca79742
|
Update WikiExtractor.py
WikiExtractor takes the contributor ID as revision ID.
|
2020-03-01 16:23:39 +01:00 |
|
Albert Villanova del Moral
|
ff9a70cd6d
|
Force 'utf-8' encoding without relying on platform-dependent default
On Windows, the default encoding is 'cp1252' and this raises a UnicodeDecodeError.
Fix #89 #144 #165
|
2019-07-13 18:21:43 +02:00 |
|
Giuseppe Attardi
|
3162bb6c3c
|
Merge pull request #137 from AriesLL/master
change argument parser for no_templates
|
2019-04-13 12:41:15 +02:00 |
|
Giuseppe Attardi
|
29e3a932dd
|
Merge pull request #134 from dvzubarev/fix-crash
Fix crash on entry without namespace attribute.
|
2019-04-13 12:40:09 +02:00 |
|
Giuseppe Attardi
|
f859630a20
|
Merge branch 'master' into fix-crash
|
2019-04-13 12:39:41 +02:00 |
|
attardi
|
57a75c5f0a
|
git push origin masterMerge branch 'nathj07-add_extra_fields_to_cirrus_output'
|
2019-04-13 12:37:17 +02:00 |
|
attardi
|
93cbcdb9df
|
Merge branch 'add_extra_fields_to_cirrus_output' of https://github.com/nathj07/wikiextractor into nathj07-add_extra_fields_to_cirrus_output
|
2019-04-13 12:36:05 +02:00 |
|
attardi
|
baa4794842
|
Merge branch 'zwChan-master'
|
2019-04-13 12:22:59 +02:00 |
|
attardi
|
45c2212f64
|
Merge branch 'master' of https://github.com/zwChan/wikiextractor into zwChan-master
|
2019-04-13 12:19:36 +02:00 |
|
Giuseppe Attardi
|
5bf4df62fa
|
Merge pull request #143 from danduma/master
Bug fix for list items
|
2019-04-13 11:43:09 +02:00 |
|
Giuseppe Attardi
|
275dcc9ac5
|
Merge pull request #152 from karlstratos/master
minor regex improvement
|
2019-04-13 11:42:02 +02:00 |
|
Nathan Davies
|
1e4236de42
|
extract language and revion from cirrus search
This simple push extracts the langauge and the page review. These are then added to the XML
|
2019-03-25 14:28:43 +00:00 |
|
Karl Stratos
|
f9d57324c2
|
minimized complexity
|
2018-03-22 16:10:12 -05:00 |
|
Karl Stratos
|
ecc7cef402
|
do not include title in text
|
2018-03-22 12:51:47 -05:00 |
|
Karl
|
e689ef3233
|
bash scripts for extraction commands
|
2018-03-22 09:54:34 -05:00 |
|
Karl Stratos
|
4ba4e9f683
|
Augmented disambig regex to catch disambiguation pages marked by the switch __DISAMBIG__. Augmented key regex to catch plus/minus signs.
|
2018-03-17 09:10:40 -07:00 |
|
Daniel
|
45e56d4e9e
|
Update WikiExtractor.py
Fix bug where it fails with an exception when n="1" or n="A"
|
2017-11-08 14:28:36 +00:00 |
|
Peipei Zhou
|
209e2b422f
|
change argument parser for no_templates
|
2017-08-10 14:51:54 -07:00 |
|
denin
|
24db54b2c8
|
Fix crash on entry without namespace attribute.
It occurs on enwiki-20170508-cirrussearch-content.json.gz
for entry with id AVQXnGH_62ewIKYZMTMP
|
2017-05-23 15:24:10 +03:00 |
|
Zhiwei Chen
|
169eaaf208
|
remove noisy print
|
2017-04-29 12:53:19 -04:00 |
|
Zhiwei Chen
|
e249508255
|
log categories statistics info
|
2017-04-29 12:50:47 -04:00 |
|
Zhiwei Chen
|
397a92894b
|
filter_categories use depth 4 under Health
|
2017-04-29 12:44:13 -04:00 |
|
Zhiwei Chen
|
5274829e16
|
print friendly error msg
|
2017-04-28 14:57:54 -04:00 |
|
Zhiwei Chen
|
cc04dae71c
|
log save to file; log page statistic info;
|
2017-04-28 12:36:46 -04:00 |
|