Commit Graph

224 Commits

Author SHA1 Message Date
Giuseppe Attardi
f4f9d2d008
Update WikiExtractor.py
Version 3.0.6
2021-10-14 14:29:26 +02:00
Giuseppe Attardi
ae7898a92e
Update README.md 2021-10-14 13:46:00 +02:00
Giuseppe Attardi
1053fe2030
Update WikiExtractor.py
Fix for TypeError: cannot pickle '_io.TextIOWrapper on MacOS.
Allow -b0 for saving a single article per file.
2021-10-14 13:41:27 +02:00
Giuseppe Attardi
0242d58c26
Update README.md 2021-10-14 10:49:46 +02:00
Giuseppe Attardi
881f3e4252
Merge pull request #241 from JonasTriki/fix-compressed-file-write
Add minor fix to OutputSplitter.write()
2021-03-18 01:14:59 +01:00
Jonas Triki
df435cad4d Explicitly use utf-8 when encoding data 2020-12-29 21:06:49 +01:00
Jonas Triki
0458f0c1fd Add minor fix to OutputSplitter.write() 2020-12-29 15:00:55 +01:00
attardi
6490f5361d Added back option --json. 2020-12-17 13:17:34 +01:00
attardi
95ddfaa451 Fix to urlbase. 2020-12-16 18:10:28 +01:00
attardi
1c57c06596 Fix to urllib.parse.quote. 2020-12-16 17:13:58 +01:00
attardi
ca5c40f68b Use param escape_doc. 2020-12-05 20:37:12 +01:00
attardi
3150f604e9 Use bz2.open. 2020-12-05 20:13:46 +01:00
attardi
a2e078f3be Better handling of encoding. 2020-12-05 19:27:01 +01:00
attardi
0933664d70 Fully working on python 3. 2020-12-05 11:21:46 +01:00
attardi
5b4302bca0 Merge branch 'master' of https://github.com/attardi/wikiextractor 2020-12-05 09:08:38 +01:00
attardi
9b5c1cb238 Fix to load_templates. 2020-12-05 09:07:43 +01:00
Giuseppe Attardi
d2732b1477
Update README.md 2020-12-04 19:12:01 +01:00
attardi
2ba214ab99 Fix NameError #225. 2020-12-04 18:43:22 +01:00
attardi
8ef37c87e2 Fix to script invocation. 2020-12-04 11:31:15 +01:00
attardi
3179a4c393 Import default_template. 2020-12-04 09:52:05 +01:00
Giuseppe Attardi
87549a91a6
Create python-publish.yml
Action to publish package to PyPi.
2020-11-27 16:48:03 +01:00
attardi
3eeffcb470 Put back the missing scripts. 2020-10-05 12:07:41 +02:00
Giuseppe Attardi
834e384cbe
Merge pull request #214 from AndyTheFactory/patch-1
Update README.md
2020-07-24 18:27:18 +02:00
Andy ThePHPFactory
0577d75e71
Update README.md
without the camelcase i get  "No module named wikiextractor.Wikiextractor" in python3 on windows 10
2020-07-24 00:32:51 +03:00
Giuseppe Attardi
6beefe96df
Update extract.sh 2020-07-22 18:29:37 +02:00
attardi
6675c69fcc Created PyPi release. 2020-07-22 18:25:35 +02:00
attardi
b13d447d93 Removed scripts directory. 2020-07-22 15:03:20 +02:00
attardi
62bdbe6106 Upgrade to Python 3.3+. 2020-07-22 14:12:37 +02:00
Giuseppe Attardi
6408a430fc
Merge pull request #183 from albertvillanova/fix-encoding
Force 'utf-8' encoding without relying on platform-dependent default
2020-07-22 12:18:17 +02:00
Giuseppe Attardi
08985cab49
Merge pull request #205 from TDesjardins/patch-1
Fix typo
2020-07-22 12:14:36 +02:00
Giuseppe Attardi
79f76061e3
Update README.md 2020-07-22 11:39:15 +02:00
Giuseppe Attardi
6e9cac2357
Update README.md 2020-07-22 11:38:18 +02:00
Giuseppe Attardi
f8282ab410
Create LICENSE 2020-07-22 11:34:16 +02:00
Tino Desjardins
e4abb4cbd0
Fix typo 2020-03-29 07:48:13 +02:00
Giuseppe Attardi
16186e290d
Update WikiExtractor.py
cgi.escape was deprecated and removed in Python 3.8 . Using html.escape is recommended.
2020-03-01 16:29:00 +01:00
Giuseppe Attardi
e3dca79742
Update WikiExtractor.py
WikiExtractor takes the contributor ID as revision ID.
2020-03-01 16:23:39 +01:00
Albert Villanova del Moral
ff9a70cd6d Force 'utf-8' encoding without relying on platform-dependent default
On Windows, the default encoding is 'cp1252' and this raises a UnicodeDecodeError.

Fix #89 #144 #165
2019-07-13 18:21:43 +02:00
Giuseppe Attardi
3162bb6c3c
Merge pull request #137 from AriesLL/master
change argument parser for no_templates
2019-04-13 12:41:15 +02:00
Giuseppe Attardi
29e3a932dd
Merge pull request #134 from dvzubarev/fix-crash
Fix crash on entry without namespace attribute.
2019-04-13 12:40:09 +02:00
Giuseppe Attardi
f859630a20
Merge branch 'master' into fix-crash 2019-04-13 12:39:41 +02:00
attardi
57a75c5f0a git push origin masterMerge branch 'nathj07-add_extra_fields_to_cirrus_output' 2019-04-13 12:37:17 +02:00
attardi
93cbcdb9df Merge branch 'add_extra_fields_to_cirrus_output' of https://github.com/nathj07/wikiextractor into nathj07-add_extra_fields_to_cirrus_output 2019-04-13 12:36:05 +02:00
attardi
baa4794842 Merge branch 'zwChan-master' 2019-04-13 12:22:59 +02:00
attardi
45c2212f64 Merge branch 'master' of https://github.com/zwChan/wikiextractor into zwChan-master 2019-04-13 12:19:36 +02:00
Giuseppe Attardi
5bf4df62fa
Merge pull request #143 from danduma/master
Bug fix for list items
2019-04-13 11:43:09 +02:00
Giuseppe Attardi
275dcc9ac5
Merge pull request #152 from karlstratos/master
minor regex improvement
2019-04-13 11:42:02 +02:00
Nathan Davies
1e4236de42 extract language and revion from cirrus search
This simple push extracts the langauge and the page review. These are then added to the XML
2019-03-25 14:28:43 +00:00
Karl Stratos
f9d57324c2 minimized complexity 2018-03-22 16:10:12 -05:00
Karl Stratos
ecc7cef402 do not include title in text 2018-03-22 12:51:47 -05:00
Karl
e689ef3233 bash scripts for extraction commands 2018-03-22 09:54:34 -05:00