attardi
8ef37c87e2
Fix to script invocation.
2020-12-04 11:31:15 +01:00
attardi
3179a4c393
Import default_template.
2020-12-04 09:52:05 +01:00
Giuseppe Attardi
87549a91a6
Create python-publish.yml
...
Action to publish package to PyPi.
2020-11-27 16:48:03 +01:00
attardi
3eeffcb470
Put back the missing scripts.
2020-10-05 12:07:41 +02:00
Giuseppe Attardi
834e384cbe
Merge pull request #214 from AndyTheFactory/patch-1
...
Update README.md
2020-07-24 18:27:18 +02:00
Andy ThePHPFactory
0577d75e71
Update README.md
...
without the camelcase i get "No module named wikiextractor.Wikiextractor" in python3 on windows 10
2020-07-24 00:32:51 +03:00
Giuseppe Attardi
6beefe96df
Update extract.sh
2020-07-22 18:29:37 +02:00
attardi
6675c69fcc
Created PyPi release.
2020-07-22 18:25:35 +02:00
attardi
b13d447d93
Removed scripts directory.
2020-07-22 15:03:20 +02:00
attardi
62bdbe6106
Upgrade to Python 3.3+.
2020-07-22 14:12:37 +02:00
Giuseppe Attardi
6408a430fc
Merge pull request #183 from albertvillanova/fix-encoding
...
Force 'utf-8' encoding without relying on platform-dependent default
2020-07-22 12:18:17 +02:00
Giuseppe Attardi
08985cab49
Merge pull request #205 from TDesjardins/patch-1
...
Fix typo
2020-07-22 12:14:36 +02:00
Giuseppe Attardi
79f76061e3
Update README.md
2020-07-22 11:39:15 +02:00
Giuseppe Attardi
6e9cac2357
Update README.md
2020-07-22 11:38:18 +02:00
Giuseppe Attardi
f8282ab410
Create LICENSE
2020-07-22 11:34:16 +02:00
Tino Desjardins
e4abb4cbd0
Fix typo
2020-03-29 07:48:13 +02:00
Giuseppe Attardi
16186e290d
Update WikiExtractor.py
...
cgi.escape was deprecated and removed in Python 3.8 . Using html.escape is recommended.
2020-03-01 16:29:00 +01:00
Giuseppe Attardi
e3dca79742
Update WikiExtractor.py
...
WikiExtractor takes the contributor ID as revision ID.
2020-03-01 16:23:39 +01:00
Albert Villanova del Moral
ff9a70cd6d
Force 'utf-8' encoding without relying on platform-dependent default
...
On Windows, the default encoding is 'cp1252' and this raises a UnicodeDecodeError.
Fix #89 #144 #165
2019-07-13 18:21:43 +02:00
Giuseppe Attardi
3162bb6c3c
Merge pull request #137 from AriesLL/master
...
change argument parser for no_templates
2019-04-13 12:41:15 +02:00
Giuseppe Attardi
29e3a932dd
Merge pull request #134 from dvzubarev/fix-crash
...
Fix crash on entry without namespace attribute.
2019-04-13 12:40:09 +02:00
Giuseppe Attardi
f859630a20
Merge branch 'master' into fix-crash
2019-04-13 12:39:41 +02:00
attardi
57a75c5f0a
git push origin masterMerge branch 'nathj07-add_extra_fields_to_cirrus_output'
2019-04-13 12:37:17 +02:00
attardi
93cbcdb9df
Merge branch 'add_extra_fields_to_cirrus_output' of https://github.com/nathj07/wikiextractor into nathj07-add_extra_fields_to_cirrus_output
2019-04-13 12:36:05 +02:00
attardi
baa4794842
Merge branch 'zwChan-master'
2019-04-13 12:22:59 +02:00
attardi
45c2212f64
Merge branch 'master' of https://github.com/zwChan/wikiextractor into zwChan-master
2019-04-13 12:19:36 +02:00
Giuseppe Attardi
5bf4df62fa
Merge pull request #143 from danduma/master
...
Bug fix for list items
2019-04-13 11:43:09 +02:00
Giuseppe Attardi
275dcc9ac5
Merge pull request #152 from karlstratos/master
...
minor regex improvement
2019-04-13 11:42:02 +02:00
Nathan Davies
1e4236de42
extract language and revion from cirrus search
...
This simple push extracts the langauge and the page review. These are then added to the XML
2019-03-25 14:28:43 +00:00
Karl Stratos
f9d57324c2
minimized complexity
2018-03-22 16:10:12 -05:00
Karl Stratos
ecc7cef402
do not include title in text
2018-03-22 12:51:47 -05:00
Karl
e689ef3233
bash scripts for extraction commands
2018-03-22 09:54:34 -05:00
Karl Stratos
4ba4e9f683
Augmented disambig regex to catch disambiguation pages marked by the switch __DISAMBIG__. Augmented key regex to catch plus/minus signs.
2018-03-17 09:10:40 -07:00
Daniel
45e56d4e9e
Update WikiExtractor.py
...
Fix bug where it fails with an exception when n="1" or n="A"
2017-11-08 14:28:36 +00:00
Peipei Zhou
209e2b422f
change argument parser for no_templates
2017-08-10 14:51:54 -07:00
denin
24db54b2c8
Fix crash on entry without namespace attribute.
...
It occurs on enwiki-20170508-cirrussearch-content.json.gz
for entry with id AVQXnGH_62ewIKYZMTMP
2017-05-23 15:24:10 +03:00
Zhiwei Chen
169eaaf208
remove noisy print
2017-04-29 12:53:19 -04:00
Zhiwei Chen
e249508255
log categories statistics info
2017-04-29 12:50:47 -04:00
Zhiwei Chen
397a92894b
filter_categories use depth 4 under Health
2017-04-29 12:44:13 -04:00
Zhiwei Chen
5274829e16
print friendly error msg
2017-04-28 14:57:54 -04:00
Zhiwei Chen
cc04dae71c
log save to file; log page statistic info;
2017-04-28 12:36:46 -04:00
root
b8323a8efc
encoding fix
2017-04-28 02:04:29 -04:00
Zhiwei Chen
1f76fd9473
encoding fix
2017-04-28 01:53:46 -04:00
Zhiwei Chen
ef0af20178
fix category not utf8 error
2017-04-28 01:42:21 -04:00
Zhiwei Chen
52ed1ef9ae
fix category not utf8 error
2017-04-28 01:23:31 -04:00
Zhiwei Chen
7903b739f5
fix category not utf8 error
2017-04-28 01:17:45 -04:00
Zhiwei Chen
8e92f464cf
add readme
2017-04-27 20:15:17 -04:00
Zhiwei Chen
9cf2a2a883
add feature filtering by category of wiki
2017-04-27 19:57:41 -04:00
Giuseppe Attardi
2a5e6aebc0
Merge pull request #119 from BrenBarn/compact-lists
...
Fix problems that occurred when a list was the first thing in a section.
2017-03-08 12:10:04 +01:00
BrenBarn
674e9a0264
Fix problems that occurred when a list was the first thing in a section.
...
There were bugs that caused content to be dropped or erroneously included if a list was the first thing in a section and --lists was specified. This change should fix #117 and #118 .
2017-03-08 01:01:31 -08:00