attardi
6f22be4702
Merge branch 'master' of https://github.com/attardi/wikiextractor
2016-02-11 13:21:14 +01:00
attardi
911dacda3a
Added emulation of Lua module If_empty.
2016-02-11 13:20:23 +01:00
Giuseppe Attardi
36f9467c33
Merge pull request #48 from rom1504/patch-1
...
fix typo in Wikipedia Cirrus Extractor section
2016-02-11 10:28:47 +01:00
Romain Beaumont
3eb4c4e3c3
fix typo in Wikipedia Cirrus Extractor section
2016-02-11 09:55:55 +01:00
attardi
8dcf73bd3e
Merge branch 'master' of https://github.com/attardi/wikiextractor
2016-02-11 01:04:25 +01:00
attardi
b2c371678c
See ChangeLog.
2016-02-11 01:03:31 +01:00
Giuseppe Attardi
c8afa84e95
Merge pull request #46 from mrshu/mrshu/add-setup
...
update: Add setup.py
2016-02-06 02:42:52 +01:00
mr.Shu
22103664fc
update: Add setup.py
...
* Add the first version of setup.py in order to simplify creation of a
real `wikiextractor` command.
Signed-off-by: mr.Shu <mr@shu.io>
2016-02-05 23:56:36 +01:00
attardi
fc89e2514e
See ChangeLog.
2016-02-04 11:23:40 +01:00
attardi
49464c0210
Merge branch 'master' of https://github.com/attardi/wikiextractor
2016-02-04 11:09:31 +01:00
attardi
3cebfdd4c0
Updated Copyright.
2016-02-04 11:08:37 +01:00
Giuseppe Attardi
0bb3061e79
Update README.md
2015-12-03 13:00:12 +01:00
Giuseppe Attardi
a412c7e3ab
Merge pull request #37 from nathj07/escape_extracted_text
...
added a new flag and it's usage
2015-12-02 14:47:11 +01:00
Giuseppe Attardi
03e18ffbc8
See ChangeLog.
2015-11-20 00:34:23 +01:00
Giuseppe Attardi
285b119370
Remove DEBUG.
2015-11-20 00:07:50 +01:00
Giuseppe Attardi
113dab796c
Fixed.
2015-11-20 00:06:23 +01:00
Giuseppe Attardi
e5720f5c52
See ChangeLog.
2015-11-20 00:04:59 +01:00
Nathan Davies
811d32e98d
Updating the README with new help
2015-11-13 08:32:32 -08:00
Nathan Davies
d1e21c2b6a
added a new flag and it's usage
...
The new flag is --escapedoc and if set the clean function runs cgi.escape(text) before return this text to be included in <doc></doc>.
This is a non-breaking change
2015-11-13 03:01:44 -08:00
Giuseppe Attardi
9229e50bb3
Update README.md
2015-10-25 17:03:17 +01:00
orangain
02f9561100
Dropped redundant global declarations.
2015-10-17 11:48:16 +02:00
orangain
09a968809e
Compliance to PEP 8.
2015-10-17 11:34:24 +02:00
Giuseppe Attardi
d7fc4788f4
See ChangeLog.
2015-09-29 15:31:19 +02:00
Giuseppe Attardi
90d1c1ebcf
See ChangeLog.
2015-09-14 20:24:10 +02:00
Giuseppe Attardi
ecd24f3fc6
See ChangeLog.
2015-09-14 18:05:36 +02:00
Giuseppe Attardi
b8cd2574e0
See ChangeLog.
2015-08-30 21:52:02 +02:00
Giuseppe Attardi
bebdc8c899
Removed extra logging.debug.
2015-08-30 21:23:15 +02:00
Giuseppe Attardi
70956025f1
Minor fix contribution.
2015-08-30 21:18:17 +02:00
Giuseppe Attardi
d5b354597f
See ChangeLog
2015-08-30 21:17:26 +02:00
Giuseppe Attardi
b7e676e1f5
Merge pull request #31 from orangain/fix-progress-report
...
Fix progress report
2015-08-13 10:25:28 +02:00
orangain
3cfa6dcee8
Fix progress report
...
Reported count and rate of processing were wrong:
* Reported number of extracted articles was fewer than the true value by 1.
* Reported rate of processing was completely different from the true value.
2015-08-13 01:11:57 +09:00
Giuseppe Attardi
5057c130cc
Merge pull request #29 from Munzey/master
...
fix for #28 - discardElement tags should be case insensitive
2015-06-24 18:20:36 +02:00
tristan
7a1b552b0c
fix for #28 - discardElement tags should be case insensitive
2015-06-24 17:55:20 +02:00
Giuseppe Attardi
d8a15dd0ba
Merge pull request #27 from gojomo/multiprocessing
...
multiprocess speedup; stdin/stdout/single-file options; stable ordering; sparser progress logging
2015-06-20 08:58:10 +02:00
Gordon Mohr
55beb4a426
restore default section-handling
2015-06-19 18:06:34 -07:00
Gordon Mohr
d420d729e7
more summary/timing logging; less bulk/repeat logging
2015-06-19 17:42:21 -07:00
Gordon Mohr
5b647e2249
stable ordering; skip dups; accept compressed tempate-file
2015-06-19 03:15:45 -07:00
Gordon Mohr
190aae11a1
up processes default to # cores
2015-06-18 14:52:20 -07:00
Gordon Mohr
e3515e2ecf
single-file; stdout; dir/multi-file
2015-06-18 14:49:03 -07:00
Gordon Mohr
5d32701400
messy 1st approach
2015-06-17 18:49:26 -07:00
Giuseppe Attardi
694cd5a7f4
Merge pull request #25 from dragoon/patch-2
...
multiline tag match fix
2015-06-14 09:12:30 +02:00
Roman Prokofyev
70ee947a8b
multiline tag match fix
...
need to add re.DOTALL so that multiline tag definitions are also matched
2015-06-12 14:23:14 +02:00
Giuseppe Attardi
625d4b69b3
Merge pull request #24 from dragoon/patch-1
...
Fix regex for <ref> tag when it's not self-closing
2015-06-10 18:40:48 +02:00
Roman Prokofyev
b99aaf19aa
Fix regex for <ref> tag when it's not self-closing
...
In some articles <ref> tag appears like this:
<ref name="Ahmed Rashid/The Telegraph">{{cite
Previous regex breaks when it sees the forward slash ("Rashid/The"). New regex stops at the earliest occurrence of the closing bracket, no need to pre-filter characters.
2015-06-10 15:40:27 +02:00
Giuseppe Attardi
c15d93c40a
See ChangeLog.
2015-06-03 00:06:35 +02:00
Giuseppe Attardi
147e36df5b
See ChangeLog.
2015-06-03 00:01:45 +02:00
Giuseppe Attardi
5b0d88a16c
See ChangeLog.
2015-05-29 20:52:27 +02:00
Giuseppe Attardi
f041f9143f
Merge branch 'master' of https://github.com/attardi/wikiextractor
2015-05-06 16:09:12 +02:00
Giuseppe Attardi
d5cca5da43
See ChangeLog.
2015-05-06 16:08:27 +02:00
Giuseppe Attardi
45b4658e72
Update README.md
2015-04-26 08:57:25 +02:00