Commit Graph

138 Commits

Author SHA1 Message Date
Nathan Davies
15e589e5cf putting the double line spacing back 2017-02-02 05:38:10 -08:00
Nathan Davies
b9c29e36bb Merge branch 'master' into ISSUE-101_discard_and_ignore_flags 2017-02-01 05:29:17 -08:00
Nathan Davies
12fb5e587d improved comment around regex 2017-02-01 05:26:44 -08:00
Nathan Davies
663a3dea73 tidying up some of the code and adding comments. 2017-01-31 16:52:59 -08:00
Giuseppe Attardi
ba616e2f85 Merge pull request #103 from xiaoling/lists
closes lists at empty lines and adds list item count
2017-01-28 10:26:12 +01:00
Xiao Ling
e646440185 closes lists at empty lines and adds list item count 2017-01-27 16:20:47 -08:00
Nathan Davies
3701b779b3 added extra replace section to tidy up output when retaining tables 2017-01-23 14:37:24 +00:00
Nathan Davies
e835e8c004 Added new flags
--discard_elements - allowing you to customise which elements are discarded
--ignored_tags - allowing you to customise which tags are ignored
--keep_tables - allows the contents of the tables in the original to articel to be retained. This does not render html tables
2017-01-23 14:18:21 +00:00
Giuseppe Attardi
b5d97f64f9 Merge pull request #100 from orangain/fix-oom-in-single-core
Ensure that process_count >= 1
2017-01-15 14:38:19 +01:00
orangain
ddec683957 Ensure that process_count >= 1 2017-01-15 21:20:17 +09:00
attardi
7449ac95ba text_type 2017-01-15 10:09:40 +01:00
attardi
6660973646 See ChanngeLog. 2017-01-15 09:08:35 +01:00
Giuseppe Attardi
e00eacb372 Update version number. 2017-01-04 13:07:10 -08:00
Giuseppe Attardi
ba74b992da Python 3 decode
Skip decode('uff-8') in Python 3.
2017-01-04 13:05:15 -08:00
Giuseppe Attardi
87dbb62961 Merge pull request #94 from seong889/patch-1
Update README.md
2016-12-10 11:38:45 +01:00
Seongjun Hong
31f848d620 Update README.md
removed option : escapedoc
2016-12-08 15:48:05 +09:00
attardi
499997910c See ChangeLog. 2016-10-29 10:51:08 +02:00
attardi
f1989bcc11 See ChangeLog. 2016-10-29 10:19:28 +02:00
attardi
ce600138f4 See ChangeLog 2016-10-29 10:03:07 +02:00
Giuseppe Attardi
ff12b69e09 Merge pull request #82 from sente/patch-2
Fix typo in README.md
2016-10-29 09:08:58 +02:00
attardi
636c9ea9f4 See ChangeLog. 2016-08-31 09:00:51 +02:00
attardi
e6a051c949 See ChangeLog. 2016-08-30 18:17:14 +02:00
Stuart Powers
fe0ac18fb9 Fix typo in README.md
"preprocesssng" -> "preprocessing"
2016-08-30 00:41:21 -04:00
attardi
d167742d16 See ChangeLog. 2016-08-29 23:34:47 +02:00
attardi
5cb7da320e See ChangeLog. 2016-08-19 14:11:37 +02:00
attardi
2942d1e19d See ChangeLog. 2016-08-11 10:34:21 +02:00
Giuseppe Attardi
419fe97d7a Merge pull request #67 from sethcleveland/remove_python2_string_encoding
Remove python2 extract utf8 encoding and log extract exceptions
2016-06-24 09:03:11 +02:00
Seth Cleveland
eacccbc6eb Remove python2 extract utf8 encoding and log extract exceptions 2016-06-20 10:39:01 -05:00
attardi
0f703c0aae Merged PR from Seth Cleveland. 2016-06-19 13:10:36 +02:00
attardi
f9b8e8ac02 Added support for Python 3. 2016-06-19 12:53:31 +02:00
Giuseppe Attardi
aee387b566 Merge pull request #66 from orangain/support-python3
Support Python 3
2016-06-19 12:46:52 +02:00
orangain
3ccd368aa6 Update README.md about Python 2/3 support 2016-06-18 13:44:45 +09:00
orangain
cb7d42d10f Add tox.ini 2016-06-18 13:44:45 +09:00
orangain
7b21d10ccd Seek StringIO to position 0 on every output
truncate(0) does not guarantee that the position is seeked to 0.
2016-06-18 13:44:45 +09:00
orangain
b19e341ce2 Use text type as a page text and encode them when writing to a file 2016-06-18 13:44:45 +09:00
orangain
6851fe4b3f Use // instead of / for integer division
Add `division` to future import.
2016-06-18 13:44:45 +09:00
orangain
9322b7ba54 Make imports Python 2/3 compatible
* Use `from __future__ import unicode_literals` and replace `u''` literals
  with `''`.
* Use `io.StringIO` instead of `cStringIO.StringIO` for Py2/3 compatibility.
* Define a const `PY2` which is True in Python 2 but False in Python 3.
* Import `quote` and `name2codepoint` from differenct modules between
  Python 2 and 3.
* Use Python 3's name in Python 2 for `zip`, `zip_longest`, `range` and `chr`.
* Use `text_type` as a type for `unicode` in Python 2 and `str` in Python 3.
* Use `sorted()` to sort dict's `items()`.
* Implement `__next__` in NextFile and call next() built-in function.
2016-06-18 13:44:45 +09:00
orangain
8749df0a81 Add .gitignore for python 2016-06-18 13:44:45 +09:00
orangain
d44b8130b5 Add test cases for Python 2 2016-06-18 13:44:45 +09:00
attardi
60e4082440 See ChangeLog. 2016-03-23 15:14:13 +01:00
attardi
bcc3d124b4 See ChangeLog. 2016-03-19 11:36:24 +01:00
Giuseppe Attardi
9521b90c08 Merge pull request #56 from spyysalo/master
Match internal links in external links (fixes #55)
2016-03-12 12:48:50 +01:00
Sampo Pyysalo
7a5b5e5765 Match internal links in external links
See attardi/wikiextractor/issues/#55
2016-03-12 11:42:06 +00:00
attardi
6af9c283eb See ChangeLog. 2016-03-12 08:15:01 +01:00
attardi
ab0d008512 See ChangeLog. 2016-03-06 17:27:39 +01:00
attardi
3bdcf6a4ad Reduce spool queue to 10%. 2016-02-20 12:13:23 +01:00
attardi
0726948142 Typo. 2016-02-20 10:49:21 +01:00
attardi
730cfc07f9 See ChangeLog. 2016-02-20 10:45:58 +01:00
attardi
ca2a34ccce See ChangeLog. 2016-02-15 09:04:46 +01:00
attardi
6d0577ef10 See ChangeLog. 2016-02-15 01:22:38 +01:00