Commit Graph

  • f6f80e2350 ignoredTags attardi 2017-02-26 01:00:05 +0100
  • 5f1fb5c995 Declared global ignoredTags attardi 2017-02-26 00:58:56 +0100
  • 25edeebafb Moved ignoredTags to top. attardi 2017-02-26 00:53:48 +0100
  • 82196d1156 Define discardedElements attardi 2017-02-26 00:39:26 +0100
  • db51f0b45c Merge pull request #105 from Cecca/master Giuseppe Attardi 2017-02-11 18:12:25 +0100
  • 7ae45fcff7 Add json output Matteo Ceccarello 2017-02-10 10:36:04 +0100
  • ea9c368e52 Fix to use of this instead of self. attardi 2017-02-06 08:36:00 +0100
  • c88ae9736d Merge pull request #102 from nathj07/ISSUE-101_discard_and_ignore_flags Giuseppe Attardi 2017-02-02 15:31:03 +0100
  • 15e589e5cf putting the double line spacing back Nathan Davies 2017-02-02 05:38:10 -0800
  • b9c29e36bb Merge branch 'master' into ISSUE-101_discard_and_ignore_flags Nathan Davies 2017-02-01 05:29:17 -0800
  • 12fb5e587d improved comment around regex Nathan Davies 2017-02-01 05:26:44 -0800
  • 663a3dea73 tidying up some of the code and adding comments. Nathan Davies 2017-01-31 16:52:59 -0800
  • 118b37b2ca Merge b93cd67e31 into ba616e2f85 Adrian Englhardt 2017-01-28 18:29:07 +0000
  • ba616e2f85 Merge pull request #103 from xiaoling/lists Giuseppe Attardi 2017-01-28 10:26:12 +0100
  • e646440185 closes lists at empty lines and adds list item count Xiao Ling 2017-01-27 16:20:43 -0800
  • b93cd67e31 Merge branch 'master' into plain-output Adrian Englhardt 2017-01-24 19:52:53 +0100
  • 3701b779b3 added extra replace section to tidy up output when retaining tables Nathan Davies 2017-01-23 14:37:24 +0000
  • e835e8c004 Added new flags Nathan Davies 2017-01-23 14:18:21 +0000
  • b5d97f64f9 Merge pull request #100 from orangain/fix-oom-in-single-core Giuseppe Attardi 2017-01-15 14:38:19 +0100
  • ddec683957 Ensure that process_count >= 1 orangain 2017-01-15 16:46:54 +0900
  • 7449ac95ba text_type attardi 2017-01-15 10:09:40 +0100
  • 6660973646 See ChanngeLog. attardi 2017-01-15 09:08:35 +0100
  • bf9360b51e Merge 17bde88194 into e00eacb372 orangain 2017-01-14 15:42:11 +0000
  • 17bde88194 Decode binary types even in Python 3 orangain 2017-01-15 00:16:02 +0900
  • 2673316c21 Merge 4f0c226e7e into e00eacb372 Leonid Boytsov 2017-01-08 03:26:03 +0000
  • e00eacb372 Update version number. Giuseppe Attardi 2017-01-04 13:07:10 -0800
  • ba74b992da Python 3 decode Giuseppe Attardi 2017-01-04 13:05:15 -0800
  • 4f0c226e7e Let's keep paragraph separators. searchivarius 2016-12-30 12:33:26 -0500
  • 87dbb62961 Merge pull request #94 from seong889/patch-1 Giuseppe Attardi 2016-12-10 11:38:45 +0100
  • 31f848d620 Update README.md Seongjun Hong 2016-12-08 15:48:05 +0900
  • e2a3b27003 Added '--no-doc' and '--no-title' option englhardt 2016-11-27 11:50:27 +0100
  • 499997910c See ChangeLog. attardi 2016-10-29 10:51:08 +0200
  • f1989bcc11 See ChangeLog. attardi 2016-10-29 10:19:28 +0200
  • 27a6bee638 Merge 0e234c9138 into ce600138f4 Josh Newman 2016-10-29 08:06:21 +0000
  • ce600138f4 See ChangeLog attardi 2016-10-29 10:03:07 +0200
  • ff12b69e09 Merge pull request #82 from sente/patch-2 Giuseppe Attardi 2016-10-29 09:08:58 +0200
  • b78a6934ee Fix binary read from stdin Egor Melnikov 2016-10-27 19:10:23 +0300
  • 56c37d69c2 fix issue #1: Empty-element tag in wikipedia dump leads to wrong document id attribution Elias Zervudakis 2016-10-13 16:12:49 +0300
  • 636c9ea9f4 See ChangeLog. attardi 2016-08-31 09:00:51 +0200
  • e6a051c949 See ChangeLog. attardi 2016-08-30 18:17:14 +0200
  • fe0ac18fb9 Fix typo in README.md Stuart Powers 2016-08-30 00:41:21 -0400
  • d167742d16 See ChangeLog. attardi 2016-08-29 23:34:47 +0200
  • 240c8a9458 Merge 14775dfd0c into 5cb7da320e okb1100 2016-08-28 15:42:55 +0000
  • 14775dfd0c h1 tag for title okb1100 2016-08-28 18:41:12 +0300
  • 5cb7da320e See ChangeLog. attardi 2016-08-19 14:11:37 +0200
  • 2942d1e19d See ChangeLog. attardi 2016-08-11 10:34:21 +0200
  • 0f96765bdc Merge 4fb5564f41 into 419fe97d7a hatzel 2016-07-11 14:28:24 +0000
  • 4fb5564f41 Added the capability to output csv. Hans Ole Hatzel 2016-07-11 16:20:49 +0200
  • 419fe97d7a Merge pull request #67 from sethcleveland/remove_python2_string_encoding Giuseppe Attardi 2016-06-24 09:03:11 +0200
  • 0e234c9138 Name a package in setup.py Joshua Newman 2016-06-20 21:36:48 -0700
  • eacccbc6eb Remove python2 extract utf8 encoding and log extract exceptions Seth Cleveland 2016-06-20 10:39:01 -0500
  • 0f703c0aae Merged PR from Seth Cleveland. attardi 2016-06-19 13:10:36 +0200
  • f9b8e8ac02 Added support for Python 3. attardi 2016-06-19 12:53:31 +0200
  • aee387b566 Merge pull request #66 from orangain/support-python3 Giuseppe Attardi 2016-06-19 12:46:52 +0200
  • 3ccd368aa6 Update README.md about Python 2/3 support orangain 2016-06-18 12:25:51 +0900
  • cb7d42d10f Add tox.ini orangain 2016-06-18 12:25:01 +0900
  • 7b21d10ccd Seek StringIO to position 0 on every output orangain 2016-06-17 23:28:58 +0900
  • b19e341ce2 Use text type as a page text and encode them when writing to a file orangain 2016-06-18 11:40:10 +0900
  • 6851fe4b3f Use // instead of / for integer division orangain 2016-06-18 11:37:17 +0900
  • 9322b7ba54 Make imports Python 2/3 compatible orangain 2016-06-18 11:36:33 +0900
  • 8749df0a81 Add .gitignore for python orangain 2016-06-17 20:30:32 +0900
  • d44b8130b5 Add test cases for Python 2 orangain 2016-06-17 22:21:50 +0900
  • d6af7cf1ee Merge f38acd14b3 into 60e4082440 Seth Cleveland 2016-06-13 18:06:40 +0000
  • f38acd14b3 Enhance output and filtering options Seth Cleveland 2016-06-13 13:04:57 -0500
  • 73fa3c2f1f Merge 30b4289ffc into 60e4082440 José M. Camacho 2016-03-31 14:53:11 +0000
  • 30b4289ffc Added '--no-doc' and '--no-title' options to WikiExtractor.py José M. Camacho 2016-03-31 16:46:52 +0200
  • 60e4082440 See ChangeLog. attardi 2016-03-23 15:14:13 +0100
  • bcc3d124b4 See ChangeLog. attardi 2016-03-19 11:36:24 +0100
  • 9521b90c08 Merge pull request #56 from spyysalo/master Giuseppe Attardi 2016-03-12 12:48:50 +0100
  • 7a5b5e5765 Match internal links in external links Sampo Pyysalo 2016-03-12 11:42:06 +0000
  • 6af9c283eb See ChangeLog. attardi 2016-03-12 08:15:01 +0100
  • ab0d008512 See ChangeLog. attardi 2016-03-06 17:27:39 +0100
  • 3bdcf6a4ad Reduce spool queue to 10%. attardi 2016-02-20 12:13:23 +0100
  • 0726948142 Typo. attardi 2016-02-20 10:49:21 +0100
  • 730cfc07f9 See ChangeLog. attardi 2016-02-20 10:45:58 +0100
  • ca2a34ccce See ChangeLog. attardi 2016-02-15 09:04:46 +0100
  • 6d0577ef10 See ChangeLog. attardi 2016-02-15 01:22:38 +0100
  • 834cad6a35 See ChangeLog. attardi 2016-02-12 23:31:21 +0100
  • b04760ecd8 See ChangeLog. attardi 2016-02-12 18:16:54 +0100
  • 6f22be4702 Merge branch 'master' of https://github.com/attardi/wikiextractor attardi 2016-02-11 13:21:14 +0100
  • 911dacda3a Added emulation of Lua module If_empty. attardi 2016-02-11 13:20:23 +0100
  • 36f9467c33 Merge pull request #48 from rom1504/patch-1 Giuseppe Attardi 2016-02-11 10:28:47 +0100
  • 3eb4c4e3c3 fix typo in Wikipedia Cirrus Extractor section Romain Beaumont 2016-02-11 09:55:55 +0100
  • 8dcf73bd3e Merge branch 'master' of https://github.com/attardi/wikiextractor attardi 2016-02-11 01:04:25 +0100
  • b2c371678c See ChangeLog. attardi 2016-02-11 01:03:31 +0100
  • c8afa84e95 Merge pull request #46 from mrshu/mrshu/add-setup Giuseppe Attardi 2016-02-06 02:42:52 +0100
  • 22103664fc update: Add setup.py mr.Shu 2016-02-05 23:56:36 +0100
  • fc89e2514e See ChangeLog. attardi 2016-02-04 11:23:40 +0100
  • 49464c0210 Merge branch 'master' of https://github.com/attardi/wikiextractor attardi 2016-02-04 11:09:31 +0100
  • 3cebfdd4c0 Updated Copyright. attardi 2016-02-04 11:08:37 +0100
  • 04e723ceef Merge pull request #1 from infolab-csail/develop Alvaro Morales 2016-01-27 15:56:40 -0500
  • 52616638fc Merge 92450a54a6 into 0bb3061e79 Alvaro Morales 2016-01-27 20:48:00 +0000
  • 92450a54a6 Basic attempt to modularize markup cleaning function Alvaro Morales 2016-01-27 15:47:06 -0500
  • 578371d2e1 Add gitignore Alvaro Morales 2016-01-27 15:46:36 -0500
  • 22ba358ef4 Move scripts into scripts/ directory Alvaro Morales 2016-01-27 15:46:22 -0500
  • 0bb3061e79 Update README.md Giuseppe Attardi 2015-12-03 13:00:12 +0100
  • a412c7e3ab Merge pull request #37 from nathj07/escape_extracted_text Giuseppe Attardi 2015-12-02 14:47:11 +0100
  • 03e18ffbc8 See ChangeLog. Giuseppe Attardi 2015-11-20 00:34:23 +0100
  • 285b119370 Remove DEBUG. Giuseppe Attardi 2015-11-20 00:07:50 +0100
  • 113dab796c Fixed. Giuseppe Attardi 2015-11-20 00:06:23 +0100