wikiextractor

Author	SHA1	Message	Date
Giuseppe Attardi	db51f0b45c	Merge pull request #105 from Cecca/master Add json output	2017-02-11 18:12:25 +01:00
Matteo Ceccarello	7ae45fcff7	Add json output This commit adds a new output format to the program, namely json. When invoked with the --json flag, the program will write several files with several pages per file (as before) with one json object per line representing a single page. The information contained in this json object is the same as in the default format, but is somewhat more straightforward to parse for other tools. The running time is the same as the default format, as well as the compressed output size. The following is a simple benchmark on an I5 machine using as input enwiki-20170120-pages-meta-current1.xml-p000000010p000030303.bz2: $ ./WikiExtractor.py -o xml --compress --no-templates input.bz2 INFO: Finished 3-process extraction of 15084 articles in 335.7s (44.9 art/s) $ ./WikiExtractor.py -o json --json --compress --no-templates input.bz2 INFO: Finished 3-process extraction of 15084 articles in 336.8s (44.8 art/s) $ du -sh json xml 69M json 69M xml	2017-02-10 20:13:27 +01:00
attardi	ea9c368e52	Fix to use of this instead of self.	2017-02-06 08:36:00 +01:00
Giuseppe Attardi	c88ae9736d	Merge pull request #102 from nathj07/ISSUE-101_discard_and_ignore_flags Issue 101 discard and ignore flags and keep_tables	2017-02-02 15:31:03 +01:00
Nathan Davies	15e589e5cf	putting the double line spacing back	2017-02-02 05:38:10 -08:00
Nathan Davies	b9c29e36bb	Merge branch 'master' into ISSUE-101_discard_and_ignore_flags	2017-02-01 05:29:17 -08:00
Nathan Davies	12fb5e587d	improved comment around regex	2017-02-01 05:26:44 -08:00
Nathan Davies	663a3dea73	tidying up some of the code and adding comments.	2017-01-31 16:52:59 -08:00
Giuseppe Attardi	ba616e2f85	Merge pull request #103 from xiaoling/lists closes lists at empty lines and adds list item count	2017-01-28 10:26:12 +01:00
Xiao Ling	e646440185	closes lists at empty lines and adds list item count	2017-01-27 16:20:47 -08:00
Nathan Davies	3701b779b3	added extra replace section to tidy up output when retaining tables	2017-01-23 14:37:24 +00:00
Nathan Davies	e835e8c004	Added new flags --discard_elements - allowing you to customise which elements are discarded --ignored_tags - allowing you to customise which tags are ignored --keep_tables - allows the contents of the tables in the original to articel to be retained. This does not render html tables	2017-01-23 14:18:21 +00:00
Giuseppe Attardi	b5d97f64f9	Merge pull request #100 from orangain/fix-oom-in-single-core Ensure that process_count >= 1	2017-01-15 14:38:19 +01:00
orangain	ddec683957	Ensure that process_count >= 1	2017-01-15 21:20:17 +09:00
attardi	7449ac95ba	text_type	2017-01-15 10:09:40 +01:00
attardi	6660973646	See ChanngeLog.	2017-01-15 09:08:35 +01:00
Giuseppe Attardi	e00eacb372	Update version number.	2017-01-04 13:07:10 -08:00
Giuseppe Attardi	ba74b992da	Python 3 decode Skip decode('uff-8') in Python 3.	2017-01-04 13:05:15 -08:00
Giuseppe Attardi	87dbb62961	Merge pull request #94 from seong889/patch-1 Update README.md	2016-12-10 11:38:45 +01:00
Seongjun Hong	31f848d620	Update README.md removed option : escapedoc	2016-12-08 15:48:05 +09:00
attardi	499997910c	See ChangeLog.	2016-10-29 10:51:08 +02:00
attardi	f1989bcc11	See ChangeLog.	2016-10-29 10:19:28 +02:00
attardi	ce600138f4	See ChangeLog	2016-10-29 10:03:07 +02:00
Giuseppe Attardi	ff12b69e09	Merge pull request #82 from sente/patch-2 Fix typo in README.md	2016-10-29 09:08:58 +02:00
attardi	636c9ea9f4	See ChangeLog.	2016-08-31 09:00:51 +02:00
attardi	e6a051c949	See ChangeLog.	2016-08-30 18:17:14 +02:00
Stuart Powers	fe0ac18fb9	Fix typo in README.md "preprocesssng" -> "preprocessing"	2016-08-30 00:41:21 -04:00
attardi	d167742d16	See ChangeLog.	2016-08-29 23:34:47 +02:00
attardi	5cb7da320e	See ChangeLog.	2016-08-19 14:11:37 +02:00
attardi	2942d1e19d	See ChangeLog.	2016-08-11 10:34:21 +02:00
Giuseppe Attardi	419fe97d7a	Merge pull request #67 from sethcleveland/remove_python2_string_encoding Remove python2 extract utf8 encoding and log extract exceptions	2016-06-24 09:03:11 +02:00
Seth Cleveland	eacccbc6eb	Remove python2 extract utf8 encoding and log extract exceptions	2016-06-20 10:39:01 -05:00
attardi	0f703c0aae	Merged PR from Seth Cleveland.	2016-06-19 13:10:36 +02:00
attardi	f9b8e8ac02	Added support for Python 3.	2016-06-19 12:53:31 +02:00
Giuseppe Attardi	aee387b566	Merge pull request #66 from orangain/support-python3 Support Python 3	2016-06-19 12:46:52 +02:00
orangain	3ccd368aa6	Update README.md about Python 2/3 support	2016-06-18 13:44:45 +09:00
orangain	cb7d42d10f	Add tox.ini	2016-06-18 13:44:45 +09:00
orangain	7b21d10ccd	Seek StringIO to position 0 on every output truncate(0) does not guarantee that the position is seeked to 0.	2016-06-18 13:44:45 +09:00
orangain	b19e341ce2	Use text type as a page text and encode them when writing to a file	2016-06-18 13:44:45 +09:00
orangain	6851fe4b3f	Use // instead of / for integer division Add `division` to future import.	2016-06-18 13:44:45 +09:00
orangain	9322b7ba54	Make imports Python 2/3 compatible * Use `from __future__ import unicode_literals` and replace `u''` literals with `''`. * Use `io.StringIO` instead of `cStringIO.StringIO` for Py2/3 compatibility. * Define a const `PY2` which is True in Python 2 but False in Python 3. * Import `quote` and `name2codepoint` from differenct modules between Python 2 and 3. * Use Python 3's name in Python 2 for `zip`, `zip_longest`, `range` and `chr`. * Use `text_type` as a type for `unicode` in Python 2 and `str` in Python 3. * Use `sorted()` to sort dict's `items()`. * Implement `__next__` in NextFile and call next() built-in function.	2016-06-18 13:44:45 +09:00
orangain	8749df0a81	Add .gitignore for python	2016-06-18 13:44:45 +09:00
orangain	d44b8130b5	Add test cases for Python 2	2016-06-18 13:44:45 +09:00
attardi	60e4082440	See ChangeLog.	2016-03-23 15:14:13 +01:00
attardi	bcc3d124b4	See ChangeLog.	2016-03-19 11:36:24 +01:00
Giuseppe Attardi	9521b90c08	Merge pull request #56 from spyysalo/master Match internal links in external links (fixes #55)	2016-03-12 12:48:50 +01:00
Sampo Pyysalo	7a5b5e5765	Match internal links in external links See attardi/wikiextractor/issues/#55	2016-03-12 11:42:06 +00:00
attardi	6af9c283eb	See ChangeLog.	2016-03-12 08:15:01 +01:00
attardi	ab0d008512	See ChangeLog.	2016-03-06 17:27:39 +01:00
attardi	3bdcf6a4ad	Reduce spool queue to 10%.	2016-02-20 12:13:23 +01:00

1 2 3 4

192 Commits