Commit Graph

126 Commits

Author SHA1 Message Date
Albert Villanova del Moral
ff9a70cd6d Force 'utf-8' encoding without relying on platform-dependent default
On Windows, the default encoding is 'cp1252' and this raises a UnicodeDecodeError.

Fix #89 #144 #165
2019-07-13 18:21:43 +02:00
Giuseppe Attardi
3162bb6c3c
Merge pull request #137 from AriesLL/master
change argument parser for no_templates
2019-04-13 12:41:15 +02:00
attardi
45c2212f64 Merge branch 'master' of https://github.com/zwChan/wikiextractor into zwChan-master 2019-04-13 12:19:36 +02:00
Giuseppe Attardi
5bf4df62fa
Merge pull request #143 from danduma/master
Bug fix for list items
2019-04-13 11:43:09 +02:00
Karl Stratos
f9d57324c2 minimized complexity 2018-03-22 16:10:12 -05:00
Karl Stratos
ecc7cef402 do not include title in text 2018-03-22 12:51:47 -05:00
Karl Stratos
4ba4e9f683 Augmented disambig regex to catch disambiguation pages marked by the switch __DISAMBIG__. Augmented key regex to catch plus/minus signs. 2018-03-17 09:10:40 -07:00
Daniel
45e56d4e9e
Update WikiExtractor.py
Fix bug where it fails with an exception when n="1" or n="A"
2017-11-08 14:28:36 +00:00
Peipei Zhou
209e2b422f change argument parser for no_templates 2017-08-10 14:51:54 -07:00
Zhiwei Chen
169eaaf208 remove noisy print 2017-04-29 12:53:19 -04:00
Zhiwei Chen
e249508255 log categories statistics info 2017-04-29 12:50:47 -04:00
Zhiwei Chen
5274829e16 print friendly error msg 2017-04-28 14:57:54 -04:00
Zhiwei Chen
cc04dae71c log save to file; log page statistic info; 2017-04-28 12:36:46 -04:00
root
b8323a8efc encoding fix 2017-04-28 02:04:29 -04:00
Zhiwei Chen
1f76fd9473 encoding fix 2017-04-28 01:53:46 -04:00
Zhiwei Chen
7903b739f5 fix category not utf8 error 2017-04-28 01:17:45 -04:00
Zhiwei Chen
9cf2a2a883 add feature filtering by category of wiki 2017-04-27 19:57:41 -04:00
BrenBarn
674e9a0264 Fix problems that occurred when a list was the first thing in a section.
There were bugs that caused content to be dropped or erroneously included if a list was the first thing in a section and --lists was specified.  This change should fix #117 and #118.
2017-03-08 01:01:31 -08:00
attardi
5414b7fda8 Completed module String 2017-03-04 04:22:30 +01:00
attardi
c9432abcd0 Define #ifexists 2017-03-03 19:44:48 +01:00
attardi
3ea2da809b Fix for empty templates. 2017-03-03 18:52:17 +01:00
attardi
8fd8da77f4 Updated version number. 2017-03-02 05:58:05 +01:00
BrenBarn
e7bb889e0e Removed some old comments 2017-02-26 12:41:58 -08:00
BrenBarn
19d358eee8 Factor all info that needs to be passed to subprocesses into "options" variable
In order for things to work properly on Windows (and to make the communication between processes more clear in general), the parent process should communicate with the subprocess function only via its arguments, not via shared global variables.  This change takes all the global variables that used to be implicitly shared with the subprocess, and puts them into a single "options" object (a dict-like SimpleNamespace).  This object is then passed to the subprocess functions.

The "options" object includes not only the values of command-line arguments (e.g., "--no-templates") but also stuff like the "URL base" and template definitions that are precomputed before the main extraction begins.
2017-02-26 11:58:44 -08:00
attardi
f6f80e2350 ignoredTags 2017-02-26 01:00:05 +01:00
attardi
5f1fb5c995 Declared global ignoredTags 2017-02-26 00:58:56 +01:00
attardi
25edeebafb Moved ignoredTags to top. 2017-02-26 00:53:48 +01:00
attardi
82196d1156 Define discardedElements 2017-02-26 00:39:26 +01:00
Matteo Ceccarello
7ae45fcff7 Add json output
This commit adds a new output format to the program, namely json. When
invoked with the --json flag, the program will write several files with
several pages per file (as before) with one json object per line
representing a single page.

The information contained in this json object is the same as in the
default format, but is somewhat more straightforward to parse for other
tools.

The running time is the same as the default format, as well as the
compressed output size. The following is a simple benchmark on an I5
machine using as input
enwiki-20170120-pages-meta-current1.xml-p000000010p000030303.bz2:

$ ./WikiExtractor.py -o xml --compress --no-templates input.bz2
INFO: Finished 3-process extraction of 15084 articles in 335.7s (44.9 art/s)
$ ./WikiExtractor.py -o json --json --compress --no-templates input.bz2
INFO: Finished 3-process extraction of 15084 articles in 336.8s (44.8 art/s)
$ du -sh json xml
69M     json
69M     xml
2017-02-10 20:13:27 +01:00
attardi
ea9c368e52 Fix to use of this instead of self. 2017-02-06 08:36:00 +01:00
Nathan Davies
15e589e5cf putting the double line spacing back 2017-02-02 05:38:10 -08:00
Nathan Davies
b9c29e36bb Merge branch 'master' into ISSUE-101_discard_and_ignore_flags 2017-02-01 05:29:17 -08:00
Nathan Davies
12fb5e587d improved comment around regex 2017-02-01 05:26:44 -08:00
Nathan Davies
663a3dea73 tidying up some of the code and adding comments. 2017-01-31 16:52:59 -08:00
Xiao Ling
e646440185 closes lists at empty lines and adds list item count 2017-01-27 16:20:47 -08:00
Nathan Davies
3701b779b3 added extra replace section to tidy up output when retaining tables 2017-01-23 14:37:24 +00:00
Nathan Davies
e835e8c004 Added new flags
--discard_elements - allowing you to customise which elements are discarded
--ignored_tags - allowing you to customise which tags are ignored
--keep_tables - allows the contents of the tables in the original to articel to be retained. This does not render html tables
2017-01-23 14:18:21 +00:00
orangain
ddec683957 Ensure that process_count >= 1 2017-01-15 21:20:17 +09:00
attardi
7449ac95ba text_type 2017-01-15 10:09:40 +01:00
attardi
6660973646 See ChanngeLog. 2017-01-15 09:08:35 +01:00
Giuseppe Attardi
e00eacb372 Update version number. 2017-01-04 13:07:10 -08:00
Giuseppe Attardi
ba74b992da Python 3 decode
Skip decode('uff-8') in Python 3.
2017-01-04 13:05:15 -08:00
attardi
ce600138f4 See ChangeLog 2016-10-29 10:03:07 +02:00
attardi
636c9ea9f4 See ChangeLog. 2016-08-31 09:00:51 +02:00
attardi
e6a051c949 See ChangeLog. 2016-08-30 18:17:14 +02:00
attardi
d167742d16 See ChangeLog. 2016-08-29 23:34:47 +02:00
attardi
5cb7da320e See ChangeLog. 2016-08-19 14:11:37 +02:00
attardi
2942d1e19d See ChangeLog. 2016-08-11 10:34:21 +02:00
Seth Cleveland
eacccbc6eb Remove python2 extract utf8 encoding and log extract exceptions 2016-06-20 10:39:01 -05:00
attardi
0f703c0aae Merged PR from Seth Cleveland. 2016-06-19 13:10:36 +02:00