518 lines
18 KiB
Plaintext
518 lines
18 KiB
Plaintext
2017-01-15 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py (process_dump): use text_type to decide whether
|
|
to use decode('utf-8').)
|
|
|
|
2016-10-29 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* setup.py: use scripts instead of console_scripts.
|
|
|
|
* WikiExtractor.py (pages_from): handle self closing <text />.
|
|
|
|
2016-08-31 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py (findMatchingBraces): fixed ambiguous cases
|
|
like {{{!}} {{!}}}
|
|
|
|
2016-08-30 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py (Extractor.wiki2text): drop templates before
|
|
dropping tables.
|
|
(Extractor.templateParams): fixed RE for separating var from value.
|
|
|
|
(main): removed option xml_namespaces, since we extract only
|
|
articles, i.e. those with namespace 0.
|
|
|
|
(Extractor.clean): removed option --escapedoc, use --html instead.
|
|
|
|
* WikiExtractor.py (Extractor.transform): Keep <nowiki> or else
|
|
further expansion will be triggered.
|
|
|
|
2016-08-29 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py (callParserFunction): reworked order of
|
|
evaluation of template parameters, according to
|
|
https://www.mediawiki.org/wiki/Help:Templates#Order_of_evaluation
|
|
Template parameters are fully evaluated before they are passed to the template.
|
|
The branching parser function are an exception, since evaluation
|
|
depends on values tested.
|
|
|
|
* WikiExtractor.py (callParserFunction): reworked handling of
|
|
parser function.
|
|
Handle expansion of:
|
|
{{numero romano|1999}}
|
|
where Template:Numero romano = {{{{{|safesubst:}}}#invoke:Numero romano|main}}
|
|
and parameter body has no arguments, by getting them from frame.
|
|
Instead expansion of:
|
|
{{Str endswith|1999|a.C.}}
|
|
where Template:Str endswith = {{#ifeq:{{#invoke:String|sub|s={{{1|}}}| -{{#invoke:String|len|s={{{2}}}}} |ignore_errors=true}}|{{{2|}}}|sì}}
|
|
uses parameters in template.
|
|
|
|
* WikiExtractor.py (Extractor.extract): Adds an h1 tag for title if --html is
|
|
passed. Patch contributed by okb1100 on GitHub.
|
|
|
|
* WikiExtractor.py (modules): added partial emulation of modules String and Roman.
|
|
)
|
|
2016-08-26 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py (nowiki): handle <nowiki>...</nowiki>.
|
|
|
|
2016-08-22 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py (Extractor.extract): fixed UTF-8 encoding with
|
|
option -a.
|
|
|
|
2016-08-19 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py (Extractor): fixed misspelled keepLists.
|
|
|
|
2016-08-11 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py (Extract): made escape_doc a class variable.
|
|
|
|
2016-06-20 Seth Cleveland <scleveland@turnitin.com>
|
|
|
|
* WikiExtractor.py: Fix string encoding error for python 2 and log
|
|
extract exceptions.
|
|
|
|
2016-06-19 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py: support for Python 3 by orangain@gmail.com
|
|
|
|
2016-03-23 Seth Cleveland <scleveland@turnitin.com>
|
|
|
|
* WikiExtractor.py: add filtering options -- min text length, xml
|
|
namespace, and disambig pages. Added option to print the revision
|
|
id with each document.
|
|
|
|
2016-03-23 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py (compact): properly emit section headers when Extractor.keepSections.
|
|
|
|
2016-03-19 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py (main): restored option --sections.
|
|
|
|
2016-03-12 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py (findBalanced): fix to bad single tuple
|
|
parameter to findBalanced.
|
|
|
|
2016-03-06 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py (process_dump): handle templates with stdin.
|
|
|
|
2016-02-20 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py (reduce_process): tell mapper to wait when
|
|
spool gets too long.
|
|
|
|
2016-02-19 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py (extract_process): use out.truncate() instread
|
|
of out.close()
|
|
|
|
2016-02-15 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py (Extractor.clean): turned into method.
|
|
(process_dump): control spool length to avoid filing up memory.
|
|
|
|
2016-02-14 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py (load_templates): save also <id> in templates.
|
|
(pages_from): common iterator for extracting pages from dump, used
|
|
both for analyzing pages, templates and single articles.
|
|
|
|
2016-02-13 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py (reduce_process): close file here.
|
|
|
|
* extractPage.py (process_data): allow range of ids.
|
|
|
|
2016-02-12 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py (reduce_process): moved here creation of OutputSplitter.
|
|
(compact): Extractor.keepLists allows preserving lists in output.
|
|
(main): added new option --lists for preserving lists in output.
|
|
(compact): rest lislLevel entering new section.
|
|
(ignoredTags): removed 'div' from here, since it is in discardedTags.
|
|
(ignoredTags): moved 'sub' and 'sup' to discardedTags.
|
|
|
|
2016-02-11 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py (if_empty): added implementation of Lua module
|
|
for "Template:If empty".
|
|
(compact): preserve lists also in text mode.
|
|
(process_dump): log ID and title of pages.
|
|
|
|
2016-02-10 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py (process_dump): detect templates by element
|
|
<ns>10</ns> rather than by colon in title.
|
|
(Extractor.templateParams): restore space in RE [^= ]:
|
|
m = re.match(' *([^= ]*?) *=(.*)', param, re.DOTALL)
|
|
|
|
2016-02-04 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* cirrus-extract.py: added.
|
|
* README.md: added mention of Cirrus extract.
|
|
|
|
2015-11-20 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py (makeExternalLink): fixed.
|
|
(main): dropped confusing option --section.
|
|
|
|
2015-10-29 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py: /usr/bin/env
|
|
|
|
2015-09-29 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py: fixed argparse default.
|
|
(load_templates): save also modules, for a future release capaple
|
|
of applying them.
|
|
|
|
2015-09-14 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py (Extractor.templateParams): allow space in key
|
|
(process_dump): terminate by joining processes rather than queues.
|
|
|
|
* WikiExtractor.py (process_dump): eliminated ordering_queue in
|
|
favor of termination signals to jobs_queue.
|
|
|
|
2015-09-13 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py (process_dump): queue.put() dropped second arg
|
|
True, since it is the default.
|
|
(process_dump): queue renamed to jobs_queue.
|
|
(main): restored default bytes to 1M.
|
|
Dropped confusing option to write to a single file: use -o - for that.
|
|
|
|
2015-08-30 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py (main): check presence of title element in
|
|
single article.
|
|
(load_templates): reconstruct the template namespace from the
|
|
title of the first template in the saved templates.
|
|
(define_template): match also #redirect as used in French.
|
|
|
|
2015-06-02 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py (Extractor.expandTemplate): extend frame before
|
|
subst(), since there may be recursion in default parameter value,
|
|
e.g. {{OTRS|celebrative|date=April 2015}} in article 21637542 in
|
|
enwiki.
|
|
|
|
2015-05-06 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py (main): fixed arg.namespaces.
|
|
(compact): use fillvalue=' ' in izip_longest.
|
|
|
|
2015-04-26 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py (clean): use re.U when matching \W or chinese
|
|
characters will be lost.
|
|
|
|
2015-04-25 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py (sharp_expr): use unicode() instread od str()
|
|
or else chinese article 596814 fails.
|
|
(splitParameters): use findMatchingBraces istead of findBalanced,
|
|
to properly handle {{{4|{{{{{subst|}}}CURRENTYEAR}}}}
|
|
(Extractor.extract): handle currentyear, currentmonth and currentday
|
|
|
|
2015-04-23 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py: make UTF-8 the default encoding
|
|
|
|
2015-04-22 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py (replaceInternalLinks): function for replacing
|
|
internal links, modeled after MediaWiki original.
|
|
(replaceExternalLinks): revised taking into account the former.
|
|
(replaceInternalLinks): fix to nested iterator.
|
|
|
|
2015-04-20 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py (findBalanced): dropped optional arguments.
|
|
(make_anchor_tag): dont use splitParameters() since we must
|
|
consider also single [, which do not count ib parameters.
|
|
(sharp_switch): use split() to split at fist = sign.
|
|
(main): grouped command options.
|
|
(main): removed option -B.
|
|
(Template): class used to represent templates. It provides a method
|
|
for parameter subsitution.
|
|
Templats are parsed on demand when used and stored in a cache.
|
|
|
|
2015-04-19 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py (clean): use dropNext to drop discardElements.
|
|
(discardElements): added div.
|
|
(compact): avoid duplicated header line with optio --sections.
|
|
(compact): skip empty list items.
|
|
(main): changed logger format.
|
|
(ignoredTags): added abbr.
|
|
(clean): handle extension SyntaxHighlight.
|
|
(main): use % parameters in logging.
|
|
(main): added option --html, that subsumes --links and --sections
|
|
for producing HTML output instead of plain text.
|
|
(Extractor): moved here variables keepLinks, keepSections, toHTML.
|
|
(compact): modified to produce HTML lists.
|
|
|
|
2015-04-18 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py (compact): strip lists of initial characters.
|
|
|
|
2015-04-17 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py (sharp_switch): fixed handling of default.
|
|
(uc, lc, ucfirst, lcfirst): parser functions.
|
|
|
|
2015-04-16 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py (MagicWords): dealing with MagicWords.
|
|
This required rewriting code as methods of class Extractor, since
|
|
some MagicWords are document related and are being handled in
|
|
different threads, hence thay canot be globals.
|
|
(findMatchingBraces): fix to ambiguities resolution.
|
|
(clean): fixed dealing with trail for make_anchor-tag()
|
|
(substParameter): expand defaultValue only if used.
|
|
(selfClosing_tag_patterns): avoid matching besides tag end..
|
|
|
|
2015-04-15 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py (expandTemplates): increase depth only when
|
|
calling expandTemplate()
|
|
(define_template): removed \n in onlyincludeAccumulator, drop
|
|
<noinclude> always.
|
|
(sharp_invoke): restored support for #invoke, by adding parameter
|
|
frame to expandTemplate.
|
|
(main): allow specifying G in --bytes.
|
|
(make_anchor_tag): urlencode link.
|
|
(wikiLink): properly match anchor.
|
|
(make_anchor_tag): use splitParameters to separate parts of link.
|
|
(splitParameters): include pair [], since arguments may contain
|
|
wikilinks.
|
|
|
|
2015-04-14 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py (clean): dropped removal of preformatted lines,
|
|
since it is hard to distinguich them, since templates may
|
|
introdice lines with starting blanks.
|
|
(discardElements) added 'small'.
|
|
(ignoredTags) removed 'small'.
|
|
(make_anchor_tag): fixed RE for wikiLink.
|
|
(sharp_expr): added infix operators.
|
|
(Infix): support for infix operators.
|
|
(Extractor.extract): moved here logging of document being processed.
|
|
(clean): rewritten handling of wikilinks since using RE is to slow.
|
|
(maxTemplateRecursionLevels): increased to 30.
|
|
|
|
2015-04-13 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py (findMatchingBraces): rewritten to handle
|
|
ambiguities.
|
|
(substParameter): only evaluate name and default.
|
|
(main): fixed option --article.
|
|
|
|
2015-04-12 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py (ExtractorThread): enabled multithread version.
|
|
(findMatchingBraces): handle isolated braces.
|
|
(expandTemplates): recurse on result from expandTemplate().
|
|
|
|
2015-04-11 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py (sharp_switch): deal properly with #default.
|
|
(OutputSplitter): update to new-style classes
|
|
|
|
* WikiExtractor.py (selfClosingTags): added nowiki.
|
|
|
|
* WikiExtractor.py (bold_italic, bold): allow single quote inside,
|
|
e.g. '''[[Chinese New Year|New Year's Eve]]'''.
|
|
|
|
* WikiExtractor.py (templateParams): fix pattern to match
|
|
parameter name.
|
|
|
|
* WikiExtractor.py (substParameter): use splitParameters()
|
|
|
|
* WikiExtractor.py (main): added --no-templates option.
|
|
|
|
* WikiExtractor.py (substParameter): added parameter param_depth
|
|
to control depth of parameter expansion, separately from depth,
|
|
used for template expansion.
|
|
|
|
2015-04-10 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py (callParserFunction): return '' also in case of
|
|
failure.
|
|
|
|
2015-04-09 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py (expandTemplates): replaced frame parameter with
|
|
depth, used to limit max template recursion.
|
|
|
|
2015-04-07 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py (main): added --debug option.
|
|
|
|
2015-01-24 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py (splitParameters): rewritten template
|
|
processing by performing proper parsing of all balanced
|
|
expressions in templates invocation and expansion, using iterator
|
|
findBalanced().
|
|
|
|
2015-01-18 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py (expandTemplates): template expansion now working.
|
|
|
|
2015-01-11 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py (externalLink): replaced .* with appropriate
|
|
[^x]* here and elsewhere.
|
|
|
|
2015-01-10 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py (main): added option --article for processing a
|
|
single article.
|
|
(main): get dump rm file rather than frpm stdin, so that
|
|
preprocessing does not need to save data to temp file.
|
|
|
|
2014-02-25 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py (ignoreTag): make / optional.
|
|
|
|
2013-12-15 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py (clean): added template expansion
|
|
|
|
2013-10-14 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py: added wiktionary and wikt to the namespaces
|
|
(used e.g. in http://en.wikipedia.org/wiki?curid=12)
|
|
|
|
2013-05-09 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py (main): handle properly keepLinks option.
|
|
|
|
2013-04-05 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py (compact): keep lines ending with ':'.
|
|
|
|
2013-04-02 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py: obtain prefix from dump.
|
|
|
|
2013-01-27 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py (WikiDocument): add newline after <doc>.
|
|
Release version 2.3.
|
|
|
|
2012-12-30 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py (process_data): added patch by Humberto
|
|
Pereira, who claims a 10x improvement in speed.
|
|
(main): added option to set acceptedNamespaces
|
|
|
|
2012-11-01 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py (get_url): create URL from Id instead than from title.
|
|
|
|
2012-06-28 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py (OutputSplitter.reserve): added method to
|
|
invoke before writing.
|
|
(WikiDocument): use reserve() before writing whole page.
|
|
(main): added version number and option -v.
|
|
|
|
2012-05-17 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py (main): added option to preserve sections as
|
|
HTML headers and lists as <LI>.
|
|
|
|
2012-05-08 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py: Released version 2.0.
|
|
|
|
* test/test.xml: added sample to test hard cases for extractor.
|
|
|
|
* WikiExtractor.py (dropNested): Completely rewritten to be more
|
|
compliant to WikiMedia Markup language.
|
|
Use proper parsing fuctions to handle nested structures.
|
|
Improved performance by reducing creation of lists and strings.
|
|
Use htmlentitydefs instead of hand crafted list.
|
|
Added parameter -b to set URL for site.
|
|
Extensive use of RegExpr instead of specific string tests.
|
|
Deal with preformatted text.
|
|
Added parameter accepetedNamespaces to select namespaces to retain
|
|
in page titles or wiki links.
|
|
TODO:
|
|
1. handle Template expansion. See WikiPrep
|
|
(http://www.cs.technion.ac.il/~gabr/resources/code/wikiprep/)
|
|
2. Use full parser in order to better deal with nested and
|
|
unbalanced expressions.
|
|
|
|
2012-02-15 Stefano Dei Rossi <deirossi@semawiki.di.unipi.it>
|
|
|
|
* WikiExtractor.py (WikiExtractor): replaced with a simple space
|
|
instead of u'\u00A0'.
|
|
|
|
2012-02-10 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py: added Copyright.
|
|
|
|
2011-11-03 Antonio Fuschetto <fuschett@di.unipi.it>
|
|
|
|
* WikiExtractor.py: updated version to 1.6 (Oct 17).
|
|
|
|
2011-10-17 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py: turned prefix into a parameter.
|
|
|
|
2011-07-29 Antonio Fuschetto <fuschett@di.unipi.it>
|
|
|
|
* WikiExtractor.py (init): fixed bugs in apostrophe_bold_pattern and
|
|
apostrophe_italic_pattern.
|
|
|
|
2011-07-28 Antonio Fuschetto <fuschett@di.unipi.it>
|
|
|
|
* WikiExtractor.py (__garbage_namespaces): added "file" namespace to
|
|
remove list.
|
|
|
|
2011-07-10 Antonio Fuschetto <fuschett@di.unipi.it>
|
|
|
|
* WikiExtractor.py (get_wiki_document_url): changed the handling of
|
|
URL prefix (anchors don't use prefix but a relative URLs).
|
|
|
|
2011-06-26 Antonio Fuschetto <fuschett@di.unipi.it>
|
|
|
|
* WikiExtractor.py (extract_document): changed the handling of
|
|
wikilinks, adding an anchor tag for each link with a reference to the
|
|
Wikipedia document.
|
|
|
|
* WikiExtractor.py (WikiExtractor): changed the handling of
|
|
placeholders: from "[Formula 12]" to "formula_12".
|
|
|
|
2011-04-06 Antonio Fuschetto <fuschett@di.unipi.it>
|
|
|
|
* WikiExtractor.py (init): fixed bugs in apostrophe_bold_pattern and
|
|
apostrophe_italic_pattern.
|
|
|
|
* WikiExtractor.py (compact): drop lines ending with ':'
|
|
(these are sentences preceding list items); fixed some bugs.
|
|
|
|
* WikiExtractor.py: released version 1.1.
|
|
|
|
2011-03-12 Antonio Fuschetto <fuschett@di.unipi.it>
|
|
|
|
* wiki-extractor.py (main): removed the sentence splitting option.
|
|
|
|
* wiki-extractor.py: fixed some bugs; released version 1.0; changed
|
|
filename to "Wiki-Extractor.py" according to Tanl module names.
|
|
|
|
2011-03-01 Antonio Fuschetto <fuschett@di.unipi.it>
|
|
|
|
* wiki-extractor.py (main): added cross platform path management.
|
|
|