Fixed infinite loop in template expansion.

This commit is contained in:
Giuseppe Attardi 2015-04-09 15:24:34 +02:00
parent bc8066fcca
commit 9f49bada4d
2 changed files with 366 additions and 67 deletions

210
ChangeLog Normal file
View File

@ -0,0 +1,210 @@
2015-04-09 Giuseppe Attardi <attardi@di.unipi.it>
* WikiExtractor.py (expandTemplates): replaced frame parameter with
depth, used to limit max template recursion.
2015-04-07 Giuseppe Attardi <attardi@di.unipi.it>
* WikiExtractor.py (main): added --debug option.
2015-01-24 Giuseppe Attardi <attardi@di.unipi.it>
* WikiExtractor.py (splitParameters): rewritten template
processing by performing proper parsing of all balanced
expressions in templates invocation and expansion, using iterator
findBalanced().
2015-01-18 Giuseppe Attardi <attardi@di.unipi.it>
* WikiExtractor.py (expandTemplates): template expansion now working.
2015-01-11 Giuseppe Attardi <attardi@di.unipi.it>
* WikiExtractor.py (externalLink): replaced .* with appropriate
[^x]* here and elsewhere.
2015-01-10 Giuseppe Attardi <attardi@di.unipi.it>
* WikiExtractor.py (main): added option --article for processing a
single article.
(main): get dump rm file rather than frpm stdin, so that
preprocessing does not need to save data to temp file.
2014-02-25 Giuseppe Attardi <attardi@di.unipi.it>
* WikiExtractor.py (ignoreTag): make / optional.
2013-12-15 Giuseppe Attardi <attardi@di.unipi.it>
* WikiExtractor.py (clean): added template expansion
2013-10-14 Giuseppe Attardi <attardi@di.unipi.it>
* WikiExtractor.py: added wiktionary and wikt to the namespaces
(used e.g. in http://en.wikipedia.org/wiki?curid=12)
2013-05-09 Giuseppe Attardi <attardi@di.unipi.it>
* WikiExtractor.py (main): handle properly keepLinks option.
2013-04-05 Giuseppe Attardi <attardi@di.unipi.it>
* WikiExtractor.py (compact): keep lines ending with ':'.
2013-04-02 Giuseppe Attardi <attardi@di.unipi.it>
* WikiExtractor.py: obtain prefix from dump.
2013-01-27 Giuseppe Attardi <attardi@di.unipi.it>
* WikiExtractor.py (WikiDocument): add newline after <doc>.
Release version 2.3.
2012-12-30 Giuseppe Attardi <attardi@di.unipi.it>
* WikiExtractor.py (process_data): added patch by Humberto
Pereira, who claims a 10x improvement in speed.
(main): added option to set acceptedNamespaces
2012-11-01 Giuseppe Attardi <attardi@di.unipi.it>
* WikiExtractor.py (get_url): create URL from Id instead than from title.
2012-06-28 Giuseppe Attardi <attardi@di.unipi.it>
* WikiExtractor.py (OutputSplitter.reserve): added method to
invoke before writing.
(WikiDocument): use reserve() before writing whole page.
(main): added version number and option -v.
2012-05-17 Giuseppe Attardi <attardi@di.unipi.it>
* WikiExtractor.py (main): added option to preserve sections as
HTML headers and lists as <LI>.
2012-05-08 Giuseppe Attardi <attardi@di.unipi.it>
* WikiExtractor.py: Released version 2.0.
* test/test.xml: added sample to test hard cases for extractor.
* WikiExtractor.py (dropNested): Completely rewritten to be more
compliant to WikiMedia Markup language.
Use proper parsing fuctions to handle nested structures.
Improved performance by reducing creation of lists and strings.
Use htmlentitydefs instead of hand crafted list.
Added parameter -b to set URL for site.
Extensive use of RegExpr instead of specific string tests.
Deal with preformatted text.
Added parameter accepetedNamespaces to select namespaces to retain
in page titles or wiki links.
TODO:
1. handle Template expansion. See WikiPrep
(http://www.cs.technion.ac.il/~gabr/resources/code/wikiprep/)
2. Use full parser in order to better deal with nested and
unbalanced expressions.
2011-02-10 Giuseppe Attardi <attardi@di.unipi.it>
* WikiExtractor.py: added Copyright.
2012-02-15 Stefano Dei Rossi <deirossi@semawiki.di.unipi.it>
* WikiExtractor.py (WikiExtractor): &nbsp; replaced with a simple space
instead of u'\u00A0'.
2009-11-03 Antonio Fuschetto <fuschett@di.unipi.it>
* WikiExtractor.py: updated version to 1.6 (Oct 17).
2009-10-17 Giuseppe Attardi <attardi@di.unipi.it>
* WikiExtractor.py: turned prefix into a parameter.
2009-07-29 Antonio Fuschetto <fuschett@di.unipi.it>
* WikiExtractor.py (init): fixed bugs in apostrophe_bold_pattern and
apostrophe_italic_pattern.
2009-07-28 Antonio Fuschetto <fuschett@di.unipi.it>
* WikiExtractor.py (__garbage_namespaces): added "file" namespace to
remove list.
2009-07-10 Antonio Fuschetto <fuschett@di.unipi.it>
* WikiExtractor.py (get_wiki_document_url): changed the handling of
URL prefix (anchors don't use prefix but a relative URLs).
2009-06-26 Antonio Fuschetto <fuschett@di.unipi.it>
* WikiExtractor.py (extract_document): changed the handling of
wikilinks, adding an anchor tag for each link with a reference to the
Wikipedia document.
* WikiExtractor.py (WikiExtractor): changed the handling of
placeholders: from "[Formula 12]" to "formula_12".
2009-04-06 Antonio Fuschetto <fuschett@di.unipi.it>
* WikiExtractor.py (init): fixed bugs in apostrophe_bold_pattern and
apostrophe_italic_pattern.
* WikiExtractor.py (compact): drop lines ending with ':'
(these are sentences preceding list items); fixed some bugs.
* WikiExtractor.py: released version 1.1.
2009-03-12 Antonio Fuschetto <fuschett@di.unipi.it>
* wiki-extractor.py (main): removed the sentence splitting option.
* wiki-extractor.py: fixed some bugs; released version 1.0; changed
filename to "Wiki-Extractor.py" according to Tanl module names.
2009-03-01 Antonio Fuschetto <fuschett@di.unipi.it>
* wiki-extractor.py (main): added cross platform path management.
2008-12-12 Antonio Fuschetto <fuschett@di.unipi.it>
* wiki-extractor.py: fixed a wrong cleaning of apostrophes prior
italic and bold text.
2008-10-27 Antonio Fuschetto <fuschett@di.unipi.it>
* wiki-extractor.py: script complete rewriting (ver 0.8).
2008-10-27 Giuseppe Attardi <attardi@di.unipi.it>
* wiki-extractor.py: added CopyLeft.
2008-07-20 Giuseppe Attardi <attardi@di.unipi.it>
* wiki-extractor.py (main): renamed option gzip to bzip.
* wiki-extractor.py (Document.__str__): removed global variables.
* wiki-extractor.py (Document): turned split_sentences, clean_document,
print_document into methods.
2008-07-15 Antonio Fuschetto <fuschett@di.unipi.it>
* wiki-extractor.py: changed object serialization using standard
pickle module.
2008-06-28 Antonio Fuschetto <fuschett@di.unipi.it>
* wiki-extractor.py: added the management of italics with a bad format.
2008-06-27 Antonio Fuschetto <fuschett@di.unipi.it>
* wiki-extractor.py: fixed a wrong use of conversion between conding;
added the management of wikilink with a bad format; added the
menagement of unicode character (numeric entity); added the management
of italics like quoted text.
2008-06-26 Giuseppe Attardi <attardi@di.unipi.it>
* wiki-extractor.py (main): turned global variables _infile and
_outfile into locals.

View File

@ -270,20 +270,17 @@ dots = re.compile(r'\.{4,}')
#----------------------------------------------------------------------
# Expand templates
# Derived from:
# wikiprep.pl - Preprocess Wikipedia XML dumps
# Copyright (C) 2007 Evgeniy Gabrilovich
maxTemplateRecursionLevels = 16
maxParameterRecursionLevels = 10
# check for template beginning
reOpen = re.compile('(?<!{){{(?!{)', re.DOTALL)
def expandTemplates(text, frame=[]):
def expandTemplates(text, depth=0):
"""
:param frame: contains pairs (title, args) of previous invocations.
Template definitions can span several lines.
:param depth: recursion level.
Templates are frequently nested. Occasionally, parsing mistakes may cause
template insertion to enter an infinite loop, for instance when trying to
@ -299,14 +296,7 @@ def expandTemplates(text, frame=[]):
Therefore, we limit the number of iterations of nested template inclusion.
"""
# This evaluates expressions outside in.
# Recursion:
# expandTemplates(text, l)
# repeat maxTemplateRecursionLevels
# for templ_i in text
# replace templ_i with expandTemplate(templ_i)
# expandTemplates(instantatiated templ_i, l+1)
# until no templates present
# template = "{{" parts "}}"
for l in xrange(maxTemplateRecursionLevels):
res = ''
@ -314,7 +304,7 @@ def expandTemplates(text, frame=[]):
# look for matching {{...}}
for s,e in findMatchingBraces(text, '{{', 2):
res += text[cur:s]
res += expandTemplate(text[s+2:e-2], frame)
res += expandTemplate(text[s+2:e-2], depth+l)
cur = e
if cur == 0:
@ -390,20 +380,20 @@ def splitParameters(paramsList, sep='|'):
return parameters
def templateParams(parameters):
def templateParams(parameters, frame):
"""
Build a dictionary with positional or name key to parameters.
Build a dictionary with positional or name key to expanded parameters.
:param parameters: the parts[1:] of a template, i.e. all except the title.
"""
templateParams = {}
if not parameters:
return templateParams
# evaluate parameters, sice they may contain templates, including the
# evaluate parameters, since they may contain templates, including the
# symbol "=".
# {{#ifexpr: {{{1}}} = 1 }}
for i,p in enumerate(parameters):
parameters[i] = expandTemplates(p)
parameters = [expandTemplates(p, frame) for p in parameters]
# Parameters can be either named or unnamed. In the latter case, their
# name is defined by their ordinal position (1, 2, 3, ...).
@ -457,9 +447,35 @@ def templateParams(parameters):
return templateParams
def findMatchingBraces(text, openDelim, ldelim):
"""
:param openDelim: RE matching opening delimiter.
:param ldelim: number of braces in openDelim.
"""
# Parsing is done with respect to pairs of double braces {{..}} delimiting
# a template, and pairs of triple braces {{{..}}} delimiting a tplarg. If
# double opening braces are followed by triple closing braces or
# conversely, this is taken as delimiting a template, with one left-over
# brace outside it, taken as plain text. For any pattern of braces this
# defines a set of templates and tplargs such that any two are either
# separate or nested (not overlapping).
# Unmatched double rectangular closing brackets can be in a template or
# tplarg, but unmatched double rectangular opening brackets
# cannot. Unmatched double or triple closing braces inside a pair of
# double rectangular brackets are treated as plain text.
# Other formulation: in ambiguity between template or tplarg on one hand,
# and a link on the other hand, the structure with the rightmost opening
# takes precedence, even if this is the opening of a link without any
# closing, so not producing an actual link.
# In the case of more than three opening braces the last three are assumed
# to belong to a tplarg, unless there is no matching triple of closing
# braces, in which case the last two opening braces are are assumed to
# belong to a template.
reOpen = re.compile(openDelim)
cur = 0
# scan text after {{ looking for matching }}
# scan text after {{ (openDelim) looking for matching }}
while True:
m = reOpen.search(text, cur)
if m:
@ -634,17 +650,59 @@ magicWords = set([
substWords = 'subst:|safesubst:'
def expandTemplate(templateInvocation, frame):
def expandTemplate(templateInvocation, depth):
"""
Expands template invocation.
:see braceSubstitution at
:param templateInvocation: the parts of a template.
:param depth: recursion depth.
:see http://meta.wikimedia.org/wiki/Help:Expansion for an explanation of
the process.
See in particular: Expansion of names and values
http://meta.wikimedia.org/wiki/Help:Expansion#Expansion_of_names_and_values
For most parser functions all names and values are expanded, regardless of
what is relevant for the result. The branching functions (#if, #ifeq,
#iferror, #ifexist, #ifexpr, #switch) are exceptions.
All names in a template call are expanded, and the titles of the tplargs
in the template body, after which it is determined which values must be
expanded, and for which tplargs in the template body the first part
(default).
In the case of a tplarg, any parts beyond the first are never expanded.
The possible name and the value of the first part is expanded if the title
does not match a name in the template call.
:see code for braceSubstitution at
https://doc.wikimedia.org/mediawiki-core/master/php/html/Parser_8php_source.html#3397:
"""
#logging.info('INVOCATION ' + templateInvocation) # DEBUG
# Templates and tplargs are decomposed in the same way, with pipes as
# separator, even though eventually any parts in a tplarg after the first
# (the parameter default) are ignored, and an equals sign in the first
# part is treated as plain text.
# Pipes inside inner templates and tplargs, or inside double rectangular
# brackets within the template or tplargs are not taken into account in
# this decomposition.
# The first part is called title, the other parts are simply called parts.
# If a part has one or more equals signs in it, the first equals sign
# determines the division into name = value. Equals signs inside inner
# templates and tplargs, or inside double rectangular brackets within the
# part are not taken into account in this decomposition. Parts without
# equals sign are indexed 1, 2, .., given as attribute in the <name> tag.
logging.debug('INVOCATION ' + templateInvocation)
if depth > maxTemplateRecursionLevels:
return ''
parts = splitParameters(templateInvocation)
# part1 is the portion before the first |
part1 = expandTemplates(parts[0].strip(), frame)
part1 = expandTemplates(parts[0].strip(), depth + 1)
# SUBST
if re.match(substWords, part1):
@ -663,7 +721,7 @@ def expandTemplate(templateInvocation, frame):
if colon > 1:
funct = part1[:colon]
parts[0] = part1[colon+1:].strip() # side-effect (parts[0] not used later)
ret = callParserFunction(funct, parts, frame)
ret = callParserFunction(funct, parts)
if ret is not None:
return ret
@ -677,10 +735,18 @@ def expandTemplate(templateInvocation, frame):
# Perform parameter substitution
template = templates[title]
#logging.info('TEMPLATE ' + template) # DEBUG
logging.debug('TEMPLATE ' + template)
# A parameter reference ( {{{...}}} ) may contain other parameters
# as well as templates, e.g.:
# tplarg = "{{{" parts "}}}"
# parts = [ title *( "|" part ) ]
# part = ( part-name "=" part-value ) / ( part-value )
# part-name = wikitext-L3
# part-value = wikitext-L3
# wikitext-L3 = literal / template / tplarg / link / comment /
# line-eating-comment / unclosed-comment /
# xmlish-element / *wikitext-L3
# A tplarg may contain other parameters as well as templates, e.g.:
# {{{text|{{{quote|{{{1|{{error|Error: No text given}}}}}}}}}}}
# hence no simple RE like this would work:
# '{{{((?:(?!{{{).)*?)}}}'
@ -690,18 +756,32 @@ def expandTemplate(templateInvocation, frame):
# {{{appointe{{#if:{{{appointer14|}}}|r|d}}14|}}}
# Because of the multiple uses of double-brace and triple-brace
# syntax, expressions can sometimes be ambiguous. It may be helpful or
# necessary to include spaces to resolve such ambiguity, for example
# by writing {{ {{{xxx}}} }} or {{{ {{xxx}} }}}, rather than typing
# five consecutive braces.
# syntax, expressions can sometimes be ambiguous.
# Precedence rules specifed here:
# http://www.mediawiki.org/wiki/Preprocessor_ABNF#Ideal_precedence
# resolve ambiguities like this:
# {{{{ }}}} -> { {{{ }}} }
# {{{{{ }}}}} -> {{ {{{ }}} }}
#
# :see: https://en.wikipedia.org/wiki/Help:Template#Handling_parameters
params = templateParams(parts[1:])
# build a dict of name-values for the expanded parameters
params = templateParams(parts[1:], depth)
# We perform substitution iteratively.
# We also limit the maximum number of iterations to avoid too long or
# even endless loops (in case of malformed input).
# :see: http://meta.wikimedia.org/wiki/Help:Expansion#Distinction_between_variables.2C_parser_functions.2C_and_templates
#
# Parameter values are assigned to parameters in two (?) passes.
# Therefore a parameter name in a template can depend on the value of
# another parameter of the same template, regardless of the order in
# which they are specified in the template call, for example, using
# Template:ppp containing "{{{{{{p}}}}}}", {{ppp|p=q|q=r}} and even
# {{ppp|q=r|p=q}} gives r, but using Template:tvvv containing
# "{{{{{{{{{p}}}}}}}}}", {{tvvv|p=q|q=r|r=s}} gives s.
for i in xrange(maxParameterRecursionLevels):
result = ''
start = 0
@ -713,7 +793,7 @@ def expandTemplate(templateInvocation, frame):
for s,e in findBalanced(template, ['{{{', '{{'], ['}}}', '}}'],
['(?<!{){{{', '{{'], 0):
result += template[start:s] + substParameter(template[s+3:e-3],
params)
params, i)
start = e
n += 1
if n == 0: # no match
@ -723,12 +803,9 @@ def expandTemplate(templateInvocation, frame):
else:
logging.warn('Reachead maximum parameter recursions: '
+ str(maxParameterRecursionLevels))
l = len(frame)
if l < maxTemplateRecursionLevels:
#logging.info('instantiated ' + str(l) + ' ' + template) # DEBUG
frame.append((title, params))
ret = expandTemplates(template, frame)
frame.pop()
if depth < maxTemplateRecursionLevels:
logging.debug('instantiated ' + str(depth) + ' ' + template)
ret = expandTemplates(template, depth + 1)
return ret
else:
logging.warn('Reached max template recursion: '
@ -739,9 +816,13 @@ def expandTemplate(templateInvocation, frame):
# The page being included could not be identified
return ""
def substParameter(parameter, templateParams):
def substParameter(parameter, templateParams, depth):
"""
:param parameter: the parts of a tplarg.
:param templateParams: dict of name-values template parameters.
"""
# the parameter name itself might contain parameters, e.g.:
# the parameter name itself might contain templates, e.g.:
# appointe{{#if:{{{appointer14|}}}|r|d}}14|
if '{{{' in parameter:
@ -749,15 +830,18 @@ def substParameter(parameter, templateParams):
start = 0
for s,e in findMatchingBraces(parameter, '(?<!{){{{(?!{)', 3):
subst += parameter[start:s] + substParameter(parameter[s+3:e-3],
templateParams)
templateParams,
depth + 1)
start = e
parameter = subst + parameter[start:]
if '{{' in parameter:
# FIXME: pass frame to limit recursion
parameter = expandTemplates(parameter)
parameter = expandTemplates(parameter, depth + 1)
m = re.match('([^|]*)\|(.*)', parameter, flags=re.DOTALL)
# any parts in a tplarg after the first (the parameter default) are
# ignored, and an equals sign in the first part is treated as plain text.
m = re.match('([^|]*)\|([^|]*)', parameter, flags=re.DOTALL)
if m:
# This parameter has a default value
paramName = m.group(1)
@ -779,8 +863,7 @@ def substParameter(parameter, templateParams):
# case we drop them.
return ''
# Surplus parameters - i.e., those assigned values in template
# invocation but not used in the template body - are simply
# ignored.
# invocation but not used in the template body - are simply ignored.
def ucfirst(string):
""":return: a string with its first character uppercase"""
@ -924,20 +1007,21 @@ def sharp_switch(primary, *templateParams):
return default
return ''
def sharp_invoke(module, function, frame):
functions = modules.get(module)
if functions:
funct = functions.get(function)
if funct:
templateTitle = fullyQualifiedTemplateTitle(function)
# find parameters in frame whose title is the one of the original
# template invocation
pair = next((x for x in frame if x[0] == templateTitle), None)
if pair:
return funct(*pair[1].values())
else:
return funct()
return None
# Extension Scribuntu
# def sharp_invoke(module, function, frame):
# functions = modules.get(module)
# if functions:
# funct = functions.get(function)
# if funct:
# templateTitle = fullyQualifiedTemplateTitle(function)
# # find parameters in frame whose title is the one of the original
# # template invocation
# pair = next((x for x in frame if x[0] == templateTitle), None)
# if pair:
# return funct(*pair[1].values())
# else:
# return funct()
# return None
parserFunctions = {
@ -981,7 +1065,7 @@ parserFunctions = {
}
def callParserFunction(functionName, args, frame):
def callParserFunction(functionName, args):
"""
Parser functions have similar syntax as templates, except that
the first argument is everything after the first colon.
@ -990,9 +1074,9 @@ def callParserFunction(functionName, args, frame):
"""
try:
if functionName == '#invoke':
# special handling of frame
return sharp_invoke(args[0].strip(), args[1].strip(), frame)
# if functionName == '#invoke':
# # special handling of frame
# return sharp_invoke(args[0].strip(), args[1].strip(), frame)
if functionName in parserFunctions:
return parserFunctions[functionName](*args)
except:
@ -1632,6 +1716,8 @@ def main():
help="accepted namespaces")
parser.add_argument("-q", "--quiet", action="store_true",
help="suppress reporting progress info")
parser.add_argument("--debug", action="store_true",
help="print debug info")
parser.add_argument("-s", "--sections", action="store_true",
help="preserve sections")
parser.add_argument("-a", "--article", action="store_true",
@ -1667,8 +1753,11 @@ def main():
if args.namespaces:
acceptedNamespaces = set(args.ns.split(','))
logger = logging.getLogger()
if not args.quiet:
logging.basicConfig(level=logging.INFO)
logger.setLevel(logging.INFO)
if args.debug:
logger.setLevel(logging.DEBUG)
input_file = args.input