Fixed infinite loop in template expansion.

2015-04-09 15:24:34 +02:00 · 2015-04-09 15:24:34 +02:00 · 9f49bada4d
commit 9f49bada4d
parent bc8066fcca
2 changed files with 366 additions and 67 deletions
--- a/210
+++ b/210
@ -0,0 +1,210 @@
+2015-04-09  Giuseppe Attardi  <attardi@di.unipi.it>
+
+	* WikiExtractor.py (expandTemplates): replaced frame parameter with
+	depth, used to limit max template recursion.
+
+2015-04-07  Giuseppe Attardi  <attardi@di.unipi.it>
+
+	* WikiExtractor.py (main): added --debug option.
+
+2015-01-24  Giuseppe Attardi  <attardi@di.unipi.it>
+
+	* WikiExtractor.py (splitParameters): rewritten template
+	processing by performing proper parsing of all balanced
+	expressions in templates invocation and expansion, using iterator
+	findBalanced().
+
+2015-01-18  Giuseppe Attardi  <attardi@di.unipi.it>
+
+	* WikiExtractor.py (expandTemplates): template expansion now working.
+
+2015-01-11  Giuseppe Attardi  <attardi@di.unipi.it>
+
+	* WikiExtractor.py (externalLink): replaced .* with appropriate
+	[^x]* here and elsewhere.
+
+2015-01-10  Giuseppe Attardi  <attardi@di.unipi.it>
+
+	* WikiExtractor.py (main): added option --article for processing a
+	single article.
+	(main): get dump rm file rather than frpm stdin, so that
+	preprocessing does not need to save data to temp file.
+
+2014-02-25  Giuseppe Attardi  <attardi@di.unipi.it>
+
+	* WikiExtractor.py (ignoreTag): make / optional.
+
+2013-12-15  Giuseppe Attardi  <attardi@di.unipi.it>
+
+	* WikiExtractor.py (clean): added template expansion
+
+2013-10-14  Giuseppe Attardi  <attardi@di.unipi.it>
+
+	* WikiExtractor.py: added wiktionary and wikt to the namespaces
+	(used e.g. in http://en.wikipedia.org/wiki?curid=12)
+
+2013-05-09  Giuseppe Attardi  <attardi@di.unipi.it>
+
+	* WikiExtractor.py (main): handle properly keepLinks option.
+
+2013-04-05  Giuseppe Attardi  <attardi@di.unipi.it>
+
+	* WikiExtractor.py (compact): keep lines ending with ':'.
+
+2013-04-02  Giuseppe Attardi  <attardi@di.unipi.it>
+
+	* WikiExtractor.py: obtain prefix from dump.
+
+2013-01-27  Giuseppe Attardi  <attardi@di.unipi.it>
+
+	* WikiExtractor.py (WikiDocument): add newline after <doc>.
+	Release version 2.3.
+
+2012-12-30  Giuseppe Attardi  <attardi@di.unipi.it>
+
+	* WikiExtractor.py (process_data): added patch by Humberto
+	Pereira, who claims a 10x improvement in speed.
+	(main): added option to set acceptedNamespaces
+
+2012-11-01  Giuseppe Attardi  <attardi@di.unipi.it>
+
+	* WikiExtractor.py (get_url): create URL from Id instead than from title.
+
+2012-06-28  Giuseppe Attardi  <attardi@di.unipi.it>
+
+	* WikiExtractor.py (OutputSplitter.reserve): added method to
+	invoke before writing.
+	(WikiDocument): use reserve() before writing whole page.
+	(main): added version number and option -v.
+
+2012-05-17  Giuseppe Attardi  <attardi@di.unipi.it>
+
+	* WikiExtractor.py (main): added option to preserve sections as
+	HTML headers and lists as <LI>.
+
+2012-05-08  Giuseppe Attardi  <attardi@di.unipi.it>
+
+	* WikiExtractor.py: Released version 2.0.
+
+	* test/test.xml: added sample to test hard cases for extractor.
+
+	* WikiExtractor.py (dropNested): Completely rewritten to be more
+	compliant to WikiMedia Markup language.
+	Use proper parsing fuctions to handle nested structures.
+	Improved performance by reducing creation of lists and strings.
+	Use htmlentitydefs instead of hand crafted list.
+	Added parameter -b to set URL for site.
+	Extensive use of RegExpr instead of specific string tests.
+	Deal with preformatted text.
+	Added parameter accepetedNamespaces to select namespaces to retain
+	in page titles or wiki links.
+	TODO:
+	1. handle Template expansion. See WikiPrep
+	   (http://www.cs.technion.ac.il/~gabr/resources/code/wikiprep/)
+	2. Use full parser in order to better deal with nested and
+	   unbalanced expressions.
+
+2011-02-10  Giuseppe Attardi  <attardi@di.unipi.it>
+
+	* WikiExtractor.py: added Copyright.
+
+2012-02-15  Stefano Dei Rossi  <deirossi@semawiki.di.unipi.it>
+
+	* WikiExtractor.py (WikiExtractor): &nbsp; replaced with a simple space
+	instead of u'\u00A0'.
+
+2009-11-03  Antonio Fuschetto  <fuschett@di.unipi.it>
+
+	* WikiExtractor.py: updated version to 1.6 (Oct 17).
+
+2009-10-17  Giuseppe Attardi  <attardi@di.unipi.it>
+
+	* WikiExtractor.py: turned prefix into a parameter.
+
+2009-07-29  Antonio Fuschetto  <fuschett@di.unipi.it>
+
+	* WikiExtractor.py (init): fixed bugs in apostrophe_bold_pattern and
+	apostrophe_italic_pattern.
+
+2009-07-28  Antonio Fuschetto  <fuschett@di.unipi.it>
+
+	* WikiExtractor.py (__garbage_namespaces): added "file" namespace to
+	remove list.
+
+2009-07-10  Antonio Fuschetto  <fuschett@di.unipi.it>
+
+	* WikiExtractor.py (get_wiki_document_url): changed the handling of
+	URL prefix (anchors don't use prefix but a relative URLs).
+
+2009-06-26  Antonio Fuschetto  <fuschett@di.unipi.it>
+
+	* WikiExtractor.py (extract_document): changed the handling of
+	wikilinks, adding an anchor tag for each link with a reference to the
+	Wikipedia document.
+
+	* WikiExtractor.py (WikiExtractor): changed the handling of
+	placeholders: from "[Formula 12]" to "formula_12".
+
+2009-04-06  Antonio Fuschetto  <fuschett@di.unipi.it>
+
+	* WikiExtractor.py (init): fixed bugs in apostrophe_bold_pattern and
+	apostrophe_italic_pattern.
+
+	* WikiExtractor.py (compact): drop lines ending with ':'
+	(these are sentences preceding list items); fixed some bugs.
+
+	* WikiExtractor.py: released version 1.1.
+
+2009-03-12  Antonio Fuschetto  <fuschett@di.unipi.it>
+
+	* wiki-extractor.py (main): removed the sentence splitting option.
+
+	* wiki-extractor.py: fixed some bugs; released version 1.0; changed
+	filename to "Wiki-Extractor.py" according to Tanl module names.
+
+2009-03-01  Antonio Fuschetto  <fuschett@di.unipi.it>
+
+	* wiki-extractor.py (main): added cross platform path management.
+
+2008-12-12  Antonio Fuschetto  <fuschett@di.unipi.it>
+
+	* wiki-extractor.py: fixed a wrong cleaning of apostrophes prior
+	italic and bold text.
+
+2008-10-27  Antonio Fuschetto  <fuschett@di.unipi.it>
+
+	* wiki-extractor.py: script complete rewriting (ver 0.8).
+
+2008-10-27  Giuseppe Attardi  <attardi@di.unipi.it>
+
+	* wiki-extractor.py: added CopyLeft.
+
+2008-07-20  Giuseppe Attardi  <attardi@di.unipi.it>
+
+	* wiki-extractor.py (main): renamed option gzip to bzip.
+
+	* wiki-extractor.py (Document.__str__): removed global variables.
+
+	* wiki-extractor.py (Document): turned split_sentences, clean_document,
+	print_document into methods.
+
+2008-07-15  Antonio Fuschetto  <fuschett@di.unipi.it>
+
+	* wiki-extractor.py: changed object serialization using standard
+	pickle module.
+
+2008-06-28  Antonio Fuschetto  <fuschett@di.unipi.it>
+
+	* wiki-extractor.py: added the management of italics with a bad format.
+
+2008-06-27  Antonio Fuschetto  <fuschett@di.unipi.it>
+
+	* wiki-extractor.py: fixed a wrong use of conversion between conding;
+	added the management of wikilink with a bad format; added the
+	menagement of unicode character (numeric entity); added the management
+	of italics like quoted text.
+
+2008-06-26  Giuseppe Attardi  <attardi@di.unipi.it>
+
+	* wiki-extractor.py (main): turned global variables _infile and
+	_outfile into locals.
--- a/WikiExtractor.py
+++ b/WikiExtractor.py
@ -270,20 +270,17 @@ dots = re.compile(r'\.{4,}')
 #----------------------------------------------------------------------
 # Expand templates

-# Derived from:
-# wikiprep.pl - Preprocess Wikipedia XML dumps
-# Copyright (C) 2007 Evgeniy Gabrilovich
-
 maxTemplateRecursionLevels = 16
 maxParameterRecursionLevels = 10

 # check for template beginning
 reOpen = re.compile('(?<!{){{(?!{)', re.DOTALL)

-def expandTemplates(text, frame=[]):
+def expandTemplates(text, depth=0):
    """
    :param frame: contains pairs (title, args) of previous invocations.
    Template definitions can span several lines.
+    :param depth: recursion level.

    Templates are frequently nested. Occasionally, parsing mistakes may cause
    template insertion to enter an infinite loop, for instance when trying to
@ -299,14 +296,7 @@ def expandTemplates(text, frame=[]):
    Therefore, we limit the number of iterations of nested template inclusion.
    """

-    # This evaluates expressions outside in.
-    # Recursion:
-    #   expandTemplates(text, l)
-    #     repeat maxTemplateRecursionLevels
-    #        for templ_i in text
-    #            replace templ_i with expandTemplate(templ_i)
-    #                                    expandTemplates(instantatiated templ_i, l+1)
-    #        until no templates present
+    # template        = "{{" parts "}}"

    for l in xrange(maxTemplateRecursionLevels):
        res = ''
@ -314,7 +304,7 @@ def expandTemplates(text, frame=[]):
        # look for matching {{...}}
        for s,e in findMatchingBraces(text, '{{', 2):
            res += text[cur:s]
-            res += expandTemplate(text[s+2:e-2], frame)
+            res += expandTemplate(text[s+2:e-2], depth+l)
            cur = e

        if cur == 0:
@ -390,20 +380,20 @@ def splitParameters(paramsList, sep='|'):

    return parameters

-def templateParams(parameters):
+def templateParams(parameters, frame):
    """
-    Build a dictionary with positional or name key to parameters.
+    Build a dictionary with positional or name key to expanded parameters.
+    :param parameters: the parts[1:] of a template, i.e. all except the title.
    """
    templateParams = {}

    if not parameters:
        return templateParams

-    # evaluate parameters, sice they may contain templates, including the
+    # evaluate parameters, since they may contain templates, including the
    # symbol "=".
    # {{#ifexpr: {{{1}}} = 1 }}
-    for i,p in enumerate(parameters):
-        parameters[i] = expandTemplates(p)
+    parameters = [expandTemplates(p, frame) for p in parameters]

    # Parameters can be either named or unnamed. In the latter case, their
    # name is defined by their ordinal position (1, 2, 3, ...).
@ -457,9 +447,35 @@ def templateParams(parameters):
    return templateParams

 def findMatchingBraces(text, openDelim, ldelim):
+    """
+    :param openDelim: RE matching opening delimiter.
+    :param ldelim: number of braces in openDelim.
+    """
+    # Parsing is done with respect to pairs of double braces {{..}} delimiting
+    # a template, and pairs of triple braces {{{..}}} delimiting a tplarg. If
+    # double opening braces are followed by triple closing braces or
+    # conversely, this is taken as delimiting a template, with one left-over
+    # brace outside it, taken as plain text. For any pattern of braces this
+    # defines a set of templates and tplargs such that any two are either
+    # separate or nested (not overlapping).
+
+    # Unmatched double rectangular closing brackets can be in a template or
+    # tplarg, but unmatched double rectangular opening brackets
+    # cannot. Unmatched double or triple closing braces inside a pair of
+    # double rectangular brackets are treated as plain text.
+    # Other formulation: in ambiguity between template or tplarg on one hand,
+    # and a link on the other hand, the structure with the rightmost opening
+    # takes precedence, even if this is the opening of a link without any
+    # closing, so not producing an actual link.
+
+    # In the case of more than three opening braces the last three are assumed
+    # to belong to a tplarg, unless there is no matching triple of closing
+    # braces, in which case the last two opening braces are are assumed to
+    # belong to a template.
+
    reOpen = re.compile(openDelim)
    cur = 0
-    # scan text after {{ looking for matching }}
+    # scan text after {{ (openDelim) looking for matching }}
    while True:
        m = reOpen.search(text, cur)
        if m:
@ -634,17 +650,59 @@ magicWords = set([

 substWords = 'subst:|safesubst:'

-def expandTemplate(templateInvocation, frame):
+def expandTemplate(templateInvocation, depth):
    """
    Expands template invocation.
-    :see braceSubstitution at
+    :param templateInvocation: the parts of a template.
+    :param depth: recursion depth.
+
+    :see http://meta.wikimedia.org/wiki/Help:Expansion for an explanation of
+    the process.
+
+    See in particular: Expansion of names and values
+    http://meta.wikimedia.org/wiki/Help:Expansion#Expansion_of_names_and_values
+
+    For most parser functions all names and values are expanded, regardless of
+    what is relevant for the result. The branching functions (#if, #ifeq,
+    #iferror, #ifexist, #ifexpr, #switch) are exceptions.
+
+    All names in a template call are expanded, and the titles of the tplargs
+    in the template body, after which it is determined which values must be
+    expanded, and for which tplargs in the template body the first part
+    (default).
+
+    In the case of a tplarg, any parts beyond the first are never expanded.
+    The possible name and the value of the first part is expanded if the title
+    does not match a name in the template call.
+
+    :see code for braceSubstitution at
    https://doc.wikimedia.org/mediawiki-core/master/php/html/Parser_8php_source.html#3397:
+
    """

-    #logging.info('INVOCATION ' + templateInvocation) # DEBUG
+    # Templates and tplargs are decomposed in the same way, with pipes as
+    # separator, even though eventually any parts in a tplarg after the first
+    # (the parameter default) are ignored, and an equals sign in the first
+    # part is treated as plain text.
+    # Pipes inside inner templates and tplargs, or inside double rectangular
+    # brackets within the template or tplargs are not taken into account in
+    # this decomposition.
+    # The first part is called title, the other parts are simply called parts.
+
+    # If a part has one or more equals signs in it, the first equals sign
+    # determines the division into name = value. Equals signs inside inner
+    # templates and tplargs, or inside double rectangular brackets within the
+    # part are not taken into account in this decomposition. Parts without
+    # equals sign are indexed 1, 2, .., given as attribute in the <name> tag.
+
+    logging.debug('INVOCATION ' + templateInvocation)
+
+    if depth > maxTemplateRecursionLevels:
+        return ''
+
    parts = splitParameters(templateInvocation)
    # part1 is the portion before the first |
-    part1 = expandTemplates(parts[0].strip(), frame)
+    part1 = expandTemplates(parts[0].strip(), depth + 1)

    # SUBST
    if re.match(substWords, part1):
@ -663,7 +721,7 @@ def expandTemplate(templateInvocation, frame):
    if colon > 1:
        funct = part1[:colon]
        parts[0] = part1[colon+1:].strip() # side-effect (parts[0] not used later)
-        ret = callParserFunction(funct, parts, frame)
+        ret = callParserFunction(funct, parts)
        if ret is not None:
            return ret

@ -677,10 +735,18 @@ def expandTemplate(templateInvocation, frame):
        # Perform parameter substitution

        template = templates[title]
-        #logging.info('TEMPLATE ' + template) # DEBUG
+        logging.debug('TEMPLATE ' + template)

-        # A parameter reference ( {{{...}}} ) may contain other parameters
-        # as well as templates, e.g.:
+        # tplarg          = "{{{" parts "}}}"
+        # parts           = [ title *( "|" part ) ]
+        # part            = ( part-name "=" part-value ) / ( part-value )
+        # part-name       = wikitext-L3
+        # part-value      = wikitext-L3
+        # wikitext-L3     = literal / template / tplarg / link / comment / 
+        #                   line-eating-comment / unclosed-comment /
+        #		    xmlish-element / *wikitext-L3
+
+        # A tplarg may contain other parameters as well as templates, e.g.:
        #  {{{text|{{{quote|{{{1|{{error|Error: No text given}}}}}}}}}}}
        # hence no simple RE like this would work:
        # '{{{((?:(?!{{{).)*?)}}}'
@ -690,18 +756,32 @@ def expandTemplate(templateInvocation, frame):
        # {{{appointe{{#if:{{{appointer14|}}}|r|d}}14|}}}

        # Because of the multiple uses of double-brace and triple-brace
-        # syntax, expressions can sometimes be ambiguous. It may be helpful or
-        # necessary to include spaces to resolve such ambiguity, for example
-        # by writing {{ {{{xxx}}} }} or {{{ {{xxx}} }}}, rather than typing
-        # five consecutive braces.
+        # syntax, expressions can sometimes be ambiguous.
+        # Precedence rules specifed here:
+        # http://www.mediawiki.org/wiki/Preprocessor_ABNF#Ideal_precedence
+        # resolve ambiguities like this:
+        # {{{{ }}}} -> { {{{ }}} }
+        # {{{{{ }}}}} -> {{ {{{ }}} }}
+        # 
        # :see: https://en.wikipedia.org/wiki/Help:Template#Handling_parameters

-        params = templateParams(parts[1:])
+        # build a dict of name-values for the expanded parameters
+        params = templateParams(parts[1:], depth)

        # We perform substitution iteratively.
        # We also limit the maximum number of iterations to avoid too long or
        # even endless loops (in case of malformed input).

+        # :see: http://meta.wikimedia.org/wiki/Help:Expansion#Distinction_between_variables.2C_parser_functions.2C_and_templates
+        #
+        # Parameter values are assigned to parameters in two (?) passes.
+        # Therefore a parameter name in a template can depend on the value of
+        # another parameter of the same template, regardless of the order in
+        # which they are specified in the template call, for example, using
+        # Template:ppp containing "{{{{{{p}}}}}}", {{ppp|p=q|q=r}} and even
+        # {{ppp|q=r|p=q}} gives r, but using Template:tvvv containing
+        # "{{{{{{{{{p}}}}}}}}}", {{tvvv|p=q|q=r|r=s}} gives s.
+
        for i in xrange(maxParameterRecursionLevels):
            result = ''
            start = 0
@ -713,7 +793,7 @@ def expandTemplate(templateInvocation, frame):
            for s,e in findBalanced(template, ['{{{', '{{'], ['}}}', '}}'],
                                    ['(?<!{){{{', '{{'], 0):
                result += template[start:s] + substParameter(template[s+3:e-3],
-                                                             params)
+                                                             params, i)
                start = e
                n += 1
            if n == 0:          # no match
@ -723,12 +803,9 @@ def expandTemplate(templateInvocation, frame):
        else:
            logging.warn('Reachead maximum parameter recursions: '
                         + str(maxParameterRecursionLevels))
-        l = len(frame)
-        if l < maxTemplateRecursionLevels:
-            #logging.info('instantiated ' + str(l) + ' ' + template) # DEBUG
-            frame.append((title, params))
-            ret =  expandTemplates(template, frame)
-            frame.pop()
+        if depth < maxTemplateRecursionLevels:
+            logging.debug('instantiated ' + str(depth) + ' ' + template)
+            ret = expandTemplates(template, depth + 1)
            return ret
        else:
            logging.warn('Reached max template recursion: '
@ -739,9 +816,13 @@ def expandTemplate(templateInvocation, frame):
        # The page being included could not be identified
        return ""

-def substParameter(parameter, templateParams):
+def substParameter(parameter, templateParams, depth):
+    """
+    :param parameter: the parts of a tplarg.
+    :param templateParams: dict of name-values template parameters.
+    """

-    # the parameter name itself might contain parameters, e.g.:
+    # the parameter name itself might contain templates, e.g.:
    # appointe{{#if:{{{appointer14|}}}|r|d}}14|

    if '{{{' in parameter:
@ -749,15 +830,18 @@ def substParameter(parameter, templateParams):
        start = 0
        for s,e in findMatchingBraces(parameter, '(?<!{){{{(?!{)', 3):
            subst += parameter[start:s] + substParameter(parameter[s+3:e-3],
-                                                         templateParams)
+                                                         templateParams,
+                                                         depth + 1)
            start = e
        parameter = subst + parameter[start:]

    if '{{' in parameter:
-        # FIXME: pass frame to limit recursion
-        parameter = expandTemplates(parameter)
+        parameter = expandTemplates(parameter, depth + 1)

-    m = re.match('([^|]*)\|(.*)', parameter, flags=re.DOTALL)
+    # any parts in a tplarg after the first (the parameter default) are
+    # ignored, and an equals sign in the first part is treated as plain text.
+
+    m = re.match('([^|]*)\|([^|]*)', parameter, flags=re.DOTALL)
    if m:
        # This parameter has a default value
        paramName = m.group(1)
@ -779,8 +863,7 @@ def substParameter(parameter, templateParams):
        # case we drop them.
        return ''
    # Surplus parameters - i.e., those assigned values in template
-    # invocation but not used in the template body - are simply
-    # ignored.
+    # invocation but not used in the template body - are simply ignored.

 def ucfirst(string):
    """:return: a string with its first character uppercase"""
@ -924,20 +1007,21 @@ def sharp_switch(primary, *templateParams):
        return default
    return ''

-def sharp_invoke(module, function, frame):
-    functions = modules.get(module)
-    if functions:
-        funct = functions.get(function)
-        if funct:
-            templateTitle = fullyQualifiedTemplateTitle(function)
-            # find parameters in frame whose title is the one of the original
-            # template invocation
-            pair = next((x for x in frame if x[0] == templateTitle), None)
-            if pair:
-                return funct(*pair[1].values())
-            else:
-                return funct()
-    return None
+# Extension Scribuntu
+# def sharp_invoke(module, function, frame):
+#     functions = modules.get(module)
+#     if functions:
+#         funct = functions.get(function)
+#         if funct:
+#             templateTitle = fullyQualifiedTemplateTitle(function)
+#             # find parameters in frame whose title is the one of the original
+#             # template invocation
+#             pair = next((x for x in frame if x[0] == templateTitle), None)
+#             if pair:
+#                 return funct(*pair[1].values())
+#             else:
+#                 return funct()
+#     return None

 parserFunctions = {

@ -981,7 +1065,7 @@ parserFunctions = {

 }

-def callParserFunction(functionName, args, frame):
+def callParserFunction(functionName, args):
    """
    Parser functions have similar syntax as templates, except that
    the first argument is everything after the first colon.
@ -990,9 +1074,9 @@ def callParserFunction(functionName, args, frame):
    """
  
    try:
-       if functionName == '#invoke':
-           # special handling of frame
-           return sharp_invoke(args[0].strip(), args[1].strip(), frame)
+       # if functionName == '#invoke':
+       #     # special handling of frame
+       #     return sharp_invoke(args[0].strip(), args[1].strip(), frame)
       if functionName in parserFunctions:
           return parserFunctions[functionName](*args)
    except:
@ -1632,6 +1716,8 @@ def main():
                        help="accepted namespaces")
    parser.add_argument("-q", "--quiet", action="store_true",
                        help="suppress reporting progress info")
+    parser.add_argument("--debug", action="store_true",
+                        help="print debug info")
    parser.add_argument("-s", "--sections", action="store_true",
                        help="preserve sections")
    parser.add_argument("-a", "--article", action="store_true",
@ -1667,8 +1753,11 @@ def main():
    if args.namespaces:
        acceptedNamespaces = set(args.ns.split(','))

+    logger = logging.getLogger()
    if not args.quiet:
-        logging.basicConfig(level=logging.INFO)
+        logger.setLevel(logging.INFO)
+    if args.debug:
+        logger.setLevel(logging.DEBUG)

    input_file = args.input