400 lines
14 KiB
Plaintext
400 lines
14 KiB
Plaintext
2015-11-20 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py (makeExternalLink): fixed.
|
|
|
|
2015-10-29 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py: /usr/bin/env
|
|
|
|
2015-09-29 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py: fixed argparse default.
|
|
(load_templates): save also modules, for a future release capaple
|
|
of applying them.
|
|
|
|
2015-09-14 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py (Extractor.templateParams): allow space in key
|
|
(process_dump): terminate by joining processes rather than queues.
|
|
|
|
* WikiExtractor.py (process_dump): eliminated ordering_queue in
|
|
favor of termination signals to jobs_queue.
|
|
|
|
2015-09-13 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py (process_dump): queue.put() dropped second arg
|
|
True, since it is the default.
|
|
(process_dump): queue renamed to jobs_queue.
|
|
(main): restored default bytes to 1M.
|
|
Dropped confusing option to write to a single file: use -o - for that.
|
|
|
|
2015-08-30 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py (main): check presence of title element in
|
|
single article.
|
|
(load_templates): reconstruct the template namespace from the
|
|
title of the first template in the saved templates.
|
|
(define_template): match also #redirect as used in French.
|
|
|
|
2015-06-02 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py (Extractor.expandTemplate): extend frame before
|
|
subst(), since there may be recursion in default parameter value,
|
|
e.g. {{OTRS|celebrative|date=April 2015}} in article 21637542 in
|
|
enwiki.
|
|
|
|
2015-05-06 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py (main): fixed arg.namespaces.
|
|
(compact): use fillvalue=' ' in izip_longest.
|
|
|
|
2015-04-26 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py (clean): use re.U when matching \W or chinese
|
|
characters will be lost.
|
|
|
|
2015-04-25 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py (sharp_expr): use unicode() instread od str()
|
|
or else chinese article 596814 fails.
|
|
(splitParameters): use findMatchingBraces istead of findBalanced,
|
|
to properly handle {{{4|{{{{{subst|}}}CURRENTYEAR}}}}
|
|
(Extractor.extract): handle currentyear, currentmonth and currentday
|
|
|
|
2015-04-23 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py: make UTF-8 the default encoding
|
|
|
|
2015-04-22 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py (replaceInternalLinks): function for replacing
|
|
internal links, modeled after MediaWiki original.
|
|
(replaceExternalLinks): revised taking into account the former.
|
|
(replaceInternalLinks): fix to nested iterator.
|
|
|
|
2015-04-20 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py (findBalanced): dropped optional arguments.
|
|
(make_anchor_tag): dont use splitParameters() since we must
|
|
consider also single [, which do not count ib parameters.
|
|
(sharp_switch): use split() to split at fist = sign.
|
|
(main): grouped command options.
|
|
(main): removed option -B.
|
|
(Template): class used to represent templates. It provides a method
|
|
for parameter subsitution.
|
|
Templats are parsed on demand when used and stored in a cache.
|
|
|
|
2015-04-19 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py (clean): use dropNext to drop discardElements.
|
|
(discardElements): added div.
|
|
(compact): avoid duplicated header line with optio --sections.
|
|
(compact): skip empty list items.
|
|
(main): changed logger format.
|
|
(ignoredTags): added abbr.
|
|
(clean): handle extension SyntaxHighlight.
|
|
(main): use % parameters in logging.
|
|
(main): added option --html, that subsumes --links and --sections
|
|
for producing HTML output instead of plain text.
|
|
(Extractor): moved here variables keepLinks, keepSections, toHTML.
|
|
(compact): modified to produce HTML lists.
|
|
|
|
2015-04-18 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py (compact): strip lists of initial characters.
|
|
|
|
2015-04-17 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py (sharp_switch): fixed handling of default.
|
|
(uc, lc, ucfirst, lcfirst): parser functions.
|
|
|
|
2015-04-16 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py (MagicWords): dealing with MagicWords.
|
|
This required rewriting code as methods of class Extractor, since
|
|
some MagicWords are document related and are being handled in
|
|
different threads, hence thay canot be globals.
|
|
(findMatchingBraces): fix to ambiguities resolution.
|
|
(clean): fixed dealing with trail for make_anchor-tag()
|
|
(substParameter): expand defaultValue only if used.
|
|
(selfClosing_tag_patterns): avoid matching besides tag end..
|
|
|
|
2015-04-15 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py (expandTemplates): increase depth only when
|
|
calling expandTemplate()
|
|
(define_template): removed \n in onlyincludeAccumulator, drop
|
|
<noinclude> always.
|
|
(sharp_invoke): restored support for #invoke, by adding parameter
|
|
frame to expandTemplate.
|
|
(main): allow specifying G in --bytes.
|
|
(make_anchor_tag): urlencode link.
|
|
(wikiLink): properly match anchor.
|
|
(make_anchor_tag): use splitParameters to separate parts of link.
|
|
(splitParameters): include pair [], since arguments may contain
|
|
wikilinks.
|
|
|
|
2015-04-14 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py (clean): dropped removal of preformatted lines,
|
|
since it is hard to distinguich them, since templates may
|
|
introdice lines with starting blanks.
|
|
(discardElements) added 'small'.
|
|
(ignoredTags) removed 'small'.
|
|
(make_anchor_tag): fixed RE for wikiLink.
|
|
(sharp_expr): added infix operators.
|
|
(Infix): support for infix operators.
|
|
(Extractor.extract): moved here logging of document being processed.
|
|
(clean): rewritten handling of wikilinks since using RE is to slow.
|
|
(maxTemplateRecursionLevels): increased to 30.
|
|
|
|
2015-04-13 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py (findMatchingBraces): rewritten to handle
|
|
ambiguities.
|
|
(substParameter): only evaluate name and default.
|
|
(main): fixed option --article.
|
|
|
|
2015-04-12 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py (ExtractorThread): enabled multithread version.
|
|
(findMatchingBraces): handle isolated braces.
|
|
(expandTemplates): recurse on result from expandTemplate().
|
|
|
|
2015-04-11 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py (sharp_switch): deal properly with #default.
|
|
(OutputSplitter): update to new-style classes
|
|
|
|
* WikiExtractor.py (selfClosingTags): added nowiki.
|
|
|
|
* WikiExtractor.py (bold_italic, bold): allow single quote inside,
|
|
e.g. '''[[Chinese New Year|New Year's Eve]]'''.
|
|
|
|
* WikiExtractor.py (templateParams): fix pattern to match
|
|
parameter name.
|
|
|
|
* WikiExtractor.py (substParameter): use splitParameters()
|
|
|
|
* WikiExtractor.py (main): added --no-templates option.
|
|
|
|
* WikiExtractor.py (substParameter): added parameter param_depth
|
|
to control depth of parameter expansion, separately from depth,
|
|
used for template expansion.
|
|
|
|
2015-04-10 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py (callParserFunction): return '' also in case of
|
|
failure.
|
|
|
|
2015-04-09 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py (expandTemplates): replaced frame parameter with
|
|
depth, used to limit max template recursion.
|
|
|
|
2015-04-07 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py (main): added --debug option.
|
|
|
|
2015-01-24 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py (splitParameters): rewritten template
|
|
processing by performing proper parsing of all balanced
|
|
expressions in templates invocation and expansion, using iterator
|
|
findBalanced().
|
|
|
|
2015-01-18 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py (expandTemplates): template expansion now working.
|
|
|
|
2015-01-11 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py (externalLink): replaced .* with appropriate
|
|
[^x]* here and elsewhere.
|
|
|
|
2015-01-10 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py (main): added option --article for processing a
|
|
single article.
|
|
(main): get dump rm file rather than frpm stdin, so that
|
|
preprocessing does not need to save data to temp file.
|
|
|
|
2014-02-25 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py (ignoreTag): make / optional.
|
|
|
|
2013-12-15 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py (clean): added template expansion
|
|
|
|
2013-10-14 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py: added wiktionary and wikt to the namespaces
|
|
(used e.g. in http://en.wikipedia.org/wiki?curid=12)
|
|
|
|
2013-05-09 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py (main): handle properly keepLinks option.
|
|
|
|
2013-04-05 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py (compact): keep lines ending with ':'.
|
|
|
|
2013-04-02 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py: obtain prefix from dump.
|
|
|
|
2013-01-27 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py (WikiDocument): add newline after <doc>.
|
|
Release version 2.3.
|
|
|
|
2012-12-30 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py (process_data): added patch by Humberto
|
|
Pereira, who claims a 10x improvement in speed.
|
|
(main): added option to set acceptedNamespaces
|
|
|
|
2012-11-01 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py (get_url): create URL from Id instead than from title.
|
|
|
|
2012-06-28 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py (OutputSplitter.reserve): added method to
|
|
invoke before writing.
|
|
(WikiDocument): use reserve() before writing whole page.
|
|
(main): added version number and option -v.
|
|
|
|
2012-05-17 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py (main): added option to preserve sections as
|
|
HTML headers and lists as <LI>.
|
|
|
|
2012-05-08 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py: Released version 2.0.
|
|
|
|
* test/test.xml: added sample to test hard cases for extractor.
|
|
|
|
* WikiExtractor.py (dropNested): Completely rewritten to be more
|
|
compliant to WikiMedia Markup language.
|
|
Use proper parsing fuctions to handle nested structures.
|
|
Improved performance by reducing creation of lists and strings.
|
|
Use htmlentitydefs instead of hand crafted list.
|
|
Added parameter -b to set URL for site.
|
|
Extensive use of RegExpr instead of specific string tests.
|
|
Deal with preformatted text.
|
|
Added parameter accepetedNamespaces to select namespaces to retain
|
|
in page titles or wiki links.
|
|
TODO:
|
|
1. handle Template expansion. See WikiPrep
|
|
(http://www.cs.technion.ac.il/~gabr/resources/code/wikiprep/)
|
|
2. Use full parser in order to better deal with nested and
|
|
unbalanced expressions.
|
|
|
|
2011-02-10 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py: added Copyright.
|
|
|
|
2012-02-15 Stefano Dei Rossi <deirossi@semawiki.di.unipi.it>
|
|
|
|
* WikiExtractor.py (WikiExtractor): replaced with a simple space
|
|
instead of u'\u00A0'.
|
|
|
|
2009-11-03 Antonio Fuschetto <fuschett@di.unipi.it>
|
|
|
|
* WikiExtractor.py: updated version to 1.6 (Oct 17).
|
|
|
|
2009-10-17 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* WikiExtractor.py: turned prefix into a parameter.
|
|
|
|
2009-07-29 Antonio Fuschetto <fuschett@di.unipi.it>
|
|
|
|
* WikiExtractor.py (init): fixed bugs in apostrophe_bold_pattern and
|
|
apostrophe_italic_pattern.
|
|
|
|
2009-07-28 Antonio Fuschetto <fuschett@di.unipi.it>
|
|
|
|
* WikiExtractor.py (__garbage_namespaces): added "file" namespace to
|
|
remove list.
|
|
|
|
2009-07-10 Antonio Fuschetto <fuschett@di.unipi.it>
|
|
|
|
* WikiExtractor.py (get_wiki_document_url): changed the handling of
|
|
URL prefix (anchors don't use prefix but a relative URLs).
|
|
|
|
2009-06-26 Antonio Fuschetto <fuschett@di.unipi.it>
|
|
|
|
* WikiExtractor.py (extract_document): changed the handling of
|
|
wikilinks, adding an anchor tag for each link with a reference to the
|
|
Wikipedia document.
|
|
|
|
* WikiExtractor.py (WikiExtractor): changed the handling of
|
|
placeholders: from "[Formula 12]" to "formula_12".
|
|
|
|
2009-04-06 Antonio Fuschetto <fuschett@di.unipi.it>
|
|
|
|
* WikiExtractor.py (init): fixed bugs in apostrophe_bold_pattern and
|
|
apostrophe_italic_pattern.
|
|
|
|
* WikiExtractor.py (compact): drop lines ending with ':'
|
|
(these are sentences preceding list items); fixed some bugs.
|
|
|
|
* WikiExtractor.py: released version 1.1.
|
|
|
|
2009-03-12 Antonio Fuschetto <fuschett@di.unipi.it>
|
|
|
|
* wiki-extractor.py (main): removed the sentence splitting option.
|
|
|
|
* wiki-extractor.py: fixed some bugs; released version 1.0; changed
|
|
filename to "Wiki-Extractor.py" according to Tanl module names.
|
|
|
|
2009-03-01 Antonio Fuschetto <fuschett@di.unipi.it>
|
|
|
|
* wiki-extractor.py (main): added cross platform path management.
|
|
|
|
2008-12-12 Antonio Fuschetto <fuschett@di.unipi.it>
|
|
|
|
* wiki-extractor.py: fixed a wrong cleaning of apostrophes prior
|
|
italic and bold text.
|
|
|
|
2008-10-27 Antonio Fuschetto <fuschett@di.unipi.it>
|
|
|
|
* wiki-extractor.py: script complete rewriting (ver 0.8).
|
|
|
|
2008-10-27 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* wiki-extractor.py: added CopyLeft.
|
|
|
|
2008-07-20 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* wiki-extractor.py (main): renamed option gzip to bzip.
|
|
|
|
* wiki-extractor.py (Document.__str__): removed global variables.
|
|
|
|
* wiki-extractor.py (Document): turned split_sentences, clean_document,
|
|
print_document into methods.
|
|
|
|
2008-07-15 Antonio Fuschetto <fuschett@di.unipi.it>
|
|
|
|
* wiki-extractor.py: changed object serialization using standard
|
|
pickle module.
|
|
|
|
2008-06-28 Antonio Fuschetto <fuschett@di.unipi.it>
|
|
|
|
* wiki-extractor.py: added the management of italics with a bad format.
|
|
|
|
2008-06-27 Antonio Fuschetto <fuschett@di.unipi.it>
|
|
|
|
* wiki-extractor.py: fixed a wrong use of conversion between conding;
|
|
added the management of wikilink with a bad format; added the
|
|
menagement of unicode character (numeric entity); added the management
|
|
of italics like quoted text.
|
|
|
|
2008-06-26 Giuseppe Attardi <attardi@di.unipi.it>
|
|
|
|
* wiki-extractor.py (main): turned global variables _infile and
|
|
_outfile into locals.
|