From 8e92f464cf3bddb8e752ad4283c7277817ffdeb6 Mon Sep 17 00:00:00 2001 From: Zhiwei Chen Date: Thu, 27 Apr 2017 20:15:17 -0400 Subject: [PATCH] add readme --- README.md | 13 +++++++++++++ 1 file changed, 13 insertions(+) diff --git a/README.md b/README.md index 4e8cc9e..c93b7dc 100644 --- a/README.md +++ b/README.md @@ -37,6 +37,7 @@ Each file will contains several documents in this [document format](http://media [-l] [-s] [--lists] [-ns ns1,ns2] [--templates TEMPLATES] [--no-templates] [-r] [--min_text_length MIN_TEXT_LENGTH] + [--filter_category path_of_categories_file] [--filter_disambig_pages] [-it abbr,b,big] [-de gallery,timeline,noinclude] [--keep_tables] [--processes PROCESSES] [-q] [--debug] [-a] [-v] @@ -91,6 +92,18 @@ Each file will contains several documents in this [document format](http://media --min_text_length MIN_TEXT_LENGTH Minimum expanded text length required to write document (default=0) + --filter_category path_of_categories_file + Include or exclude specific categories from the dataset. Specify the categories in + file 'path_of_categories_file'. Format: + One category one line, and if the line starts with: + 1) #: Comments, ignored; + 2) ^: the categories will be in excluding-categories + 3) others: the categories will be in including-categories. + Priority: + 1) If excluding-categories is not empty, and any category of a page exists in excluding-categories, the page will be excluded; else + 2) If including-categories is not empty, and no category of a page exists in including-categories, the page will be excluded; else + 3) the page will be included + --filter_disambig_pages Remove pages from output that contain disabmiguation markup (default=False)