add readme

This commit is contained in:
Zhiwei Chen 2017-04-27 20:15:17 -04:00
parent 9cf2a2a883
commit 8e92f464cf

View File

@ -37,6 +37,7 @@ Each file will contains several documents in this [document format](http://media
[-l] [-s] [--lists] [-ns ns1,ns2]
[--templates TEMPLATES] [--no-templates] [-r]
[--min_text_length MIN_TEXT_LENGTH]
[--filter_category path_of_categories_file]
[--filter_disambig_pages] [-it abbr,b,big]
[-de gallery,timeline,noinclude] [--keep_tables]
[--processes PROCESSES] [-q] [--debug] [-a] [-v]
@ -91,6 +92,18 @@ Each file will contains several documents in this [document format](http://media
--min_text_length MIN_TEXT_LENGTH
Minimum expanded text length required to write
document (default=0)
--filter_category path_of_categories_file
Include or exclude specific categories from the dataset. Specify the categories in
file 'path_of_categories_file'. Format:
One category one line, and if the line starts with:
1) #: Comments, ignored;
2) ^: the categories will be in excluding-categories
3) others: the categories will be in including-categories.
Priority:
1) If excluding-categories is not empty, and any category of a page exists in excluding-categories, the page will be excluded; else
2) If including-categories is not empty, and no category of a page exists in including-categories, the page will be excluded; else
3) the page will be included
--filter_disambig_pages
Remove pages from output that contain disabmiguation
markup (default=False)