Update README.md

2018-12-27 02:20:46 +01:00 · 2018-12-27 02:20:46 +01:00 · 0c0f3129cd
commit 0c0f3129cd
parent e95c70c844
1 changed files with 23 additions and 8 deletions
--- a/README.md
+++ b/README.md
@ -2,7 +2,7 @@
 Fbcrawl is an advanced crawler for Facebook, written in python, based on the [Scrapy](https://scrapy.org/) framework. 

 ## DONATIONS
-Fbcrawl is free software. It is not "free as beer" nor "free as speech", it is "free as a toilet": it is always available and working, but someone as to keep it clean and tidy, and I am the only one at the moment, it is not a community project. Please consider make a donation, it will keep this project alive and if I see actual interest from people I will get on with the TODO list. One of the my long-term goal is to refactor the framework with a gui, connections to databases and graph vizualitations. These tasks would take at least a couple of months of work, and I will be able to afford them only with your support! Thank you :)
+Fbcrawl is free software. It is not "free as beer" nor "free as speech", it is "free as a toilet": it is always available and working, but someone as to keep it clean and tidy, and I am the only one at the moment, it is not a community project. Please consider make a donation, it will keep this project alive and if I see actual interest from people I will get on with the [TODO](https://github.com/rugantio/fbcrawl/blob/master/README.md#TODO) list. One of the my long-term goal is to refactor the framework with a gui, connections to databases and graph vizualitations. These tasks would take at least a couple of months of work, and I will be able to afford them only with your support! Thank you :)

 [![paypal](https://www.paypalobjects.com/en_US/IT/i/btn/btn_donateCC_LG.gif)](https://www.paypal.com/cgi-bin/webscr?cmd=_donations&business=G96T8U8W7UZDL&currency_code=EUR&source=url)

@ -149,22 +149,37 @@ scrapy crawl comments -a email="EMAILTOLOGIN" -a password="PASSWORDTOLOGIN" -a p
 Make sure that the `page` option is a proper post link, that begins with the pagename and is accessible from mbasic.facebook.com.

 # TODO
+~~Comments and commentators are naively parsed~~
+* ~~write a spyder that crawls all the metadata possible~~

-Number of comments is wrong, it only counts direct comments and not reply comments, because that's how `mbasic.facebook.com` works. Also the number of shares is not retrieved. To fix both of these issues:
+The crawler only works in italian:
+* add english interface support

+The number of shares is not retrieved, it is not available in `mbasic.facebook.com`. Also the number of comments field only counts direct comments and not reply comments, because that's how mbasic works. To fix both of these issues:
 * extract URL of post and use m.facebook.com to retrieve these data

-At the moment crawling starts from the beginning of 2017, it needs to go back until 2004:
+Improve feature extraction. Not all metadata are crawled:
+* scrape reactions from comments

+Some other interesting features can be derived. Comments and commentators can be related better to post and page:
+* count comments from same commentator under a post
+* add features representing connections between commentators (-> reply-to, <- replied-to)
+
+Crawling starts from the beginning of 2017, it needs to go back until 2004:
 * write appropriate recursive functions in parse_page
 * set two parameters at runtime (**from** ant **until**) in \__init__
 * memorize datetime in a local variable in parsing method and check that datetime in the post respect the period, otherwise stop crawling
 * this is faster than using the pipeline but might not be as accurate, so change pipelines.py and settings.py accordingly

-The crawler works only in italian:
+Better handling of data:
+* merge comments and posts and use JSON instead of CSV
+* add postgresql pipeline for simple CSV
+* add mongodb pipeline for more complex JSON

-* add english interface support
+Integrate data visualization:
+* display reactions and other features (comments, shares etc.) as timeseries 
+* add networkx or graph-tools support to display connections (features as links) between posts and people (nodes) 
+* inegrate gephi or save out to gephi

-Comments and commentators are naively parsed:
-
-* write a spyder that crawls all the metadata possible
+The script is not very user friendly:
+* create a gui using pyqt