From 0c0f3129cd66bb8c5b57d7214612295bb4324ae7 Mon Sep 17 00:00:00 2001 From: Rugantio Costa Date: Thu, 27 Dec 2018 02:20:46 +0100 Subject: [PATCH] Update README.md --- README.md | 31 +++++++++++++++++++++++-------- 1 file changed, 23 insertions(+), 8 deletions(-) diff --git a/README.md b/README.md index b1051b1..1bc0e8f 100644 --- a/README.md +++ b/README.md @@ -2,7 +2,7 @@ Fbcrawl is an advanced crawler for Facebook, written in python, based on the [Scrapy](https://scrapy.org/) framework. ## DONATIONS -Fbcrawl is free software. It is not "free as beer" nor "free as speech", it is "free as a toilet": it is always available and working, but someone as to keep it clean and tidy, and I am the only one at the moment, it is not a community project. Please consider make a donation, it will keep this project alive and if I see actual interest from people I will get on with the TODO list. One of the my long-term goal is to refactor the framework with a gui, connections to databases and graph vizualitations. These tasks would take at least a couple of months of work, and I will be able to afford them only with your support! Thank you :) +Fbcrawl is free software. It is not "free as beer" nor "free as speech", it is "free as a toilet": it is always available and working, but someone as to keep it clean and tidy, and I am the only one at the moment, it is not a community project. Please consider make a donation, it will keep this project alive and if I see actual interest from people I will get on with the [TODO](https://github.com/rugantio/fbcrawl/blob/master/README.md#TODO) list. One of the my long-term goal is to refactor the framework with a gui, connections to databases and graph vizualitations. These tasks would take at least a couple of months of work, and I will be able to afford them only with your support! Thank you :) [![paypal](https://www.paypalobjects.com/en_US/IT/i/btn/btn_donateCC_LG.gif)](https://www.paypal.com/cgi-bin/webscr?cmd=_donations&business=G96T8U8W7UZDL¤cy_code=EUR&source=url) @@ -149,22 +149,37 @@ scrapy crawl comments -a email="EMAILTOLOGIN" -a password="PASSWORDTOLOGIN" -a p Make sure that the `page` option is a proper post link, that begins with the pagename and is accessible from mbasic.facebook.com. # TODO +~~Comments and commentators are naively parsed~~ +* ~~write a spyder that crawls all the metadata possible~~ -Number of comments is wrong, it only counts direct comments and not reply comments, because that's how `mbasic.facebook.com` works. Also the number of shares is not retrieved. To fix both of these issues: +The crawler only works in italian: +* add english interface support +The number of shares is not retrieved, it is not available in `mbasic.facebook.com`. Also the number of comments field only counts direct comments and not reply comments, because that's how mbasic works. To fix both of these issues: * extract URL of post and use m.facebook.com to retrieve these data -At the moment crawling starts from the beginning of 2017, it needs to go back until 2004: +Improve feature extraction. Not all metadata are crawled: +* scrape reactions from comments +Some other interesting features can be derived. Comments and commentators can be related better to post and page: +* count comments from same commentator under a post +* add features representing connections between commentators (-> reply-to, <- replied-to) + +Crawling starts from the beginning of 2017, it needs to go back until 2004: * write appropriate recursive functions in parse_page * set two parameters at runtime (**from** ant **until**) in \__init__ * memorize datetime in a local variable in parsing method and check that datetime in the post respect the period, otherwise stop crawling * this is faster than using the pipeline but might not be as accurate, so change pipelines.py and settings.py accordingly -The crawler works only in italian: +Better handling of data: +* merge comments and posts and use JSON instead of CSV +* add postgresql pipeline for simple CSV +* add mongodb pipeline for more complex JSON -* add english interface support +Integrate data visualization: +* display reactions and other features (comments, shares etc.) as timeseries +* add networkx or graph-tools support to display connections (features as links) between posts and people (nodes) +* inegrate gephi or save out to gephi -Comments and commentators are naively parsed: - -* write a spyder that crawls all the metadata possible +The script is not very user friendly: +* create a gui using pyqt