Update README.md

This commit is contained in:
Rugantio Costa 2018-12-27 02:20:46 +01:00 committed by GitHub
parent e95c70c844
commit 0c0f3129cd
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

View File

@ -2,7 +2,7 @@
Fbcrawl is an advanced crawler for Facebook, written in python, based on the [Scrapy](https://scrapy.org/) framework. Fbcrawl is an advanced crawler for Facebook, written in python, based on the [Scrapy](https://scrapy.org/) framework.
## DONATIONS ## DONATIONS
Fbcrawl is free software. It is not "free as beer" nor "free as speech", it is "free as a toilet": it is always available and working, but someone as to keep it clean and tidy, and I am the only one at the moment, it is not a community project. Please consider make a donation, it will keep this project alive and if I see actual interest from people I will get on with the TODO list. One of the my long-term goal is to refactor the framework with a gui, connections to databases and graph vizualitations. These tasks would take at least a couple of months of work, and I will be able to afford them only with your support! Thank you :) Fbcrawl is free software. It is not "free as beer" nor "free as speech", it is "free as a toilet": it is always available and working, but someone as to keep it clean and tidy, and I am the only one at the moment, it is not a community project. Please consider make a donation, it will keep this project alive and if I see actual interest from people I will get on with the [TODO](https://github.com/rugantio/fbcrawl/blob/master/README.md#TODO) list. One of the my long-term goal is to refactor the framework with a gui, connections to databases and graph vizualitations. These tasks would take at least a couple of months of work, and I will be able to afford them only with your support! Thank you :)
[![paypal](https://www.paypalobjects.com/en_US/IT/i/btn/btn_donateCC_LG.gif)](https://www.paypal.com/cgi-bin/webscr?cmd=_donations&business=G96T8U8W7UZDL&currency_code=EUR&source=url) [![paypal](https://www.paypalobjects.com/en_US/IT/i/btn/btn_donateCC_LG.gif)](https://www.paypal.com/cgi-bin/webscr?cmd=_donations&business=G96T8U8W7UZDL&currency_code=EUR&source=url)
@ -149,22 +149,37 @@ scrapy crawl comments -a email="EMAILTOLOGIN" -a password="PASSWORDTOLOGIN" -a p
Make sure that the `page` option is a proper post link, that begins with the pagename and is accessible from mbasic.facebook.com. Make sure that the `page` option is a proper post link, that begins with the pagename and is accessible from mbasic.facebook.com.
# TODO # TODO
~~Comments and commentators are naively parsed~~
* ~~write a spyder that crawls all the metadata possible~~
Number of comments is wrong, it only counts direct comments and not reply comments, because that's how `mbasic.facebook.com` works. Also the number of shares is not retrieved. To fix both of these issues: The crawler only works in italian:
* add english interface support
The number of shares is not retrieved, it is not available in `mbasic.facebook.com`. Also the number of comments field only counts direct comments and not reply comments, because that's how mbasic works. To fix both of these issues:
* extract URL of post and use m.facebook.com to retrieve these data * extract URL of post and use m.facebook.com to retrieve these data
At the moment crawling starts from the beginning of 2017, it needs to go back until 2004: Improve feature extraction. Not all metadata are crawled:
* scrape reactions from comments
Some other interesting features can be derived. Comments and commentators can be related better to post and page:
* count comments from same commentator under a post
* add features representing connections between commentators (-> reply-to, <- replied-to)
Crawling starts from the beginning of 2017, it needs to go back until 2004:
* write appropriate recursive functions in parse_page * write appropriate recursive functions in parse_page
* set two parameters at runtime (**from** ant **until**) in \__init__ * set two parameters at runtime (**from** ant **until**) in \__init__
* memorize datetime in a local variable in parsing method and check that datetime in the post respect the period, otherwise stop crawling * memorize datetime in a local variable in parsing method and check that datetime in the post respect the period, otherwise stop crawling
* this is faster than using the pipeline but might not be as accurate, so change pipelines.py and settings.py accordingly * this is faster than using the pipeline but might not be as accurate, so change pipelines.py and settings.py accordingly
The crawler works only in italian: Better handling of data:
* merge comments and posts and use JSON instead of CSV
* add postgresql pipeline for simple CSV
* add mongodb pipeline for more complex JSON
* add english interface support Integrate data visualization:
* display reactions and other features (comments, shares etc.) as timeseries
* add networkx or graph-tools support to display connections (features as links) between posts and people (nodes)
* inegrate gephi or save out to gephi
Comments and commentators are naively parsed: The script is not very user friendly:
* create a gui using pyqt
* write a spyder that crawls all the metadata possible