Update README.md
This commit is contained in:
parent
e95c70c844
commit
0c0f3129cd
31
README.md
31
README.md
@ -2,7 +2,7 @@
|
|||||||
Fbcrawl is an advanced crawler for Facebook, written in python, based on the [Scrapy](https://scrapy.org/) framework.
|
Fbcrawl is an advanced crawler for Facebook, written in python, based on the [Scrapy](https://scrapy.org/) framework.
|
||||||
|
|
||||||
## DONATIONS
|
## DONATIONS
|
||||||
Fbcrawl is free software. It is not "free as beer" nor "free as speech", it is "free as a toilet": it is always available and working, but someone as to keep it clean and tidy, and I am the only one at the moment, it is not a community project. Please consider make a donation, it will keep this project alive and if I see actual interest from people I will get on with the TODO list. One of the my long-term goal is to refactor the framework with a gui, connections to databases and graph vizualitations. These tasks would take at least a couple of months of work, and I will be able to afford them only with your support! Thank you :)
|
Fbcrawl is free software. It is not "free as beer" nor "free as speech", it is "free as a toilet": it is always available and working, but someone as to keep it clean and tidy, and I am the only one at the moment, it is not a community project. Please consider make a donation, it will keep this project alive and if I see actual interest from people I will get on with the [TODO](https://github.com/rugantio/fbcrawl/blob/master/README.md#TODO) list. One of the my long-term goal is to refactor the framework with a gui, connections to databases and graph vizualitations. These tasks would take at least a couple of months of work, and I will be able to afford them only with your support! Thank you :)
|
||||||
|
|
||||||
[![paypal](https://www.paypalobjects.com/en_US/IT/i/btn/btn_donateCC_LG.gif)](https://www.paypal.com/cgi-bin/webscr?cmd=_donations&business=G96T8U8W7UZDL¤cy_code=EUR&source=url)
|
[![paypal](https://www.paypalobjects.com/en_US/IT/i/btn/btn_donateCC_LG.gif)](https://www.paypal.com/cgi-bin/webscr?cmd=_donations&business=G96T8U8W7UZDL¤cy_code=EUR&source=url)
|
||||||
|
|
||||||
@ -149,22 +149,37 @@ scrapy crawl comments -a email="EMAILTOLOGIN" -a password="PASSWORDTOLOGIN" -a p
|
|||||||
Make sure that the `page` option is a proper post link, that begins with the pagename and is accessible from mbasic.facebook.com.
|
Make sure that the `page` option is a proper post link, that begins with the pagename and is accessible from mbasic.facebook.com.
|
||||||
|
|
||||||
# TODO
|
# TODO
|
||||||
|
~~Comments and commentators are naively parsed~~
|
||||||
|
* ~~write a spyder that crawls all the metadata possible~~
|
||||||
|
|
||||||
Number of comments is wrong, it only counts direct comments and not reply comments, because that's how `mbasic.facebook.com` works. Also the number of shares is not retrieved. To fix both of these issues:
|
The crawler only works in italian:
|
||||||
|
* add english interface support
|
||||||
|
|
||||||
|
The number of shares is not retrieved, it is not available in `mbasic.facebook.com`. Also the number of comments field only counts direct comments and not reply comments, because that's how mbasic works. To fix both of these issues:
|
||||||
* extract URL of post and use m.facebook.com to retrieve these data
|
* extract URL of post and use m.facebook.com to retrieve these data
|
||||||
|
|
||||||
At the moment crawling starts from the beginning of 2017, it needs to go back until 2004:
|
Improve feature extraction. Not all metadata are crawled:
|
||||||
|
* scrape reactions from comments
|
||||||
|
|
||||||
|
Some other interesting features can be derived. Comments and commentators can be related better to post and page:
|
||||||
|
* count comments from same commentator under a post
|
||||||
|
* add features representing connections between commentators (-> reply-to, <- replied-to)
|
||||||
|
|
||||||
|
Crawling starts from the beginning of 2017, it needs to go back until 2004:
|
||||||
* write appropriate recursive functions in parse_page
|
* write appropriate recursive functions in parse_page
|
||||||
* set two parameters at runtime (**from** ant **until**) in \__init__
|
* set two parameters at runtime (**from** ant **until**) in \__init__
|
||||||
* memorize datetime in a local variable in parsing method and check that datetime in the post respect the period, otherwise stop crawling
|
* memorize datetime in a local variable in parsing method and check that datetime in the post respect the period, otherwise stop crawling
|
||||||
* this is faster than using the pipeline but might not be as accurate, so change pipelines.py and settings.py accordingly
|
* this is faster than using the pipeline but might not be as accurate, so change pipelines.py and settings.py accordingly
|
||||||
|
|
||||||
The crawler works only in italian:
|
Better handling of data:
|
||||||
|
* merge comments and posts and use JSON instead of CSV
|
||||||
|
* add postgresql pipeline for simple CSV
|
||||||
|
* add mongodb pipeline for more complex JSON
|
||||||
|
|
||||||
* add english interface support
|
Integrate data visualization:
|
||||||
|
* display reactions and other features (comments, shares etc.) as timeseries
|
||||||
|
* add networkx or graph-tools support to display connections (features as links) between posts and people (nodes)
|
||||||
|
* inegrate gephi or save out to gephi
|
||||||
|
|
||||||
Comments and commentators are naively parsed:
|
The script is not very user friendly:
|
||||||
|
* create a gui using pyqt
|
||||||
* write a spyder that crawls all the metadata possible
|
|
||||||
|
Loading…
Reference in New Issue
Block a user