Update README.md

This commit is contained in:
Rugantio Costa 2019-02-05 04:49:57 +01:00 committed by GitHub
parent bdeae9f4b5
commit 98642cffd8
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

View File

@ -140,7 +140,7 @@ By design scrapy is **asynchronous**, it will not return time ordered rows, you
While the crawling occurs you can investigate the correct working of the spiders in the console, to show more informations change the last line of settings.py to `LOG_LEVEL = 'DEBUG'`. At the end of the process, if everything has been done right, the result can be visualized on a table.
The "-o " option states that result is to be saved in a .csv file (comma separated values), similar to a txt file that can be interpreted as a table. Fbcrawl can also save to JSON easily, but this feature is not implemented.
Keep in mind that the default behavior is to append the items crawled at the bottom of the already existing file and not to overwrite it, so you might want to prefix your scrapy command with something like `rm OLDTABLE.csv;`. There are many other ways of exporting, check out the [exporter reference](https://doc.scrapy.org/en/latest/topics/exporters.html) if you want to know more.
Keep in mind that the default behavior is to append the items crawled at the bottom of the already existing file and not to overwrite it, so you might want to prefix your scrapy command with something like `rm OLDTABLE.csv; scrapy crawl fb etc.`. There are many other ways of exporting, check out the [exporter reference](https://doc.scrapy.org/en/latest/topics/exporters.html) if you want to know more.
More information regarding Scrapy's [Deployment](https://doc.scrapy.org/en/latest/topics/deploy.html) and [Common Practices](https://doc.scrapy.org/en/latest/topics/practices.html) are present in the official documentation.
@ -164,15 +164,14 @@ Make sure that the `page` option is a proper post link, that begins with the pag
~~Crawling starts from the beginning of 2017, it needs to go back until 2006:~~
* ~~write appropriate recursive functions in parse_page~~
Retrieve CSV timely ordered:
* ~~Implement synchronous crawling to ~~
~~Retrieve CSV timely ordered:~~
* ~~Implement synchronous crawling ~~
~~Comments and commentators are not parsed:~~
* ~~write a spyder that crawls all the comments from a given post ~~
* scrape reactions from comments
* add features representing connections between commentators (-> reply-to, <- replied-to)
The number of shares is not retrieved, it is not available in `mbasic.facebook.com`. Also the number of comments field only counts direct comments and not reply comments, because that's how mbasic works. To fix both of these issues:
* extract URL of post and use m.facebook.com to retrieve these data