Merge branch 'master' of https://github.com/rugantio/fbcrawl
This commit is contained in:
commit
96d3423b8d
@ -140,7 +140,7 @@ By design scrapy is **asynchronous**, it will not return time ordered rows, you
|
||||
While the crawling occurs you can investigate the correct working of the spiders in the console, to show more informations change the last line of settings.py to `LOG_LEVEL = 'DEBUG'`. At the end of the process, if everything has been done right, the result can be visualized on a table.
|
||||
|
||||
The "-o " option states that result is to be saved in a .csv file (comma separated values), similar to a txt file that can be interpreted as a table. Fbcrawl can also save to JSON easily, but this feature is not implemented.
|
||||
Keep in mind that the default behavior is to append the items crawled at the bottom of the already existing file and not to overwrite it, so you might want to prefix your scrapy command with something like `rm OLDTABLE.csv;`. There are many other ways of exporting, check out the [exporter reference](https://doc.scrapy.org/en/latest/topics/exporters.html) if you want to know more.
|
||||
Keep in mind that the default behavior is to append the items crawled at the bottom of the already existing file and not to overwrite it, so you might want to prefix your scrapy command with something like `rm OLDTABLE.csv; scrapy crawl fb etc.`. There are many other ways of exporting, check out the [exporter reference](https://doc.scrapy.org/en/latest/topics/exporters.html) if you want to know more.
|
||||
|
||||
More information regarding Scrapy's [Deployment](https://doc.scrapy.org/en/latest/topics/deploy.html) and [Common Practices](https://doc.scrapy.org/en/latest/topics/practices.html) are present in the official documentation.
|
||||
|
||||
@ -164,15 +164,14 @@ Make sure that the `page` option is a proper post link, that begins with the pag
|
||||
~~Crawling starts from the beginning of 2017, it needs to go back until 2006:~~
|
||||
* ~~write appropriate recursive functions in parse_page~~
|
||||
|
||||
Retrieve CSV timely ordered:
|
||||
* ~~Implement synchronous crawling to ~~
|
||||
~~Retrieve CSV timely ordered:~~
|
||||
* ~~Implement synchronous crawling ~~
|
||||
|
||||
~~Comments and commentators are not parsed:~~
|
||||
* ~~write a spyder that crawls all the comments from a given post ~~
|
||||
* scrape reactions from comments
|
||||
* add features representing connections between commentators (-> reply-to, <- replied-to)
|
||||
|
||||
|
||||
The number of shares is not retrieved, it is not available in `mbasic.facebook.com`. Also the number of comments field only counts direct comments and not reply comments, because that's how mbasic works. To fix both of these issues:
|
||||
* extract URL of post and use m.facebook.com to retrieve these data
|
||||
|
||||
|
Loading…
Reference in New Issue
Block a user