final

2018-08-27 02:21:51 +02:00 · 2018-08-27 02:21:51 +02:00 · cf6313e4b1
commit cf6313e4b1
parent 8359748b81
6 changed files with 30 additions and 6 deletions
--- a/.~lock.exploit_sorted.csv#
+++ b/.~lock.exploit_sorted.csv#
@ -0,0 +1 @@
+,rugantio,alice,26.08.2018 17:29,file:///home/rugantio/.config/libreoffice/4;
--- a/README.md
+++ b/README.md
@ -128,4 +128,25 @@ scrapy crawl fb -a email="EMAILTOLOGIN" -a password="PASSWORDTOLOGIN" -a page="N
 ```
 Keep in mind that the default behavior is to append the field crawled over to the already existing file and not to overwrite it. There are many other ways of exporting, check out the [exporter reference](https://doc.scrapy.org/en/latest/topics/exporters.html) if you want to know more.

-More information regarding Scrapy's [Deployment](https://doc.scrapy.org/en/latest/topics/deploy.html) and [Common Practices](https://doc.scrapy.org/en/latest/topics/practices.html) are present in the official documentation. 
+More information regarding Scrapy's [Deployment](https://doc.scrapy.org/en/latest/topics/deploy.html) and [Common Practices](https://doc.scrapy.org/en/latest/topics/practices.html) are present in the official documentation.
+
+#TODO
+
+Number of comments is wrong, it only counts direct comments and not reply comments, because that's how `mbasic.facebook.com` works. Also the number of shares is not retrieved. To fix both of these issues:
+
+* extract URL of post and use m.facebook.com to retrieve these data
+
+At the moment crawling starts from the beginning of 2017, it needs to go back until 2004:
+
+* write appropriate recursive functions in parse_page
+* set two parameters at runtime (**from** ant **until**) in \__init__
+* memorize datetime in a local variable in parsing method and check that datetime in the post respect the period, otherwise stop crawling
+* this is faster than using the pipeline but might not be as accurate, so change pipelines.py and settings.py accordingly
+
+The crawler works only in italian:
+
+* add english interface support
+
+Comments and commentators are ignored:
+
+* write a spyder that crawls all the metadata possible
--- a/fbcrawl/pycache/items.cpython-37.pyc
+++ b/fbcrawl/pycache/items.cpython-37.pyc
--- a/fbcrawl/pipelines.py
+++ b/fbcrawl/pipelines.py
@ -10,7 +10,9 @@ from datetime import datetime

 class FbcrawlPipeline(object):
    def process_item(self, item, spider):
-        if item['date'] < datetime(2017,3,4).date():
-            raise DropItem("Dropping element because it's older than 04/03/2017")
+        if item['date'] < datetime(2017,1,1).date():
+            raise DropItem("Dropping element because it's older than 01/01/2017")
+        elif item['date'] > datetime(2018,3,4).date():
+            raise DropItem("Dropping element because it's newer than 04/03/2018")
        else:
            return item
--- a/fbcrawl/settings.py
+++ b/fbcrawl/settings.py
@ -64,9 +64,9 @@ ROBOTSTXT_OBEY = False

 # Configure item pipelines
 # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
-#ITEM_PIPELINES = {
-    #'fbcrawl.pipelines.FbcrawlPipeline': 300,
-#}
+ITEM_PIPELINES = {
+    'fbcrawl.pipelines.FbcrawlPipeline': 300,
+}

 # Enable and configure the AutoThrottle extension (disabled by default)
 # See https://doc.scrapy.org/en/latest/topics/autothrottle.html
--- a/fbcrawl/spiders/pycache/fbcrawl.cpython-37.pyc
+++ b/fbcrawl/spiders/pycache/fbcrawl.cpython-37.pyc
				`@ -0,0 +1 @@`
				`,rugantio,alice,26.08.2018 17:29,file:///home/rugantio/.config/libreoffice/4;`