From f447621f0bd8adfa1d705514382439325db9840d Mon Sep 17 00:00:00 2001 From: lzjqsdd Date: Sat, 23 Apr 2016 15:36:15 +0800 Subject: [PATCH] =?UTF-8?q?=E8=AF=B4=E6=98=8E=E6=8A=93=E5=8F=96=E8=BF=87?= =?UTF-8?q?=E7=A8=8B=E4=B8=AD=E9=81=87=E5=88=B0=E7=9A=84=E9=97=AE=E9=A2=98?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- README.md | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/README.md b/README.md index 91d9549..9cc5f0f 100644 --- a/README.md +++ b/README.md @@ -16,3 +16,8 @@ scrapy crawlall ```shell scrapy crawl toutiao ``` + +### 注意事项 + 抓取的新闻为utf-8格式的,并不是乱码 + 网易新闻2015年的内容格式和2016的不一样,可以抓取,需要修改xpath解析方式 + 默认参数可以抓取到13万条左右的数据,保存在title.json(不含新闻内容),news.json(含新闻内容),可以在setting.py中选择