NewsSpider/README.md
2016-04-23 15:36:42 +08:00

24 lines
605 B
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

## 包含网站:
- 今日头条
- 网易新闻
- 腾讯新闻
## 运行
### 同时运行所有爬虫
```shell
git clone https://github.com/lzjqsdd/NewsSpider.git
cd NewsSpider/news_spider
scrapy crawlall
```
### 运行单个爬虫
```shell
scrapy crawl toutiao
```
### 注意事项
- 抓取的新闻为utf-8格式的并不是乱码
- 网易新闻2015年的内容格式和2016的不一样可以抓取需要修改xpath解析方式
- 默认参数可以抓取到13万条左右的数据保存在title.json(不含新闻内容)news.json(含新闻内容)可以在setting.py中选择