Update README.md
This commit is contained in:
parent
2e32a3cd91
commit
41a2cc1c38
52
README.md
52
README.md
@ -4,46 +4,48 @@
|
||||
- 抽取三元组
|
||||
- 构建中文知识图谱
|
||||
|
||||
![](./kg/kg.png)
|
||||
|
||||
### 环境
|
||||
|
||||
- python 3.6
|
||||
- requests:网络请求
|
||||
- re:url匹配
|
||||
- re:url正则匹配
|
||||
- bs4:网页解析
|
||||
- pickle:进度保存
|
||||
- threading:线程
|
||||
- threading:多线程
|
||||
- neo4j:知识图谱图数据库
|
||||
- pip install neo4j-driver:neo4j python驱动
|
||||
|
||||
### 目錄
|
||||
|
||||
- spider/ 抓取網頁
|
||||
- ie/ 抽取網頁信息
|
||||
- kg/ 抽取三元組
|
||||
|
||||
### 执行:
|
||||
### 运行前指定几个路径:
|
||||
|
||||
spider/html_paser.py第38行为网页存储路径:
|
||||
```
|
||||
python spider_main.py
|
||||
path='/data/ruben/data/webpages/'#custom diectory for webpages
|
||||
```
|
||||
ie/extract-para.py第11行为网页存储路径:
|
||||
```
|
||||
pages=glob.glob('/data/ruben/data/webpages/*')
|
||||
```
|
||||
ie/extract-table.py第37行为网页存储路径:
|
||||
```
|
||||
pages=glob.glob('/data/ruben/data/webpages/*')
|
||||
```
|
||||
|
||||
### 网页保存路径:
|
||||
### 代码目錄
|
||||
|
||||
我的是:```/data/webpages```,如需修改更改```html_parser.py```文件下的路径即可。
|
||||
### 运行log:
|
||||
- spider/ 抓取原始网页
|
||||
- ie/ 从网页中解析正文,从正文中抽取结构化信息
|
||||
- kg/ 抽取三元組,存入neo4j数据库
|
||||
|
||||
|
||||
### 代码执行顺序:
|
||||
|
||||
```
|
||||
craw 68357 : http://baike.baidu.com/item/%E8%BF%87%E9%80%9F%E7%BB%AF%E9%97%BB
|
||||
Save to disk filename:webpages/非常主播
|
||||
craw 68358 : http://baike.baidu.com/item/%E5%B8%82%E5%9C%BA%E8%A7%84%E6%A8%A1
|
||||
Save to disk filename:webpages/市场规模
|
||||
craw 68359 : https://baike.baidu.com/item/%E6%B8%85%E6%99%8F%E5%9B%AD
|
||||
Save to disk filename:webpages/清晏园
|
||||
craw 68360 : http://baike.baidu.com/item/%E5%AE%9D%E8%8E%B1%E5%9D%9E
|
||||
Save to disk filename:webpages/宝莱坞
|
||||
craw 68361 : https://baike.baidu.com/item/%E5%BA%93%E6%96%AF%E7%A7%91%E5%9F%8E
|
||||
Save to disk filename:webpages/库斯科城
|
||||
python spider/spider_main.py
|
||||
python ie/extract-para.py
|
||||
python ie/extract-table.py
|
||||
python kg/test_neo4j.py
|
||||
```
|
||||
|
||||
|
||||
![](./kg/kg.png)
|
||||
|
||||
|
Loading…
Reference in New Issue
Block a user