1.增加日志;2.增加部分图片;3.优化部分逻辑;4.修改README.md

This commit is contained in:
ruben 2019-11-21 23:46:16 +08:00
parent 4492395a51
commit 97252ff6b6
8 changed files with 76 additions and 52 deletions

1
.gitignore vendored
View File

@ -2,3 +2,4 @@ __pycache__/
webpages/*
*.html
*.txt
*.log

View File

@ -1,47 +1,59 @@
#### Tips
# 开源web知识图谱项目
- 爬取百度百科中文页面
- 解析三元组和网页内容
- 构建中文知识图谱
- 构建百科bot构建中
##### update 20191121
- 迁移代码到爬虫框架scrapy
- 优化了抽取部分代码
- 数据持久化迁移到mongodb
- 修复chatbot失效问题
- 开放neo4j后台界面可以查看知识图谱成型效果
##### Tips
- 如果是项目问题请提issue。
- 如果涉及到不方便公开的,请发邮件。
- ChatBot请访问[链接](http://bot.rubenxiao.com/)
# 开源web知识图谱项目
- 爬取百度百科中文页面
- 抽取[100W+个三元组](https://raw.githubusercontent.com/lixiang0/WEB_KG/master/kg/triples.txt)
- 构建中文知识图谱
- 成型的百科知识图谱访问[链接](http://kg.rubenxiao.com/)用户名neo4j,密码123。
### 环境
- python 3.6
- requests:网络请求
- re:url正则匹配
- bs4:网页解析
- pickle:进度保存
- threading:多线程
- scrapy:网页爬虫和网页解析
- neo4j:知识图谱图数据库,安装可以参考[链接](http://blog.rubenxiao.com/posts/install-neo4j.html)
- pip install neo4j-driverneo4j python驱动
- pip install pymongodbmongodb的python支持
- mongodb数据库安装参考[链接](https://docs.mongodb.com/manual/tutorial/install-mongodb-on-ubuntu/)
### 代码目錄
### 代码执行:
- spider/ 抓取原始网页
- ie/ 从网页中解析正文,从正文中抽取结构化信息
- kg/ 抽取三元組存入neo4j数据库
```
cd WEB_KG/baike
scrapy crawl baike
```
### 代码执行顺序:
- 1.spider目录下执行python spider_main.py
- 2.ie目录下执行python extract-para.py
- 3.ie目录下执行python extract-table.py
- 4.kg目录下执行python build-triple-from-table.py
- 5.kg目录下执行python insert_to_neo4j.py
第二步本项目可以不执行。
执行界面(按ctrl+c停止)
![](./imgs/kg5.png)
### 知识图谱效果图
![](./kg/kg.png)
![](./imgs/kg.png)
### mongodb存储的网页内容
![](./imgs/kg3.png)
### mongodb存储的三元组
![](./imgs/kg4.png)
### neo4j后台界面
![](./imgs/kg2.png)

View File

@ -8,8 +8,13 @@ import re
import pymongo
from scrapy.selector import Selector
from neo4j.v1 import GraphDatabase
import logging
import time
logfile_name = time.ctime(time.time()).replace(' ', '_')
if not os.path.exists('logs/'):
os.mkdir('logs/')
logging.basicConfig(filename=f'logs/{logfile_name}.log', filemode='a+',
format='%(levelname)s - %(asctime)s - %(message)s', datefmt='%d-%b-%y %H:%M:%S')
class BaikeSpider(scrapy.Spider):
name = 'baike'
allowed_domains = ['baike.baidu.com']
@ -59,23 +64,27 @@ class BaikeSpider(scrapy.Spider):
# 处理三元组
entity = ''.join(response.xpath(
'//h1/text()').extract()).replace('/', '')
'//h1/text()').getall()).replace('/', '')
attrs = response.xpath(
'//dt[contains(@class,"basicInfo-item name")]').extract()
'//dt[contains(@class,"basicInfo-item name")]').getall()
values = response.xpath(
'//dd[contains(@class,"basicInfo-item value")]').extract()
'//dd[contains(@class,"basicInfo-item value")]').getall()
if len(attrs)!= len(values):
return
with self.driver.session() as session:
for i, attr in enumerate(attrs):
try:
for attr,value in zip(attrs,values):
# attr
temp = Selector(text=attr).xpath(
'//dt/text()|//dt/a/text()').extract()
attr = ''.join(temp).replace('\n', '').replace('', '').replace(
':', '').replace('\xa0', '').replace(' ', '').replace('', '').replace('', '')
'//dt//text()').getall()
attr = ''.join(temp).replace('\xa0', '')
# value
temp = Selector(text=values[i]).xpath(
'//dd/text()|//dd/a/text()').extract()
value = ''.join(temp).replace('\n', '')
values = Selector(text=value).xpath(
'//dd/text()|//dd/a//text()').getall()
for value in values:
try:
value=value.replace('\n','')
logging.warning(entity+'_'+attr+'_'+value)
self.db_triples.insert_one({
"_id": entity+'_'+attr+'_'+value,
"item_name": entity,
@ -85,3 +94,5 @@ class BaikeSpider(scrapy.Spider):
except pymongo.errors.DuplicateKeyError:
pass
session.write_transaction(self.add_node, entity, attr, value)
except Exception:
logging.error('\n---'.join(attrs)+'\n_________________'+'\n---'.join(values))

BIN
imgs/kg.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 440 KiB

BIN
imgs/kg2.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 155 KiB

BIN
imgs/kg3.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 404 KiB

BIN
imgs/kg4.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 135 KiB

BIN
imgs/kg5.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 312 KiB