add cocoNLP信息抽取工具\中文事件抽取资料

2018-12-14 10:35:51 +08:00 · 2018-12-14 10:35:51 +08:00 · ea8c74e005
commit ea8c74e005
parent c4c4aa1e33
1 changed files with 61 additions and 1 deletions
--- a/README.md
+++ b/README.md
@ -3,7 +3,7 @@
 很多包非常有趣，值得收藏，满足大家的收集癖！
 如果觉得有用，请分享并star，谢谢！

-涉及内容包括：**中英文敏感词、语言检测、中外手机/电话归属地/运营商查询、名字推断性别、手机号抽取、身份证抽取、邮箱抽取、中日文人名库、中文缩写库、拆字词典、词汇情感值、停用词、反动词表、暴恐词表、繁简体转换、英文模拟中文发音、汪峰歌词生成器、职业名称词库、同义词库、反义词库、否定词库、汽车品牌词库、汽车零件词库、连续英文切割、各种中文词向量、公司名字大全、古诗词库、IT词库、财经词库、成语词库、地名词库、历史名人词库、诗词词库、医学词库、饮食词库、法律词库、汽车词库、动物词库、中文聊天语料、中文谣言数据**。
+涉及内容包括：**中英文敏感词、语言检测、中外手机/电话归属地/运营商查询、名字推断性别、手机号抽取、身份证抽取、邮箱抽取、中日文人名库、中文缩写库、拆字词典、词汇情感值、停用词、反动词表、暴恐词表、繁简体转换、英文模拟中文发音、汪峰歌词生成器、职业名称词库、同义词库、反义词库、否定词库、汽车品牌词库、汽车零件词库、连续英文切割、各种中文词向量、公司名字大全、古诗词库、IT词库、财经词库、成语词库、地名词库、历史名人词库、诗词词库、医学词库、饮食词库、法律词库、汽车词库、动物词库、中文聊天语料、中文谣言数据、百度中文问答数据集、句子相似度匹配算法集合、bert资源、文本生成&摘要相关工具、cocoNLP信息抽取工具**。

 **1\. textfilter: 中英文敏感词过滤**  [observerss/textfilter](https://github.com/observerss/textfilter)
 ```
@ -45,6 +45,9 @@ print(detect_langs(s3))    # detect_langs()输出探测出的所有语言类型


 **4\. phone 中国手机归属地查询：** [ls0f/phone](https://github.com/ls0f/phone)
+
+> 已集成到 python package [cocoNLP](https://github.com/fighting41love/cocoNLP)中，欢迎试用
+
 ```
 from phone import Phone
 p  = Phone()
@ -78,11 +81,17 @@ phone('(817) 569-8900'); // return ['+18175698900, 'USA']
 ('female', 0.9759486128949907)
 ```
 **7\. 抽取email的正则表达式**
+
+> 已集成到 python package [cocoNLP](https://github.com/fighting41love/cocoNLP)中，欢迎试用
+
 ```
 email_pattern = '^[*#\u4e00-\u9fa5 a-zA-Z0-9_.-]+@[a-zA-Z0-9-]+(\.[a-zA-Z0-9-]+)*\.[a-zA-Z0-9]{2,6}$'
 emails = re.findall(email_pattern, text, flags=0)
 ```
 **8\. 抽取phone_number的正则表达式**
+
+> 已集成到 python package [cocoNLP](https://github.com/fighting41love/cocoNLP)中，欢迎试用
+
 ```
 cellphone_pattern = '^((13[0-9])|(14[0-9])|(15[0-9])|(17[0-9])|(18[0-9]))\d{8}$'
 phoneNumbers = re.findall(cellphone_pattern, text, flags=0)
@ -93,6 +102,9 @@ IDCards_pattern = r'^([1-9]\d{5}[12]\d{3}(0[1-9]|1[012])(0[1-9]|[12][0-9]|3[01])
 IDs = re.findall(IDCards_pattern, text, flags=0)
 ```
 **10.  人名语料库：** [wainshine/Chinese-Names-Corpus](https://github.com/wainshine/Chinese-Names-Corpus)
+
+> 人名抽取功能 python package [cocoNLP](https://github.com/fighting41love/cocoNLP)，欢迎试用
+
 ```
 中文（现代、古代）名字、日文名字、中文的姓和名、称呼（大姨妈、小姨妈等）、英文->中文名字（李约翰）、成语词典
 ```
@ -173,6 +185,9 @@ say wo i ni
 见本repo的data文件 [data](https://github.com/fighting41love/funNLP/tree/master/data)
 ```
 **26\. 时间抽取：**
+
+> 已集成到 python package [cocoNLP](https://github.com/fighting41love/cocoNLP)中，欢迎试用
+
 ```
 在2016年6月7日9:44执行測試，结果如下

@ -258,5 +273,50 @@ publishTime: 该谣言被举报时间

 基于Tensorflow的开源工具包，旨在支持广泛的机器学习，特别是文本生成任务，如机器翻译、对话、摘要、内容处置、语言建模等

+**38. 中文事件抽取：** [github](https://github.com/liuhuanyong/ComplexEventExtraction)
+
+中文复合事件抽取，包括条件事件、因果事件、顺承事件、反转事件等事件抽取，并形成事理图谱。
+
+**39. cocoNLP: ** [github](https://github.com/fighting41love/cocoNLP)
+
+人名、地址、邮箱、手机号、手机归属地 等信息的抽取，rake短语抽取算法。
+> pip install cocoNLP
+
+```
+>>> from cocoNLP.extractor import extractor
+
+>>> ex = extractor()
+
+>>> text = '急寻特朗普，男孩，于2018年11月27号11时在陕西省安康市汉滨区走失。丢失发型短发，...如有线索，请迅速与警方联系：18100065143，132-6156-2938，baizhantang@sina.com.cn 和yangyangfuture at gmail dot com'
+
+# 抽取邮箱
+>>> emails = ex.extract_email(text)
+>>> print(emails)
+
+['baizhantang@sina.com.cn', 'yangyangfuture@gmail.com.cn']
+# 抽取手机号
+>>> cellphones = ex.extract_cellphone(text,nation='CHN')
+>>> print(cellphones)
+
+['18100065143', '13261562938']
+# 抽取手机归属地、运营商
+>>> cell_locs = [ex.extract_cellphone_location(cell,'CHN') for cell in cellphones]
+>>> print(cell_locs)
+
+cellphone_location [{'phone': '18100065143', 'province': '上海', 'city': '上海', 'zip_code': '200000', 'area_code': '021', 'phone_type': '电信'}]
+# 抽取地址信息
+>>> locations = ex.extract_locations(text)
+>>> print(locations)
+['陕西省安康市汉滨区', '安康市汉滨区', '汉滨区']
+# 抽取时间点
+>>> times = ex.extract_time(text)
+>>> print(times)
+time {"type": "timestamp", "timestamp": "2018-11-27 11:00:00"}
+# 抽取人名
+>>> name = ex.extract_name(text)
+>>> print(name)
+特朗普
+
+```

 [jieba](https://github.com/fxsjy/jieba)和[hanlp](https://github.com/hankcs/pyhanlp)就不必说了吧。