🌿 中文近义词:聊天机器人,智能问答工具包
Go to file
cclauss 77a196f695
Run flake8 tests on all new code submissions
The Travis CI services is free for all open source projects like this one.  The owner of the this repo would need to go to https://travis-ci.org/profile (log in via GitHub) and flip the repository switch __on__ to enable free automated flake8 testing of each pull request.
2018-01-15 18:44:45 +01:00
.github add ISSUE_TEMPLATE and PULL_REQUEST_TEMPLATE 2017-11-05 15:07:49 +08:00
assets Add files via upload 2017-11-05 00:27:45 -05:00
scripts init 2017-09-27 15:27:47 +08:00
synonyms Refine distance params, upgrade to v2 2017-12-31 19:01:05 +08:00
.gitignore enable build dict 2017-09-28 21:56:24 +08:00
.travis.yml Run flake8 tests on all new code submissions 2018-01-15 18:44:45 +01:00
benchmark.py #6 simplify code and support py2,3 2017-10-31 16:54:55 +08:00
CHANGELOG.md use jieba as tokenizer 2017-10-28 10:06:11 +08:00
CODE_OF_CONDUCT.md Create CODE_OF_CONDUCT.md 2017-10-18 20:42:18 -05:00
demo.py #6 simplify code and support py2,3 2017-10-31 16:54:55 +08:00
LICENSE init 2017-09-27 15:27:47 +08:00
README.md #21 add web sharing link 2018-01-15 15:18:27 +08:00
Requirements.txt #6 simplify code and support py2,3 2017-10-31 16:54:55 +08:00
setup.cfg init 2017-09-27 15:27:47 +08:00
setup.py Refine distance params, upgrade to v2 2017-12-31 19:01:05 +08:00

Synonyms

Chinese Synonyms for Natural Language Processing and Understanding.

最好的中文近义词工具包。

synonyms可以用于自然语言理解的很多任务:文本对齐,推荐算法,相似度计算,语义偏移,关键字提取,概念提取,自动摘要,搜索引擎等。

Welcome

pip install -U synonyms

兼容py2和py3当前稳定版本 v2.0。

Samples

Usage

synonyms#nearby

import synonyms
print("人脸: %s" % (synonyms.nearby("人脸")))
print("识别: %s" % (synonyms.nearby("识别")))
print("NOT_EXIST: %s" % (synonyms.nearby("NOT_EXIST")))

synonyms.nearby(WORD)返回一个listlist中包含两项[[nearby_words], [nearby_words_score]]nearby_words是WORD的近义词们也以list的方式存储并且按照距离的长度由近及远排列nearby_words_scorenearby_words对应位置的词的距离的分数,分数在(0-1)区间内越接近于1代表越相近。比如:

synonyms.nearby(人脸) = [
    ["图片", "图像", "通过观察", "数字图像", "几何图形", "脸部", "图象", "放大镜", "面孔", "Mii"], 
    [0.597284, 0.580373, 0.568486, 0.535674, 0.531835, 0.530
095, 0.525344, 0.524009, 0.523101, 0.516046]]

在OOV的情况下返回 [[], []],目前的字典大小: 125,792。

synonyms#compare

两个句子的相似度比较

    sen1 = "发生历史性变革"
    sen2 = "发生历史性变革"
    r = synonyms.compare(sen1, sen2, seg=True)

其中,参数 seg 表示 synonyms.compare是否对sen1 和 sen2进行分词默认为 True。返回值[0-1]并且越接近于1代表两个句子越相似。

旗帜引领方向 vs 道路决定命运: 0.429
旗帜引领方向 vs 旗帜指引道路: 0.93
发生历史性变革 vs 发生历史性变革: 1.0
  • 句子相似度准确率

SentenceSim上进行测试。

测试语料条数为7516条.
设定阈值 0.5
  相似度 > 0.5, 返回相似;
  相似度 < 0.5, 返回不相似.

评测结果:

正确 : 6626错误 : 890准确度 : 88.15%

关于距离计算和阀值选取,参考 enhance Synonyms#compare

synonyms#display

以友好的方式打印近义词,方便调试,display调用了 synonyms#nearby 方法。

>>> synonyms.display("飞机")
'飞机'近义词:
  1. 架飞机:0.837399
  2. 客机:0.764609
  3. 直升机:0.762116
  4. 民航机:0.750519
  5. 航机:0.750116
  6. 起飞:0.735736
  7. 战机:0.734975
  8. 飞行中:0.732649
  9. 航空器:0.723945
  10. 运输机:0.720578

PCA

Demo

$ pip install -r Requirements.txt
$ python demo.py

Data

synonyms/data/words.nearby.x.pklz # compressed pickle object

data is built based on wikidata-corpus.

Benchmark

Test with py3, MacBook Pro.

python benchmark.py

++++++++++ OS Name and version ++++++++++

Platform: Darwin

Kernel: 16.7.0

Architecture: ('64bit', '')

++++++++++ CPU Cores ++++++++++

Cores: 4

CPU Load: 60

++++++++++ System Memory ++++++++++

meminfo 8GB

synonyms#nearby: 100000 loops, best of 3 epochs: 0.209 usec per loop

Live Sharing

线上分享: Synonyms 中文近义词工具包 @ 2018-02-07

Statement

Synonyms发布证书 GPL 3.0。数据和程序可用于研究和商业产品,必须注明引用和地址,比如发布的任何媒体、期刊、杂志或博客等内容。

@online{Synonyms:hain2017,
  author = {Hai Liang Wang, Hu Ying Xi},
  title = {中文近义词工具包Synonyms},
  year = 2017,
  url = {https://github.com/huyingxi/Synonyms},
  urldate = {2017-09-27}
}

任何基于Synonyms衍生的数据和项目也需要开放并需要声明一致的“声明”。

References

wikidata-corpus

word2vec原理推导与代码分析

Authors

Hai Liang Wang

Hu Ying Xi

Give credits to

Word2vec by Google

Wikimedia: 训练语料来源

gensim: word2vec.py

SentenceSim: 相似度评测语料

jieba: 中文分词

License

GPL3.0