init similarities project.

This commit is contained in:
shibing624 2022-02-23 19:44:53 +08:00
parent a88748119f
commit 487469a423
19 changed files with 1029 additions and 2 deletions

28
.github/ISSUE_TEMPLATE/bug-report.md vendored Normal file
View File

@ -0,0 +1,28 @@
---
name: Bug Report
about: Create a report to help us improve
title: ''
labels: bug
assignees: ''
---
### Describe the bug
Please provide a clear and concise description of what the bug is. If applicable, add screenshots to help explain your problem, especially for visualization related problems.
### To Reproduce
Please provide a [Minimal, Complete, and Verifiable example](https://stackoverflow.com/help/mcve) here. We hope we can simply copy/paste/run it. It is also nice to share a hosted runnable script (e.g. Google Colab), especially for hardware-related problems.
### Describe your attempts
- [ ] I checked the documentation and found no answer
- [ ] I checked to make sure that this is not a duplicate issue
You should also provide code snippets you tried as a workaround, StackOverflow solution that you have walked through, or your best guess of the cause that you can't locate (e.g. cosmic radiation).
### Context
- **OS** [e.g. Windows 10, macOS 10.14]:
- **Hardware** [e.g. CPU only, GTX 1080 Ti]:
### Additional Information
Other things you want the developers to know.

View File

@ -0,0 +1,23 @@
---
name: Feature Request
about: Suggest an idea for this project
title: ''
labels: enhancement
assignees: ''
---
- [ ] I checked to make sure that this is not a duplicate issue
- [ ] I'm submitting the request to the correct repository (for model requests
### Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
### Describe the solution you'd like
A clear and concise description of what you want to happen.
### Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
### Additional Information
Other things you want the developers to know.

View File

@ -0,0 +1,18 @@
---
name: Usage Question
about: Ask a question about text2vec usage
title: ''
labels: question
assignees: ''
---
### Describe the Question
Please provide a clear and concise description of what the question is.
### Describe your attempts
- [ ] I walked through the tutorials
- [ ] I checked the documentation
- [ ] I checked to make sure that this is not a duplicate question
You may also provide a [Minimal, Complete, and Verifiable example](https://stackoverflow.com/help/mcve) you tried as a workaround, or StackOverflow solution that you have walked through. (e.g. cosmic radiation).

17
.github/stale.yml vendored Normal file
View File

@ -0,0 +1,17 @@
# Number of days of inactivity before an issue becomes stale
daysUntilStale: 60
# Number of days of inactivity before a stale issue is closed
daysUntilClose: 7
# Issues with these labels will never be considered stale
exemptLabels:
- pinned
- security
# Label to use when marking an issue as stale
staleLabel: wontfix
# Comment to post when marking an issue as stale. Set to `false` to disable
markComment: >
This issue has been automatically marked as stale because it has not had
recent activity. It will be closed if no further activity occurs. Thank you
for your contributions.(由于长期不活动,机器人自动关闭此问题,如果需要欢迎提问)
# Comment to post when closing a stale issue. Set to `false` to disable
closeComment: false

1
.gitignore vendored
View File

@ -127,3 +127,4 @@ dmypy.json
# Pyre type checker # Pyre type checker
.pyre/ .pyre/
.idea/

9
CONTRIBUTING.md Normal file
View File

@ -0,0 +1,9 @@
# Contributing
We are happy to accept your contributions to make `text2vec` better and more awesome! To avoid unnecessary work on either
side, please stick to the following process:
1. Check if there is already [an issue](https://github.com/shibing624/similarities/issues) for your concern.
2. If there is not, open a new one to start a discussion. We hate to close finished PRs!
3. If we decide your concern needs code changes, we would be happy to accept a pull request. Please consider the
commit guidelines below.

183
README.md
View File

@ -1,2 +1,181 @@
# similarities [![PyPI version](https://badge.fury.io/py/similarities.svg)](https://badge.fury.io/py/similarities)
Similarities is a toolkit for compute similarity score between texts. 相似度计算工具包,实现多种字面、语义匹配模型。 [![Downloads](https://pepy.tech/badge/similarities)](https://pepy.tech/project/similarities)
[![Contributions welcome](https://img.shields.io/badge/contributions-welcome-brightgreen.svg)](CONTRIBUTING.md)
[![GitHub contributors](https://img.shields.io/github/contributors/shibing624/similarities.svg)](https://github.com/shibing624/similarities/graphs/contributors)
[![License Apache 2.0](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](LICENSE)
[![python_version](https://img.shields.io/badge/Python-3.5%2B-green.svg)](requirements.txt)
[![GitHub issues](https://img.shields.io/github/issues/shibing624/similarities.svg)](https://github.com/shibing624/similarities/issues)
[![Wechat Group](http://vlog.sfyc.ltd/wechat_everyday/wxgroup_logo.png?imageView2/0/w/60/h/20)](#Contact)
# Similarities
Similarities is a toolkit for Compute Similarity Score between texts.
相似度计算工具包,实现多种字面、语义匹配模型。
**similarities**实现了Word2Vec、RankBM25、BERT、Sentence-BERT、CoSENT等多种文本表征、文本相似度计算模型并在文本语义匹配相似度计算任务上比较了各模型的效果。
**Guide**
- [Feature](#Feature)
- [Evaluate](#Evaluate)
- [Install](#install)
- [Usage](#usage)
- [Contact](#Contact)
- [Citation](#Citation)
- [Reference](#reference)
# Feature
### 文本向量表示模型
- [Word2Vec](similarities/word2vec.py)通过腾讯AI Lab开源的大规模高质量中文[词向量数据800万中文词轻量版](https://pan.baidu.com/s/1La4U4XNFe8s5BJqxPQpeiQ) (文件名light_Tencent_AILab_ChineseEmbedding.bin 密码: tawe实现词向量检索本项目实现了句子词向量求平均的word2vec向量表示
- [SBert(Sentence-BERT)](similarities/sentence_bert)权衡性能和效率的句向量表示模型训练时通过有监督训练上层分类函数文本匹配预测时直接句子向量做余弦本项目基于PyTorch复现了Sentence-BERT模型的训练和预测
- [CoSENT(Cosine Sentence)](similarities/cosent)CoSENT模型提出了一种排序的损失函数使训练过程更贴近预测模型收敛速度和效果比Sentence-BERT更好本项目基于PyTorch实现了CoSENT模型的训练和预测
### 文本相似度比较方法
- 余弦相似Cosine Similarity两向量求余弦
- 点积Dot Product两向量归一化后求内积
- 词移距离Word Movers Distance词移距离使用两文本间的词向量测量其中一文本中的单词在语义空间中移动到另一文本单词所需要的最短距离
- [RankBM25](similarities/bm25.py)BM25的变种算法对query和文档之间的相似度打分得到docs的rank排序
- [SemanticSearch](https://github.com/shibing624/similarities/blob/master/similarities/sbert.py#L80)向量相似检索使用Cosine Similarty + topk高效计算比一对一暴力计算快一个数量级
# Evaluate
### 文本匹配
- 英文匹配数据集的评测结果:
| Arch | Backbone | Model Name | English-STS-B |
| :-- | :--- | :--- | :-: |
| GloVe | glove | Avg_word_embeddings_glove_6B_300d | 61.77 |
| BERT | bert-base-uncased | BERT-base-cls | 20.29 |
| BERT | bert-base-uncased | BERT-base-first_last_avg | 59.04 |
| BERT | bert-base-uncased | BERT-base-first_last_avg-whiten(NLI) | 63.65 |
| SBERT | sentence-transformers/bert-base-nli-mean-tokens | SBERT-base-nli-cls | 73.65 |
| SBERT | sentence-transformers/bert-base-nli-mean-tokens | SBERT-base-nli-first_last_avg | 77.96 |
| SBERT | xlm-roberta-base | paraphrase-multilingual-MiniLM-L12-v2 | 84.42 |
| CoSENT | bert-base-uncased | CoSENT-base-first_last_avg | 69.93 |
| CoSENT | sentence-transformers/bert-base-nli-mean-tokens | CoSENT-base-nli-first_last_avg | 79.68 |
- 中文匹配数据集的评测结果:
| Arch | Backbone | Model Name | ATEC | BQ | LCQMC | PAWSX | STS-B | Avg | QPS |
| :-- | :--- | :--- | :-: | :-: | :-: | :-: | :-: | :-: | :-: |
| CoSENT | hfl/chinese-macbert-base | CoSENT-macbert-base | 50.39 | **72.93** | **79.17** | **60.86** | **80.51** | **68.77** | 2572 |
| CoSENT | Langboat/mengzi-bert-base | CoSENT-mengzi-base | **50.52** | 72.27 | 78.69 | 12.89 | 80.15 | 58.90 | 2502 |
| CoSENT | bert-base-chinese | CoSENT-bert-base | 49.74 | 72.38 | 78.69 | 60.00 | 80.14 | 68.19 | 2653 |
| SBERT | bert-base-chinese | SBERT-bert-base | 46.36 | 70.36 | 78.72 | 46.86 | 66.41 | 61.74 | 1365 |
| SBERT | hfl/chinese-macbert-base | SBERT-macbert-base | 47.28 | 68.63 | **79.42** | 55.59 | 64.82 | 63.15 | 1948 |
| CoSENT | hfl/chinese-roberta-wwm-ext | CoSENT-roberta-ext | **50.81** | **71.45** | **79.31** | **61.56** | **81.13** | **68.85** | - |
| SBERT | hfl/chinese-roberta-wwm-ext | SBERT-roberta-ext | 48.29 | 69.99 | 79.22 | 44.10 | 72.42 | 62.80 | - |
- 本项目release模型的中文匹配评测结果
| Arch | Backbone | Model Name | ATEC | BQ | LCQMC | PAWSX | STS-B | Avg | QPS |
| :-- | :--- | :---- | :-: | :-: | :-: | :-: | :-: | :-: | :-: |
| Word2Vec | word2vec | w2v-light-tencent-chinese | 20.00 | 31.49 | 59.46 | 2.57 | 55.78 | 33.86 | 10283 |
| SBERT | xlm-roberta-base | paraphrase-multilingual-MiniLM-L12-v2 | 18.42 | 38.52 | 63.96 | 10.14 | 78.90 | 41.99 | 2371 |
| CoSENT | hfl/chinese-macbert-base | similarities-base-chinese | 31.93 | 42.67 | 70.16 | 17.21 | 79.30 | **48.25** | 2572 |
说明:
- 结果值均使用spearman系数
- 结果均只用该数据集的train训练在test上评估得到的表现没用外部数据
- `paraphrase-multilingual-MiniLM-L12-v2`模型名称是`sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2`,是`paraphrase-MiniLM-L12-v2`模型的多语言版本,速度快,效果好,支持中文
- `CoSENT-macbert-base`模型达到同级别参数量SOTA效果是用CoSENT方法训练运行[similarities/cosent](similarities/cosent)文件夹下代码可以复现结果
- `SBERT-macbert-base`模型是用SBERT方法训练运行[similarities/sentence_bert](similarities/sentence_bert)文件夹下代码可以复现结果
- `similarities-base-chinese`模型是用CoSENT方法训练基于MacBERT在中文STS-B数据训练得到模型文件已经上传到huggingface的模型库[shibing624/similarities-base-chinese](https://huggingface.co/shibing624/similarities-base-chinese)
- `w2v-light-tencent-chinese`是腾讯词向量的Word2Vec模型CPU加载使用
- 各预训练模型均可以通过transformers调用如MacBERT模型`--pretrained_model_path hfl/chinese-macbert-base`
- 中文匹配数据集下载[链接见下方](#数据集)
- 中文匹配任务实验表明pooling最优是`first_last_avg`预测可以调用SBert的`mean pooling`方法,效果损失很小
- QPS的GPU测试环境是Tesla V100显存32GB
# Demo
Official Demo: http://42.193.145.218/product/short_text_sim/
HuggingFace Demo: https://huggingface.co/spaces/shibing624/similarities
![](docs/hf.png)
# Install
```
pip3 install -U similarities
```
or
```
git clone https://github.com/shibing624/similarities.git
cd similarities
python3 setup.py install
```
### 数据集
常见中文语义匹配数据集,包含[ATEC](https://github.com/IceFlameWorm/NLP_Datasets/tree/master/ATEC)、[BQ](http://icrc.hitsz.edu.cn/info/1037/1162.htm)、[LCQMC](http://icrc.hitsz.edu.cn/Article/show/171.html)、[PAWSX](https://arxiv.org/abs/1908.11828)、[STS-B](https://github.com/pluto-junzeng/CNSD)共5个任务。
可以从数据集对应的链接自行下载,也可以从[百度网盘(提取码:qkt6)](https://pan.baidu.com/s/1d6jSiU1wHQAEMWJi7JJWCQ)下载。
其中senteval_cn目录是评测数据集汇总senteval_cn.zip是senteval目录的打包两者下其一就好。
# Usage
### 1. 计算文本向量
### 2. 计算句子之间的相似度值
示例[semantic_text_similarity.py](./examples/semantic_text_similarity.py)
> 句子余弦相似度值`score`范围是[-1, 1],值越大越相似。
### 3. 计算句子与文档集之间的相似度值
一般在文档候选集中找与query最相似的文本常用于QA场景的问句相似匹配、文本相似检索等任务。
> `Score`的值范围[-1, 1]值越大表示该query与corpus相似度越近。
# Contact
- Issue(建议)[![GitHub issues](https://img.shields.io/github/issues/shibing624/similarities.svg)](https://github.com/shibing624/similarities/issues)
- 邮件我xuming: xuming624@qq.com
- 微信我:
加我*微信号xuming624, 备注:个人名称-公司-NLP* 进NLP交流群。
<img src="docs/wechat.jpeg" width="200" />
# Citation
如果你在研究中使用了similarities请按如下格式引用
```latex
@misc{similarities,
title={similarities: A Tool for Compute Similarity Score},
author={Ming Xu},
howpublished={https://github.com/shibing624/similarities},
year={2022}
}
```
# License
授权协议为 [The Apache License 2.0](/LICENSE)可免费用做商业用途。请在产品说明中附加similarities的链接和授权协议。
# Contribute
项目代码还很粗糙,如果大家对代码有所改进,欢迎提交回本项目,在提交之前,注意以下两点:
- 在`tests`添加相应的单元测试
- 使用`python setup.py test`来运行所有单元测试,确保所有单测都是通过的
之后即可提交PR。
# Reference
- [A Simple but Tough-to-Beat Baseline for Sentence Embeddings[Sanjeev Arora and Yingyu Liang and Tengyu Ma, 2017]](https://openreview.net/forum?id=SyK00v5xx)
- [四种计算文本相似度的方法对比[Yves Peirsman]](https://zhuanlan.zhihu.com/p/37104535)
- [谈谈文本匹配和多轮检索](https://zhuanlan.zhihu.com/p/111769969)

BIN
docs/hf.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 238 KiB

View File

@ -0,0 +1,538 @@
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>SBERT.net Models</title>
<!-- Vue.js -->
<script src="https://cdnjs.cloudflare.com/ajax/libs/vue/2.6.12/vue.min.js" integrity="sha512-BKbSR+cfyxLdMAsE0naLReFSLg8/pjbgfxHh/k/kUC82Hy7r6HtR5hLhobaln2gcTvzkyyehrdREdjpsQwy2Jw==" crossorigin="anonymous" referrerpolicy="no-referrer"></script>
<!-- Bootstrap -->
<link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css"
integrity="sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh" crossorigin="anonymous">
<script src="https://code.jquery.com/jquery-3.4.1.slim.min.js"
integrity="sha384-J6qa4849blE2+poT4WnyKhv5vZF5SrPo0iEjwBvKU7imGFAV0wwj1yYfoRSJoZ+n"
crossorigin="anonymous"></script>
<script src="https://cdn.jsdelivr.net/npm/popper.js@1.16.0/dist/umd/popper.min.js"
integrity="sha384-Q6E9RHvbIyZFJoft+2mJbHaEWldlvI9IOYy5n3zV9zzTtmI3UksdQRVvoxMfooAo"
crossorigin="anonymous"></script>
<script src="https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/js/bootstrap.min.js"
integrity="sha384-wfSDF2E50Y2D1uUdj0O3uMBJnjuUD4Ih7YwaYd1iqfktj0Uod8GCExl3Og8ifwB6"
crossorigin="anonymous"></script>
<!-- Axios -->
<!-- <script src="https://cdnjs.cloudflare.com/ajax/libs/axios/0.21.1/axios.min.js" integrity="sha512-bZS47S7sPOxkjU/4Bt0zrhEtWx0y0CRkhEp8IckzK+ltifIIE9EMIMTuT/mEzoIMewUINruDBIR/jJnbguonqQ==" crossorigin="anonymous" referrerpolicy="no-referrer"></script> -->
<!-- Font-awesome -->
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/5.15.3/css/all.min.css"
integrity="sha512-iBBXm8fW90+nuLcSKlbmrPcLa0OT92xO1BIsZ+ywDWZCvqsWgccV3gFoRBv0z+8dLJgyAHIhR35VZc2oM/gI1w=="
crossorigin="anonymous" referrerpolicy="no-referrer"/>
<!-- Lodash -->
<script src="https://cdnjs.cloudflare.com/ajax/libs/lodash.js/4.17.21/lodash.min.js"
integrity="sha512-WFN04846sdKMIP5LKNphMaWzU7YpMyCU245etK3g/2ARYbPK9Ub18eG+ljU96qKRCWh+quCY7yefSmlkQw1ANQ=="
crossorigin="anonymous" referrerpolicy="no-referrer"></script>
<style>
.fa-active {
color: #337ab7;
}
.header-cell {
cursor: pointer;
}
.models-table thead th {
position: sticky;
top: 0;
z-index: 1;
background-color: #ffffff;
}
.info-icon {
color: #007bff;
}
.info-icon:hover {
color: #0056b3;
}
.info-icon-model {
padding-left: 10px;
}
.bs-popover-auto[x-placement^=bottom], .bs-popover-bottom {
margin-top: .5rem;
}
.popover {
max-width: 400px;
}
</style>
</head>
<body>
<div id="app">
<table class="table table-bordered table-sm">
<thead>
<tr>
<th class="header-cell" @click="sortAsc = (sortBy=='name') ? sortAsc = !sortAsc : false; sortBy='name'">
<i class="fas fa-active" v-if="sortBy == 'name'" v-bind:class="{ 'fa-sort-amount-up': !sortAsc, 'fa-sort-amount-down-alt': sortAsc }"></i>
Model Name
</th>
<th class="header-cell text-center" @click="sortAsc = (sortBy=='stsb') ? sortAsc = !sortAsc : false; sortBy='stsb'">
<i class="fas fa-active" v-if="sortBy == 'stsb'" v-bind:class="{ 'fa-sort-amount-up': !sortAsc, 'fa-sort-amount-down-alt': sortAsc }"></i>
STSb
<span class="info-icon" data-trigger="hover" data-toggle="popover" title="STSbenchmark" data-content="Spearman-rank correlation on the STSbenchmark test set. Higher = Better" data-html="true" data-placement="bottom"><i class="fas fa-info-circle"></i></span>
</th>
<th class="header-cell text-center" @click="sortAsc = (sortBy=='dupq') ? sortAsc = !sortAsc : false; sortBy='dupq'">
<i class="fas fa-active" v-if="sortBy == 'dupq'" v-bind:class="{ 'fa-sort-amount-up': !sortAsc, 'fa-sort-amount-down-alt': sortAsc }"></i>
DupQ
<span class="info-icon" data-trigger="hover" data-toggle="popover" title="Duplicate Questions" data-content="Combination of two datasets for duplicate questions detection:<br>Mean-Average-Precision on the Quora Duplicate Questions Semantic Search test set.<br>Average-Precision on the Sprint duplicate questions test set.<br> Higher = Better" data-html="true" data-placement="bottom"><i class="fas fa-info-circle"></i></span>
</th>
<th class="header-cell text-center" @click="sortAsc = (sortBy=='TwitterP') ? sortAsc = !sortAsc : false; sortBy='TwitterP'">
<i class="fas fa-active" v-if="sortBy == 'TwitterP'" v-bind:class="{ 'fa-sort-amount-up': !sortAsc, 'fa-sort-amount-down-alt': sortAsc }"></i>
TwitterP
<span class="info-icon" data-trigger="hover" data-toggle="popover" title="Twitter Paraphrases" data-content="A test to find tweets that are considered paraphrases. Combination of the SemEval2015 Tweet paraphrase test set and the Twitter-URL-Corpus test set. Performance is measured using Average Precision.<br> Higher = Better" data-html="true" data-placement="bottom"><i class="fas fa-info-circle"></i></span>
</th>
<th class="header-cell text-center" @click="sortAsc = (sortBy=='scidocs') ? sortAsc = !sortAsc : false; sortBy='scidocs'">
<i class="fas fa-active" v-if="sortBy == 'scidocs'" v-bind:class="{ 'fa-sort-amount-up': !sortAsc, 'fa-sort-amount-down-alt': sortAsc }"></i>
SciDocs
<span class="info-icon" data-trigger="hover" data-toggle="popover" title="SciDocs" data-content="A test to find similar scientific publications given a paper title. From SciDocs, we use the information which papers are often co-cited, co-read, or co-viewed. Performance is measured using MAP.<br> Higher = Better" data-html="true" data-placement="bottom"><i class="fas fa-info-circle"></i></span>
</th>
<th class="header-cell text-center" @click="sortAsc = (sortBy=='clustering') ? sortAsc = !sortAsc : false; sortBy='clustering'">
<i class="fas fa-active" v-if="sortBy == 'clustering'" v-bind:class="{ 'fa-sort-amount-up': !sortAsc, 'fa-sort-amount-down-alt': sortAsc }"></i>
Clustering
<span class="info-icon" data-trigger="hover" data-toggle="popover" title="Clustering" data-content="Here we test how well the embeddings can be used for clustering. We use three datasets: email subjects from 20NewsGroups, titles from 199 popular subreddits, questions from 121 StackExchanges. We cluster different sentence collections with sentences from 10-50 categories. Performance is measured using V-Measure. <br> Higher = Better" data-html="true" data-placement="bottom"><i class="fas fa-info-circle"></i></span>
</th>
<th class="header-cell text-center" @click="sortAsc = (sortBy=='final_score') ? sortAsc = !sortAsc : false; sortBy='final_score'">
<i class="fas fa-active" v-if="sortBy == 'final_score'" v-bind:class="{ 'fa-sort-amount-up': !sortAsc, 'fa-sort-amount-down-alt': sortAsc }"></i>
Avg. Performance
<span class="info-icon" data-trigger="hover" data-toggle="popover" title="Average Performance" data-content="Average Performance over all tasks.<br>Higher = Better" data-html="true" data-placement="bottom"><i class="fas fa-info-circle"></i></span>
</th>
<th class="header-cell text-center" @click="sortAsc = (sortBy=='speed') ? sortAsc = !sortAsc : false; sortBy='speed'">
<i class="fas fa-active" v-if="sortBy == 'speed'" v-bind:class="{ 'fa-sort-amount-up': !sortAsc, 'fa-sort-amount-down-alt': sortAsc }"></i>
Speed
<span class="info-icon" data-trigger="hover" data-toggle="popover" title="Encoding Speed" data-content="Encoding speed (sentences / sec) on a V100 GPU.<br>Higher = Better" data-html="true" data-placement="bottom"><i class="fas fa-info-circle"></i></span>
</th>
</tr>
</thead>
<tbody>
<tr v-for="item in sortedModels">
<td>
{{ item.name }}
<span class="info-icon info-icon-model" data-trigger="hover" data-toggle="popover" :title="item.name" :data-content="'<b>Base Model:</b> '+item.base_model+'<br><b>Pooling:</b> '+item.pooling+'<br><b>Training Data:</b> '+item.training_data+'<br><b>Dimensions:</b> '+item.dim+'<br><b>Size:</b> '+item.size+' MB'" data-html="true" data-placement="bottom"><i class="fas fa-info-circle"></i></span>
</td>
<td class="text-center">{{ item.stsb.toFixed(2) }}</td>
<td class="text-center"><span :title="'Quora: '+item.qqp.toFixed(2)+' Sprint: '+item.sprint.toFixed(2)">{{ item.dupq.toFixed(2) }}</span></td>
<td class="text-center">{{ item.TwitterP.toFixed(2) }}</td>
<td class="text-center">{{ item.scidocs.toFixed(2) }}</td>
<td class="text-center"><span :title="'Newsgroups: '+item.newsgroups.toFixed(2)+' Reddit: '+item.reddit.toFixed(2)+' StackExchange: '+item.stackexchange.toFixed(2)">{{ item.clustering.toFixed(2) }}</span></td>
<td class="text-center">{{ item.final_score.toFixed(2) }}</td>
<td class="text-center">{{ item.speed }}</td>
</tr>
</tbody>
</table>
</div>
<script>
var app = new Vue({
el: '#app',
data: {
models: [
{
"name": "stsb-mpnet-base-v2",
"base_model": "microsoft/mpnet-base",
"pooling": "Mean Pooling",
"training_data": "NLI+STSb",
"stsb": 88.57,
"qqp": 79.74,
"sprint": 90.34,
"TwitterP": 75.35,
"scidocs": 72.48,
"newsgroups": 31.56,
"reddit": 39.52,
"stackexchange": 46.41,
"speed": 2800,
"size": 386,
"dim": 768
},
{
"name": "stsb-roberta-base-v2",
"base_model": "roberta-base",
"pooling": "Mean Pooling",
"training_data": "NLI+STSb",
"stsb": 87.21,
"qqp": 78.68,
"sprint": 86.42,
"TwitterP": 73.44,
"scidocs": 69.83,
"newsgroups": 26.87,
"reddit": 36.91,
"stackexchange": 45.48,
"speed": 2300,
"size": 440,
"dim": 768
},
{
"name": "stsb-distilroberta-base-v2",
"base_model": "distilroberta-base",
"pooling": "Mean Pooling",
"training_data": "NLI+STSb",
"stsb": 86.41,
"qqp": 78.13,
"sprint": 87.28,
"TwitterP": 73.68,
"scidocs": 69.85,
"newsgroups": 28.63,
"reddit": 38.26,
"stackexchange": 46.16,
"speed": 4000,
"size": 292,
"dim": 768
},
{
"name": "nli-mpnet-base-v2",
"base_model": "microsoft/mpnet-base",
"pooling": "Mean Pooling",
"training_data": "NLI",
"stsb": 86.53,
"qqp": 80.65,
"sprint": 85.79,
"TwitterP": 76.24,
"scidocs": 72.90,
"newsgroups": 36.56,
"reddit": 42.68,
"stackexchange": 50.90,
"speed": 2800,
"size": 385,
"dim": 768
},
{
"name": "nli-roberta-base-v2",
"base_model": "roberta-base",
"pooling": "Mean Pooling",
"training_data": "NLI",
"stsb": 85.54,
"qqp": 78.73,
"sprint": 81.67,
"TwitterP": 74.28,
"scidocs": 69.86,
"newsgroups": 31.28,
"reddit": 39.58,
"stackexchange": 49.51,
"speed": 2300,
"size": 440,
"dim": 768
},
{
"name": "nli-distilroberta-base-v2",
"base_model": "distilroberta-base",
"pooling": "Mean Pooling",
"training_data": "NLI",
"stsb": 84.38,
"qqp": 78.47,
"sprint": 83.03,
"TwitterP": 73.86,
"scidocs": 70.23,
"newsgroups": 31.87,
"reddit": 39.12,
"stackexchange": 49.27,
"speed": 4000,
"size": 292,
"dim": 768
},
{
"name": "average_word_embeddings_glove.6B.300d",
"base_model": "Word Embeddings: GloVe",
"pooling": "Mean Pooling",
"training_data": "-",
"stsb": 61.77,
"qqp": 69.18,
"sprint": 86.96,
"TwitterP": 68.60,
"scidocs": 63.69,
"newsgroups": 26.65,
"reddit": 28.37,
"stackexchange": 36.37,
"speed": 34000,
"size": 422,
"dim": 300
},
{
"name": "average_word_embeddings_komninos",
"base_model": "Word Embeddings: Komninos et al.",
"pooling": "Mean Pooling",
"training_data": "-",
"stsb": 61.56,
"qqp": 69.83,
"sprint": 85.55,
"TwitterP": 71.23,
"scidocs": 65.25,
"newsgroups": 27.53,
"reddit": 29.54,
"stackexchange": 39.35,
"speed": 22000,
"size": 237,
"dim": 300
},
/*{
"name": "average_word_embeddings_levy_dependency",
"base_model": "Word Embeddings: Levy et al.",
"pooling": "Mean Pooling",
"training_data": "-",
"stsb": 59.22,
"qqp": 64.62,
"sprint": 80.12,
"TwitterP": 70.79,
"scidocs": 60.04 ,
"newsgroups": 22.72,
"reddit": 24.23,
"stackexchange": 33.66,
"speed": 22000,
"size": 186,
"dim": 300
},*/
{
"name": "paraphrase-MiniLM-L12-v2",
"base_model": "microsoft/MiniLM-L12-H384-uncased",
"pooling": "Mean Pooling",
"training_data": "AllNLI, sentence-compression, SimpleWiki, altlex, msmarco-triplets, quora_duplicates, coco_captions,flickr30k_captions, yahoo_answers_title_question, S2ORC_citation_pairs, stackexchange_duplicate_questions, wiki-atomic-edits",
"stsb": 84.41,
"qqp": 84.64,
"sprint": 89.91,
"TwitterP": 75.34,
"scidocs": 80.08,
"newsgroups": 41.81,
"reddit": 44.42,
"stackexchange": 54.63,
"speed": 7500,
"size": 118,
"dim": 384
},
{
"name": "paraphrase-MiniLM-L6-v2",
"base_model": "nreimers/MiniLM-L6-H384-uncased",
"pooling": "Mean Pooling",
"training_data": "AllNLI, sentence-compression, SimpleWiki, altlex, msmarco-triplets, quora_duplicates, coco_captions,flickr30k_captions, yahoo_answers_title_question, S2ORC_citation_pairs, stackexchange_duplicate_questions, wiki-atomic-edits",
"stsb": 84.12,
"qqp": 84.25,
"sprint": 90.21,
"TwitterP": 76.32,
"scidocs": 78.91,
"newsgroups": 40.16,
"reddit": 42.71,
"stackexchange": 53.14,
"speed": 14200,
"size": 80,
"dim": 384
},
{
"name": "paraphrase-MiniLM-L3-v2",
"base_model": "nreimers/MiniLM-L3-H384-uncased",
"pooling": "Mean Pooling",
"training_data": "AllNLI, sentence-compression, SimpleWiki, altlex, msmarco-triplets, quora_duplicates, coco_captions,flickr30k_captions, yahoo_answers_title_question, S2ORC_citation_pairs, stackexchange_duplicate_questions, wiki-atomic-edits",
"stsb": 82.41,
"qqp": 83.29,
"sprint": 92.9,
"TwitterP": 76.14,
"scidocs": 77.71,
"newsgroups": 37.73,
"reddit": 41.18,
"stackexchange": 51.25,
"speed": 19000,
"size": 61,
"dim": 384
},
{
"name": "paraphrase-distilroberta-base-v2",
"base_model": "distilroberta-base",
"pooling": "Mean Pooling",
"training_data": "AllNLI, sentence-compression, SimpleWiki, altlex, msmarco-triplets, quora_duplicates, coco_captions,flickr30k_captions, yahoo_answers_title_question, S2ORC_citation_pairs, stackexchange_duplicate_questions, wiki-atomic-edits",
"stsb": 85.37,
"qqp": 84.31,
"sprint": 89.64,
"TwitterP": 73.96,
"scidocs": 80.25,
"newsgroups": 42.12,
"reddit": 47.53,
"stackexchange": 57.90,
"speed": 4000,
"size": 292,
"dim": 768
},
{
"name": "paraphrase-TinyBERT-L6-v2",
"base_model": "https://huggingface.co/nreimers/TinyBERT_L-6_H-768_v2",
"pooling": "Mean Pooling",
"training_data": "AllNLI, sentence-compression, SimpleWiki, altlex, msmarco-triplets, quora_duplicates, coco_captions,flickr30k_captions, yahoo_answers_title_question, S2ORC_citation_pairs, stackexchange_duplicate_questions, wiki-atomic-edits",
"stsb": 84.91,
"qqp": 84.23,
"sprint": 89.62,
"TwitterP": 75.39,
"scidocs": 81.51,
"newsgroups": 43.82,
"reddit": 44.61,
"stackexchange": 55.69,
"speed": 4500,
"size": 238,
"dim": 768
},
{
"name": "paraphrase-mpnet-base-v2",
"base_model": "microsoft/mpnet-base",
"pooling": "Mean Pooling",
"training_data": "AllNLI, sentence-compression, SimpleWiki, altlex, msmarco-triplets, quora_duplicates, coco_captions,flickr30k_captions, yahoo_answers_title_question, S2ORC_citation_pairs, stackexchange_duplicate_questions, wiki-atomic-edits",
"stsb": 86.99,
"qqp": 84.93,
"sprint": 90.67,
"TwitterP": 76.05,
"scidocs": 80.57,
"newsgroups": 48.13,
"reddit": 50.52,
"stackexchange": 59.79,
"speed": 2800,
"size": 387,
"dim": 768
},
/*{
"name": "paraphrase-albert-base-v2",
"base_model": "albert-base-v2",
"pooling": "Mean Pooling",
"training_data": "AllNLI, sentence-compression, SimpleWiki, altlex, msmarco-triplets, quora_duplicates, coco_captions,flickr30k_captions, yahoo_answers_title_question, S2ORC_citation_pairs, stackexchange_duplicate_questions, wiki-atomic-edits",
"stsb": 83.38,
"qqp": 83.28,
"sprint": 90.45,
"TwitterP": 74.83,
"scidocs": 77.83,
"newsgroups": 39.88,
"reddit": 42.64,
"stackexchange": 52.95,
"speed": 2400,
"size": 43,
"dim": 768
},*/
{
"name": "paraphrase-albert-small-v2",
"base_model": "nreimers/albert-small-v2",
"pooling": "Mean Pooling",
"training_data": "AllNLI, sentence-compression, SimpleWiki, altlex, msmarco-triplets, quora_duplicates, coco_captions,flickr30k_captions, yahoo_answers_title_question, S2ORC_citation_pairs, stackexchange_duplicate_questions, wiki-atomic-edits",
"stsb": 83.40,
"qqp": 83.54,
"sprint": 89.6,
"TwitterP": 74.51,
"scidocs": 80.28,
"newsgroups": 40.54,
"reddit": 41.54,
"stackexchange": 52.74,
"speed": 5000,
"size": 43,
"dim": 768
},
{
"name": "paraphrase-multilingual-mpnet-base-v2",
"base_model": "Teacher: paraphrase-mpnet-base-v2; Student: xlm-roberta-base",
"pooling": "Mean Pooling",
"training_data": "Multi-lingual model of paraphrase-mpnet-base-v2, extended to 50+ languages.",
"stsb": 86.82,
"qqp": 83.91,
"sprint": 91.1,
"TwitterP": 76.52,
"scidocs": 78.66,
"newsgroups": 44.65,
"reddit": 45.01,
"stackexchange": 52.73,
"speed": 2500,
"size": 969,
"dim": 768
},
{
"name": "paraphrase-multilingual-MiniLM-L12-v2",
"base_model": "Teacher: paraphrase-MiniLM-L12-v2; Student: microsoft/Multilingual-MiniLM-L12-H384",
"pooling": "Mean Pooling",
"training_data": "Multi-lingual model of paraphrase-multilingual-MiniLM-L12-v2, extended to 50+ languages.",
"stsb": 84.42,
"qqp": 83.89,
"sprint": 91.15,
"TwitterP": 74.94,
"scidocs": 78.27,
"newsgroups": 40.36,
"reddit": 41.49,
"stackexchange": 49.75,
"speed": 7500,
"size": 418,
"dim": 384
},
{
"name": "distiluse-base-multilingual-cased-v1",
"base_model": "Teacher: mUSE; Student: distilbert-base-multilingual",
"pooling": "Mean Pooling",
"training_data": "Multi-Lingual model of Universal Sentence Encoder for 15 languages: Arabic, Chinese, Dutch, English, French, German, Italian, Korean, Polish, Portuguese, Russian, Spanish, Turkish.",
"stsb": 80.62,
"qqp": 81.10,
"sprint": 88.54,
"TwitterP": 76.24,
"scidocs": 70.41,
"newsgroups": 32.97,
"reddit": 42.93,
"stackexchange": 44.30,
"speed": 4000,
"size": 482,
"dim": 512
},
{
"name": "distiluse-base-multilingual-cased-v2",
"base_model": "Teacher: mUSE; Student: distilbert-base-multilingual",
"pooling": "Mean Pooling",
"training_data": "Multi-Lingual model of Universal Sentence Encoder for 50 languages.",
"stsb": 80.75,
"qqp": 79.89,
"sprint": 87.15,
"TwitterP": 76.26,
"scidocs": 70.39,
"newsgroups": 29.96,
"reddit": 39.95,
"stackexchange": 41.19,
"speed": 4000,
"size": 481,
"dim": 512
}
],
sortBy: 'final_score',
sortAsc: false
},
methods: {
},
computed: {
sortedModels: function() {
//Add avg. for duplicate questions
let models_ext = this.models.map(function(elem, index) { elem.dupq = (elem.qqp + elem.sprint)/2.0; return elem;} );
//Add avg. for clustering
models_ext = models_ext.map(function(elem, index) { elem.clustering = (elem.newsgroups + elem.reddit + elem.stackexchange)/3.0; return elem;} );
//Final score
models_ext = models_ext.map(function(elem, index) { elem.final_score = (elem.stsb + elem.dupq + elem.TwitterP + elem.scidocs + elem.clustering)/ 5.0; return elem;} );
return _.orderBy(models_ext, (item) => item[this.sortBy] || (this.sortAsc ? 9999 : -9999), this.sortAsc ? 'asc' : 'desc')
}
}
})
</script>
<script>
$(function () {
$('[data-toggle="popover"]').popover()
})
</script>
</body>
</html>

BIN
docs/wechat.jpeg Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 40 KiB

21
examples/base_demo.py Normal file
View File

@ -0,0 +1,21 @@
# -*- coding: utf-8 -*-
"""
@author:XuMing(xuming624@qq.com)
@description:
This basic example loads a pre-trained model from the web and uses it to
generate sentence embeddings for a given list of sentences.
"""
import sys
sys.path.append('..')
from similarities import BertSimilarity
if __name__ == '__main__':
model = BertSimilarity("shibing624/text2vec-base-chinese") # 中文句向量模型(CoSENT)
# Embed a list of sentences
sentences = ['如何更换花呗绑定银行卡',
'花呗更改绑定银行卡']
sentence_embeddings = model.encode(sentences)
print(type(sentence_embeddings), sentence_embeddings.shape)
similarity_score = model.similarity_score([sentences[0]], [sentences[1]])
print(similarity_score)

View File

@ -0,0 +1,19 @@
# -*- coding: utf-8 -*-
"""
@author:XuMing(xuming624@qq.com)
@description:
This basic example loads a pre-trained model from the web and uses it to
generate sentence embeddings for a given list of sentences.
"""
import sys
sys.path.append('..')
from similarities import BertSimilarity
model = BertSimilarity("shibing624/text2vec-base-chinese") # 中文句向量模型(CoSENT)
# Embed a list of sentences
sentences = ['如何更换花呗绑定银行卡',
'花呗更改绑定银行卡']
sentence_embeddings = model.encode(sentences)
print(type(sentence_embeddings), sentence_embeddings.shape)

40
examples/gradio_demo.py Normal file
View File

@ -0,0 +1,40 @@
# -*- coding: utf-8 -*-
"""
@author:XuMing(xuming624@qq.com)
@description: pip install gradio
"""
import gradio as gr
from similarities import BertSimilarity
# 中文句向量模型(CoSENT)
sim_model = BertSimilarity(model_name_or_path='shibing624/text2vec-base-chinese')
def ai_text(sentence1, sentence2):
score = sim_model.similarity_score(sentence1, sentence2)
print("{} \t\t {} \t\t Score: {:.4f}".format(sentence1, sentence2, score))
return score
if __name__ == '__main__':
examples = [
['如何更换花呗绑定银行卡', '花呗更改绑定银行卡'],
['我在北京打篮球', '我是北京人,我喜欢篮球'],
['一个女人在看书。', '一个女人在揉面团'],
['一个男人在车库里举重。', '一个人在举重。'],
]
input1 = gr.inputs.Textbox(lines=2, placeholder="Enter First Sentence")
input2 = gr.inputs.Textbox(lines=2, placeholder="Enter Second Sentence")
output_text = gr.outputs.Textbox()
gr.Interface(ai_text,
inputs=[input1, input2],
outputs=[output_text],
# theme="grass",
title="Chinese Text Matching Model shibing624/text2vec-base-chinese",
description="Copy or input Chinese text here. Submit and the machine will calculate the cosine score.",
article="Link to <a href='https://github.com/shibing624/similarities' style='color:blue;' target='_blank\'>Github REPO</a>",
examples=examples
).launch()

8
requirements.txt Normal file
View File

@ -0,0 +1,8 @@
jieba>=0.39
loguru
transformers>=4.6.0
tokenizers>=0.10.3
tqdm
numpy
scikit-learn
gensim>=4.0.0

51
setup.py Normal file
View File

@ -0,0 +1,51 @@
# -*- coding: utf-8 -*-
import sys
from setuptools import setup, find_packages
# Avoids IDE errors, but actual version is read from version.py
__version__ = None
exec(open('similarities/version.py').read())
if sys.version_info < (3,):
sys.exit('Sorry, Python3 is required.')
with open('README.md', 'r', encoding='utf-8') as f:
readme = f.read()
with open('requirements.txt', 'r', encoding='utf-8') as f:
reqs = f.read()
setup(
name='similarities',
version=__version__,
description='Similarities is a toolkit for compute similarity scores between two sets of strings.',
long_description=readme,
long_description_content_type='text/markdown',
author='XuMing',
author_email='xuming624@qq.com',
url='https://github.com/shibing624/similarities',
license="Apache License 2.0",
zip_safe=False,
python_requires=">=3.6.0",
classifiers=[
"Development Status :: 5 - Production/Stable",
"Intended Audience :: Developers",
"Intended Audience :: Education",
"Intended Audience :: Science/Research",
"License :: OSI Approved :: Apache Software License",
"Operating System :: OS Independent",
"Programming Language :: Python :: 3",
"Programming Language :: Python :: 3.6",
"Programming Language :: Python :: 3.7",
"Programming Language :: Python :: 3.8",
"Programming Language :: Python :: 3.9",
"Topic :: Scientific/Engineering :: Artificial Intelligence",
],
keywords='similarities,Chinese Text Similarity Calculation Tool,similarity,word2vec',
install_requires=reqs.strip().split('\n'),
packages=find_packages(exclude=['tests']),
package_dir={'similarities': 'similarities'},
package_data={'similarities': ['*.*', '../LICENSE', '../README.*', '../*.txt', 'utils/*',
'data/*', ]}
)

7
similarities/__init__.py Normal file
View File

@ -0,0 +1,7 @@
# -*- coding: utf-8 -*-
"""
@author:XuMing(xuming624@qq.com)
@description:
"""
from .similarity import BertSimilarity

View File

@ -0,0 +1,34 @@
# -*- coding: utf-8 -*-
"""
@author:XuMing(xuming624@qq.com)
@description:
"""
from typing import List, Union, Optional
import numpy as np
from numpy import ndarray
from torch import Tensor
from loguru import logger
class BertSimilarity:
def __init__(self, model_name_or_path=''):
"""
Cal text similarity
:param similarity_type:
:param embedding_type:
"""
self.model_name_or_path = model_name_or_path
self.model = None
def encode(self, sentences: Union[List[str], str]) -> ndarray:
return np.array([])
def similarity_score(self, sentences1: Union[List[str], str], entences2: Union[List[str], str]):
"""
Get similarity scores between sentences1 and sentences2
:param sentences1: list, sentence1 list
:param sentences2: list, sentence2 list
:return: return: Matrix with res[i][j] = cos_sim(a[i], b[j])
"""
return 0.0

7
similarities/version.py Normal file
View File

@ -0,0 +1,7 @@
# -*- coding: utf-8 -*-
"""
@author:XuMing(xuming624@qq.com)
@description:
"""
__version__ = '0.0.1'

27
tests/test_sim_score.py Normal file
View File

@ -0,0 +1,27 @@
# -*- coding: utf-8 -*-
"""
@author:XuMing(xuming624@qq.com)
@description:
"""
import sys
import unittest
sys.path.append('..')
from similarities import BertSimilarity
bert_model = BertSimilarity()
class IssueTestCase(unittest.TestCase):
def test_sim_diff(self):
a = '研究团队面向国家重大战略需求追踪国际前沿发展借鉴国际人工智能研究领域的科研模式有效整合创新资源解决复'
b = '英汉互译比较语言学'
r = bert_model.similarity_score(a, b)
print(a, b, r)
self.assertTrue(abs(r - 0.1733) < 0.001)
if __name__ == '__main__':
unittest.main()