init similarities project.
This commit is contained in:
parent
a88748119f
commit
487469a423
28
.github/ISSUE_TEMPLATE/bug-report.md
vendored
Normal file
28
.github/ISSUE_TEMPLATE/bug-report.md
vendored
Normal file
@ -0,0 +1,28 @@
|
|||||||
|
---
|
||||||
|
name: Bug Report
|
||||||
|
about: Create a report to help us improve
|
||||||
|
title: ''
|
||||||
|
labels: bug
|
||||||
|
assignees: ''
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Describe the bug
|
||||||
|
Please provide a clear and concise description of what the bug is. If applicable, add screenshots to help explain your problem, especially for visualization related problems.
|
||||||
|
|
||||||
|
### To Reproduce
|
||||||
|
Please provide a [Minimal, Complete, and Verifiable example](https://stackoverflow.com/help/mcve) here. We hope we can simply copy/paste/run it. It is also nice to share a hosted runnable script (e.g. Google Colab), especially for hardware-related problems.
|
||||||
|
|
||||||
|
### Describe your attempts
|
||||||
|
- [ ] I checked the documentation and found no answer
|
||||||
|
- [ ] I checked to make sure that this is not a duplicate issue
|
||||||
|
|
||||||
|
You should also provide code snippets you tried as a workaround, StackOverflow solution that you have walked through, or your best guess of the cause that you can't locate (e.g. cosmic radiation).
|
||||||
|
|
||||||
|
### Context
|
||||||
|
- **OS** [e.g. Windows 10, macOS 10.14]:
|
||||||
|
- **Hardware** [e.g. CPU only, GTX 1080 Ti]:
|
||||||
|
|
||||||
|
|
||||||
|
### Additional Information
|
||||||
|
Other things you want the developers to know.
|
23
.github/ISSUE_TEMPLATE/feature-request.md
vendored
Normal file
23
.github/ISSUE_TEMPLATE/feature-request.md
vendored
Normal file
@ -0,0 +1,23 @@
|
|||||||
|
---
|
||||||
|
name: Feature Request
|
||||||
|
about: Suggest an idea for this project
|
||||||
|
title: ''
|
||||||
|
labels: enhancement
|
||||||
|
assignees: ''
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
- [ ] I checked to make sure that this is not a duplicate issue
|
||||||
|
- [ ] I'm submitting the request to the correct repository (for model requests
|
||||||
|
|
||||||
|
### Is your feature request related to a problem? Please describe.
|
||||||
|
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
|
||||||
|
|
||||||
|
### Describe the solution you'd like
|
||||||
|
A clear and concise description of what you want to happen.
|
||||||
|
|
||||||
|
### Describe alternatives you've considered
|
||||||
|
A clear and concise description of any alternative solutions or features you've considered.
|
||||||
|
|
||||||
|
### Additional Information
|
||||||
|
Other things you want the developers to know.
|
18
.github/ISSUE_TEMPLATE/usage-question.md
vendored
Normal file
18
.github/ISSUE_TEMPLATE/usage-question.md
vendored
Normal file
@ -0,0 +1,18 @@
|
|||||||
|
---
|
||||||
|
name: Usage Question
|
||||||
|
about: Ask a question about text2vec usage
|
||||||
|
title: ''
|
||||||
|
labels: question
|
||||||
|
assignees: ''
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Describe the Question
|
||||||
|
Please provide a clear and concise description of what the question is.
|
||||||
|
|
||||||
|
### Describe your attempts
|
||||||
|
- [ ] I walked through the tutorials
|
||||||
|
- [ ] I checked the documentation
|
||||||
|
- [ ] I checked to make sure that this is not a duplicate question
|
||||||
|
|
||||||
|
You may also provide a [Minimal, Complete, and Verifiable example](https://stackoverflow.com/help/mcve) you tried as a workaround, or StackOverflow solution that you have walked through. (e.g. cosmic radiation).
|
17
.github/stale.yml
vendored
Normal file
17
.github/stale.yml
vendored
Normal file
@ -0,0 +1,17 @@
|
|||||||
|
# Number of days of inactivity before an issue becomes stale
|
||||||
|
daysUntilStale: 60
|
||||||
|
# Number of days of inactivity before a stale issue is closed
|
||||||
|
daysUntilClose: 7
|
||||||
|
# Issues with these labels will never be considered stale
|
||||||
|
exemptLabels:
|
||||||
|
- pinned
|
||||||
|
- security
|
||||||
|
# Label to use when marking an issue as stale
|
||||||
|
staleLabel: wontfix
|
||||||
|
# Comment to post when marking an issue as stale. Set to `false` to disable
|
||||||
|
markComment: >
|
||||||
|
This issue has been automatically marked as stale because it has not had
|
||||||
|
recent activity. It will be closed if no further activity occurs. Thank you
|
||||||
|
for your contributions.(由于长期不活动,机器人自动关闭此问题,如果需要欢迎提问)
|
||||||
|
# Comment to post when closing a stale issue. Set to `false` to disable
|
||||||
|
closeComment: false
|
1
.gitignore
vendored
1
.gitignore
vendored
@ -127,3 +127,4 @@ dmypy.json
|
|||||||
|
|
||||||
# Pyre type checker
|
# Pyre type checker
|
||||||
.pyre/
|
.pyre/
|
||||||
|
.idea/
|
||||||
|
9
CONTRIBUTING.md
Normal file
9
CONTRIBUTING.md
Normal file
@ -0,0 +1,9 @@
|
|||||||
|
# Contributing
|
||||||
|
|
||||||
|
We are happy to accept your contributions to make `text2vec` better and more awesome! To avoid unnecessary work on either
|
||||||
|
side, please stick to the following process:
|
||||||
|
|
||||||
|
1. Check if there is already [an issue](https://github.com/shibing624/similarities/issues) for your concern.
|
||||||
|
2. If there is not, open a new one to start a discussion. We hate to close finished PRs!
|
||||||
|
3. If we decide your concern needs code changes, we would be happy to accept a pull request. Please consider the
|
||||||
|
commit guidelines below.
|
183
README.md
183
README.md
@ -1,2 +1,181 @@
|
|||||||
# similarities
|
[![PyPI version](https://badge.fury.io/py/similarities.svg)](https://badge.fury.io/py/similarities)
|
||||||
Similarities is a toolkit for compute similarity score between texts. 相似度计算工具包,实现多种字面、语义匹配模型。
|
[![Downloads](https://pepy.tech/badge/similarities)](https://pepy.tech/project/similarities)
|
||||||
|
[![Contributions welcome](https://img.shields.io/badge/contributions-welcome-brightgreen.svg)](CONTRIBUTING.md)
|
||||||
|
[![GitHub contributors](https://img.shields.io/github/contributors/shibing624/similarities.svg)](https://github.com/shibing624/similarities/graphs/contributors)
|
||||||
|
[![License Apache 2.0](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](LICENSE)
|
||||||
|
[![python_version](https://img.shields.io/badge/Python-3.5%2B-green.svg)](requirements.txt)
|
||||||
|
[![GitHub issues](https://img.shields.io/github/issues/shibing624/similarities.svg)](https://github.com/shibing624/similarities/issues)
|
||||||
|
[![Wechat Group](http://vlog.sfyc.ltd/wechat_everyday/wxgroup_logo.png?imageView2/0/w/60/h/20)](#Contact)
|
||||||
|
|
||||||
|
# Similarities
|
||||||
|
Similarities is a toolkit for Compute Similarity Score between texts.
|
||||||
|
|
||||||
|
相似度计算工具包,实现多种字面、语义匹配模型。
|
||||||
|
|
||||||
|
**similarities**实现了Word2Vec、RankBM25、BERT、Sentence-BERT、CoSENT等多种文本表征、文本相似度计算模型,并在文本语义匹配(相似度计算)任务上比较了各模型的效果。
|
||||||
|
|
||||||
|
|
||||||
|
**Guide**
|
||||||
|
- [Feature](#Feature)
|
||||||
|
- [Evaluate](#Evaluate)
|
||||||
|
- [Install](#install)
|
||||||
|
- [Usage](#usage)
|
||||||
|
- [Contact](#Contact)
|
||||||
|
- [Citation](#Citation)
|
||||||
|
- [Reference](#reference)
|
||||||
|
|
||||||
|
# Feature
|
||||||
|
### 文本向量表示模型
|
||||||
|
- [Word2Vec](similarities/word2vec.py):通过腾讯AI Lab开源的大规模高质量中文[词向量数据(800万中文词轻量版)](https://pan.baidu.com/s/1La4U4XNFe8s5BJqxPQpeiQ) (文件名:light_Tencent_AILab_ChineseEmbedding.bin 密码: tawe)实现词向量检索,本项目实现了句子(词向量求平均)的word2vec向量表示
|
||||||
|
- [SBert(Sentence-BERT)](similarities/sentence_bert):权衡性能和效率的句向量表示模型,训练时通过有监督训练上层分类函数,文本匹配预测时直接句子向量做余弦,本项目基于PyTorch复现了Sentence-BERT模型的训练和预测
|
||||||
|
- [CoSENT(Cosine Sentence)](similarities/cosent):CoSENT模型提出了一种排序的损失函数,使训练过程更贴近预测,模型收敛速度和效果比Sentence-BERT更好,本项目基于PyTorch实现了CoSENT模型的训练和预测
|
||||||
|
|
||||||
|
### 文本相似度比较方法
|
||||||
|
|
||||||
|
- 余弦相似(Cosine Similarity):两向量求余弦
|
||||||
|
- 点积(Dot Product):两向量归一化后求内积
|
||||||
|
- 词移距离(Word Mover’s Distance):词移距离使用两文本间的词向量,测量其中一文本中的单词在语义空间中移动到另一文本单词所需要的最短距离
|
||||||
|
- [RankBM25](similarities/bm25.py):BM25的变种算法,对query和文档之间的相似度打分,得到docs的rank排序
|
||||||
|
- [SemanticSearch](https://github.com/shibing624/similarities/blob/master/similarities/sbert.py#L80):向量相似检索,使用Cosine Similarty + topk高效计算,比一对一暴力计算快一个数量级
|
||||||
|
|
||||||
|
# Evaluate
|
||||||
|
|
||||||
|
### 文本匹配
|
||||||
|
|
||||||
|
- 英文匹配数据集的评测结果:
|
||||||
|
|
||||||
|
| Arch | Backbone | Model Name | English-STS-B |
|
||||||
|
| :-- | :--- | :--- | :-: |
|
||||||
|
| GloVe | glove | Avg_word_embeddings_glove_6B_300d | 61.77 |
|
||||||
|
| BERT | bert-base-uncased | BERT-base-cls | 20.29 |
|
||||||
|
| BERT | bert-base-uncased | BERT-base-first_last_avg | 59.04 |
|
||||||
|
| BERT | bert-base-uncased | BERT-base-first_last_avg-whiten(NLI) | 63.65 |
|
||||||
|
| SBERT | sentence-transformers/bert-base-nli-mean-tokens | SBERT-base-nli-cls | 73.65 |
|
||||||
|
| SBERT | sentence-transformers/bert-base-nli-mean-tokens | SBERT-base-nli-first_last_avg | 77.96 |
|
||||||
|
| SBERT | xlm-roberta-base | paraphrase-multilingual-MiniLM-L12-v2 | 84.42 |
|
||||||
|
| CoSENT | bert-base-uncased | CoSENT-base-first_last_avg | 69.93 |
|
||||||
|
| CoSENT | sentence-transformers/bert-base-nli-mean-tokens | CoSENT-base-nli-first_last_avg | 79.68 |
|
||||||
|
|
||||||
|
- 中文匹配数据集的评测结果:
|
||||||
|
|
||||||
|
| Arch | Backbone | Model Name | ATEC | BQ | LCQMC | PAWSX | STS-B | Avg | QPS |
|
||||||
|
| :-- | :--- | :--- | :-: | :-: | :-: | :-: | :-: | :-: | :-: |
|
||||||
|
| CoSENT | hfl/chinese-macbert-base | CoSENT-macbert-base | 50.39 | **72.93** | **79.17** | **60.86** | **80.51** | **68.77** | 2572 |
|
||||||
|
| CoSENT | Langboat/mengzi-bert-base | CoSENT-mengzi-base | **50.52** | 72.27 | 78.69 | 12.89 | 80.15 | 58.90 | 2502 |
|
||||||
|
| CoSENT | bert-base-chinese | CoSENT-bert-base | 49.74 | 72.38 | 78.69 | 60.00 | 80.14 | 68.19 | 2653 |
|
||||||
|
| SBERT | bert-base-chinese | SBERT-bert-base | 46.36 | 70.36 | 78.72 | 46.86 | 66.41 | 61.74 | 1365 |
|
||||||
|
| SBERT | hfl/chinese-macbert-base | SBERT-macbert-base | 47.28 | 68.63 | **79.42** | 55.59 | 64.82 | 63.15 | 1948 |
|
||||||
|
| CoSENT | hfl/chinese-roberta-wwm-ext | CoSENT-roberta-ext | **50.81** | **71.45** | **79.31** | **61.56** | **81.13** | **68.85** | - |
|
||||||
|
| SBERT | hfl/chinese-roberta-wwm-ext | SBERT-roberta-ext | 48.29 | 69.99 | 79.22 | 44.10 | 72.42 | 62.80 | - |
|
||||||
|
|
||||||
|
- 本项目release模型的中文匹配评测结果:
|
||||||
|
|
||||||
|
| Arch | Backbone | Model Name | ATEC | BQ | LCQMC | PAWSX | STS-B | Avg | QPS |
|
||||||
|
| :-- | :--- | :---- | :-: | :-: | :-: | :-: | :-: | :-: | :-: |
|
||||||
|
| Word2Vec | word2vec | w2v-light-tencent-chinese | 20.00 | 31.49 | 59.46 | 2.57 | 55.78 | 33.86 | 10283 |
|
||||||
|
| SBERT | xlm-roberta-base | paraphrase-multilingual-MiniLM-L12-v2 | 18.42 | 38.52 | 63.96 | 10.14 | 78.90 | 41.99 | 2371 |
|
||||||
|
| CoSENT | hfl/chinese-macbert-base | similarities-base-chinese | 31.93 | 42.67 | 70.16 | 17.21 | 79.30 | **48.25** | 2572 |
|
||||||
|
|
||||||
|
说明:
|
||||||
|
- 结果值均使用spearman系数
|
||||||
|
- 结果均只用该数据集的train训练,在test上评估得到的表现,没用外部数据
|
||||||
|
- `paraphrase-multilingual-MiniLM-L12-v2`模型名称是`sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2`,是`paraphrase-MiniLM-L12-v2`模型的多语言版本,速度快,效果好,支持中文
|
||||||
|
- `CoSENT-macbert-base`模型达到同级别参数量SOTA效果,是用CoSENT方法训练,运行[similarities/cosent](similarities/cosent)文件夹下代码可以复现结果
|
||||||
|
- `SBERT-macbert-base`模型,是用SBERT方法训练,运行[similarities/sentence_bert](similarities/sentence_bert)文件夹下代码可以复现结果
|
||||||
|
- `similarities-base-chinese`模型,是用CoSENT方法训练,基于MacBERT在中文STS-B数据训练得到,模型文件已经上传到huggingface的模型库[shibing624/similarities-base-chinese](https://huggingface.co/shibing624/similarities-base-chinese)
|
||||||
|
- `w2v-light-tencent-chinese`是腾讯词向量的Word2Vec模型,CPU加载使用
|
||||||
|
- 各预训练模型均可以通过transformers调用,如MacBERT模型:`--pretrained_model_path hfl/chinese-macbert-base`
|
||||||
|
- 中文匹配数据集下载[链接见下方](#数据集)
|
||||||
|
- 中文匹配任务实验表明,pooling最优是`first_last_avg`,预测可以调用SBert的`mean pooling`方法,效果损失很小
|
||||||
|
- QPS的GPU测试环境是Tesla V100,显存32GB
|
||||||
|
|
||||||
|
# Demo
|
||||||
|
|
||||||
|
Official Demo: http://42.193.145.218/product/short_text_sim/
|
||||||
|
|
||||||
|
HuggingFace Demo: https://huggingface.co/spaces/shibing624/similarities
|
||||||
|
|
||||||
|
![](docs/hf.png)
|
||||||
|
|
||||||
|
# Install
|
||||||
|
```
|
||||||
|
pip3 install -U similarities
|
||||||
|
```
|
||||||
|
|
||||||
|
or
|
||||||
|
|
||||||
|
```
|
||||||
|
git clone https://github.com/shibing624/similarities.git
|
||||||
|
cd similarities
|
||||||
|
python3 setup.py install
|
||||||
|
```
|
||||||
|
|
||||||
|
### 数据集
|
||||||
|
常见中文语义匹配数据集,包含[ATEC](https://github.com/IceFlameWorm/NLP_Datasets/tree/master/ATEC)、[BQ](http://icrc.hitsz.edu.cn/info/1037/1162.htm)、[LCQMC](http://icrc.hitsz.edu.cn/Article/show/171.html)、[PAWSX](https://arxiv.org/abs/1908.11828)、[STS-B](https://github.com/pluto-junzeng/CNSD)共5个任务。
|
||||||
|
可以从数据集对应的链接自行下载,也可以从[百度网盘(提取码:qkt6)](https://pan.baidu.com/s/1d6jSiU1wHQAEMWJi7JJWCQ)下载。
|
||||||
|
|
||||||
|
其中senteval_cn目录是评测数据集汇总,senteval_cn.zip是senteval目录的打包,两者下其一就好。
|
||||||
|
|
||||||
|
# Usage
|
||||||
|
|
||||||
|
### 1. 计算文本向量
|
||||||
|
|
||||||
|
|
||||||
|
### 2. 计算句子之间的相似度值
|
||||||
|
|
||||||
|
示例[semantic_text_similarity.py](./examples/semantic_text_similarity.py)
|
||||||
|
|
||||||
|
|
||||||
|
> 句子余弦相似度值`score`范围是[-1, 1],值越大越相似。
|
||||||
|
|
||||||
|
### 3. 计算句子与文档集之间的相似度值
|
||||||
|
|
||||||
|
一般在文档候选集中找与query最相似的文本,常用于QA场景的问句相似匹配、文本相似检索等任务。
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
> `Score`的值范围[-1, 1],值越大,表示该query与corpus相似度越近。
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
# Contact
|
||||||
|
|
||||||
|
- Issue(建议):[![GitHub issues](https://img.shields.io/github/issues/shibing624/similarities.svg)](https://github.com/shibing624/similarities/issues)
|
||||||
|
- 邮件我:xuming: xuming624@qq.com
|
||||||
|
- 微信我:
|
||||||
|
加我*微信号:xuming624, 备注:个人名称-公司-NLP* 进NLP交流群。
|
||||||
|
|
||||||
|
<img src="docs/wechat.jpeg" width="200" />
|
||||||
|
|
||||||
|
|
||||||
|
# Citation
|
||||||
|
|
||||||
|
如果你在研究中使用了similarities,请按如下格式引用:
|
||||||
|
|
||||||
|
```latex
|
||||||
|
@misc{similarities,
|
||||||
|
title={similarities: A Tool for Compute Similarity Score},
|
||||||
|
author={Ming Xu},
|
||||||
|
howpublished={https://github.com/shibing624/similarities},
|
||||||
|
year={2022}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
# License
|
||||||
|
|
||||||
|
|
||||||
|
授权协议为 [The Apache License 2.0](/LICENSE),可免费用做商业用途。请在产品说明中附加similarities的链接和授权协议。
|
||||||
|
|
||||||
|
|
||||||
|
# Contribute
|
||||||
|
项目代码还很粗糙,如果大家对代码有所改进,欢迎提交回本项目,在提交之前,注意以下两点:
|
||||||
|
|
||||||
|
- 在`tests`添加相应的单元测试
|
||||||
|
- 使用`python setup.py test`来运行所有单元测试,确保所有单测都是通过的
|
||||||
|
|
||||||
|
之后即可提交PR。
|
||||||
|
|
||||||
|
# Reference
|
||||||
|
- [A Simple but Tough-to-Beat Baseline for Sentence Embeddings[Sanjeev Arora and Yingyu Liang and Tengyu Ma, 2017]](https://openreview.net/forum?id=SyK00v5xx)
|
||||||
|
- [四种计算文本相似度的方法对比[Yves Peirsman]](https://zhuanlan.zhihu.com/p/37104535)
|
||||||
|
- [谈谈文本匹配和多轮检索](https://zhuanlan.zhihu.com/p/111769969)
|
||||||
|
BIN
docs/hf.png
Normal file
BIN
docs/hf.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 238 KiB |
538
docs/models_en_sentence_embeddings.html
Normal file
538
docs/models_en_sentence_embeddings.html
Normal file
@ -0,0 +1,538 @@
|
|||||||
|
<!DOCTYPE html>
|
||||||
|
<html lang="en">
|
||||||
|
<head>
|
||||||
|
<meta charset="utf-8">
|
||||||
|
<meta name="viewport" content="width=device-width, initial-scale=1.0">
|
||||||
|
<title>SBERT.net Models</title>
|
||||||
|
<!-- Vue.js -->
|
||||||
|
<script src="https://cdnjs.cloudflare.com/ajax/libs/vue/2.6.12/vue.min.js" integrity="sha512-BKbSR+cfyxLdMAsE0naLReFSLg8/pjbgfxHh/k/kUC82Hy7r6HtR5hLhobaln2gcTvzkyyehrdREdjpsQwy2Jw==" crossorigin="anonymous" referrerpolicy="no-referrer"></script>
|
||||||
|
|
||||||
|
<!-- Bootstrap -->
|
||||||
|
<link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css"
|
||||||
|
integrity="sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh" crossorigin="anonymous">
|
||||||
|
|
||||||
|
<script src="https://code.jquery.com/jquery-3.4.1.slim.min.js"
|
||||||
|
integrity="sha384-J6qa4849blE2+poT4WnyKhv5vZF5SrPo0iEjwBvKU7imGFAV0wwj1yYfoRSJoZ+n"
|
||||||
|
crossorigin="anonymous"></script>
|
||||||
|
<script src="https://cdn.jsdelivr.net/npm/popper.js@1.16.0/dist/umd/popper.min.js"
|
||||||
|
integrity="sha384-Q6E9RHvbIyZFJoft+2mJbHaEWldlvI9IOYy5n3zV9zzTtmI3UksdQRVvoxMfooAo"
|
||||||
|
crossorigin="anonymous"></script>
|
||||||
|
<script src="https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/js/bootstrap.min.js"
|
||||||
|
integrity="sha384-wfSDF2E50Y2D1uUdj0O3uMBJnjuUD4Ih7YwaYd1iqfktj0Uod8GCExl3Og8ifwB6"
|
||||||
|
crossorigin="anonymous"></script>
|
||||||
|
|
||||||
|
<!-- Axios -->
|
||||||
|
<!-- <script src="https://cdnjs.cloudflare.com/ajax/libs/axios/0.21.1/axios.min.js" integrity="sha512-bZS47S7sPOxkjU/4Bt0zrhEtWx0y0CRkhEp8IckzK+ltifIIE9EMIMTuT/mEzoIMewUINruDBIR/jJnbguonqQ==" crossorigin="anonymous" referrerpolicy="no-referrer"></script> -->
|
||||||
|
|
||||||
|
<!-- Font-awesome -->
|
||||||
|
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/5.15.3/css/all.min.css"
|
||||||
|
integrity="sha512-iBBXm8fW90+nuLcSKlbmrPcLa0OT92xO1BIsZ+ywDWZCvqsWgccV3gFoRBv0z+8dLJgyAHIhR35VZc2oM/gI1w=="
|
||||||
|
crossorigin="anonymous" referrerpolicy="no-referrer"/>
|
||||||
|
|
||||||
|
<!-- Lodash -->
|
||||||
|
<script src="https://cdnjs.cloudflare.com/ajax/libs/lodash.js/4.17.21/lodash.min.js"
|
||||||
|
integrity="sha512-WFN04846sdKMIP5LKNphMaWzU7YpMyCU245etK3g/2ARYbPK9Ub18eG+ljU96qKRCWh+quCY7yefSmlkQw1ANQ=="
|
||||||
|
crossorigin="anonymous" referrerpolicy="no-referrer"></script>
|
||||||
|
|
||||||
|
<style>
|
||||||
|
.fa-active {
|
||||||
|
color: #337ab7;
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
.header-cell {
|
||||||
|
cursor: pointer;
|
||||||
|
}
|
||||||
|
|
||||||
|
.models-table thead th {
|
||||||
|
position: sticky;
|
||||||
|
top: 0;
|
||||||
|
z-index: 1;
|
||||||
|
background-color: #ffffff;
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
.info-icon {
|
||||||
|
color: #007bff;
|
||||||
|
}
|
||||||
|
|
||||||
|
.info-icon:hover {
|
||||||
|
color: #0056b3;
|
||||||
|
}
|
||||||
|
|
||||||
|
.info-icon-model {
|
||||||
|
padding-left: 10px;
|
||||||
|
}
|
||||||
|
|
||||||
|
.bs-popover-auto[x-placement^=bottom], .bs-popover-bottom {
|
||||||
|
margin-top: .5rem;
|
||||||
|
}
|
||||||
|
|
||||||
|
.popover {
|
||||||
|
max-width: 400px;
|
||||||
|
}
|
||||||
|
</style>
|
||||||
|
</head>
|
||||||
|
<body>
|
||||||
|
|
||||||
|
<div id="app">
|
||||||
|
<table class="table table-bordered table-sm">
|
||||||
|
<thead>
|
||||||
|
<tr>
|
||||||
|
<th class="header-cell" @click="sortAsc = (sortBy=='name') ? sortAsc = !sortAsc : false; sortBy='name'">
|
||||||
|
<i class="fas fa-active" v-if="sortBy == 'name'" v-bind:class="{ 'fa-sort-amount-up': !sortAsc, 'fa-sort-amount-down-alt': sortAsc }"></i>
|
||||||
|
Model Name
|
||||||
|
</th>
|
||||||
|
<th class="header-cell text-center" @click="sortAsc = (sortBy=='stsb') ? sortAsc = !sortAsc : false; sortBy='stsb'">
|
||||||
|
<i class="fas fa-active" v-if="sortBy == 'stsb'" v-bind:class="{ 'fa-sort-amount-up': !sortAsc, 'fa-sort-amount-down-alt': sortAsc }"></i>
|
||||||
|
STSb
|
||||||
|
<span class="info-icon" data-trigger="hover" data-toggle="popover" title="STSbenchmark" data-content="Spearman-rank correlation on the STSbenchmark test set. Higher = Better" data-html="true" data-placement="bottom"><i class="fas fa-info-circle"></i></span>
|
||||||
|
</th>
|
||||||
|
<th class="header-cell text-center" @click="sortAsc = (sortBy=='dupq') ? sortAsc = !sortAsc : false; sortBy='dupq'">
|
||||||
|
<i class="fas fa-active" v-if="sortBy == 'dupq'" v-bind:class="{ 'fa-sort-amount-up': !sortAsc, 'fa-sort-amount-down-alt': sortAsc }"></i>
|
||||||
|
DupQ
|
||||||
|
<span class="info-icon" data-trigger="hover" data-toggle="popover" title="Duplicate Questions" data-content="Combination of two datasets for duplicate questions detection:<br>Mean-Average-Precision on the Quora Duplicate Questions Semantic Search test set.<br>Average-Precision on the Sprint duplicate questions test set.<br> Higher = Better" data-html="true" data-placement="bottom"><i class="fas fa-info-circle"></i></span>
|
||||||
|
</th>
|
||||||
|
<th class="header-cell text-center" @click="sortAsc = (sortBy=='TwitterP') ? sortAsc = !sortAsc : false; sortBy='TwitterP'">
|
||||||
|
<i class="fas fa-active" v-if="sortBy == 'TwitterP'" v-bind:class="{ 'fa-sort-amount-up': !sortAsc, 'fa-sort-amount-down-alt': sortAsc }"></i>
|
||||||
|
TwitterP
|
||||||
|
<span class="info-icon" data-trigger="hover" data-toggle="popover" title="Twitter Paraphrases" data-content="A test to find tweets that are considered paraphrases. Combination of the SemEval2015 Tweet paraphrase test set and the Twitter-URL-Corpus test set. Performance is measured using Average Precision.<br> Higher = Better" data-html="true" data-placement="bottom"><i class="fas fa-info-circle"></i></span>
|
||||||
|
</th>
|
||||||
|
<th class="header-cell text-center" @click="sortAsc = (sortBy=='scidocs') ? sortAsc = !sortAsc : false; sortBy='scidocs'">
|
||||||
|
<i class="fas fa-active" v-if="sortBy == 'scidocs'" v-bind:class="{ 'fa-sort-amount-up': !sortAsc, 'fa-sort-amount-down-alt': sortAsc }"></i>
|
||||||
|
SciDocs
|
||||||
|
<span class="info-icon" data-trigger="hover" data-toggle="popover" title="SciDocs" data-content="A test to find similar scientific publications given a paper title. From SciDocs, we use the information which papers are often co-cited, co-read, or co-viewed. Performance is measured using MAP.<br> Higher = Better" data-html="true" data-placement="bottom"><i class="fas fa-info-circle"></i></span>
|
||||||
|
</th>
|
||||||
|
<th class="header-cell text-center" @click="sortAsc = (sortBy=='clustering') ? sortAsc = !sortAsc : false; sortBy='clustering'">
|
||||||
|
<i class="fas fa-active" v-if="sortBy == 'clustering'" v-bind:class="{ 'fa-sort-amount-up': !sortAsc, 'fa-sort-amount-down-alt': sortAsc }"></i>
|
||||||
|
Clustering
|
||||||
|
<span class="info-icon" data-trigger="hover" data-toggle="popover" title="Clustering" data-content="Here we test how well the embeddings can be used for clustering. We use three datasets: email subjects from 20NewsGroups, titles from 199 popular subreddits, questions from 121 StackExchanges. We cluster different sentence collections with sentences from 10-50 categories. Performance is measured using V-Measure. <br> Higher = Better" data-html="true" data-placement="bottom"><i class="fas fa-info-circle"></i></span>
|
||||||
|
</th>
|
||||||
|
<th class="header-cell text-center" @click="sortAsc = (sortBy=='final_score') ? sortAsc = !sortAsc : false; sortBy='final_score'">
|
||||||
|
<i class="fas fa-active" v-if="sortBy == 'final_score'" v-bind:class="{ 'fa-sort-amount-up': !sortAsc, 'fa-sort-amount-down-alt': sortAsc }"></i>
|
||||||
|
Avg. Performance
|
||||||
|
<span class="info-icon" data-trigger="hover" data-toggle="popover" title="Average Performance" data-content="Average Performance over all tasks.<br>Higher = Better" data-html="true" data-placement="bottom"><i class="fas fa-info-circle"></i></span>
|
||||||
|
</th>
|
||||||
|
<th class="header-cell text-center" @click="sortAsc = (sortBy=='speed') ? sortAsc = !sortAsc : false; sortBy='speed'">
|
||||||
|
<i class="fas fa-active" v-if="sortBy == 'speed'" v-bind:class="{ 'fa-sort-amount-up': !sortAsc, 'fa-sort-amount-down-alt': sortAsc }"></i>
|
||||||
|
Speed
|
||||||
|
<span class="info-icon" data-trigger="hover" data-toggle="popover" title="Encoding Speed" data-content="Encoding speed (sentences / sec) on a V100 GPU.<br>Higher = Better" data-html="true" data-placement="bottom"><i class="fas fa-info-circle"></i></span>
|
||||||
|
|
||||||
|
</th>
|
||||||
|
</tr>
|
||||||
|
</thead>
|
||||||
|
<tbody>
|
||||||
|
<tr v-for="item in sortedModels">
|
||||||
|
<td>
|
||||||
|
{{ item.name }}
|
||||||
|
<span class="info-icon info-icon-model" data-trigger="hover" data-toggle="popover" :title="item.name" :data-content="'<b>Base Model:</b> '+item.base_model+'<br><b>Pooling:</b> '+item.pooling+'<br><b>Training Data:</b> '+item.training_data+'<br><b>Dimensions:</b> '+item.dim+'<br><b>Size:</b> '+item.size+' MB'" data-html="true" data-placement="bottom"><i class="fas fa-info-circle"></i></span>
|
||||||
|
|
||||||
|
</td>
|
||||||
|
<td class="text-center">{{ item.stsb.toFixed(2) }}</td>
|
||||||
|
<td class="text-center"><span :title="'Quora: '+item.qqp.toFixed(2)+' Sprint: '+item.sprint.toFixed(2)">{{ item.dupq.toFixed(2) }}</span></td>
|
||||||
|
<td class="text-center">{{ item.TwitterP.toFixed(2) }}</td>
|
||||||
|
<td class="text-center">{{ item.scidocs.toFixed(2) }}</td>
|
||||||
|
<td class="text-center"><span :title="'Newsgroups: '+item.newsgroups.toFixed(2)+' Reddit: '+item.reddit.toFixed(2)+' StackExchange: '+item.stackexchange.toFixed(2)">{{ item.clustering.toFixed(2) }}</span></td>
|
||||||
|
<td class="text-center">{{ item.final_score.toFixed(2) }}</td>
|
||||||
|
<td class="text-center">{{ item.speed }}</td>
|
||||||
|
</tr>
|
||||||
|
</tbody>
|
||||||
|
</table>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<script>
|
||||||
|
var app = new Vue({
|
||||||
|
el: '#app',
|
||||||
|
data: {
|
||||||
|
models: [
|
||||||
|
{
|
||||||
|
"name": "stsb-mpnet-base-v2",
|
||||||
|
"base_model": "microsoft/mpnet-base",
|
||||||
|
"pooling": "Mean Pooling",
|
||||||
|
"training_data": "NLI+STSb",
|
||||||
|
"stsb": 88.57,
|
||||||
|
"qqp": 79.74,
|
||||||
|
"sprint": 90.34,
|
||||||
|
"TwitterP": 75.35,
|
||||||
|
"scidocs": 72.48,
|
||||||
|
"newsgroups": 31.56,
|
||||||
|
"reddit": 39.52,
|
||||||
|
"stackexchange": 46.41,
|
||||||
|
"speed": 2800,
|
||||||
|
"size": 386,
|
||||||
|
"dim": 768
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "stsb-roberta-base-v2",
|
||||||
|
"base_model": "roberta-base",
|
||||||
|
"pooling": "Mean Pooling",
|
||||||
|
"training_data": "NLI+STSb",
|
||||||
|
"stsb": 87.21,
|
||||||
|
"qqp": 78.68,
|
||||||
|
"sprint": 86.42,
|
||||||
|
"TwitterP": 73.44,
|
||||||
|
"scidocs": 69.83,
|
||||||
|
"newsgroups": 26.87,
|
||||||
|
"reddit": 36.91,
|
||||||
|
"stackexchange": 45.48,
|
||||||
|
"speed": 2300,
|
||||||
|
"size": 440,
|
||||||
|
"dim": 768
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "stsb-distilroberta-base-v2",
|
||||||
|
"base_model": "distilroberta-base",
|
||||||
|
"pooling": "Mean Pooling",
|
||||||
|
"training_data": "NLI+STSb",
|
||||||
|
"stsb": 86.41,
|
||||||
|
"qqp": 78.13,
|
||||||
|
"sprint": 87.28,
|
||||||
|
"TwitterP": 73.68,
|
||||||
|
"scidocs": 69.85,
|
||||||
|
"newsgroups": 28.63,
|
||||||
|
"reddit": 38.26,
|
||||||
|
"stackexchange": 46.16,
|
||||||
|
"speed": 4000,
|
||||||
|
"size": 292,
|
||||||
|
"dim": 768
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "nli-mpnet-base-v2",
|
||||||
|
"base_model": "microsoft/mpnet-base",
|
||||||
|
"pooling": "Mean Pooling",
|
||||||
|
"training_data": "NLI",
|
||||||
|
"stsb": 86.53,
|
||||||
|
"qqp": 80.65,
|
||||||
|
"sprint": 85.79,
|
||||||
|
"TwitterP": 76.24,
|
||||||
|
"scidocs": 72.90,
|
||||||
|
"newsgroups": 36.56,
|
||||||
|
"reddit": 42.68,
|
||||||
|
"stackexchange": 50.90,
|
||||||
|
"speed": 2800,
|
||||||
|
"size": 385,
|
||||||
|
"dim": 768
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "nli-roberta-base-v2",
|
||||||
|
"base_model": "roberta-base",
|
||||||
|
"pooling": "Mean Pooling",
|
||||||
|
"training_data": "NLI",
|
||||||
|
"stsb": 85.54,
|
||||||
|
"qqp": 78.73,
|
||||||
|
"sprint": 81.67,
|
||||||
|
"TwitterP": 74.28,
|
||||||
|
"scidocs": 69.86,
|
||||||
|
"newsgroups": 31.28,
|
||||||
|
"reddit": 39.58,
|
||||||
|
"stackexchange": 49.51,
|
||||||
|
"speed": 2300,
|
||||||
|
"size": 440,
|
||||||
|
"dim": 768
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "nli-distilroberta-base-v2",
|
||||||
|
"base_model": "distilroberta-base",
|
||||||
|
"pooling": "Mean Pooling",
|
||||||
|
"training_data": "NLI",
|
||||||
|
"stsb": 84.38,
|
||||||
|
"qqp": 78.47,
|
||||||
|
"sprint": 83.03,
|
||||||
|
"TwitterP": 73.86,
|
||||||
|
"scidocs": 70.23,
|
||||||
|
"newsgroups": 31.87,
|
||||||
|
"reddit": 39.12,
|
||||||
|
"stackexchange": 49.27,
|
||||||
|
"speed": 4000,
|
||||||
|
"size": 292,
|
||||||
|
"dim": 768
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "average_word_embeddings_glove.6B.300d",
|
||||||
|
"base_model": "Word Embeddings: GloVe",
|
||||||
|
"pooling": "Mean Pooling",
|
||||||
|
"training_data": "-",
|
||||||
|
"stsb": 61.77,
|
||||||
|
"qqp": 69.18,
|
||||||
|
"sprint": 86.96,
|
||||||
|
"TwitterP": 68.60,
|
||||||
|
"scidocs": 63.69,
|
||||||
|
"newsgroups": 26.65,
|
||||||
|
"reddit": 28.37,
|
||||||
|
"stackexchange": 36.37,
|
||||||
|
"speed": 34000,
|
||||||
|
"size": 422,
|
||||||
|
"dim": 300
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "average_word_embeddings_komninos",
|
||||||
|
"base_model": "Word Embeddings: Komninos et al.",
|
||||||
|
"pooling": "Mean Pooling",
|
||||||
|
"training_data": "-",
|
||||||
|
"stsb": 61.56,
|
||||||
|
"qqp": 69.83,
|
||||||
|
"sprint": 85.55,
|
||||||
|
"TwitterP": 71.23,
|
||||||
|
"scidocs": 65.25,
|
||||||
|
"newsgroups": 27.53,
|
||||||
|
"reddit": 29.54,
|
||||||
|
"stackexchange": 39.35,
|
||||||
|
"speed": 22000,
|
||||||
|
"size": 237,
|
||||||
|
"dim": 300
|
||||||
|
},
|
||||||
|
/*{
|
||||||
|
"name": "average_word_embeddings_levy_dependency",
|
||||||
|
"base_model": "Word Embeddings: Levy et al.",
|
||||||
|
"pooling": "Mean Pooling",
|
||||||
|
"training_data": "-",
|
||||||
|
"stsb": 59.22,
|
||||||
|
"qqp": 64.62,
|
||||||
|
"sprint": 80.12,
|
||||||
|
"TwitterP": 70.79,
|
||||||
|
"scidocs": 60.04 ,
|
||||||
|
"newsgroups": 22.72,
|
||||||
|
"reddit": 24.23,
|
||||||
|
"stackexchange": 33.66,
|
||||||
|
"speed": 22000,
|
||||||
|
"size": 186,
|
||||||
|
"dim": 300
|
||||||
|
},*/
|
||||||
|
{
|
||||||
|
"name": "paraphrase-MiniLM-L12-v2",
|
||||||
|
"base_model": "microsoft/MiniLM-L12-H384-uncased",
|
||||||
|
"pooling": "Mean Pooling",
|
||||||
|
"training_data": "AllNLI, sentence-compression, SimpleWiki, altlex, msmarco-triplets, quora_duplicates, coco_captions,flickr30k_captions, yahoo_answers_title_question, S2ORC_citation_pairs, stackexchange_duplicate_questions, wiki-atomic-edits",
|
||||||
|
"stsb": 84.41,
|
||||||
|
"qqp": 84.64,
|
||||||
|
"sprint": 89.91,
|
||||||
|
"TwitterP": 75.34,
|
||||||
|
"scidocs": 80.08,
|
||||||
|
"newsgroups": 41.81,
|
||||||
|
"reddit": 44.42,
|
||||||
|
"stackexchange": 54.63,
|
||||||
|
"speed": 7500,
|
||||||
|
"size": 118,
|
||||||
|
"dim": 384
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "paraphrase-MiniLM-L6-v2",
|
||||||
|
"base_model": "nreimers/MiniLM-L6-H384-uncased",
|
||||||
|
"pooling": "Mean Pooling",
|
||||||
|
"training_data": "AllNLI, sentence-compression, SimpleWiki, altlex, msmarco-triplets, quora_duplicates, coco_captions,flickr30k_captions, yahoo_answers_title_question, S2ORC_citation_pairs, stackexchange_duplicate_questions, wiki-atomic-edits",
|
||||||
|
"stsb": 84.12,
|
||||||
|
"qqp": 84.25,
|
||||||
|
"sprint": 90.21,
|
||||||
|
"TwitterP": 76.32,
|
||||||
|
"scidocs": 78.91,
|
||||||
|
"newsgroups": 40.16,
|
||||||
|
"reddit": 42.71,
|
||||||
|
"stackexchange": 53.14,
|
||||||
|
"speed": 14200,
|
||||||
|
"size": 80,
|
||||||
|
"dim": 384
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "paraphrase-MiniLM-L3-v2",
|
||||||
|
"base_model": "nreimers/MiniLM-L3-H384-uncased",
|
||||||
|
"pooling": "Mean Pooling",
|
||||||
|
"training_data": "AllNLI, sentence-compression, SimpleWiki, altlex, msmarco-triplets, quora_duplicates, coco_captions,flickr30k_captions, yahoo_answers_title_question, S2ORC_citation_pairs, stackexchange_duplicate_questions, wiki-atomic-edits",
|
||||||
|
"stsb": 82.41,
|
||||||
|
"qqp": 83.29,
|
||||||
|
"sprint": 92.9,
|
||||||
|
"TwitterP": 76.14,
|
||||||
|
"scidocs": 77.71,
|
||||||
|
"newsgroups": 37.73,
|
||||||
|
"reddit": 41.18,
|
||||||
|
"stackexchange": 51.25,
|
||||||
|
"speed": 19000,
|
||||||
|
"size": 61,
|
||||||
|
"dim": 384
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "paraphrase-distilroberta-base-v2",
|
||||||
|
"base_model": "distilroberta-base",
|
||||||
|
"pooling": "Mean Pooling",
|
||||||
|
"training_data": "AllNLI, sentence-compression, SimpleWiki, altlex, msmarco-triplets, quora_duplicates, coco_captions,flickr30k_captions, yahoo_answers_title_question, S2ORC_citation_pairs, stackexchange_duplicate_questions, wiki-atomic-edits",
|
||||||
|
"stsb": 85.37,
|
||||||
|
"qqp": 84.31,
|
||||||
|
"sprint": 89.64,
|
||||||
|
"TwitterP": 73.96,
|
||||||
|
"scidocs": 80.25,
|
||||||
|
"newsgroups": 42.12,
|
||||||
|
"reddit": 47.53,
|
||||||
|
"stackexchange": 57.90,
|
||||||
|
"speed": 4000,
|
||||||
|
"size": 292,
|
||||||
|
"dim": 768
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "paraphrase-TinyBERT-L6-v2",
|
||||||
|
"base_model": "https://huggingface.co/nreimers/TinyBERT_L-6_H-768_v2",
|
||||||
|
"pooling": "Mean Pooling",
|
||||||
|
"training_data": "AllNLI, sentence-compression, SimpleWiki, altlex, msmarco-triplets, quora_duplicates, coco_captions,flickr30k_captions, yahoo_answers_title_question, S2ORC_citation_pairs, stackexchange_duplicate_questions, wiki-atomic-edits",
|
||||||
|
"stsb": 84.91,
|
||||||
|
"qqp": 84.23,
|
||||||
|
"sprint": 89.62,
|
||||||
|
"TwitterP": 75.39,
|
||||||
|
"scidocs": 81.51,
|
||||||
|
"newsgroups": 43.82,
|
||||||
|
"reddit": 44.61,
|
||||||
|
"stackexchange": 55.69,
|
||||||
|
"speed": 4500,
|
||||||
|
"size": 238,
|
||||||
|
"dim": 768
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "paraphrase-mpnet-base-v2",
|
||||||
|
"base_model": "microsoft/mpnet-base",
|
||||||
|
"pooling": "Mean Pooling",
|
||||||
|
"training_data": "AllNLI, sentence-compression, SimpleWiki, altlex, msmarco-triplets, quora_duplicates, coco_captions,flickr30k_captions, yahoo_answers_title_question, S2ORC_citation_pairs, stackexchange_duplicate_questions, wiki-atomic-edits",
|
||||||
|
"stsb": 86.99,
|
||||||
|
"qqp": 84.93,
|
||||||
|
"sprint": 90.67,
|
||||||
|
"TwitterP": 76.05,
|
||||||
|
"scidocs": 80.57,
|
||||||
|
"newsgroups": 48.13,
|
||||||
|
"reddit": 50.52,
|
||||||
|
"stackexchange": 59.79,
|
||||||
|
"speed": 2800,
|
||||||
|
"size": 387,
|
||||||
|
"dim": 768
|
||||||
|
},
|
||||||
|
/*{
|
||||||
|
"name": "paraphrase-albert-base-v2",
|
||||||
|
"base_model": "albert-base-v2",
|
||||||
|
"pooling": "Mean Pooling",
|
||||||
|
"training_data": "AllNLI, sentence-compression, SimpleWiki, altlex, msmarco-triplets, quora_duplicates, coco_captions,flickr30k_captions, yahoo_answers_title_question, S2ORC_citation_pairs, stackexchange_duplicate_questions, wiki-atomic-edits",
|
||||||
|
"stsb": 83.38,
|
||||||
|
"qqp": 83.28,
|
||||||
|
"sprint": 90.45,
|
||||||
|
"TwitterP": 74.83,
|
||||||
|
"scidocs": 77.83,
|
||||||
|
"newsgroups": 39.88,
|
||||||
|
"reddit": 42.64,
|
||||||
|
"stackexchange": 52.95,
|
||||||
|
"speed": 2400,
|
||||||
|
"size": 43,
|
||||||
|
"dim": 768
|
||||||
|
},*/
|
||||||
|
{
|
||||||
|
"name": "paraphrase-albert-small-v2",
|
||||||
|
"base_model": "nreimers/albert-small-v2",
|
||||||
|
"pooling": "Mean Pooling",
|
||||||
|
"training_data": "AllNLI, sentence-compression, SimpleWiki, altlex, msmarco-triplets, quora_duplicates, coco_captions,flickr30k_captions, yahoo_answers_title_question, S2ORC_citation_pairs, stackexchange_duplicate_questions, wiki-atomic-edits",
|
||||||
|
"stsb": 83.40,
|
||||||
|
"qqp": 83.54,
|
||||||
|
"sprint": 89.6,
|
||||||
|
"TwitterP": 74.51,
|
||||||
|
"scidocs": 80.28,
|
||||||
|
"newsgroups": 40.54,
|
||||||
|
"reddit": 41.54,
|
||||||
|
"stackexchange": 52.74,
|
||||||
|
"speed": 5000,
|
||||||
|
"size": 43,
|
||||||
|
"dim": 768
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "paraphrase-multilingual-mpnet-base-v2",
|
||||||
|
"base_model": "Teacher: paraphrase-mpnet-base-v2; Student: xlm-roberta-base",
|
||||||
|
"pooling": "Mean Pooling",
|
||||||
|
"training_data": "Multi-lingual model of paraphrase-mpnet-base-v2, extended to 50+ languages.",
|
||||||
|
"stsb": 86.82,
|
||||||
|
"qqp": 83.91,
|
||||||
|
"sprint": 91.1,
|
||||||
|
"TwitterP": 76.52,
|
||||||
|
"scidocs": 78.66,
|
||||||
|
"newsgroups": 44.65,
|
||||||
|
"reddit": 45.01,
|
||||||
|
"stackexchange": 52.73,
|
||||||
|
"speed": 2500,
|
||||||
|
"size": 969,
|
||||||
|
"dim": 768
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "paraphrase-multilingual-MiniLM-L12-v2",
|
||||||
|
"base_model": "Teacher: paraphrase-MiniLM-L12-v2; Student: microsoft/Multilingual-MiniLM-L12-H384",
|
||||||
|
"pooling": "Mean Pooling",
|
||||||
|
"training_data": "Multi-lingual model of paraphrase-multilingual-MiniLM-L12-v2, extended to 50+ languages.",
|
||||||
|
"stsb": 84.42,
|
||||||
|
"qqp": 83.89,
|
||||||
|
"sprint": 91.15,
|
||||||
|
"TwitterP": 74.94,
|
||||||
|
"scidocs": 78.27,
|
||||||
|
"newsgroups": 40.36,
|
||||||
|
"reddit": 41.49,
|
||||||
|
"stackexchange": 49.75,
|
||||||
|
"speed": 7500,
|
||||||
|
"size": 418,
|
||||||
|
"dim": 384
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "distiluse-base-multilingual-cased-v1",
|
||||||
|
"base_model": "Teacher: mUSE; Student: distilbert-base-multilingual",
|
||||||
|
"pooling": "Mean Pooling",
|
||||||
|
"training_data": "Multi-Lingual model of Universal Sentence Encoder for 15 languages: Arabic, Chinese, Dutch, English, French, German, Italian, Korean, Polish, Portuguese, Russian, Spanish, Turkish.",
|
||||||
|
"stsb": 80.62,
|
||||||
|
"qqp": 81.10,
|
||||||
|
"sprint": 88.54,
|
||||||
|
"TwitterP": 76.24,
|
||||||
|
"scidocs": 70.41,
|
||||||
|
"newsgroups": 32.97,
|
||||||
|
"reddit": 42.93,
|
||||||
|
"stackexchange": 44.30,
|
||||||
|
"speed": 4000,
|
||||||
|
"size": 482,
|
||||||
|
"dim": 512
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "distiluse-base-multilingual-cased-v2",
|
||||||
|
"base_model": "Teacher: mUSE; Student: distilbert-base-multilingual",
|
||||||
|
"pooling": "Mean Pooling",
|
||||||
|
"training_data": "Multi-Lingual model of Universal Sentence Encoder for 50 languages.",
|
||||||
|
"stsb": 80.75,
|
||||||
|
"qqp": 79.89,
|
||||||
|
"sprint": 87.15,
|
||||||
|
"TwitterP": 76.26,
|
||||||
|
"scidocs": 70.39,
|
||||||
|
"newsgroups": 29.96,
|
||||||
|
"reddit": 39.95,
|
||||||
|
"stackexchange": 41.19,
|
||||||
|
"speed": 4000,
|
||||||
|
"size": 481,
|
||||||
|
"dim": 512
|
||||||
|
}
|
||||||
|
|
||||||
|
],
|
||||||
|
sortBy: 'final_score',
|
||||||
|
sortAsc: false
|
||||||
|
},
|
||||||
|
methods: {
|
||||||
|
|
||||||
|
},
|
||||||
|
computed: {
|
||||||
|
sortedModels: function() {
|
||||||
|
//Add avg. for duplicate questions
|
||||||
|
let models_ext = this.models.map(function(elem, index) { elem.dupq = (elem.qqp + elem.sprint)/2.0; return elem;} );
|
||||||
|
|
||||||
|
//Add avg. for clustering
|
||||||
|
models_ext = models_ext.map(function(elem, index) { elem.clustering = (elem.newsgroups + elem.reddit + elem.stackexchange)/3.0; return elem;} );
|
||||||
|
|
||||||
|
//Final score
|
||||||
|
models_ext = models_ext.map(function(elem, index) { elem.final_score = (elem.stsb + elem.dupq + elem.TwitterP + elem.scidocs + elem.clustering)/ 5.0; return elem;} );
|
||||||
|
|
||||||
|
return _.orderBy(models_ext, (item) => item[this.sortBy] || (this.sortAsc ? 9999 : -9999), this.sortAsc ? 'asc' : 'desc')
|
||||||
|
}
|
||||||
|
}
|
||||||
|
})
|
||||||
|
</script>
|
||||||
|
|
||||||
|
<script>
|
||||||
|
$(function () {
|
||||||
|
$('[data-toggle="popover"]').popover()
|
||||||
|
})
|
||||||
|
</script>
|
||||||
|
|
||||||
|
|
||||||
|
</body>
|
||||||
|
</html>
|
BIN
docs/wechat.jpeg
Normal file
BIN
docs/wechat.jpeg
Normal file
Binary file not shown.
After Width: | Height: | Size: 40 KiB |
21
examples/base_demo.py
Normal file
21
examples/base_demo.py
Normal file
@ -0,0 +1,21 @@
|
|||||||
|
# -*- coding: utf-8 -*-
|
||||||
|
"""
|
||||||
|
@author:XuMing(xuming624@qq.com)
|
||||||
|
@description:
|
||||||
|
This basic example loads a pre-trained model from the web and uses it to
|
||||||
|
generate sentence embeddings for a given list of sentences.
|
||||||
|
"""
|
||||||
|
import sys
|
||||||
|
|
||||||
|
sys.path.append('..')
|
||||||
|
from similarities import BertSimilarity
|
||||||
|
|
||||||
|
if __name__ == '__main__':
|
||||||
|
model = BertSimilarity("shibing624/text2vec-base-chinese") # 中文句向量模型(CoSENT)
|
||||||
|
# Embed a list of sentences
|
||||||
|
sentences = ['如何更换花呗绑定银行卡',
|
||||||
|
'花呗更改绑定银行卡']
|
||||||
|
sentence_embeddings = model.encode(sentences)
|
||||||
|
print(type(sentence_embeddings), sentence_embeddings.shape)
|
||||||
|
similarity_score = model.similarity_score([sentences[0]], [sentences[1]])
|
||||||
|
print(similarity_score)
|
19
examples/computing_embeddings.py
Normal file
19
examples/computing_embeddings.py
Normal file
@ -0,0 +1,19 @@
|
|||||||
|
# -*- coding: utf-8 -*-
|
||||||
|
"""
|
||||||
|
@author:XuMing(xuming624@qq.com)
|
||||||
|
@description:
|
||||||
|
This basic example loads a pre-trained model from the web and uses it to
|
||||||
|
generate sentence embeddings for a given list of sentences.
|
||||||
|
"""
|
||||||
|
|
||||||
|
import sys
|
||||||
|
|
||||||
|
sys.path.append('..')
|
||||||
|
from similarities import BertSimilarity
|
||||||
|
|
||||||
|
model = BertSimilarity("shibing624/text2vec-base-chinese") # 中文句向量模型(CoSENT)
|
||||||
|
# Embed a list of sentences
|
||||||
|
sentences = ['如何更换花呗绑定银行卡',
|
||||||
|
'花呗更改绑定银行卡']
|
||||||
|
sentence_embeddings = model.encode(sentences)
|
||||||
|
print(type(sentence_embeddings), sentence_embeddings.shape)
|
40
examples/gradio_demo.py
Normal file
40
examples/gradio_demo.py
Normal file
@ -0,0 +1,40 @@
|
|||||||
|
# -*- coding: utf-8 -*-
|
||||||
|
"""
|
||||||
|
@author:XuMing(xuming624@qq.com)
|
||||||
|
@description: pip install gradio
|
||||||
|
"""
|
||||||
|
|
||||||
|
import gradio as gr
|
||||||
|
from similarities import BertSimilarity
|
||||||
|
|
||||||
|
# 中文句向量模型(CoSENT)
|
||||||
|
sim_model = BertSimilarity(model_name_or_path='shibing624/text2vec-base-chinese')
|
||||||
|
|
||||||
|
|
||||||
|
def ai_text(sentence1, sentence2):
|
||||||
|
score = sim_model.similarity_score(sentence1, sentence2)
|
||||||
|
print("{} \t\t {} \t\t Score: {:.4f}".format(sentence1, sentence2, score))
|
||||||
|
|
||||||
|
return score
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == '__main__':
|
||||||
|
examples = [
|
||||||
|
['如何更换花呗绑定银行卡', '花呗更改绑定银行卡'],
|
||||||
|
['我在北京打篮球', '我是北京人,我喜欢篮球'],
|
||||||
|
['一个女人在看书。', '一个女人在揉面团'],
|
||||||
|
['一个男人在车库里举重。', '一个人在举重。'],
|
||||||
|
]
|
||||||
|
input1 = gr.inputs.Textbox(lines=2, placeholder="Enter First Sentence")
|
||||||
|
input2 = gr.inputs.Textbox(lines=2, placeholder="Enter Second Sentence")
|
||||||
|
|
||||||
|
output_text = gr.outputs.Textbox()
|
||||||
|
gr.Interface(ai_text,
|
||||||
|
inputs=[input1, input2],
|
||||||
|
outputs=[output_text],
|
||||||
|
# theme="grass",
|
||||||
|
title="Chinese Text Matching Model shibing624/text2vec-base-chinese",
|
||||||
|
description="Copy or input Chinese text here. Submit and the machine will calculate the cosine score.",
|
||||||
|
article="Link to <a href='https://github.com/shibing624/similarities' style='color:blue;' target='_blank\'>Github REPO</a>",
|
||||||
|
examples=examples
|
||||||
|
).launch()
|
8
requirements.txt
Normal file
8
requirements.txt
Normal file
@ -0,0 +1,8 @@
|
|||||||
|
jieba>=0.39
|
||||||
|
loguru
|
||||||
|
transformers>=4.6.0
|
||||||
|
tokenizers>=0.10.3
|
||||||
|
tqdm
|
||||||
|
numpy
|
||||||
|
scikit-learn
|
||||||
|
gensim>=4.0.0
|
51
setup.py
Normal file
51
setup.py
Normal file
@ -0,0 +1,51 @@
|
|||||||
|
# -*- coding: utf-8 -*-
|
||||||
|
import sys
|
||||||
|
|
||||||
|
from setuptools import setup, find_packages
|
||||||
|
|
||||||
|
# Avoids IDE errors, but actual version is read from version.py
|
||||||
|
__version__ = None
|
||||||
|
exec(open('similarities/version.py').read())
|
||||||
|
|
||||||
|
if sys.version_info < (3,):
|
||||||
|
sys.exit('Sorry, Python3 is required.')
|
||||||
|
|
||||||
|
with open('README.md', 'r', encoding='utf-8') as f:
|
||||||
|
readme = f.read()
|
||||||
|
|
||||||
|
with open('requirements.txt', 'r', encoding='utf-8') as f:
|
||||||
|
reqs = f.read()
|
||||||
|
|
||||||
|
setup(
|
||||||
|
name='similarities',
|
||||||
|
version=__version__,
|
||||||
|
description='Similarities is a toolkit for compute similarity scores between two sets of strings.',
|
||||||
|
long_description=readme,
|
||||||
|
long_description_content_type='text/markdown',
|
||||||
|
author='XuMing',
|
||||||
|
author_email='xuming624@qq.com',
|
||||||
|
url='https://github.com/shibing624/similarities',
|
||||||
|
license="Apache License 2.0",
|
||||||
|
zip_safe=False,
|
||||||
|
python_requires=">=3.6.0",
|
||||||
|
classifiers=[
|
||||||
|
"Development Status :: 5 - Production/Stable",
|
||||||
|
"Intended Audience :: Developers",
|
||||||
|
"Intended Audience :: Education",
|
||||||
|
"Intended Audience :: Science/Research",
|
||||||
|
"License :: OSI Approved :: Apache Software License",
|
||||||
|
"Operating System :: OS Independent",
|
||||||
|
"Programming Language :: Python :: 3",
|
||||||
|
"Programming Language :: Python :: 3.6",
|
||||||
|
"Programming Language :: Python :: 3.7",
|
||||||
|
"Programming Language :: Python :: 3.8",
|
||||||
|
"Programming Language :: Python :: 3.9",
|
||||||
|
"Topic :: Scientific/Engineering :: Artificial Intelligence",
|
||||||
|
],
|
||||||
|
keywords='similarities,Chinese Text Similarity Calculation Tool,similarity,word2vec',
|
||||||
|
install_requires=reqs.strip().split('\n'),
|
||||||
|
packages=find_packages(exclude=['tests']),
|
||||||
|
package_dir={'similarities': 'similarities'},
|
||||||
|
package_data={'similarities': ['*.*', '../LICENSE', '../README.*', '../*.txt', 'utils/*',
|
||||||
|
'data/*', ]}
|
||||||
|
)
|
7
similarities/__init__.py
Normal file
7
similarities/__init__.py
Normal file
@ -0,0 +1,7 @@
|
|||||||
|
# -*- coding: utf-8 -*-
|
||||||
|
"""
|
||||||
|
@author:XuMing(xuming624@qq.com)
|
||||||
|
@description:
|
||||||
|
"""
|
||||||
|
|
||||||
|
from .similarity import BertSimilarity
|
34
similarities/similarity.py
Normal file
34
similarities/similarity.py
Normal file
@ -0,0 +1,34 @@
|
|||||||
|
# -*- coding: utf-8 -*-
|
||||||
|
"""
|
||||||
|
@author:XuMing(xuming624@qq.com)
|
||||||
|
@description:
|
||||||
|
"""
|
||||||
|
|
||||||
|
from typing import List, Union, Optional
|
||||||
|
import numpy as np
|
||||||
|
from numpy import ndarray
|
||||||
|
from torch import Tensor
|
||||||
|
from loguru import logger
|
||||||
|
|
||||||
|
|
||||||
|
class BertSimilarity:
|
||||||
|
def __init__(self, model_name_or_path=''):
|
||||||
|
"""
|
||||||
|
Cal text similarity
|
||||||
|
:param similarity_type:
|
||||||
|
:param embedding_type:
|
||||||
|
"""
|
||||||
|
self.model_name_or_path = model_name_or_path
|
||||||
|
self.model = None
|
||||||
|
|
||||||
|
def encode(self, sentences: Union[List[str], str]) -> ndarray:
|
||||||
|
return np.array([])
|
||||||
|
|
||||||
|
def similarity_score(self, sentences1: Union[List[str], str], entences2: Union[List[str], str]):
|
||||||
|
"""
|
||||||
|
Get similarity scores between sentences1 and sentences2
|
||||||
|
:param sentences1: list, sentence1 list
|
||||||
|
:param sentences2: list, sentence2 list
|
||||||
|
:return: return: Matrix with res[i][j] = cos_sim(a[i], b[j])
|
||||||
|
"""
|
||||||
|
return 0.0
|
7
similarities/version.py
Normal file
7
similarities/version.py
Normal file
@ -0,0 +1,7 @@
|
|||||||
|
# -*- coding: utf-8 -*-
|
||||||
|
"""
|
||||||
|
@author:XuMing(xuming624@qq.com)
|
||||||
|
@description:
|
||||||
|
"""
|
||||||
|
|
||||||
|
__version__ = '0.0.1'
|
27
tests/test_sim_score.py
Normal file
27
tests/test_sim_score.py
Normal file
@ -0,0 +1,27 @@
|
|||||||
|
# -*- coding: utf-8 -*-
|
||||||
|
"""
|
||||||
|
@author:XuMing(xuming624@qq.com)
|
||||||
|
@description:
|
||||||
|
"""
|
||||||
|
import sys
|
||||||
|
import unittest
|
||||||
|
|
||||||
|
sys.path.append('..')
|
||||||
|
|
||||||
|
from similarities import BertSimilarity
|
||||||
|
|
||||||
|
bert_model = BertSimilarity()
|
||||||
|
|
||||||
|
|
||||||
|
class IssueTestCase(unittest.TestCase):
|
||||||
|
|
||||||
|
def test_sim_diff(self):
|
||||||
|
a = '研究团队面向国家重大战略需求追踪国际前沿发展借鉴国际人工智能研究领域的科研模式有效整合创新资源解决复'
|
||||||
|
b = '英汉互译比较语言学'
|
||||||
|
r = bert_model.similarity_score(a, b)
|
||||||
|
print(a, b, r)
|
||||||
|
self.assertTrue(abs(r - 0.1733) < 0.001)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == '__main__':
|
||||||
|
unittest.main()
|
Loading…
Reference in New Issue
Block a user