init similarities project.

2022-02-23 19:44:53 +08:00 · 2022-02-23 19:44:53 +08:00 · 487469a423
commit 487469a423
parent a88748119f
19 changed files with 1029 additions and 2 deletions
--- a/.github/ISSUE_TEMPLATE/bug-report.md
+++ b/.github/ISSUE_TEMPLATE/bug-report.md
@ -0,0 +1,28 @@
 ---
 name: Bug Report
 about: Create a report to help us improve
 title: ''
 labels: bug
 assignees: ''
 ---
 ### Describe the bug
 Please provide a clear and concise description of what the bug is. If applicable, add screenshots to help explain your problem, especially for visualization related problems.
 ### To Reproduce
 Please provide a [Minimal, Complete, and Verifiable example](https://stackoverflow.com/help/mcve) here. We hope we can simply copy/paste/run it. It is also nice to share a hosted runnable script (e.g. Google Colab), especially for hardware-related problems.
 ### Describe your attempts
 - [ ] I checked the documentation and found no answer
 - [ ] I checked to make sure that this is not a duplicate issue
 You should also provide code snippets you tried as a workaround, StackOverflow solution that you have walked through, or your best guess of the cause that you can't locate (e.g. cosmic radiation).
 ### Context
 - **OS** [e.g. Windows 10, macOS 10.14]: 
 - **Hardware** [e.g. CPU only, GTX 1080 Ti]: 
 ### Additional Information
 Other things you want the developers to know.
--- a/.github/ISSUE_TEMPLATE/feature-request.md
+++ b/.github/ISSUE_TEMPLATE/feature-request.md
@ -0,0 +1,23 @@
 ---
 name: Feature Request
 about: Suggest an idea for this project
 title: ''
 labels: enhancement
 assignees: ''
 ---
 - [ ] I checked to make sure that this is not a duplicate issue
 - [ ] I'm submitting the request to the correct repository (for model requests
 ### Is your feature request related to a problem? Please describe.
 A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
 ### Describe the solution you'd like
 A clear and concise description of what you want to happen.
 ### Describe alternatives you've considered
 A clear and concise description of any alternative solutions or features you've considered.
 ### Additional Information
 Other things you want the developers to know.
--- a/.github/ISSUE_TEMPLATE/usage-question.md
+++ b/.github/ISSUE_TEMPLATE/usage-question.md
@ -0,0 +1,18 @@
 ---
 name: Usage Question
 about: Ask a question about text2vec usage
 title: ''
 labels: question
 assignees: ''
 ---
 ### Describe the Question
 Please provide a clear and concise description of what the question is.
 ### Describe your attempts
 - [ ] I walked through the tutorials
 - [ ] I checked the documentation
 - [ ] I checked to make sure that this is not a duplicate question
 You may also provide a [Minimal, Complete, and Verifiable example](https://stackoverflow.com/help/mcve) you tried as a workaround, or StackOverflow solution that you have walked through. (e.g. cosmic radiation).
--- a/.github/stale.yml
+++ b/.github/stale.yml
@ -0,0 +1,17 @@
 # Number of days of inactivity before an issue becomes stale
 daysUntilStale: 60
 # Number of days of inactivity before a stale issue is closed
 daysUntilClose: 7
 # Issues with these labels will never be considered stale
 exemptLabels:
  - pinned
  - security
 # Label to use when marking an issue as stale
 staleLabel: wontfix
 # Comment to post when marking an issue as stale. Set to `false` to disable
 markComment: >
  This issue has been automatically marked as stale because it has not had
  recent activity. It will be closed if no further activity occurs. Thank you
  for your contributions.(由于长期不活动，机器人自动关闭此问题，如果需要欢迎提问)
 # Comment to post when closing a stale issue. Set to `false` to disable
 closeComment: false
--- a/.gitignore
+++ b/.gitignore
@ -127,3 +127,4 @@ dmypy.json
 # Pyre type checker
 .pyre/
 .idea/
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@ -0,0 +1,9 @@
 # Contributing
 We are happy to accept your contributions to make `text2vec` better and more awesome! To avoid unnecessary work on either
 side, please stick to the following process:
 1. Check if there is already [an issue](https://github.com/shibing624/similarities/issues) for your concern.
 2. If there is not, open a new one to start a discussion. We hate to close finished PRs!
 3. If we decide your concern needs code changes, we would be happy to accept a pull request. Please consider the
 commit guidelines below.
--- a/README.md
+++ b/README.md
@ -1,2 +1,181 @@
-# similarities
+[![PyPI version](https://badge.fury.io/py/similarities.svg)](https://badge.fury.io/py/similarities)
-Similarities is a toolkit for compute similarity score between texts. 相似度计算工具包，实现多种字面、语义匹配模型。
+[![Downloads](https://pepy.tech/badge/similarities)](https://pepy.tech/project/similarities)
 [![Contributions welcome](https://img.shields.io/badge/contributions-welcome-brightgreen.svg)](CONTRIBUTING.md)
 [![GitHub contributors](https://img.shields.io/github/contributors/shibing624/similarities.svg)](https://github.com/shibing624/similarities/graphs/contributors)
 [![License Apache 2.0](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](LICENSE)
 [![python_version](https://img.shields.io/badge/Python-3.5%2B-green.svg)](requirements.txt)
 [![GitHub issues](https://img.shields.io/github/issues/shibing624/similarities.svg)](https://github.com/shibing624/similarities/issues)
 [![Wechat Group](http://vlog.sfyc.ltd/wechat_everyday/wxgroup_logo.png?imageView2/0/w/60/h/20)](#Contact)
 # Similarities
 Similarities is a toolkit for Compute Similarity Score between texts. 
 相似度计算工具包，实现多种字面、语义匹配模型。
 **similarities**实现了Word2Vec、RankBM25、BERT、Sentence-BERT、CoSENT等多种文本表征、文本相似度计算模型，并在文本语义匹配（相似度计算）任务上比较了各模型的效果。
 **Guide**
 - [Feature](#Feature)
 - [Evaluate](#Evaluate)
 - [Install](#install)
 - [Usage](#usage)
 - [Contact](#Contact)
 - [Citation](#Citation)
 - [Reference](#reference)
 # Feature
 ### 文本向量表示模型
 - [Word2Vec](similarities/word2vec.py)：通过腾讯AI Lab开源的大规模高质量中文[词向量数据（800万中文词轻量版）](https://pan.baidu.com/s/1La4U4XNFe8s5BJqxPQpeiQ) (文件名：light_Tencent_AILab_ChineseEmbedding.bin 密码: tawe）实现词向量检索，本项目实现了句子（词向量求平均）的word2vec向量表示
 - [SBert(Sentence-BERT)](similarities/sentence_bert)：权衡性能和效率的句向量表示模型，训练时通过有监督训练上层分类函数，文本匹配预测时直接句子向量做余弦，本项目基于PyTorch复现了Sentence-BERT模型的训练和预测
 - [CoSENT(Cosine Sentence)](similarities/cosent)：CoSENT模型提出了一种排序的损失函数，使训练过程更贴近预测，模型收敛速度和效果比Sentence-BERT更好，本项目基于PyTorch实现了CoSENT模型的训练和预测
 ### 文本相似度比较方法
 - 余弦相似（Cosine Similarity）：两向量求余弦
 - 点积（Dot Product）：两向量归一化后求内积
 - 词移距离（Word Mover’s Distance）：词移距离使用两文本间的词向量，测量其中一文本中的单词在语义空间中移动到另一文本单词所需要的最短距离
 - [RankBM25](similarities/bm25.py)：BM25的变种算法，对query和文档之间的相似度打分，得到docs的rank排序
 - [SemanticSearch](https://github.com/shibing624/similarities/blob/master/similarities/sbert.py#L80)：向量相似检索，使用Cosine Similarty + topk高效计算，比一对一暴力计算快一个数量级
 # Evaluate
 ### 文本匹配
 - 英文匹配数据集的评测结果：
 | Arch | Backbone | Model Name | English-STS-B | 
 | :-- | :--- | :--- | :-: |
 | GloVe | glove | Avg_word_embeddings_glove_6B_300d | 61.77 |
 | BERT | bert-base-uncased | BERT-base-cls | 20.29 |
 | BERT | bert-base-uncased | BERT-base-first_last_avg | 59.04 |
 | BERT | bert-base-uncased | BERT-base-first_last_avg-whiten(NLI) | 63.65 |
 | SBERT | sentence-transformers/bert-base-nli-mean-tokens | SBERT-base-nli-cls | 73.65 |
 | SBERT | sentence-transformers/bert-base-nli-mean-tokens | SBERT-base-nli-first_last_avg | 77.96 |
 | SBERT | xlm-roberta-base | paraphrase-multilingual-MiniLM-L12-v2 | 84.42 |
 | CoSENT | bert-base-uncased | CoSENT-base-first_last_avg | 69.93 |
 | CoSENT | sentence-transformers/bert-base-nli-mean-tokens | CoSENT-base-nli-first_last_avg | 79.68 |
 - 中文匹配数据集的评测结果：
 | Arch | Backbone | Model Name | ATEC | BQ | LCQMC | PAWSX | STS-B | Avg | QPS |
 | :-- | :--- | :--- | :-: | :-: | :-: | :-: | :-: | :-: | :-: |
 | CoSENT | hfl/chinese-macbert-base | CoSENT-macbert-base | 50.39 | **72.93** | **79.17** | **60.86** | **80.51** | **68.77**  | 2572 |
 | CoSENT | Langboat/mengzi-bert-base | CoSENT-mengzi-base | **50.52** | 72.27 | 78.69 | 12.89 | 80.15 | 58.90 | 2502 |
 | CoSENT | bert-base-chinese | CoSENT-bert-base | 49.74 | 72.38 | 78.69 | 60.00 | 80.14 | 68.19 | 2653 |
 | SBERT | bert-base-chinese | SBERT-bert-base | 46.36 | 70.36 | 78.72 | 46.86 | 66.41 | 61.74 | 1365 |
 | SBERT | hfl/chinese-macbert-base | SBERT-macbert-base | 47.28 | 68.63 | **79.42** | 55.59 | 64.82 | 63.15 | 1948 |
 | CoSENT | hfl/chinese-roberta-wwm-ext | CoSENT-roberta-ext | **50.81** | **71.45** | **79.31** | **61.56** | **81.13** | **68.85** | - |
 | SBERT | hfl/chinese-roberta-wwm-ext | SBERT-roberta-ext | 48.29 | 69.99 | 79.22 | 44.10 | 72.42 | 62.80 | - |
 - 本项目release模型的中文匹配评测结果：
 | Arch | Backbone | Model Name | ATEC | BQ | LCQMC | PAWSX | STS-B | Avg | QPS |
 | :-- | :--- | :---- | :-: | :-: | :-: | :-: | :-: | :-: | :-: |
 | Word2Vec | word2vec | w2v-light-tencent-chinese | 20.00 | 31.49 | 59.46 | 2.57 | 55.78 | 33.86 | 10283 |
 | SBERT | xlm-roberta-base | paraphrase-multilingual-MiniLM-L12-v2 | 18.42 | 38.52 | 63.96 | 10.14 | 78.90 | 41.99 | 2371 |
 | CoSENT | hfl/chinese-macbert-base | similarities-base-chinese | 31.93 | 42.67 | 70.16 | 17.21 | 79.30 | **48.25** | 2572 |
 说明：
 - 结果值均使用spearman系数
 - 结果均只用该数据集的train训练，在test上评估得到的表现，没用外部数据
 - `paraphrase-multilingual-MiniLM-L12-v2`模型名称是`sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2`，是`paraphrase-MiniLM-L12-v2`模型的多语言版本，速度快，效果好，支持中文
 - `CoSENT-macbert-base`模型达到同级别参数量SOTA效果，是用CoSENT方法训练，运行[similarities/cosent](similarities/cosent)文件夹下代码可以复现结果
 - `SBERT-macbert-base`模型，是用SBERT方法训练，运行[similarities/sentence_bert](similarities/sentence_bert)文件夹下代码可以复现结果
 - `similarities-base-chinese`模型，是用CoSENT方法训练，基于MacBERT在中文STS-B数据训练得到，模型文件已经上传到huggingface的模型库[shibing624/similarities-base-chinese](https://huggingface.co/shibing624/similarities-base-chinese)
 - `w2v-light-tencent-chinese`是腾讯词向量的Word2Vec模型，CPU加载使用
 - 各预训练模型均可以通过transformers调用，如MacBERT模型：`--pretrained_model_path hfl/chinese-macbert-base`
 - 中文匹配数据集下载[链接见下方](#数据集)
 - 中文匹配任务实验表明，pooling最优是`first_last_avg`，预测可以调用SBert的`mean pooling`方法，效果损失很小
 - QPS的GPU测试环境是Tesla V100，显存32GB
 # Demo
 Official Demo: http://42.193.145.218/product/short_text_sim/
 HuggingFace Demo: https://huggingface.co/spaces/shibing624/similarities
 ![](docs/hf.png)
 # Install
 ```
 pip3 install -U similarities
 ```
 or
 ```
 git clone https://github.com/shibing624/similarities.git
 cd similarities
 python3 setup.py install
 ```
 ### 数据集
 常见中文语义匹配数据集，包含[ATEC](https://github.com/IceFlameWorm/NLP_Datasets/tree/master/ATEC)、[BQ](http://icrc.hitsz.edu.cn/info/1037/1162.htm)、[LCQMC](http://icrc.hitsz.edu.cn/Article/show/171.html)、[PAWSX](https://arxiv.org/abs/1908.11828)、[STS-B](https://github.com/pluto-junzeng/CNSD)共5个任务。
 可以从数据集对应的链接自行下载，也可以从[百度网盘(提取码:qkt6)](https://pan.baidu.com/s/1d6jSiU1wHQAEMWJi7JJWCQ)下载。
 其中senteval_cn目录是评测数据集汇总，senteval_cn.zip是senteval目录的打包，两者下其一就好。
 # Usage
 ### 1. 计算文本向量
 ### 2. 计算句子之间的相似度值
 示例[semantic_text_similarity.py](./examples/semantic_text_similarity.py)
 > 句子余弦相似度值`score`范围是[-1, 1]，值越大越相似。
 ### 3. 计算句子与文档集之间的相似度值
 一般在文档候选集中找与query最相似的文本，常用于QA场景的问句相似匹配、文本相似检索等任务。
 > `Score`的值范围[-1, 1]，值越大，表示该query与corpus相似度越近。
 # Contact
 - Issue(建议)：[![GitHub issues](https://img.shields.io/github/issues/shibing624/similarities.svg)](https://github.com/shibing624/similarities/issues)
 - 邮件我：xuming: xuming624@qq.com
 - 微信我：
 加我*微信号：xuming624, 备注：个人名称-公司-NLP* 进NLP交流群。
 <img src="docs/wechat.jpeg" width="200" />
 # Citation
 如果你在研究中使用了similarities，请按如下格式引用：
 ```latex
@misc{similarities,
  title={similarities: A Tool for Compute Similarity Score},
  author={Ming Xu},
  howpublished={https://github.com/shibing624/similarities},
  year={2022}
 }
 ```
 # License
 授权协议为 [The Apache License 2.0](/LICENSE)，可免费用做商业用途。请在产品说明中附加similarities的链接和授权协议。
 # Contribute
 项目代码还很粗糙，如果大家对代码有所改进，欢迎提交回本项目，在提交之前，注意以下两点：
 - 在`tests`添加相应的单元测试
 - 使用`python setup.py test`来运行所有单元测试，确保所有单测都是通过的
 之后即可提交PR。
 # Reference
 - [A Simple but Tough-to-Beat Baseline for Sentence Embeddings[Sanjeev Arora and Yingyu Liang and Tengyu Ma, 2017]](https://openreview.net/forum?id=SyK00v5xx)
 - [四种计算文本相似度的方法对比[Yves Peirsman]](https://zhuanlan.zhihu.com/p/37104535)
 - [谈谈文本匹配和多轮检索](https://zhuanlan.zhihu.com/p/111769969)
--- a/docs/hf.png
+++ b/docs/hf.png
--- a/docs/models_en_sentence_embeddings.html
+++ b/docs/models_en_sentence_embeddings.html
@ -0,0 +1,538 @@
 <!DOCTYPE html>
 <html lang="en">
 <head>
    <meta charset="utf-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>SBERT.net Models</title>
    <!-- Vue.js -->
   <script src="https://cdnjs.cloudflare.com/ajax/libs/vue/2.6.12/vue.min.js" integrity="sha512-BKbSR+cfyxLdMAsE0naLReFSLg8/pjbgfxHh/k/kUC82Hy7r6HtR5hLhobaln2gcTvzkyyehrdREdjpsQwy2Jw==" crossorigin="anonymous" referrerpolicy="no-referrer"></script>
    <!-- Bootstrap -->
    <link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css"
          integrity="sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh" crossorigin="anonymous">
    <script src="https://code.jquery.com/jquery-3.4.1.slim.min.js"
            integrity="sha384-J6qa4849blE2+poT4WnyKhv5vZF5SrPo0iEjwBvKU7imGFAV0wwj1yYfoRSJoZ+n"
            crossorigin="anonymous"></script>
    <script src="https://cdn.jsdelivr.net/npm/popper.js@1.16.0/dist/umd/popper.min.js"
            integrity="sha384-Q6E9RHvbIyZFJoft+2mJbHaEWldlvI9IOYy5n3zV9zzTtmI3UksdQRVvoxMfooAo"
            crossorigin="anonymous"></script>
    <script src="https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/js/bootstrap.min.js"
            integrity="sha384-wfSDF2E50Y2D1uUdj0O3uMBJnjuUD4Ih7YwaYd1iqfktj0Uod8GCExl3Og8ifwB6"
            crossorigin="anonymous"></script>
    <!-- Axios -->
    <!-- <script src="https://cdnjs.cloudflare.com/ajax/libs/axios/0.21.1/axios.min.js" integrity="sha512-bZS47S7sPOxkjU/4Bt0zrhEtWx0y0CRkhEp8IckzK+ltifIIE9EMIMTuT/mEzoIMewUINruDBIR/jJnbguonqQ==" crossorigin="anonymous" referrerpolicy="no-referrer"></script> -->
    <!-- Font-awesome -->
    <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/5.15.3/css/all.min.css"
          integrity="sha512-iBBXm8fW90+nuLcSKlbmrPcLa0OT92xO1BIsZ+ywDWZCvqsWgccV3gFoRBv0z+8dLJgyAHIhR35VZc2oM/gI1w=="
          crossorigin="anonymous" referrerpolicy="no-referrer"/>
    <!-- Lodash -->
    <script src="https://cdnjs.cloudflare.com/ajax/libs/lodash.js/4.17.21/lodash.min.js"
            integrity="sha512-WFN04846sdKMIP5LKNphMaWzU7YpMyCU245etK3g/2ARYbPK9Ub18eG+ljU96qKRCWh+quCY7yefSmlkQw1ANQ=="
            crossorigin="anonymous" referrerpolicy="no-referrer"></script>
    <style>
        .fa-active {
            color: #337ab7;
        }
        .header-cell {
            cursor: pointer;
        }
        .models-table thead th {
            position: sticky;
            top: 0;
            z-index: 1;
            background-color: #ffffff;
        }
        .info-icon {
            color: #007bff;
        }
        .info-icon:hover {
            color: #0056b3;
        }
        .info-icon-model {
            padding-left: 10px;
        }
        .bs-popover-auto[x-placement^=bottom], .bs-popover-bottom {
            margin-top: .5rem;
        }
        .popover {
            max-width: 400px;
        }
    </style>
 </head>
 <body>
 <div id="app">
    <table class="table table-bordered table-sm">
        <thead>
            <tr>
                <th class="header-cell" @click="sortAsc = (sortBy=='name') ? sortAsc = !sortAsc : false; sortBy='name'">
                    <i class="fas fa-active" v-if="sortBy == 'name'" v-bind:class="{ 'fa-sort-amount-up': !sortAsc, 'fa-sort-amount-down-alt': sortAsc }"></i>
                    Model Name
                </th>
                <th class="header-cell text-center" @click="sortAsc = (sortBy=='stsb') ? sortAsc = !sortAsc : false; sortBy='stsb'">
                    <i class="fas fa-active" v-if="sortBy == 'stsb'" v-bind:class="{ 'fa-sort-amount-up': !sortAsc, 'fa-sort-amount-down-alt': sortAsc }"></i>
                    STSb
                     <span class="info-icon" data-trigger="hover" data-toggle="popover" title="STSbenchmark" data-content="Spearman-rank correlation on the STSbenchmark test set. Higher = Better" data-html="true" data-placement="bottom"><i class="fas fa-info-circle"></i></span>
                </th>
                <th class="header-cell text-center" @click="sortAsc = (sortBy=='dupq') ? sortAsc = !sortAsc : false; sortBy='dupq'">
                     <i class="fas fa-active" v-if="sortBy == 'dupq'" v-bind:class="{ 'fa-sort-amount-up': !sortAsc, 'fa-sort-amount-down-alt': sortAsc }"></i>
                    DupQ
                    <span class="info-icon" data-trigger="hover" data-toggle="popover" title="Duplicate Questions" data-content="Combination of two datasets for duplicate questions detection:<br>Mean-Average-Precision on the Quora Duplicate Questions Semantic Search test set.<br>Average-Precision on the Sprint duplicate questions test set.<br> Higher = Better" data-html="true" data-placement="bottom"><i class="fas fa-info-circle"></i></span>
                </th>
                <th class="header-cell text-center" @click="sortAsc = (sortBy=='TwitterP') ? sortAsc = !sortAsc : false; sortBy='TwitterP'">
                    <i class="fas fa-active" v-if="sortBy == 'TwitterP'" v-bind:class="{ 'fa-sort-amount-up': !sortAsc, 'fa-sort-amount-down-alt': sortAsc }"></i>
                    TwitterP
                     <span class="info-icon" data-trigger="hover" data-toggle="popover" title="Twitter Paraphrases" data-content="A test to find tweets that are considered paraphrases. Combination of the SemEval2015 Tweet paraphrase test set and the Twitter-URL-Corpus test set. Performance is measured using Average Precision.<br> Higher = Better" data-html="true" data-placement="bottom"><i class="fas fa-info-circle"></i></span>
                </th>
                <th class="header-cell text-center" @click="sortAsc = (sortBy=='scidocs') ? sortAsc = !sortAsc : false; sortBy='scidocs'">
                    <i class="fas fa-active" v-if="sortBy == 'scidocs'" v-bind:class="{ 'fa-sort-amount-up': !sortAsc, 'fa-sort-amount-down-alt': sortAsc }"></i>
                    SciDocs
                     <span class="info-icon" data-trigger="hover" data-toggle="popover" title="SciDocs" data-content="A test to find similar scientific publications given a paper title. From SciDocs, we use the information which papers are often co-cited, co-read, or co-viewed. Performance is measured using MAP.<br> Higher = Better" data-html="true" data-placement="bottom"><i class="fas fa-info-circle"></i></span>
                </th>
                <th class="header-cell text-center" @click="sortAsc = (sortBy=='clustering') ? sortAsc = !sortAsc : false; sortBy='clustering'">
                    <i class="fas fa-active" v-if="sortBy == 'clustering'" v-bind:class="{ 'fa-sort-amount-up': !sortAsc, 'fa-sort-amount-down-alt': sortAsc }"></i>
                    Clustering
                     <span class="info-icon" data-trigger="hover" data-toggle="popover" title="Clustering" data-content="Here we test how well the embeddings can be used for clustering. We use three datasets: email subjects from 20NewsGroups, titles from 199 popular subreddits, questions from 121 StackExchanges. We cluster different sentence collections with sentences from 10-50 categories. Performance is measured using V-Measure. <br> Higher = Better" data-html="true" data-placement="bottom"><i class="fas fa-info-circle"></i></span>
                </th>
                <th class="header-cell text-center" @click="sortAsc = (sortBy=='final_score') ? sortAsc = !sortAsc : false; sortBy='final_score'">
                    <i class="fas fa-active" v-if="sortBy == 'final_score'" v-bind:class="{ 'fa-sort-amount-up': !sortAsc, 'fa-sort-amount-down-alt': sortAsc }"></i>
                    Avg. Performance
                     <span class="info-icon" data-trigger="hover" data-toggle="popover" title="Average Performance" data-content="Average Performance over all tasks.<br>Higher = Better" data-html="true" data-placement="bottom"><i class="fas fa-info-circle"></i></span>
                </th>
                <th class="header-cell text-center" @click="sortAsc = (sortBy=='speed') ? sortAsc = !sortAsc : false; sortBy='speed'">
                    <i class="fas fa-active" v-if="sortBy == 'speed'" v-bind:class="{ 'fa-sort-amount-up': !sortAsc, 'fa-sort-amount-down-alt': sortAsc }"></i>
                    Speed
                     <span class="info-icon" data-trigger="hover" data-toggle="popover" title="Encoding Speed" data-content="Encoding speed (sentences / sec) on a V100 GPU.<br>Higher = Better" data-html="true" data-placement="bottom"><i class="fas fa-info-circle"></i></span>
                </th>
            </tr>
        </thead>
        <tbody>
            <tr v-for="item in sortedModels">
                <td>
                    {{ item.name }}
                    <span class="info-icon info-icon-model" data-trigger="hover" data-toggle="popover" :title="item.name" :data-content="'<b>Base Model:</b> '+item.base_model+'<br><b>Pooling:</b> '+item.pooling+'<br><b>Training Data:</b> '+item.training_data+'<br><b>Dimensions:</b> '+item.dim+'<br><b>Size:</b> '+item.size+' MB'" data-html="true" data-placement="bottom"><i class="fas fa-info-circle"></i></span>
                </td>
                <td class="text-center">{{ item.stsb.toFixed(2) }}</td>
                <td class="text-center"><span :title="'Quora: '+item.qqp.toFixed(2)+' Sprint: '+item.sprint.toFixed(2)">{{ item.dupq.toFixed(2) }}</span></td>
                <td class="text-center">{{ item.TwitterP.toFixed(2) }}</td>
                <td class="text-center">{{ item.scidocs.toFixed(2) }}</td>
                <td class="text-center"><span :title="'Newsgroups: '+item.newsgroups.toFixed(2)+' Reddit: '+item.reddit.toFixed(2)+' StackExchange: '+item.stackexchange.toFixed(2)">{{ item.clustering.toFixed(2) }}</span></td>
                <td class="text-center">{{ item.final_score.toFixed(2) }}</td>
                <td class="text-center">{{ item.speed }}</td>
            </tr>
        </tbody>
    </table>
 </div>
 <script>
    var app = new Vue({
        el: '#app',
        data: {
            models: [
                {
                    "name": "stsb-mpnet-base-v2",
                    "base_model": "microsoft/mpnet-base",
                    "pooling": "Mean Pooling",
                    "training_data": "NLI+STSb",
                    "stsb": 88.57,
                    "qqp": 79.74,
                    "sprint": 90.34,
                    "TwitterP": 75.35,
                    "scidocs": 72.48,
                    "newsgroups": 31.56,
                    "reddit": 39.52,
                    "stackexchange": 46.41,
                    "speed": 2800,
                    "size": 386,
                    "dim": 768
                },
                {
                    "name": "stsb-roberta-base-v2",
                    "base_model": "roberta-base",
                    "pooling": "Mean Pooling",
                    "training_data": "NLI+STSb",
                    "stsb": 87.21,
                    "qqp": 78.68,
                    "sprint": 86.42,
                    "TwitterP": 73.44,
                    "scidocs": 69.83,
                    "newsgroups": 26.87,
                    "reddit": 36.91,
                    "stackexchange": 45.48,
                    "speed": 2300,
                    "size": 440,
                    "dim": 768
                },
                {
                    "name": "stsb-distilroberta-base-v2",
                    "base_model": "distilroberta-base",
                    "pooling": "Mean Pooling",
                    "training_data": "NLI+STSb",
                    "stsb": 86.41,
                    "qqp": 78.13,
                    "sprint": 87.28,
                    "TwitterP": 73.68,
                    "scidocs": 69.85,
                    "newsgroups": 28.63,
                    "reddit": 38.26,
                    "stackexchange": 46.16,
                    "speed": 4000,
                    "size": 292,
                    "dim": 768
                },
                {
                    "name": "nli-mpnet-base-v2",
                    "base_model": "microsoft/mpnet-base",
                    "pooling": "Mean Pooling",
                    "training_data": "NLI",
                    "stsb": 86.53,
                    "qqp": 80.65,
                    "sprint": 85.79,
                    "TwitterP": 76.24,
                    "scidocs": 72.90,
                    "newsgroups": 36.56,
                    "reddit": 42.68,
                    "stackexchange": 50.90,
                    "speed": 2800,
                    "size": 385,
                    "dim": 768
                },
                {
                    "name": "nli-roberta-base-v2",
                    "base_model": "roberta-base",
                    "pooling": "Mean Pooling",
                    "training_data": "NLI",
                    "stsb": 85.54,
                    "qqp": 78.73,
                    "sprint": 81.67,
                    "TwitterP": 74.28,
                    "scidocs": 69.86,
                    "newsgroups": 31.28,
                    "reddit": 39.58,
                    "stackexchange": 49.51,
                    "speed": 2300,
                    "size": 440,
                    "dim": 768
                },
                {
                    "name": "nli-distilroberta-base-v2",
                    "base_model": "distilroberta-base",
                    "pooling": "Mean Pooling",
                    "training_data": "NLI",
                    "stsb": 84.38,
                    "qqp": 78.47,
                    "sprint": 83.03,
                    "TwitterP": 73.86,
                    "scidocs": 70.23,
                    "newsgroups": 31.87,
                    "reddit": 39.12,
                    "stackexchange": 49.27,
                    "speed": 4000,
                    "size": 292,
                    "dim": 768
                },
                {
                    "name": "average_word_embeddings_glove.6B.300d",
                    "base_model": "Word Embeddings: GloVe",
                    "pooling": "Mean Pooling",
                    "training_data": "-",
                    "stsb": 61.77,
                    "qqp": 69.18,
                    "sprint": 86.96,
                    "TwitterP": 68.60,
                    "scidocs": 63.69,
                    "newsgroups": 26.65,
                    "reddit": 28.37,
                    "stackexchange": 36.37,
                    "speed": 34000,
                    "size": 422,
                    "dim": 300
                },
                {
                    "name": "average_word_embeddings_komninos",
                    "base_model": "Word Embeddings: Komninos et al.",
                    "pooling": "Mean Pooling",
                    "training_data": "-",
                    "stsb": 61.56,
                    "qqp": 69.83,
                    "sprint": 85.55,
                    "TwitterP": 71.23,
                    "scidocs": 65.25,
                    "newsgroups": 27.53,
                    "reddit": 29.54,
                    "stackexchange": 39.35,
                    "speed": 22000,
                    "size": 237,
                    "dim": 300
                },
                /*{
                    "name": "average_word_embeddings_levy_dependency",
                    "base_model": "Word Embeddings: Levy et al.",
                    "pooling": "Mean Pooling",
                    "training_data": "-",
                    "stsb": 59.22,
                    "qqp": 64.62,
                    "sprint": 80.12,
                    "TwitterP": 70.79,
                    "scidocs": 60.04 ,
                    "newsgroups": 22.72,
                    "reddit": 24.23,
                    "stackexchange": 33.66,
                    "speed": 22000,
                    "size": 186,
                    "dim": 300
                },*/
                {
                    "name": "paraphrase-MiniLM-L12-v2",
                    "base_model": "microsoft/MiniLM-L12-H384-uncased",
                    "pooling": "Mean Pooling",
                    "training_data": "AllNLI, sentence-compression, SimpleWiki, altlex, msmarco-triplets, quora_duplicates, coco_captions,flickr30k_captions, yahoo_answers_title_question, S2ORC_citation_pairs, stackexchange_duplicate_questions, wiki-atomic-edits",
                    "stsb": 84.41,
                    "qqp":  84.64,
                    "sprint": 89.91,
                    "TwitterP": 75.34,
                    "scidocs": 80.08,
                    "newsgroups": 41.81,
                    "reddit": 44.42,
                    "stackexchange": 54.63,
                    "speed": 7500,
                    "size": 118,
                    "dim": 384
                },
                {
                    "name": "paraphrase-MiniLM-L6-v2",
                    "base_model": "nreimers/MiniLM-L6-H384-uncased",
                    "pooling": "Mean Pooling",
                    "training_data": "AllNLI, sentence-compression, SimpleWiki, altlex, msmarco-triplets, quora_duplicates, coco_captions,flickr30k_captions, yahoo_answers_title_question, S2ORC_citation_pairs, stackexchange_duplicate_questions, wiki-atomic-edits",
                    "stsb": 84.12,
                    "qqp":  84.25,
                    "sprint": 90.21,
                    "TwitterP": 76.32,
                    "scidocs": 78.91,
                    "newsgroups": 40.16,
                    "reddit": 42.71,
                    "stackexchange": 53.14,
                    "speed": 14200,
                    "size": 80,
                    "dim": 384
                },
                {
                    "name": "paraphrase-MiniLM-L3-v2",
                    "base_model": "nreimers/MiniLM-L3-H384-uncased",
                    "pooling": "Mean Pooling",
                    "training_data": "AllNLI, sentence-compression, SimpleWiki, altlex, msmarco-triplets, quora_duplicates, coco_captions,flickr30k_captions, yahoo_answers_title_question, S2ORC_citation_pairs, stackexchange_duplicate_questions, wiki-atomic-edits",
                    "stsb": 82.41,
                    "qqp":  83.29,
                    "sprint": 92.9,
                    "TwitterP": 76.14,
                    "scidocs": 77.71,
                    "newsgroups": 37.73,
                    "reddit": 41.18,
                    "stackexchange": 51.25,
                    "speed": 19000,
                    "size": 61,
                    "dim": 384
                },
                {
                    "name": "paraphrase-distilroberta-base-v2",
                    "base_model": "distilroberta-base",
                    "pooling": "Mean Pooling",
                    "training_data": "AllNLI, sentence-compression, SimpleWiki, altlex, msmarco-triplets, quora_duplicates, coco_captions,flickr30k_captions, yahoo_answers_title_question, S2ORC_citation_pairs, stackexchange_duplicate_questions, wiki-atomic-edits",
                    "stsb": 85.37,
                    "qqp": 84.31,
                    "sprint": 89.64,
                    "TwitterP": 73.96,
                    "scidocs":  80.25,
                    "newsgroups": 42.12,
                    "reddit": 47.53,
                    "stackexchange": 57.90,
                    "speed": 4000,
                    "size": 292,
                    "dim": 768
                },
                {
                    "name": "paraphrase-TinyBERT-L6-v2",
                    "base_model": "https://huggingface.co/nreimers/TinyBERT_L-6_H-768_v2",
                    "pooling": "Mean Pooling",
                    "training_data": "AllNLI, sentence-compression, SimpleWiki, altlex, msmarco-triplets, quora_duplicates, coco_captions,flickr30k_captions, yahoo_answers_title_question, S2ORC_citation_pairs, stackexchange_duplicate_questions, wiki-atomic-edits",
                    "stsb": 84.91,
                    "qqp": 84.23,
                    "sprint": 89.62,
                    "TwitterP": 75.39,
                    "scidocs":  81.51,
                    "newsgroups": 43.82,
                    "reddit": 44.61,
                    "stackexchange": 55.69,
                    "speed": 4500,
                    "size": 238,
                    "dim": 768
                },
                {
                    "name": "paraphrase-mpnet-base-v2",
                    "base_model": "microsoft/mpnet-base",
                    "pooling": "Mean Pooling",
                    "training_data": "AllNLI, sentence-compression, SimpleWiki, altlex, msmarco-triplets, quora_duplicates, coco_captions,flickr30k_captions, yahoo_answers_title_question, S2ORC_citation_pairs, stackexchange_duplicate_questions, wiki-atomic-edits",
                    "stsb": 86.99,
                    "qqp": 84.93,
                    "sprint": 90.67,
                    "TwitterP": 76.05,
                    "scidocs": 80.57,
                    "newsgroups": 48.13,
                    "reddit": 50.52,
                    "stackexchange": 59.79,
                    "speed": 2800,
                    "size": 387,
                    "dim": 768
                },
                /*{
                    "name": "paraphrase-albert-base-v2",
                    "base_model": "albert-base-v2",
                    "pooling": "Mean Pooling",
                    "training_data": "AllNLI, sentence-compression, SimpleWiki, altlex, msmarco-triplets, quora_duplicates, coco_captions,flickr30k_captions, yahoo_answers_title_question, S2ORC_citation_pairs, stackexchange_duplicate_questions, wiki-atomic-edits",
                    "stsb": 83.38,
                    "qqp": 83.28,
                    "sprint": 90.45,
                    "TwitterP": 74.83,
                    "scidocs": 77.83,
                    "newsgroups": 39.88,
                    "reddit": 42.64,
                    "stackexchange": 52.95,
                    "speed": 2400,
                    "size": 43,
                    "dim": 768
                },*/
                {
                    "name": "paraphrase-albert-small-v2",
                    "base_model": "nreimers/albert-small-v2",
                    "pooling": "Mean Pooling",
                    "training_data": "AllNLI, sentence-compression, SimpleWiki, altlex, msmarco-triplets, quora_duplicates, coco_captions,flickr30k_captions, yahoo_answers_title_question, S2ORC_citation_pairs, stackexchange_duplicate_questions, wiki-atomic-edits",
                    "stsb": 83.40,
                    "qqp": 83.54,
                    "sprint": 89.6,
                    "TwitterP": 74.51,
                    "scidocs": 80.28,
                    "newsgroups": 40.54,
                    "reddit": 41.54,
                    "stackexchange": 52.74,
                    "speed": 5000,
                    "size": 43,
                    "dim": 768
                },
                {
                    "name": "paraphrase-multilingual-mpnet-base-v2",
                    "base_model": "Teacher: paraphrase-mpnet-base-v2; Student: xlm-roberta-base",
                    "pooling": "Mean Pooling",
                    "training_data": "Multi-lingual model of paraphrase-mpnet-base-v2, extended to 50+ languages.",
                    "stsb": 86.82,
                    "qqp": 83.91,
                    "sprint": 91.1,
                    "TwitterP": 76.52,
                    "scidocs": 78.66,
                    "newsgroups": 44.65,
                    "reddit": 45.01,
                    "stackexchange": 52.73,
                    "speed": 2500,
                    "size": 969,
                    "dim": 768
                },
                {
                    "name": "paraphrase-multilingual-MiniLM-L12-v2",
                    "base_model": "Teacher: paraphrase-MiniLM-L12-v2; Student: microsoft/Multilingual-MiniLM-L12-H384",
                    "pooling": "Mean Pooling",
                    "training_data": "Multi-lingual model of paraphrase-multilingual-MiniLM-L12-v2, extended to 50+ languages.",
                    "stsb": 84.42,
                    "qqp": 83.89,
                    "sprint": 91.15,
                    "TwitterP": 74.94,
                    "scidocs": 78.27,
                    "newsgroups": 40.36,
                    "reddit": 41.49,
                    "stackexchange": 49.75,
                    "speed": 7500,
                    "size": 418,
                    "dim": 384
                },
                {
                    "name": "distiluse-base-multilingual-cased-v1",
                    "base_model": "Teacher: mUSE; Student: distilbert-base-multilingual",
                    "pooling": "Mean Pooling",
                    "training_data": "Multi-Lingual model of Universal Sentence Encoder for 15 languages: Arabic, Chinese, Dutch, English, French, German, Italian, Korean, Polish, Portuguese, Russian, Spanish, Turkish.",
                    "stsb": 80.62,
                    "qqp": 81.10,
                    "sprint": 88.54,
                    "TwitterP": 76.24,
                    "scidocs": 70.41,
                    "newsgroups": 32.97,
                    "reddit": 42.93,
                    "stackexchange": 44.30,
                    "speed": 4000,
                    "size": 482,
                    "dim": 512
                },
                {
                    "name": "distiluse-base-multilingual-cased-v2",
                    "base_model": "Teacher: mUSE; Student: distilbert-base-multilingual",
                    "pooling": "Mean Pooling",
                    "training_data": "Multi-Lingual model of Universal Sentence Encoder for 50 languages.",
                    "stsb": 80.75,
                    "qqp": 79.89,
                    "sprint": 87.15,
                    "TwitterP": 76.26,
                    "scidocs": 70.39,
                    "newsgroups": 29.96,
                    "reddit": 39.95,
                    "stackexchange": 41.19,
                    "speed": 4000,
                    "size": 481,
                    "dim": 512
                }
            ],
            sortBy: 'final_score',
            sortAsc: false
        },
        methods: {
        },
        computed: {
            sortedModels: function() {
                //Add avg. for duplicate questions
                let models_ext = this.models.map(function(elem, index)  { elem.dupq = (elem.qqp + elem.sprint)/2.0; return elem;} );
                //Add avg. for clustering
                models_ext = models_ext.map(function(elem, index)  { elem.clustering = (elem.newsgroups + elem.reddit + elem.stackexchange)/3.0; return elem;} );
                //Final score
                models_ext = models_ext.map(function(elem, index)  { elem.final_score = (elem.stsb + elem.dupq + elem.TwitterP + elem.scidocs + elem.clustering)/ 5.0; return elem;} );
                return _.orderBy(models_ext, (item) => item[this.sortBy] || (this.sortAsc ? 9999 : -9999), this.sortAsc ? 'asc' : 'desc')
            }
        }
    })
 </script>
 <script>
    $(function () {
      $('[data-toggle="popover"]').popover()
    })
 </script>
 </body>
 </html>
--- a/docs/wechat.jpeg
+++ b/docs/wechat.jpeg
--- a/examples/base_demo.py
+++ b/examples/base_demo.py
@ -0,0 +1,21 @@
 # -*- coding: utf-8 -*-
 """
@author:XuMing(xuming624@qq.com)
@description:
 This basic example loads a pre-trained model from the web and uses it to
 generate sentence embeddings for a given list of sentences.
 """
 import sys
 sys.path.append('..')
 from similarities import BertSimilarity
 if __name__ == '__main__':
    model = BertSimilarity("shibing624/text2vec-base-chinese")  # 中文句向量模型(CoSENT)
    # Embed a list of sentences
    sentences = ['如何更换花呗绑定银行卡',
                 '花呗更改绑定银行卡']
    sentence_embeddings = model.encode(sentences)
    print(type(sentence_embeddings), sentence_embeddings.shape)
    similarity_score = model.similarity_score([sentences[0]], [sentences[1]])
    print(similarity_score)
--- a/examples/computing_embeddings.py
+++ b/examples/computing_embeddings.py
@ -0,0 +1,19 @@
 # -*- coding: utf-8 -*-
 """
@author:XuMing(xuming624@qq.com)
@description:
 This basic example loads a pre-trained model from the web and uses it to
 generate sentence embeddings for a given list of sentences.
 """
 import sys
 sys.path.append('..')
 from similarities import BertSimilarity
 model = BertSimilarity("shibing624/text2vec-base-chinese")  # 中文句向量模型(CoSENT)
 # Embed a list of sentences
 sentences = ['如何更换花呗绑定银行卡',
             '花呗更改绑定银行卡']
 sentence_embeddings = model.encode(sentences)
 print(type(sentence_embeddings), sentence_embeddings.shape)
--- a/examples/gradio_demo.py
+++ b/examples/gradio_demo.py
@ -0,0 +1,40 @@
 # -*- coding: utf-8 -*-
 """
@author:XuMing(xuming624@qq.com)
@description: pip install gradio
 """
 import gradio as gr
 from similarities import BertSimilarity
 # 中文句向量模型(CoSENT)
 sim_model = BertSimilarity(model_name_or_path='shibing624/text2vec-base-chinese')
 def ai_text(sentence1, sentence2):
    score = sim_model.similarity_score(sentence1, sentence2)
    print("{} \t\t {} \t\t Score: {:.4f}".format(sentence1, sentence2, score))
    return score
 if __name__ == '__main__':
    examples = [
        ['如何更换花呗绑定银行卡', '花呗更改绑定银行卡'],
        ['我在北京打篮球', '我是北京人，我喜欢篮球'],
        ['一个女人在看书。', '一个女人在揉面团'],
        ['一个男人在车库里举重。', '一个人在举重。'],
    ]
    input1 = gr.inputs.Textbox(lines=2, placeholder="Enter First Sentence")
    input2 = gr.inputs.Textbox(lines=2, placeholder="Enter Second Sentence")
    output_text = gr.outputs.Textbox()
    gr.Interface(ai_text,
                 inputs=[input1, input2],
                 outputs=[output_text],
                 # theme="grass",
                 title="Chinese Text Matching Model shibing624/text2vec-base-chinese",
                 description="Copy or input Chinese text here. Submit and the machine will calculate the cosine score.",
                 article="Link to <a href='https://github.com/shibing624/similarities' style='color:blue;' target='_blank\'>Github REPO</a>",
                 examples=examples
                 ).launch()
--- a/requirements.txt
+++ b/requirements.txt
@ -0,0 +1,8 @@
 jieba>=0.39
 loguru
 transformers>=4.6.0
 tokenizers>=0.10.3
 tqdm
 numpy
 scikit-learn
 gensim>=4.0.0
--- a/setup.py
+++ b/setup.py
@ -0,0 +1,51 @@
 # -*- coding: utf-8 -*-
 import sys
 from setuptools import setup, find_packages
 # Avoids IDE errors, but actual version is read from version.py
 __version__ = None
 exec(open('similarities/version.py').read())
 if sys.version_info < (3,):
    sys.exit('Sorry, Python3 is required.')
 with open('README.md', 'r', encoding='utf-8') as f:
    readme = f.read()
 with open('requirements.txt', 'r', encoding='utf-8') as f:
    reqs = f.read()
 setup(
    name='similarities',
    version=__version__,
    description='Similarities is a toolkit for compute similarity scores between two sets of strings.',
    long_description=readme,
    long_description_content_type='text/markdown',
    author='XuMing',
    author_email='xuming624@qq.com',
    url='https://github.com/shibing624/similarities',
    license="Apache License 2.0",
    zip_safe=False,
    python_requires=">=3.6.0",
    classifiers=[
        "Development Status :: 5 - Production/Stable",
        "Intended Audience :: Developers",
        "Intended Audience :: Education",
        "Intended Audience :: Science/Research",
        "License :: OSI Approved :: Apache Software License",
        "Operating System :: OS Independent",
        "Programming Language :: Python :: 3",
        "Programming Language :: Python :: 3.6",
        "Programming Language :: Python :: 3.7",
        "Programming Language :: Python :: 3.8",
        "Programming Language :: Python :: 3.9",
        "Topic :: Scientific/Engineering :: Artificial Intelligence",
    ],
    keywords='similarities,Chinese Text Similarity Calculation Tool,similarity,word2vec',
    install_requires=reqs.strip().split('\n'),
    packages=find_packages(exclude=['tests']),
    package_dir={'similarities': 'similarities'},
    package_data={'similarities': ['*.*', '../LICENSE', '../README.*', '../*.txt', 'utils/*',
                                   'data/*', ]}
 )
--- a/similarities/init.py
+++ b/similarities/init.py
@ -0,0 +1,7 @@
 # -*- coding: utf-8 -*-
 """
@author:XuMing(xuming624@qq.com)
@description: 
 """
 from .similarity import BertSimilarity
--- a/similarities/similarity.py
+++ b/similarities/similarity.py
@ -0,0 +1,34 @@
 # -*- coding: utf-8 -*-
 """
@author:XuMing(xuming624@qq.com)
@description: 
 """
 from typing import List, Union, Optional
 import numpy as np
 from numpy import ndarray
 from torch import Tensor
 from loguru import logger
 class BertSimilarity:
    def __init__(self, model_name_or_path=''):
        """
        Cal text similarity
        :param similarity_type:
        :param embedding_type:
        """
        self.model_name_or_path = model_name_or_path
        self.model = None
    def encode(self, sentences: Union[List[str], str]) -> ndarray:
        return np.array([])
    def similarity_score(self, sentences1: Union[List[str], str], entences2: Union[List[str], str]):
        """
        Get similarity scores between sentences1 and sentences2
        :param sentences1: list, sentence1 list
        :param sentences2: list, sentence2 list
        :return: return: Matrix with res[i][j]  = cos_sim(a[i], b[j])
        """
        return 0.0
--- a/similarities/version.py
+++ b/similarities/version.py
@ -0,0 +1,7 @@
 # -*- coding: utf-8 -*-
 """
@author:XuMing(xuming624@qq.com)
@description:
 """
 __version__ = '0.0.1'
--- a/tests/test_sim_score.py
+++ b/tests/test_sim_score.py
@ -0,0 +1,27 @@
 # -*- coding: utf-8 -*-
 """
@author:XuMing(xuming624@qq.com)
@description: 
 """
 import sys
 import unittest
 sys.path.append('..')
 from similarities import BertSimilarity
 bert_model = BertSimilarity()
 class IssueTestCase(unittest.TestCase):
    def test_sim_diff(self):
        a = '研究团队面向国家重大战略需求追踪国际前沿发展借鉴国际人工智能研究领域的科研模式有效整合创新资源解决复'
        b = '英汉互译比较语言学'
        r = bert_model.similarity_score(a, b)
        print(a, b, r)
        self.assertTrue(abs(r - 0.1733) < 0.001)
 if __name__ == '__main__':
    unittest.main()