update termsim.

2022-03-05 03:10:45 +08:00 · 2022-03-05 03:10:45 +08:00 · be6af69acc
commit be6af69acc
parent 413796bdfc
12 changed files with 1543 additions and 68 deletions
--- a/CITATION.cff
+++ b/CITATION.cff
@ -3,6 +3,8 @@ message: "If you use this software, please cite it as below."
 authors:
 - family-names: "Xu"
  given-names: "Ming"
-title: "Similarities: Compute Similarity Score for humans"
+  orcid: "https://orcid.org/0000-0003-3402-7159"
+title: "Similarities: Compute similarity score for humans"
 url: "https://github.com/shibing624/similarities"
-data-released: 2022-02-28
+data-released: 2022-02-28
+version: 0.0.3
--- a/README.md
+++ b/README.md
@ -49,45 +49,6 @@ Similarities is a toolkit for Compute Similarity Score between texts.
 | GloVe | glove | Avg_word_embeddings_glove_6B_300d | 61.77 |
 | BERT | bert-base-uncased | BERT-base-cls | 20.29 |
 | BERT | bert-base-uncased | BERT-base-first_last_avg | 59.04 |
-| BERT | bert-base-uncased | BERT-base-first_last_avg-whiten(NLI) | 63.65 |
-| SBERT | sentence-transformers/bert-base-nli-mean-tokens | SBERT-base-nli-cls | 73.65 |
-| SBERT | sentence-transformers/bert-base-nli-mean-tokens | SBERT-base-nli-first_last_avg | 77.96 |
-| SBERT | xlm-roberta-base | paraphrase-multilingual-MiniLM-L12-v2 | 84.42 |
-| CoSENT | bert-base-uncased | CoSENT-base-first_last_avg | 69.93 |
-| CoSENT | sentence-transformers/bert-base-nli-mean-tokens | CoSENT-base-nli-first_last_avg | 79.68 |
-
- 中文匹配数据集的评测结果：
-
-| Arch | Backbone | Model Name | ATEC | BQ | LCQMC | PAWSX | STS-B | Avg | QPS |
-| :-- | :--- | :--- | :-: | :-: | :-: | :-: | :-: | :-: | :-: |
-| CoSENT | hfl/chinese-macbert-base | CoSENT-macbert-base | 50.39 | **72.93** | **79.17** | **60.86** | **80.51** | **68.77**  | 2572 |
-| CoSENT | Langboat/mengzi-bert-base | CoSENT-mengzi-base | **50.52** | 72.27 | 78.69 | 12.89 | 80.15 | 58.90 | 2502 |
-| CoSENT | bert-base-chinese | CoSENT-bert-base | 49.74 | 72.38 | 78.69 | 60.00 | 80.14 | 68.19 | 2653 |
-| SBERT | bert-base-chinese | SBERT-bert-base | 46.36 | 70.36 | 78.72 | 46.86 | 66.41 | 61.74 | 1365 |
-| SBERT | hfl/chinese-macbert-base | SBERT-macbert-base | 47.28 | 68.63 | **79.42** | 55.59 | 64.82 | 63.15 | 1948 |
-| CoSENT | hfl/chinese-roberta-wwm-ext | CoSENT-roberta-ext | **50.81** | **71.45** | **79.31** | **61.56** | **81.13** | **68.85** | - |
-| SBERT | hfl/chinese-roberta-wwm-ext | SBERT-roberta-ext | 48.29 | 69.99 | 79.22 | 44.10 | 72.42 | 62.80 | - |
-
- 本项目release模型的中文匹配评测结果：
-
-| Arch | Backbone | Model Name | ATEC | BQ | LCQMC | PAWSX | STS-B | Avg | QPS |
-| :-- | :--- | :---- | :-: | :-: | :-: | :-: | :-: | :-: | :-: |
-| Word2Vec | word2vec | w2v-light-tencent-chinese | 20.00 | 31.49 | 59.46 | 2.57 | 55.78 | 33.86 | 10283 |
-| SBERT | xlm-roberta-base | paraphrase-multilingual-MiniLM-L12-v2 | 18.42 | 38.52 | 63.96 | 10.14 | 78.90 | 41.99 | 2371 |
-| CoSENT | hfl/chinese-macbert-base | similarities-base-chinese | 31.93 | 42.67 | 70.16 | 17.21 | 79.30 | **48.25** | 2572 |
-
-说明：
- 结果值均使用spearman系数
- 结果均只用该数据集的train训练，在test上评估得到的表现，没用外部数据
- `paraphrase-multilingual-MiniLM-L12-v2`模型名称是`sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2`，是`paraphrase-MiniLM-L12-v2`模型的多语言版本，速度快，效果好，支持中文
- `CoSENT-macbert-base`模型达到同级别参数量SOTA效果，是用CoSENT方法训练，运行[similarities/cosent](similarities/cosent)文件夹下代码可以复现结果
- `SBERT-macbert-base`模型，是用SBERT方法训练，运行[similarities/sentence_bert](similarities/sentence_bert)文件夹下代码可以复现结果
- `similarities-base-chinese`模型，是用CoSENT方法训练，基于MacBERT在中文STS-B数据训练得到，模型文件已经上传到huggingface的模型库[shibing624/similarities-base-chinese](https://huggingface.co/shibing624/similarities-base-chinese)
- `w2v-light-tencent-chinese`是腾讯词向量的Word2Vec模型，CPU加载使用
- 各预训练模型均可以通过transformers调用，如MacBERT模型：`--pretrained_model_path hfl/chinese-macbert-base`
- 中文匹配数据集下载[链接见下方](#数据集)
- 中文匹配任务实验表明，pooling最优是`first_last_avg`，预测可以调用SBert的`mean pooling`方法，效果损失很小
- QPS的GPU测试环境是Tesla V100，显存32GB

 # Demo

@ -111,11 +72,6 @@ cd similarities
 python3 setup.py install
 ```

-### 数据集
-常见中文语义匹配数据集，包含[ATEC](https://github.com/IceFlameWorm/NLP_Datasets/tree/master/ATEC)、[BQ](http://icrc.hitsz.edu.cn/info/1037/1162.htm)、[LCQMC](http://icrc.hitsz.edu.cn/Article/show/171.html)、[PAWSX](https://arxiv.org/abs/1908.11828)、[STS-B](https://github.com/pluto-junzeng/CNSD)共5个任务。
-可以从数据集对应的链接自行下载，也可以从[百度网盘(提取码:qkt6)](https://pan.baidu.com/s/1d6jSiU1wHQAEMWJi7JJWCQ)下载。
-
-其中senteval_cn目录是评测数据集汇总，senteval_cn.zip是senteval目录的打包，两者下其一就好。

 # Usage

@ -124,7 +80,7 @@ python3 setup.py install

 ### 2. 计算句子之间的相似度值

-示例[semantic_text_similarity.py](./examples/semantic_text_similarity.py)
+示例[examples/semantic_text_similarity.py](./examples/semantic_text_similarity.py)


 > 句子余弦相似度值`score`范围是[-1, 1]，值越大越相似。
@ -144,7 +100,7 @@ python3 setup.py install
 - Issue(建议)：[![GitHub issues](https://img.shields.io/github/issues/shibing624/similarities.svg)](https://github.com/shibing624/similarities/issues)
 - 邮件我：xuming: xuming624@qq.com
 - 微信我：
-加我*微信号：xuming624, 备注：个人名称-公司-NLP* 进NLP交流群。
+加我*微信号：xuming624, 备注：姓名-公司-NLP* 进NLP交流群。

 <img src="docs/wechat.jpeg" width="200" />

@ -154,10 +110,10 @@ python3 setup.py install
 如果你在研究中使用了similarities，请按如下格式引用：

 ```latex
-@misc{similarities,
+@software{similarities,
  title={similarities: A Tool for Compute Similarity Score},
  author={Ming Xu},
-  howpublished={https://github.com/shibing624/similarities},
+  url={https://github.com/shibing624/similarities},
  year={2022}
 }
 ```
@ -178,5 +134,4 @@ python3 setup.py install

 # Reference
 - [A Simple but Tough-to-Beat Baseline for Sentence Embeddings[Sanjeev Arora and Yingyu Liang and Tengyu Ma, 2017]](https://openreview.net/forum?id=SyK00v5xx)
- [四种计算文本相似度的方法对比[Yves Peirsman]](https://zhuanlan.zhihu.com/p/37104535)
- [谈谈文本匹配和多轮检索](https://zhuanlan.zhihu.com/p/111769969)
+- [liuhuanyong/SentenceSimilarity](https://github.com/liuhuanyong/SentenceSimilarity)
--- a/similarities/init.py
+++ b/similarities/init.py
@ -1,7 +1,9 @@
 # -*- coding: utf-8 -*-
 """
@author:XuMing(xuming624@qq.com)
-@description: 
+@description:
+
+This package contains implementations of pairwise similarity queries.
 """

-from .similarity import BertSimilarity
+# bring classes directly into package namespace, to save some typing
--- a/similarities/similarity.py
+++ b/similarities/similarity.py
@ -6,29 +6,246 @@

 from typing import List, Union, Optional
 import numpy as np
-from numpy import ndarray
-from torch import Tensor
+import scipy
 from loguru import logger
+import torch
+
+import logging
+import scipy.sparse
+from gensim import utils, matutils
+
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+
+from enum import Enum, unique


-class BertSimilarity:
-    def __init__(self, model_name_or_path=''):
+def cos_sim(v1: Union[torch.Tensor, np.ndarray], v2: Union[torch.Tensor, np.ndarray]):
+    """
+    Computes the cosine similarity cos_sim(a[i], b[j]) for all i and j.
+    :return: Matrix with res[i][j]  = cos_sim(a[i], b[j])
+    """
+    if not isinstance(v1, torch.Tensor):
+        v1 = torch.tensor(v1)
+    if not isinstance(v2, torch.Tensor):
+        v2 = torch.tensor(v2)
+    if len(v1.shape) == 1:
+        v1 = v1.unsqueeze(0)
+    if len(v2.shape) == 1:
+        v2 = v2.unsqueeze(0)
+
+    v1_norm = torch.nn.functional.normalize(v1, p=2, dim=1)
+    v2_norm = torch.nn.functional.normalize(v2, p=2, dim=1)
+    return torch.mm(v1_norm, v2_norm.transpose(0, 1))
+
+
+class EncoderType(Enum):
+    FIRST_LAST_AVG = 0
+    LAST_AVG = 1
+    CLS = 2
+    POOLER = 3
+    MEAN = 4
+
+    def __str__(self):
+        return self.name
+
+    @staticmethod
+    def from_string(s):
+        try:
+            return EncoderType[s]
+        except KeyError:
+            raise ValueError()
+
+
+class Similarity:
+    """
+    Compute cosine similarity of a dynamic query against a corpus of documents ('the index').
+
+    The index supports adding new documents dynamically.
+    """
+
+    def __init__(self, model_name_or_path=None, docs=None):
        """
-        Cal text similarity
-        :param similarity_type:
-        :param embedding_type:
+
+        Parameters
+        ----------
+        output_prefix : str
+            Prefix for shard filename. If None, a random filename in temp will be used.
+        docs : iterable of list of (int, number)
+            Corpus in streamed Gensim bag-of-words format.
        """
        self.model_name_or_path = model_name_or_path
        self.model = None
+        logger.debug(f'Loading model {model_name_or_path}')
+        logger.debug(f"Device: {device}")

-    def encode(self, sentences: Union[List[str], str]) -> ndarray:
-        return np.array([])
+        self.normalize = True
+        self.keyedvectors = None
+        self.docs = docs
+        self.norm = False
+        if docs is not None:
+            self.add_documents(docs)

-    def similarity_score(self, sentences1: Union[List[str], str], entences2: Union[List[str], str]):
+    def __len__(self):
+        """Get length of index."""
+        return self.docs.shape[0]
+
+    def __str__(self):
+        return "%s" % (self.__class__.__name__)
+
+    def add_documents(self, corpus):
+        """Extend the index with new documents.
+
+        Parameters
+        ----------
+        corpus : iterable of list of (int, number)
+            Corpus in BoW format.
        """
-        Get similarity scores between sentences1 and sentences2
-        :param sentences1: list, sentence1 list
-        :param sentences2: list, sentence2 list
+        for doc in corpus:
+            self.docs.append(doc)
+            if len(self.docs) % 10000 == 0:
+                logger.info("PROGRESS: fresh_shard size=%i", len(self.docs))
+
+    def get_vector(self, text, norm=False):
+        """Get the key's vector, as a 1D numpy array.
+
+        Parameters
+        ----------
+
+        text : str
+            Key for vector to return.
+        norm : bool, optional
+            If True, the resulting vector will be L2-normalized (unit Euclidean length).
+
+        Returns
+        -------
+
+        numpy.ndarray
+            Vector for the specified key.
+
+        Raises
+        ------
+
+        KeyError
+            If the given key doesn't exist.
+
+        """
+        pass
+
+    def similarity(
+            self, text1: Union[List[str], str], text2: Union[List[str], str]
+    ) -> Union[np.ndarray, torch.Tensor]:
+        """
+        Compute similarity between two list of texts.
+        :param text1: list, sentence1 list
+        :param text2: list, sentence2 list
        :return: return: Matrix with res[i][j]  = cos_sim(a[i], b[j])
        """
-        return 0.0
+        if not text1 or not text2:
+            return np.array([])
+        if isinstance(text1, str):
+            text1 = [text1]  # type: ignore
+        if isinstance(text2, str):
+            text2 = [text2]  # type: ignore
+        pass
+
+    def distance(self, text1: Union[List[str], str], text2: Union[List[str], str]):
+        """Compute cosine distance between two keys.
+        Calculate 1 - :meth:`~gensim.models.keyedvectors.KeyedVectors.similarity`.
+
+        Parameters
+        ----------
+        w1 : str
+            Input key.
+        w2 : str
+            Input key.
+
+        Returns
+        -------
+        float
+            Distance between `w1` and `w2`.
+
+        """
+        return 1 - self.similarity(text1, text2)
+
+    def most_similar(self, query: Union[List[str], str], topn=10, threshold=0, exponent=2.0):
+        """
+        Get topn similar text
+        :param query: str, query text
+        :param top_k: int, top_k
+        :return: list, top_k similar text
+        """
+        if query not in self.keyedvectors:
+            logger.debug('an out-of-dictionary term "%s"', query)
+        else:
+            most_similar = self.keyedvectors.most_similar(query, topn=topn)
+            for t2, similarity in most_similar:
+                if similarity > threshold:
+                    yield (t2, similarity ** exponent)
+
+    def semantic_search(
+            self,
+            query_embeddings: Union[torch.Tensor, np.ndarray],
+            corpus_embeddings: Union[torch.Tensor, np.ndarray],
+            query_chunk_size: int = 100,
+            corpus_chunk_size: int = 500000,
+            top_k: int = 10,
+            score_function=cos_sim
+    ):
+        """
+        This function performs a cosine similarity search between a list of query embeddings and a list of corpus embeddings.
+        It can be used for Information Retrieval / Semantic Search for corpora up to about 1 Million entries.
+
+        :param query_embeddings: A 2 dimensional tensor with the query embeddings.
+        :param corpus_embeddings: A 2 dimensional tensor with the corpus embeddings.
+        :param query_chunk_size: Process 100 queries simultaneously. Increasing that value increases the speed, but requires more memory.
+        :param corpus_chunk_size: Scans the corpus 100k entries at a time. Increasing that value increases the speed, but requires more memory.
+        :param top_k: Retrieve top k matching entries.
+        :param score_function: Funtion for computing scores. By default, cosine similarity.
+        :return: Returns a sorted list with decreasing cosine similarity scores. Entries are dictionaries with the keys 'corpus_id' and 'score'
+        """
+
+        if isinstance(query_embeddings, (np.ndarray, np.generic)):
+            query_embeddings = torch.from_numpy(query_embeddings)
+        elif isinstance(query_embeddings, list):
+            query_embeddings = torch.stack(query_embeddings)
+
+        if len(query_embeddings.shape) == 1:
+            query_embeddings = query_embeddings.unsqueeze(0)
+
+        if isinstance(corpus_embeddings, (np.ndarray, np.generic)):
+            corpus_embeddings = torch.from_numpy(corpus_embeddings)
+        elif isinstance(corpus_embeddings, list):
+            corpus_embeddings = torch.stack(corpus_embeddings)
+
+        # Check that corpus and queries are on the same device
+        query_embeddings = query_embeddings.to(device)
+        corpus_embeddings = corpus_embeddings.to(device)
+
+        queries_result_list = [[] for _ in range(len(query_embeddings))]
+
+        for query_start_idx in range(0, len(query_embeddings), query_chunk_size):
+            # Iterate over chunks of the corpus
+            for corpus_start_idx in range(0, len(corpus_embeddings), corpus_chunk_size):
+                # Compute cosine similarity
+                cos_scores = score_function(query_embeddings[query_start_idx:query_start_idx + query_chunk_size],
+                                            corpus_embeddings[corpus_start_idx:corpus_start_idx + corpus_chunk_size])
+
+                # Get top-k scores
+                cos_scores_top_k_values, cos_scores_top_k_idx = torch.topk(cos_scores, min(top_k, len(cos_scores[0])),
+                                                                           dim=1, largest=True, sorted=False)
+                cos_scores_top_k_values = cos_scores_top_k_values.cpu().tolist()
+                cos_scores_top_k_idx = cos_scores_top_k_idx.cpu().tolist()
+
+                for query_itr in range(len(cos_scores)):
+                    for sub_corpus_id, score in zip(cos_scores_top_k_idx[query_itr],
+                                                    cos_scores_top_k_values[query_itr]):
+                        corpus_id = corpus_start_idx + sub_corpus_id
+                        query_id = query_start_idx + query_itr
+                        queries_result_list[query_id].append({'corpus_id': corpus_id, 'score': score})
+
+        # Sort and strip to top_k results
+        for idx in range(len(queries_result_list)):
+            queries_result_list[idx] = sorted(queries_result_list[idx], key=lambda x: x['score'], reverse=True)
+            queries_result_list[idx] = queries_result_list[idx][0:top_k]
+
+        return queries_result_list
--- a/similarities/termsim.py
+++ b/similarities/termsim.py
@ -0,0 +1,591 @@
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+#
+# Copyright (C) 2018 Vit Novotny <witiko@mail.muni.cz>
+# Licensed under the GNU LGPL v2.1 - http://www.gnu.org/licenses/lgpl.html
+
+"""
+This module provides classes that deal with term similarities.
+Adjust the Index to compute term similarities.
+"""
+import math
+from loguru import logger
+from typing import Dict, List, Tuple, Set, Optional, Union
+import numpy as np
+import torch
+import jieba
+import jieba.posseg
+from text2vec import Word2Vec
+from similarities.similarity import cos_sim, Similarity
+import os
+from similarities.utils.distance import cosine_distance
+from simhash import Simhash
+from similarities.utils.tfidf import TFIDF
+
+pwd_path = os.path.dirname(os.path.abspath(__file__))
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+
+
+class WordEmbeddingSimilarity(object):
+    """
+    Computes cosine similarities between word embeddings and retrieves most
+    similar terms for a given term.
+
+    Notes
+    -----
+    By fitting the word embeddings to a vocabulary that you will be using, you
+    can eliminate all out-of-vocabulary (OOV) words that you would otherwise
+    receive from the `most_similar` method. In subword models such as fastText,
+    this procedure will also infer word-vectors for words from your vocabulary
+    that previously had no word-vector.
+
+    Parameters
+    ----------
+    keyedvectors : :class:`~text2vec.Word2Vec`
+        The word embeddings.
+    docs: list of str
+    """
+
+    def __init__(self, keyedvectors: Word2Vec, docs: List[str] = None):
+        # super().__init__()
+        self.keyedvectors = keyedvectors
+        self.docs = []
+        self.docs_embeddings = np.array([])
+        if docs is not None:
+            self.add_documents(docs)
+
+    def __len__(self):
+        """Get length of index."""
+        return self.docs_embeddings.shape[0]
+
+    def __str__(self):
+        return "%s" % (self.__class__.__name__)
+
+    def add_documents(self, docs):
+        """Extend the index with new documents.
+
+        Parameters
+        ----------
+        docs : iterable of list of str
+        """
+        self.docs += docs
+        docs_embeddings = self.get_vector(docs)
+        if self.docs_embeddings.size > 0:
+            self.docs_embeddings = np.vstack((self.docs_embeddings, docs_embeddings))
+        else:
+            self.docs_embeddings = docs_embeddings
+        logger.info(f"Add docs size: {len(docs)}, total size: {len(self.docs)}")
+
+    def get_vector(self, text):
+        return self.keyedvectors.encode(text)
+
+    def similarity(self, text1, text2, score_function=cos_sim):
+        text_emb1 = self.get_vector(text1)
+        text_emb2 = self.get_vector(text2)
+        return score_function(text_emb1, text_emb2)
+
+    def distance(self, text1, text2):
+        """Compute cosine distance between two keys.
+        Calculate 1 - :meth:`~gensim.models.keyedvectors.KeyedVectors.similarity`.
+
+        Parameters
+        ----------
+        w1 : str
+            Input key.
+        w2 : str
+            Input key.
+
+        Returns
+        -------
+        float
+            Distance between `w1` and `w2`.
+
+        """
+        return 1 - self.similarity(text1, text2)
+
+    def semantic_search(
+            self,
+            query_embeddings: Union[torch.Tensor, np.ndarray],
+            corpus_embeddings: Union[torch.Tensor, np.ndarray],
+            query_chunk_size: int = 100,
+            corpus_chunk_size: int = 500000,
+            top_k: int = 10,
+            score_function=cos_sim
+    ):
+        """
+        This function performs a cosine similarity search between a list of query embeddings and a list of corpus embeddings.
+        It can be used for Information Retrieval / Semantic Search for corpora up to about 1 Million entries.
+
+        :param query_embeddings: A 2 dimensional tensor with the query embeddings.
+        :param corpus_embeddings: A 2 dimensional tensor with the corpus embeddings.
+        :param query_chunk_size: Process 100 queries simultaneously. Increasing that value increases the speed, but requires more memory.
+        :param corpus_chunk_size: Scans the corpus 100k entries at a time. Increasing that value increases the speed, but requires more memory.
+        :param top_k: Retrieve top k matching entries.
+        :param score_function: Funtion for computing scores. By default, cosine similarity.
+        :return: Returns a sorted list with decreasing cosine similarity scores. Entries are dictionaries with the keys 'corpus_id' and 'score'
+        """
+
+        if isinstance(query_embeddings, (np.ndarray, np.generic)):
+            query_embeddings = torch.from_numpy(query_embeddings)
+        elif isinstance(query_embeddings, list):
+            query_embeddings = torch.stack(query_embeddings)
+
+        if len(query_embeddings.shape) == 1:
+            query_embeddings = query_embeddings.unsqueeze(0)
+
+        if isinstance(corpus_embeddings, (np.ndarray, np.generic)):
+            corpus_embeddings = torch.from_numpy(corpus_embeddings)
+        elif isinstance(corpus_embeddings, list):
+            corpus_embeddings = torch.stack(corpus_embeddings)
+
+        # Check that corpus and queries are on the same device
+        query_embeddings = query_embeddings.to(device)
+        corpus_embeddings = corpus_embeddings.to(device)
+
+        queries_result_list = [[] for _ in range(len(query_embeddings))]
+
+        for query_start_idx in range(0, len(query_embeddings), query_chunk_size):
+            # Iterate over chunks of the corpus
+            for corpus_start_idx in range(0, len(corpus_embeddings), corpus_chunk_size):
+                # Compute cosine similarity
+                cos_scores = score_function(query_embeddings[query_start_idx:query_start_idx + query_chunk_size],
+                                            corpus_embeddings[corpus_start_idx:corpus_start_idx + corpus_chunk_size])
+
+                # Get top-k scores
+                cos_scores_top_k_values, cos_scores_top_k_idx = torch.topk(cos_scores, min(top_k, len(cos_scores[0])),
+                                                                           dim=1, largest=True, sorted=False)
+                cos_scores_top_k_values = cos_scores_top_k_values.cpu().tolist()
+                cos_scores_top_k_idx = cos_scores_top_k_idx.cpu().tolist()
+
+                for query_itr in range(len(cos_scores)):
+                    for sub_corpus_id, score in zip(cos_scores_top_k_idx[query_itr],
+                                                    cos_scores_top_k_values[query_itr]):
+                        corpus_id = corpus_start_idx + sub_corpus_id
+                        query_id = query_start_idx + query_itr
+                        queries_result_list[query_id].append({'corpus_id': corpus_id, 'score': score})
+
+        # Sort and strip to top_k results
+        for idx in range(len(queries_result_list)):
+            queries_result_list[idx] = sorted(queries_result_list[idx], key=lambda x: x['score'], reverse=True)
+            queries_result_list[idx] = queries_result_list[idx][0:top_k]
+
+        return queries_result_list
+
+    def most_similar(self, query, topn=10):
+        result = []
+        query_embeddings = self.get_vector(query)
+        hits = self.semantic_search(query_embeddings, self.docs_embeddings, top_k=topn)
+        hits = hits[0]  # Get the hits for the first query
+
+        print("Input question:", query)
+        for hit in hits[0:topn]:
+            result.append((self.docs[hit['corpus_id']], round(hit['score'], 4)))
+            print("\t{:.3f}\t{}".format(hit['score'], self.docs[hit['corpus_id']]))
+
+        print("\n\n========\n")
+        return result
+
+
+class CilinSimilarity(object):
+    """
+    Computes cilin similarities between word embeddings and retrieves most
+    similar terms for a given term.
+    """
+    default_cilin_path = os.path.join(pwd_path, 'data', 'cilin.txt')
+
+    def __init__(self, cilin_path: str = default_cilin_path, docs: List[str] = None):
+        super().__init__()
+        self.cilin_dict = self.load_cilin_dict(cilin_path)  # Cilin(词林) semantic dictionary
+        self.docs = []
+        if docs is not None:
+            self.add_documents(docs)
+
+    def __len__(self):
+        """Get length of index."""
+        return len(self.docs)
+
+    def __str__(self):
+        return "%s" % (self.__class__.__name__)
+
+    def add_documents(self, docs):
+        """Extend the index with new documents.
+
+        Parameters
+        ----------
+        docs : iterable of list of str
+        """
+        self.docs += docs
+        logger.info(f"Add docs size: {len(docs)}, total size: {len(self.docs)}")
+
+    @staticmethod
+    def load_cilin_dict(path):
+        """加载词林语义词典"""
+        sem_dict = {}
+        for line in open(path, 'r', encoding='utf-8'):
+            line = line.strip()
+            terms = line.split(' ')
+            sem_type = terms[0]
+            words = terms[1:]
+            for word in words:
+                if word not in sem_dict:
+                    sem_dict[word] = sem_type
+                else:
+                    sem_dict[word] += ';' + sem_type
+
+        for word, sem_type in sem_dict.items():
+            sem_dict[word] = sem_type.split(';')
+        return sem_dict
+
+    def _compute_word_sim(self, word1, word2):
+        """
+        比较计算词语之间的相似度，取max最大值
+        :param word1:
+        :param word2:
+        :return:
+        """
+        sems_word1 = self.cilin_dict.get(word1, [])
+        sems_word2 = self.cilin_dict.get(word2, [])
+        score_list = [self._compute_sem(sem_word1, sem_word2) for sem_word1 in sems_word1 for sem_word2 in sems_word2]
+        if score_list:
+            return max(score_list)
+        else:
+            return 0
+
+    def _compute_sem(self, sem1, sem2):
+        """
+        基于语义计算词语相似度
+        :param sem1:
+        :param sem2:
+        :return:
+        """
+        sem1 = [sem1[0], sem1[1], sem1[2:4], sem1[4], sem1[5:7], sem1[-1]]
+        sem2 = [sem2[0], sem2[1], sem2[2:4], sem2[4], sem2[5:7], sem2[-1]]
+        score = 0
+        for index in range(len(sem1)):
+            if sem1[index] == sem2[index]:
+                if index in [0, 1]:
+                    score += 3
+                elif index == 2:
+                    score += 2
+                elif index in [3, 4]:
+                    score += 1
+        return score / 10
+
+    def similarity(self, text1, text2):
+        """
+        基于词相似度计算句子相似度
+        :param text1:
+        :param text2:
+        :return:
+        """
+        words1 = [word.word for word in jieba.posseg.cut(text1) if word.flag[0] not in ['u', 'x', 'w']]
+        words2 = [word.word for word in jieba.posseg.cut(text2) if word.flag[0] not in ['u', 'x', 'w']]
+        score_words1 = []
+        score_words2 = []
+        for word1 in words1:
+            score = max(self._compute_word_sim(word1, word2) for word2 in words2)
+            score_words1.append(score)
+        for word2 in words2:
+            score = max(self._compute_word_sim(word2, word1) for word1 in words1)
+            score_words2.append(score)
+        similarity_score = max(sum(score_words1) / len(words1), sum(score_words2) / len(words2))
+
+        return similarity_score
+
+    def distance(self, text1, text2):
+        """Compute cosine distance between two keys."""
+        return 1 - self.similarity(text1, text2)
+
+    def most_similar(self, query, topn=10):
+        result = []
+        for doc in self.docs:
+            score = self.similarity(query, doc)
+            result.append((doc, round(score, 4)))
+        result.sort(key=lambda x: x[1], reverse=True)
+        return result[:topn]
+
+
+class HownetSimilarity(object):
+    """
+    Computes hownet similarities between word embeddings and retrieves most
+    similar terms for a given term.
+    """
+    default_hownet_path = os.path.join(pwd_path, 'data', 'hownet.dat')
+
+    def __init__(self, cilin_path: str = default_hownet_path, docs: List[str] = None):
+        super().__init__()
+        self.hownet_dict = self.load_hownet_dict(cilin_path)  # semantic dictionary
+        self.docs = []
+        if docs is not None:
+            self.add_documents(docs)
+
+    def __len__(self):
+        """Get length of index."""
+        return len(self.docs)
+
+    def __str__(self):
+        return "%s" % (self.__class__.__name__)
+
+    def add_documents(self, docs):
+        """Extend the index with new documents.
+
+        Parameters
+        ----------
+        docs : iterable of list of str
+        """
+        self.docs += docs
+        logger.info(f"Add docs size: {len(docs)}, total size: {len(self.docs)}")
+
+    @staticmethod
+    def load_hownet_dict(path):
+        """加载Hownet语义词典"""
+        hownet_dict = {}
+        for line in open(path, 'r', encoding='utf-8'):
+            words = [word for word in line.strip().replace(' ', '>').replace('\t', '>').split('>') if word != '']
+            word = words[0]
+            word_def = words[2]
+            hownet_dict[word] = word_def.split(',')
+        return hownet_dict
+
+    def _compute_sem(self, sem1, sem2):
+        """计算语义相似度"""
+        sem_inter = set(sem1).intersection(set(sem2))
+        sem_union = set(sem1).union(set(sem2))
+        return float(len(sem_inter)) / float(len(sem_union))
+
+    def _compute_word_sim(self, word1, word2):
+        """比较两个词语之间的相似度"""
+        DEFS_word1 = self.hownet_dict.get(word1, [])
+        DEFS_word2 = self.hownet_dict.get(word2, [])
+        scores = [self._compute_sem(DEF_word1, DEF_word2) for DEF_word1 in DEFS_word1 for DEF_word2 in DEFS_word2]
+        if scores:
+            return max(scores)
+        else:
+            return 0
+
+    def similarity(self, text1, text2):
+        """
+        基于词相似度计算句子相似度
+        :param text1:
+        :param text2:
+        :return:
+        """
+        words1 = [word.word for word in jieba.posseg.cut(text1) if word.flag[0] not in ['u', 'x', 'w']]
+        words2 = [word.word for word in jieba.posseg.cut(text2) if word.flag[0] not in ['u', 'x', 'w']]
+        score_words1 = []
+        score_words2 = []
+        for word1 in words1:
+            score = max(self._compute_word_sim(word1, word2) for word2 in words2)
+            score_words1.append(score)
+        for word2 in words2:
+            score = max(self._compute_word_sim(word2, word1) for word1 in words1)
+            score_words2.append(score)
+        similarity_score = max(sum(score_words1) / len(words1), sum(score_words2) / len(words2))
+
+        return similarity_score
+
+    def distance(self, text1, text2):
+        """Compute cosine distance between two keys."""
+        return 1 - self.similarity(text1, text2)
+
+    def most_similar(self, query, topn=10):
+        result = []
+        for doc in self.docs:
+            score = self.similarity(query, doc)
+            result.append((doc, round(score, 4)))
+        result.sort(key=lambda x: x[1], reverse=True)
+        return result[:topn]
+
+
+class SimhashSimilarity(object):
+    """
+    Computes Simhash similarities between word embeddings and retrieves most
+    similar terms for a given term.
+    """
+
+    def __init__(self, docs: List[str] = None, hashbits=64):
+        super().__init__()
+        self.docs = []
+        self.hashbits = hashbits
+        self.docs_embeddings = np.array([])
+        if docs is not None:
+            self.add_documents(docs)
+
+    def __len__(self):
+        """Get length of index."""
+        return len(self.docs)
+
+    def __str__(self):
+        return "%s" % (self.__class__.__name__)
+
+    def add_documents(self, docs):
+        """Extend the index with new documents.
+
+        Parameters
+        ----------
+        docs : iterable of list of str
+        """
+        self.docs += docs
+        docs_embeddings = []
+        for doc in docs:
+            doc_emb = self._get_code(doc)
+            docs_embeddings.append(doc_emb)
+            if len(docs_embeddings) % 10000 == 0:
+                logger.debug(f"Progress, add docs size: {len(docs_embeddings)}")
+        if self.docs_embeddings.size > 0:
+            self.docs_embeddings = np.vstack((self.docs_embeddings, docs_embeddings))
+        else:
+            self.docs_embeddings = docs_embeddings
+        logger.info(f"Add docs size: {len(docs)}, total size: {len(self.docs)}")
+
+    def _hamming_distance(self, code_s1, code_s2):
+        """利用64位数，计算海明距离"""
+        x = (code_s1 ^ code_s2) & ((1 << self.hashbits) - 1)
+        ans = 0
+        while x:
+            ans += 1
+            x &= x - 1
+        return ans
+
+    def _get_features(self, string):
+        """
+        对全文进行分词,提取全文特征,使用词性将虚词等无关字符去重
+        :param string:
+        :return:
+        """
+        word_list = [word.word for word in jieba.posseg.cut(string) if
+                     word.flag[0] not in ['u', 'x', 'w', 'o', 'p', 'c', 'm', 'q']]
+        return word_list
+
+    def _get_code(self, string):
+        """对全文进行编码"""
+        return Simhash(self._get_features(string)).value
+
+    def similarity(self, text1, text2):
+        """
+        计算句子间的海明距离
+        :param text1:
+        :param text2:
+        :return:
+        """
+        code_s1 = self._get_code(text1)
+        code_s2 = self._get_code(text2)
+        similarity_score = (100 - self._hamming_distance(code_s1, code_s2) * 100 / self.hashbits) / 100
+
+        return similarity_score
+
+    def distance(self, text1, text2):
+        """Compute cosine distance between two keys."""
+        return 1 - self.similarity(text1, text2)
+
+    def most_similar(self, query, topn=10):
+        result = []
+        query_emb = self._get_code(query)
+        for doc, doc_emb in zip(self.docs, self.docs_embeddings):
+            score = (100 - self._hamming_distance(query_emb, doc_emb) * 100 / self.hashbits) / 100
+            result.append((doc, round(score, 4)))
+        result.sort(key=lambda x: x[1], reverse=True)
+        return result[:topn]
+
+
+class TfidfSimilarity(object):
+    """
+    Computes Tfidf similarities between word embeddings and retrieves most
+    similar texts for a given text.
+    """
+
+    def __init__(self, docs: List[str] = None):
+        super().__init__()
+        self.docs = []
+        self.docs_embeddings = np.array([])
+        self.tfidf = TFIDF()
+        if docs is not None:
+            self.add_documents(docs)
+
+    def __len__(self):
+        """Get length of index."""
+        return len(self.docs)
+
+    def __str__(self):
+        return "%s" % (self.__class__.__name__)
+
+    def add_documents(self, docs):
+        """Extend the index with new documents.
+
+        Parameters
+        ----------
+        docs : iterable of list of str
+        """
+        self.docs += docs
+        docs_embeddings = np.array(self.tfidf.get_tfidf(docs))
+        if self.docs_embeddings.size > 0:
+            self.docs_embeddings = np.vstack((self.docs_embeddings, docs_embeddings))
+        else:
+            self.docs_embeddings = docs_embeddings
+        logger.info(f"Add docs size: {len(docs)}, total size: {len(self.docs)}")
+
+    def similarity(self, text1, text2):
+        """
+        基于tfidf计算句子间的余弦相似度
+        :param text1:
+        :param text2:
+        :return:
+        """
+        tfidf_features = self.tfidf.get_tfidf([text1, text2])
+        return cosine_distance(np.array(tfidf_features[0]), np.array(tfidf_features[1]))
+
+    def distance(self, text1, text2):
+        """Compute cosine distance between two keys."""
+        return 1 - self.similarity(text1, text2)
+
+    def most_similar(self, query, topn=10):
+        result = []
+        query_emb = np.array(self.tfidf.get_tfidf([query]))
+        for doc, doc_emb in zip(self.docs, self.docs_embeddings):
+            score = cosine_distance(query_emb, doc_emb)
+            result.append((doc, round(score, 4)))
+        result.sort(key=lambda x: x[1], reverse=True)
+        return result[:topn]
+
+
+if __name__ == '__main__':
+    wm = Word2Vec()
+    list_of_docs = ["This is a test1", "This is a test2", "This is a test3"]
+    list_of_docs2 = ["that is test4", "that is a test5", "that is a test6"]
+    m = WordEmbeddingSimilarity(wm, list_of_docs)
+    m.add_documents(list_of_docs2)
+    v = m.get_vector("This is a test1")
+    print(v[:10], v.shape)
+    print(m.similarity("This is a test1", "that is a test5"))
+    print(m.distance("This is a test1", "that is a test5"))
+    print(m.most_similar("This is a test1"))
+
+    text1 = '周杰伦是一个歌手'
+    text2 = '刘若英是个演员'
+    m = CilinSimilarity()
+    print(m.similarity(text1, text2))
+    print(m.distance(text1, text2))
+    zh_list = ['刘若英是个演员', '他唱歌很好听', 'women喜欢这首歌']
+    m.add_documents(zh_list)
+    print(m.most_similar('刘若英是演员'))
+
+    m = HownetSimilarity()
+    print(m.similarity(text1, text2))
+    print(m.distance(text1, text2))
+    zh_list = ['刘若英是个演员', '他唱歌很好听', 'women喜欢这首歌']
+    m.add_documents(zh_list)
+    print(m.most_similar('刘若英是演员'))
+
+    m = SimhashSimilarity()
+    print(m.similarity(text1, text2))
+    print(m.distance(text1, text2))
+    zh_list = ['刘若英是个演员', '他唱歌很好听', 'women喜欢这首歌']
+    m.add_documents(zh_list)
+    print(m.most_similar('刘若英是演员'))
+
+    m = TfidfSimilarity()
+    print(m.similarity(text1, text2))
+    print(m.distance(text1, text2))
+    zh_list = ['刘若英是个演员', '他唱歌很好听', 'women喜欢这首歌', '我不是演员吗']
+    m.add_documents(zh_list)
+    print(m.most_similar('刘若英是演员'))
--- a/similarities/utils/init.py
+++ b/similarities/utils/init.py
@ -3,3 +3,4 @@
@author:XuMing(xuming624@qq.com)
@description: 
 """
+
--- a/similarities/utils/distance.py
+++ b/similarities/utils/distance.py
@ -0,0 +1,231 @@
+# -*- coding: utf-8 -*-
+"""
+@author:XuMing(xuming624@qq.com)
+@description: 
+"""
+from difflib import SequenceMatcher
+
+import numpy as np
+
+zero_bit = 0.000000001
+
+
+def try_divide(x, y, val=0.0):
+    """
+    try to divide two numbers
+    """
+    if y != 0.0:
+        val = float(x) / y
+    return val
+
+
+def cosine_distance(v1, v2):
+    """
+    余弦距离
+    return cos score
+    """
+    up = np.dot(v1, v2)
+    down = np.linalg.norm(v1) * np.linalg.norm(v2)
+    return try_divide(up, down)
+
+
+def hamming_distance(v1, v2):  # 海明距离
+    n = int(v1, 2) ^ int(v2, 2)
+    return bin(n & 0xffffffff).count('1')
+
+
+def euclidean_distance(v1, v2):  # 欧氏距离
+    return np.sqrt(np.sum(np.square(v1 - v2)))
+
+
+def manhattan_distance(v1, v2):  # 曼哈顿距离
+    return np.sum(np.abs(v1 - v2))
+
+
+def chebyshev_distance(v1, v2):  # 切比雪夫距离
+    return np.max(np.abs(v1 - v2))
+
+
+def minkowski_distance(v1, v2):  # 闵可夫斯基距离
+    return np.sqrt(np.sum(np.square(v1 - v2)))
+
+
+def euclidean_distance_standardized(v1, v2):  # 标准化欧氏距离
+    v1_v2 = np.vstack([v1, v2])
+    sk_v1_v2 = np.var(v1_v2, axis=0, ddof=1)
+    return np.sqrt(((v1 - v2) ** 2 / (sk_v1_v2 + zero_bit * np.ones_like(sk_v1_v2))).sum())
+
+
+def edit_distance(str1, str2):
+    try:
+        # very fast
+        # http://stackoverflow.com/questions/14260126/how-python-levenshtein-ratio-is-computed
+        import Levenshtein
+        d = Levenshtein.distance(str1, str2) / float(max(len(str1), len(str2)))
+    except:
+        # https://docs.python.org/2/library/difflib.html
+        d = 1. - SequenceMatcher(lambda x: x == " ", str1, str2).ratio()
+    return d
+
+
+def pearson_correlation_distance(v1, v2):  # 皮尔逊相关系数（Pearson correlation）
+    v1_v2 = np.vstack([v1, v2])
+    return np.corrcoef(v1_v2)[0][1]
+
+
+def jaccard_similarity_coefficient_distance(v1, v2):  # 杰卡德相似系数(Jaccard similarity coefficient)
+    # 公式求解
+    v1 = np.asarray(v1)
+    v2 = np.asarray(v2)
+    up = np.double(np.bitwise_and((v1 != v2), np.bitwise_or(v1 != 0, v2 != 0)).sum())
+    down = np.double(np.bitwise_or(v1 != 0, v2 != 0).sum() + zero_bit)
+    return try_divide(up, down)
+
+
+def wmd_distance(model, sent1_cut_list, sent2_cut_list):  # WMD距离
+    """
+    wmd 距离
+    :param model: gensim word2vec model
+    :param sent1_cut_list:
+    :param sent2_cut_list:
+    :return:
+    """
+    distance = model.wmdistance(sent1_cut_list, sent2_cut_list)
+    return distance
+
+
+def is_str_match(str1, str2, threshold=1.0):
+    assert 0.0 <= threshold <= 1.0, "Wrong threshold."
+    if float(threshold) == 1.0:
+        return str1 == str2
+    else:
+        return (1. - edit_distance(str1, str2)) >= threshold
+
+
+def longest_match_size(str1, str2):
+    sq = SequenceMatcher(lambda x: x == " ", str1, str2)
+    match = sq.find_longest_match(0, len(str1), 0, len(str2))
+    return match.size
+
+
+def longest_match_ratio(str1, str2):
+    sq = SequenceMatcher(lambda x: x == " ", str1, str2)
+    match = sq.find_longest_match(0, len(str1), 0, len(str2))
+    return try_divide(match.size, min(len(str1), len(str2)))
+
+
+def jaccard_coef(A, B):
+    if not isinstance(A, set):
+        A = set(A)
+    if not isinstance(B, set):
+        B = set(B)
+    return try_divide(float(len(A.intersection(B))), len(A.union(B)))
+
+
+def num_of_common_sub_str(str1, str2):
+    """
+    求两个字符串的最长公共子串
+    思想：建立一个二维数组，保存连续位相同与否的状态
+    """
+    lstr1 = len(str1)
+    lstr2 = len(str2)
+    record = [[0 for i in range(lstr2 + 1)] for j in range(lstr1 + 1)]  # 多一位
+    max_num = 0  # 最长匹配长度
+
+    for i in range(lstr1):
+        for j in range(lstr2):
+            if str1[i] == str2[j]:
+                # 相同则累加
+                record[i + 1][j + 1] = record[i][j] + 1
+                if record[i + 1][j + 1] > max_num:
+                    # 获取最大匹配长度
+                    max_num = record[i + 1][j + 1]
+    return max_num
+
+
+def string_hash(source):
+    if source == "":
+        return 0
+    else:
+        x = ord(source[0]) << 7
+        m = 1000003
+        mask = 2 ** 128 - 1
+        for c in source:
+            x = ((x * m) ^ ord(c)) & mask
+        x ^= len(source)
+        if x == -1:
+            x = -2
+        x = bin(x).replace('0b', '').zfill(64)[-64:]
+
+        return str(x)
+
+
+def sim_hash(text):
+    import jieba.analyse
+    seg = jieba.cut(text)
+    key_word = jieba.analyse.extract_tags('|'.join(seg), topK=20, withWeight=True, allowPOS=())
+    # 先按照权重排序，再按照词排序
+    key_list = []
+    for feature, weight in key_word:
+        weight = int(weight * 20)
+        temp = []
+        for f in string_hash(feature):
+            if f == '1':
+                temp.append(weight)
+            else:
+                temp.append(-weight)
+        key_list.append(temp)
+    content_list = np.sum(np.array(key_list), axis=0)
+    # 编码读不出来
+    if len(key_list) == 0:
+        return '00'
+    simhash = ''
+    for c in content_list:
+        if c > 0:
+            simhash = simhash + '1'
+        else:
+            simhash = simhash + '0'
+    return simhash
+
+
+def normalization(x):
+    """
+      归一化，最大最小值
+    :param x:
+    :return:
+    """
+    return [(float(i) - min(x)) / float(max(x) - min(x) + zero_bit) for i in x]
+
+
+def z_score(x, axis=0):
+    """
+      标准化
+    :param x: arrary, numpy
+    :param axis: int, 0
+    :return: arrary, numpy
+    """
+    x = np.array(x).astype(float)
+    xr = np.rollaxis(x, axis=axis)
+    xr -= np.mean(x, axis=axis)
+    xr /= np.std(x, axis=axis)
+    return x
+
+
+if __name__ == '__main__':
+    vec1_test = np.array([1, 38, 17, 32])
+    vec2_test = np.array([5, 6, 8, 9])
+
+    str1_test = "你到底是谁?"
+    str2_test = "没想到我是谁，是真样子"
+
+    print(euclidean_distance(vec1_test, vec2_test))
+    print(cosine_distance(vec1_test, vec2_test))
+    print(manhattan_distance(vec1_test, vec2_test))
+    print(euclidean_distance(vec1_test, vec2_test))
+    print(cosine_distance(vec1_test, vec2_test))
+
+    print('hamming_distance:', str1_test, str2_test, hamming_distance(sim_hash(str1_test), sim_hash(str2_test)))
+    print(edit_distance(str1_test, str2_test))
+    print(num_of_common_sub_str(str1_test, str2_test))
+    print(normalization(vec1_test))  # 归一化（0-1）
+    print(z_score(vec1_test))  # 标准化（0附近，正负）
--- a/similarities/utils/get_file.py
+++ b/similarities/utils/get_file.py
@ -0,0 +1,36 @@
+# -*- coding: utf-8 -*-
+"""
+@author:XuMing(xuming624@qq.com)
+@description: Download file.
+"""
+import requests
+import os
+import sys
+from tqdm.autonotebook import tqdm
+
+
+def http_get(url, path):
+    """
+    Downloads a URL to a given path on disc
+    """
+    if os.path.dirname(path) != '':
+        os.makedirs(os.path.dirname(path), exist_ok=True)
+
+    req = requests.get(url, stream=True)
+    if req.status_code != 200:
+        print("Exception when trying to download {}. Response {}".format(url, req.status_code), file=sys.stderr)
+        req.raise_for_status()
+        return
+
+    download_filepath = path + "_part"
+    with open(download_filepath, "wb") as file_binary:
+        content_length = req.headers.get('Content-Length')
+        total = int(content_length) if content_length is not None else None
+        progress = tqdm(unit="B", total=total, unit_scale=True)
+        for chunk in req.iter_content(chunk_size=1024):
+            if chunk:  # filter out keep-alive new chunks
+                progress.update(len(chunk))
+                file_binary.write(chunk)
+
+    os.rename(download_filepath, path)
+    progress.close()
--- a/similarities/utils/ngram_util.py
+++ b/similarities/utils/ngram_util.py
@ -0,0 +1,171 @@
+# -*- coding: utf-8 -*-
+"""
+@author:XuMing(xuming624@qq.com)
+@description: 
+"""
+
+
+class NgramUtil(object):
+    def __init__(self):
+        pass
+
+    @staticmethod
+    def unigrams(words):
+        """
+            Input: a list of words, e.g., ["I", "am", "Denny"]
+            Output: a list of unigram
+        """
+        assert type(words) == list
+        return words
+
+    @staticmethod
+    def bigrams(words, join_string, skip=0):
+        """
+           Input: a list of words, e.g., ["I", "am", "Denny"]
+           Output: a list of bigram, e.g., ["I_am", "am_Denny"]
+        """
+        assert type(words) == list
+        L = len(words)
+        if L > 1:
+            lst = []
+            for i in range(L - 1):
+                for k in range(1, skip + 2):
+                    if i + k < L:
+                        lst.append(join_string.join([words[i], words[i + k]]))
+        else:
+            # set it as unigram
+            lst = NgramUtil.unigrams(words)
+        return lst
+
+    @staticmethod
+    def trigrams(words, join_string, skip=0):
+        """
+           Input: a list of words, e.g., ["I", "am", "Denny"]
+           Output: a list of trigram, e.g., ["I_am_Denny"]
+        """
+        assert type(words) == list
+        L = len(words)
+        if L > 2:
+            lst = []
+            for i in range(L - 2):
+                for k1 in range(1, skip + 2):
+                    for k2 in range(1, skip + 2):
+                        if i + k1 < L and i + k1 + k2 < L:
+                            lst.append(join_string.join([words[i], words[i + k1], words[i + k1 + k2]]))
+        else:
+            # set it as bigram
+            lst = NgramUtil.bigrams(words, join_string, skip)
+        return lst
+
+    @staticmethod
+    def fourgrams(words, join_string):
+        """
+            Input: a list of words, e.g., ["I", "am", "Denny", "boy"]
+            Output: a list of trigram, e.g., ["I_am_Denny_boy"]
+        """
+        assert type(words) == list
+        L = len(words)
+        if L > 3:
+            lst = []
+            for i in range(L - 3):
+                lst.append(join_string.join([words[i], words[i + 1], words[i + 2], words[i + 3]]))
+        else:
+            # set it as trigram
+            lst = NgramUtil.trigrams(words, join_string)
+        return lst
+
+    @staticmethod
+    def uniterms(words):
+        return NgramUtil.unigrams(words)
+
+    @staticmethod
+    def biterms(words, join_string):
+        """
+            Input: a list of words, e.g., ["I", "am", "Denny", "boy"]
+            Output: a list of biterm, e.g., ["I_am", "I_Denny", "I_boy", "am_Denny", "am_boy", "Denny_boy"]
+        """
+        assert type(words) == list
+        L = len(words)
+        if L > 1:
+            lst = []
+            for i in range(L - 1):
+                for j in range(i + 1, L):
+                    lst.append(join_string.join([words[i], words[j]]))
+        else:
+            # set it as uniterm
+            lst = NgramUtil.uniterms(words)
+        return lst
+
+    @staticmethod
+    def triterms(words, join_string):
+        """
+            Input: a list of words, e.g., ["I", "am", "Denny", "boy"]
+            Output: a list of triterm, e.g., ["I_am_Denny", "I_am_boy", "I_Denny_boy", "am_Denny_boy"]
+        """
+        assert type(words) == list
+        L = len(words)
+        if L > 2:
+            lst = []
+            for i in range(L - 2):
+                for j in range(i + 1, L - 1):
+                    for k in range(j + 1, L):
+                        lst.append(join_string.join([words[i], words[j], words[k]]))
+        else:
+            # set it as biterm
+            lst = NgramUtil.biterms(words, join_string)
+        return lst
+
+    @staticmethod
+    def fourterms(words, join_string):
+        """
+            Input: a list of words, e.g., ["I", "am", "Denny", "boy", "ha"]
+            Output: a list of fourterm, e.g., ["I_am_Denny_boy", "I_am_Denny_ha", "I_am_boy_ha", "I_Denny_boy_ha", "am_Denny_boy_ha"]
+        """
+        assert type(words) == list
+        L = len(words)
+        if L > 3:
+            lst = []
+            for i in range(L - 3):
+                for j in range(i + 1, L - 2):
+                    for k in range(j + 1, L - 1):
+                        for l in range(k + 1, L):
+                            lst.append(join_string.join([words[i], words[j], words[k], words[l]]))
+        else:
+            # set it as triterm
+            lst = NgramUtil.triterms(words, join_string)
+        return lst
+
+    @staticmethod
+    def ngrams(words, ngram, join_string=" "):
+        """
+        wrapper for ngram
+        """
+        if ngram == 1:
+            return NgramUtil.unigrams(words)
+        elif ngram == 2:
+            return NgramUtil.bigrams(words, join_string)
+        elif ngram == 3:
+            return NgramUtil.trigrams(words, join_string)
+        elif ngram == 4:
+            return NgramUtil.fourgrams(words, join_string)
+        elif ngram == 12:
+            unigram = NgramUtil.unigrams(words)
+            bigram = [x for x in NgramUtil.bigrams(words, join_string) if len(x.split(join_string)) == 2]
+            return unigram + bigram
+        elif ngram == 123:
+            unigram = NgramUtil.unigrams(words)
+            bigram = [x for x in NgramUtil.bigrams(words, join_string) if len(x.split(join_string)) == 2]
+            trigram = [x for x in NgramUtil.trigrams(words, join_string) if len(x.split(join_string)) == 3]
+            return unigram + bigram + trigram
+
+    @staticmethod
+    def nterms(words, nterm, join_string=" "):
+        """wrapper for nterm"""
+        if nterm == 1:
+            return NgramUtil.uniterms(words)
+        elif nterm == 2:
+            return NgramUtil.biterms(words, join_string)
+        elif nterm == 3:
+            return NgramUtil.triterms(words, join_string)
+        elif nterm == 4:
+            return NgramUtil.fourterms(words, join_string)
--- a/similarities/utils/rank_bm25.py
+++ b/similarities/utils/rank_bm25.py
@ -0,0 +1,164 @@
+# -*- coding: utf-8 -*-
+# Author: dorianbrown
+# Brief: https://github.com/dorianbrown/rank_bm25
+
+import math
+from multiprocessing import Pool, cpu_count
+
+import numpy as np
+
+"""
+All of these algorithms have been taken from the paper:
+Trotmam et al, Improvements to BM25 and Language Models Examined
+Here we implement all the BM25 variations mentioned. 
+"""
+
+
+class BM25:
+    def __init__(self, corpus, tokenizer=None):
+        self.corpus_size = len(corpus)
+        self.avgdl = 0
+        self.doc_freqs = []
+        self.idf = {}
+        self.doc_len = []
+        self.tokenizer = tokenizer
+
+        if tokenizer:
+            corpus = self._tokenize_corpus(corpus)
+
+        nd = self._initialize(corpus)
+        self._calc_idf(nd)
+
+    def _initialize(self, corpus):
+        nd = {}  # word -> number of documents with word
+        num_doc = 0
+        for document in corpus:
+            self.doc_len.append(len(document))
+            num_doc += len(document)
+
+            frequencies = {}
+            for word in document:
+                if word not in frequencies:
+                    frequencies[word] = 0
+                frequencies[word] += 1
+            self.doc_freqs.append(frequencies)
+
+            for word, freq in frequencies.items():
+                if word not in nd:
+                    nd[word] = 0
+                nd[word] += 1
+
+        self.avgdl = num_doc / self.corpus_size
+        return nd
+
+    def _tokenize_corpus(self, corpus):
+        pool = Pool(cpu_count())
+        tokenized_corpus = pool.map(self.tokenizer, corpus)
+        return tokenized_corpus
+
+    def _calc_idf(self, nd):
+        raise NotImplementedError()
+
+    def get_scores(self, query):
+        raise NotImplementedError()
+
+    def get_top_n(self, query, documents, n=5):
+
+        assert self.corpus_size == len(documents), "The documents given don't match the index corpus!"
+
+        scores = self.get_scores(query)
+        top_n = np.argsort(scores)[::-1][:n]
+        return [documents[i] for i in top_n]
+
+
+class BM25Okapi(BM25):
+    def __init__(self, corpus, tokenizer=None, k1=1.5, b=0.75, epsilon=0.25):
+        self.k1 = k1
+        self.b = b
+        self.epsilon = epsilon
+        super().__init__(corpus, tokenizer)
+
+    def _calc_idf(self, nd):
+        """
+        Calculates frequencies of terms in documents and in corpus.
+        This algorithm sets a floor on the idf values to eps * average_idf
+        """
+        # collect idf sum to calculate an average idf for epsilon value
+        idf_sum = 0
+        # collect words with negative idf to set them a special epsilon value.
+        # idf can be negative if word is contained in more than half of documents
+        negative_idfs = []
+        for word, freq in nd.items():
+            idf = math.log(self.corpus_size - freq + 0.5) - math.log(freq + 0.5)
+            self.idf[word] = idf
+            idf_sum += idf
+            if idf < 0:
+                negative_idfs.append(word)
+        self.average_idf = idf_sum / len(self.idf)
+
+        eps = self.epsilon * self.average_idf
+        for word in negative_idfs:
+            self.idf[word] = eps
+
+    def get_scores(self, query):
+        """
+        The ATIRE BM25 variant uses an idf function which uses a log(idf) score. To prevent negative idf scores,
+        this algorithm also adds a floor to the idf value of epsilon.
+        See [Trotman, A., X. Jia, M. Crane, Towards an Efficient and Effective Search Engine] for more info
+        :param query: str
+        :return: array
+        """
+        scores = np.zeros(self.corpus_size)
+        doc_len = np.array(self.doc_len)
+        for q in query:
+            q_freq = np.array([(doc.get(q) or 0) for doc in self.doc_freqs])
+            scores += (self.idf.get(q) or 0) * (q_freq * (self.k1 + 1) /
+                                               (q_freq + self.k1 * (1 - self.b + self.b * doc_len / self.avgdl)))
+        return scores
+
+
+class BM25L(BM25):
+    def __init__(self, corpus, tokenizer=None, k1=1.5, b=0.75, delta=0.5):
+        # Algorithm specific parameters
+        self.k1 = k1
+        self.b = b
+        self.delta = delta
+        super().__init__(corpus, tokenizer)
+
+    def _calc_idf(self, nd):
+        for word, freq in nd.items():
+            idf = math.log(self.corpus_size + 1) - math.log(freq + 0.5)
+            self.idf[word] = idf
+
+    def get_scores(self, query):
+        scores = np.zeros(self.corpus_size)
+        doc_len = np.array(self.doc_len)
+        for q in query:
+            q_freq = np.array([(doc.get(q) or 0) for doc in self.doc_freqs])
+            ctd = q_freq / (1 - self.b + self.b * doc_len / self.avgdl)
+            scores += (self.idf.get(q) or 0) * q_freq * (self.k1 + 1) * (ctd + self.delta) / \
+                     (self.k1 + ctd + self.delta)
+        return scores
+
+
+class BM25Plus(BM25):
+    def __init__(self, corpus, tokenizer=None, k1=1.5, b=0.75, delta=1):
+        # Algorithm specific parameters
+        self.k1 = k1
+        self.b = b
+        self.delta = delta
+        super().__init__(corpus, tokenizer)
+
+    def _calc_idf(self, nd):
+        for word, freq in nd.items():
+            idf = math.log((self.corpus_size + 1) / freq)
+            self.idf[word] = idf
+
+    def get_scores(self, query):
+        scores = np.zeros(self.corpus_size)
+        doc_len = np.array(self.doc_len)
+        for q in query:
+            q_freq = np.array([(doc.get(q) or 0) for doc in self.doc_freqs])
+            scores += (self.idf.get(q) or 0) * (self.delta + (q_freq * (self.k1 + 1)) /
+                                               (self.k1 * (1 - self.b + self.b * doc_len / self.avgdl) + q_freq))
+        return scores
--- a/similarities/utils/tfidf.py
+++ b/similarities/utils/tfidf.py
@ -0,0 +1,79 @@
+# -*- coding: utf-8 -*-
+"""
+@author:XuMing(xuming624@qq.com)
+@description: 
+"""
+
+import os
+import jieba
+import jieba.posseg
+
+from jieba.analyse.tfidf import DEFAULT_IDF, _get_abs_path
+
+pwd_path = os.path.abspath(os.path.dirname(__file__))
+default_stopwords_file = os.path.join(pwd_path, '../data/stopwords.txt')
+
+
+def load_stopwords(file_path):
+    stopwords = set()
+    if file_path and os.path.exists(file_path):
+        with open(file_path, 'r', encoding='utf-8') as f:
+            for line in f:
+                line = line.strip()
+                stopwords.add(line)
+    return stopwords
+
+
+class IDFLoader(object):
+
+    def __init__(self, idf_path=None):
+        self.path = ""
+        self.idf_freq = {}
+        self.median_idf = 0.0
+        if idf_path:
+            self.set_new_path(idf_path)
+
+    def set_new_path(self, new_idf_path):
+        if self.path != new_idf_path:
+            self.path = new_idf_path
+            content = open(new_idf_path, 'rb').read().decode('utf-8')
+            self.idf_freq = {}
+            for line in content.splitlines():
+                word, freq = line.strip().split(' ')
+                self.idf_freq[word] = float(freq)
+            self.median_idf = sorted(
+                self.idf_freq.values())[len(self.idf_freq) // 2]
+
+    def get_idf(self):
+        return self.idf_freq, self.median_idf
+
+
+class TFIDF:
+
+    def __init__(self, idf_path=None, stopwords=None):
+        self.stopwords = stopwords if stopwords else load_stopwords(default_stopwords_file)
+        self.idf_loader = IDFLoader(idf_path or DEFAULT_IDF)
+        self.idf_freq, self.median_idf = self.idf_loader.get_idf()
+
+    def set_idf_path(self, idf_path):
+        new_abs_path = _get_abs_path(idf_path)
+        if not os.path.isfile(new_abs_path):
+            raise Exception("IDF file does not exist: " + new_abs_path)
+        self.idf_loader.set_new_path(new_abs_path)
+        self.idf_freq, self.median_idf = self.idf_loader.get_idf()
+
+    def get_tfidf(self, sentences):
+        """
+        Extract keywords from sentence using TF-IDF algorithm.
+        """
+        result = []
+        for sentence in sentences:
+            words = [word.word for word in jieba.posseg.cut(sentence) if word.flag[0] not in ['u', 'x', 'w']]
+            words = [word for word in words if word.lower() not in self.stopwords or len(word.strip()) < 2]
+            word_idf = {word: self.idf_freq.get(word, self.median_idf) for word in words}
+
+            freqs = []
+            for w in list(self.idf_freq.keys()):
+                freqs.append(word_idf.get(w, 0))
+            result.append(freqs)
+        return result
--- a/similarities/utils/tokenizer.py
+++ b/similarities/utils/tokenizer.py
@ -3,3 +3,29 @@
@author:XuMing(xuming624@qq.com)
@description: 
 """
+import os
+import jieba
+import logging
+
+
+class JiebaTokenizer(object):
+    def __init__(self, dict_path='', custom_word_freq_dict=None):
+        self.model = jieba
+        self.model.default_logger.setLevel(logging.ERROR)
+        # 初始化大词典
+        if os.path.exists(dict_path):
+            self.model.set_dictionary(dict_path)
+        # 加载用户自定义词典
+        if custom_word_freq_dict:
+            for w, f in custom_word_freq_dict.items():
+                self.model.add_word(w, freq=f)
+
+    def tokenize(self, sentence, cut_all=False, HMM=True):
+        """
+        切词并返回切词位置
+        :param sentence: 句子
+        :param cut_all: 全模式，默认关闭
+        :param HMM: 是否打开NER识别，默认打开
+        :return:  A list of strings.
+        """
+        return self.model.lcut(sentence, cut_all=cut_all, HMM=HMM)