https://github.com/chatopera/Synonyms/issues/112 优化词向量,扩大词汇表,加快下载速度

This commit is contained in:
Hai Liang Wang 2020-09-21 14:22:38 +08:00
parent 1fdfaaea40
commit 0dbe1ec7cd
9 changed files with 493 additions and 63 deletions

2
.gitignore vendored
View File

@ -12,4 +12,4 @@ synonyms.egg-info
.vscode/
build/
.env
synonyms/data/words.vector
synonyms/data/words.vector*

View File

@ -1,4 +1,4 @@
Copyright (2018-2020) Hu Ying Xi<>, Hai Liang Wang<hain@chatopera.com>
Copyright (2018-2020) Chatopera Inc. <https://www.chatopera.com>
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

View File

@ -26,6 +26,8 @@ pip install -U synonyms
兼容 py2 和 py3当前稳定版本 [v3.x](https://github.com/chatopera/Synonyms/releases)。
**提示:安装后初次使用会下载词向量文件,下载速度取决于网络情况。**
![](./assets/3.gif)
**Node.js 用户可以使用 [node-synonyms](https://www.npmjs.com/package/node-synonyms)了。**
@ -80,7 +82,7 @@ synonyms.nearby(人脸, 10) = (
095, 0.525344, 0.524009, 0.523101, 0.516046])
```
在 OOV 的情况下,返回 `([], [])`,目前的字典大小: 125,792
在 OOV 的情况下,返回 `([], [])`,目前的字典大小: 435,729
### synonyms#compare
@ -107,16 +109,16 @@ synonyms.nearby(人脸, 10) = (
```
>>> synonyms.display("飞机")
'飞机'近义词:
1. 架飞机:0.837399
2. 客机:0.764609
3. 直升机:0.762116
4. 民航机:0.750519
5. 航机:0.750116
6. 起飞:0.735736
7. 战机:0.734975
8. 飞行中:0.732649
9. 航空器:0.723945
10. 运输机:0.720578
1. 飞机:1.0
2. 直升机:0.8423391
3. 客机:0.8393003
4. 滑翔机:0.7872388
5. 军用飞机:0.7832081
6. 水上飞机:0.77857226
7. 运输机:0.7724742
8. 航机:0.7664748
9. 航空器:0.76592904
10. 民航机:0.74209654
```
`SIZE` 是打印词汇表的数量,默认 10。
@ -182,7 +184,7 @@ HowNet也被称为知网它并不只是一个语义字典而是一个
### 对比
Synonyms 的词表容量是 125,792,下面选择一些在同义词词林、知网和 Synonyms 都存在的几个词,给出其近似度的对比:
Synonyms 的词表容量是 435,729,下面选择一些在同义词词林、知网和 Synonyms 都存在的几个词,给出其近似度的对比:
![](./assets/5.png)
@ -190,6 +192,12 @@ Synonyms 的词表容量是 125,792下面选择一些在同义词词林、知
更多[比对结果](./VALUATION.md)。
## Used by
[Github 关联用户列表](https://github.com/chatopera/Synonyms/network/dependents?package_id=UGFja2FnZS01MjY2NDc1Nw%3D%3D)
![](./assets/6.png)
## Benchmark
Test with py3, MacBook Pro.
@ -242,7 +250,7 @@ meminfo 8GB
# Promotion
[Chatopera 云服务](https://bot.chatopera.com/dashboard) 是面向企业聊天机器人构建的一站式解决方案,融合信息检索系统、机器学习、聊天机器人脚本语法和语音识别等技术,为定制化聊天机器人和自然语言交互而生!
[Chatopera 云服务](https://bot.chatopera.com/dashboard)
<p align="center">
<b>Chatopera 云服务</b><br>
@ -251,6 +259,8 @@ meminfo 8GB
</a>
</p>
Chatopera 机器人平台包括知识库、多轮对话、意图识别和语音识别等组件,标准化聊天机器人开发,支持企业 OA 智能问答、HR 智能问答、智能客服和网络营销等场景;一站式实现聊天机器人,按量付费,让聊天机器人上线!
# References
[wikidata-corpus](https://github.com/Samurais/wikidata-corpus)
@ -273,9 +283,9 @@ Google 发布的[word2vec](https://code.google.com/archive/p/word2vec/),该库
# Authors
[Hai Liang Wang](http://blog.chatbot.io/webcv/)
[Hai Liang Wang](https://pre-angel.com/peoples/hailiang-wang/)
[Hu Ying Xi](https://github.com/chatopera/)
[Hu Ying Xi](https://github.com/huyingxi)
# Give credits to
@ -293,6 +303,14 @@ Google 发布的[word2vec](https://code.google.com/archive/p/word2vec/),该库
[MIT](./LICENSE)
Copyright (2018-2020) Chatopera Inc. <https://www.chatopera.com>
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.OF
[![chatoper banner][co-banner-image]][co-url]
[co-banner-image]: https://user-images.githubusercontent.com/3538629/42383104-da925942-8168-11e8-8195-868d5fcec170.png

View File

@ -1,6 +1,7 @@
# synonyms 分数评测 [(v3.11.0)](https://pypi.python.org/pypi/synonyms/3.11.0)
| 词1 | 词2 | synonyms | 人工评定 |
| --- | --- | --- | --- |
# synonyms 分数评测 [(v3.12.0)](https://pypi.python.org/pypi/synonyms/3.12.0)
| 词 1 | 词 2 | synonyms | 人工评定 |
| ------ | -------- | -------- | -------- |
| 轿车 | 汽车 | 0.892 | 0.98 |
| 宝石 | 宝物 | 1.0 | 0.96 |
| 旅游 | 游历 | 0.649 | 0.96 |

BIN
assets/6.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 185 KiB

View File

@ -15,7 +15,7 @@ export PATH=/opt/miniconda3/envs/venv-py3/bin:$PATH
cd $baseDir/..
if [ -f .env ]; then
echo "load env with" `pwd`"/.env"
source .env
#source .env
fi
python demo.py

View File

@ -4,20 +4,19 @@ LONGDOC = """
Synonyms
=====================
Chinese Synonyms for Natural Language Processing and Understanding.
中文近义词
Welcome
-------
https://github.com/chatopera/Synonyms
"""
setup(
name='synonyms',
version='3.11.0',
description=' 中文近义词聊天机器人智能问答工具包Chinese Synonyms for Natural Language Processing and Understanding',
version='3.12.0',
description='中文近义词聊天机器人智能问答工具包Chinese Synonyms for Natural Language Processing and Understanding',
long_description=LONGDOC,
author='Hai Liang Wang, Hu Ying Xi',
author_email='hailiang.hl.wang@gmail.com',
author_email='hain@chatopera.com',
url='https://github.com/chatopera/Synonyms',
license="MIT",
classifiers=[
@ -32,6 +31,7 @@ setup(
'Programming Language :: Python :: 3',
'Programming Language :: Python :: 3.5',
'Programming Language :: Python :: 3.6',
'Programming Language :: Python :: 3.7',
'Topic :: Text Processing',
'Topic :: Text Processing :: Indexing',
'Topic :: Text Processing :: Linguistic'],
@ -48,5 +48,4 @@ setup(
'synonyms': [
'**/*.gz',
'**/*.txt',
'**/*.vector',
'LICENSE']})

View File

@ -20,7 +20,7 @@ from __future__ import division
__copyright__ = "Copyright (c) (2017-2020) Chatopera Inc. All Rights Reserved"
__author__ = "Hu Ying Xi<>, Hai Liang Wang<hailiang.hl.wang@gmail.com>"
__date__ = "2017-09-27"
__version__ = "3.11.0"
__version__ = "3.12.0"
import os
import sys
@ -56,6 +56,7 @@ from .utils import cosine
from .utils import is_digit
import jieba
from .jieba import posseg as _tokenizer
import wget
'''
globals
@ -119,19 +120,28 @@ def _segment_words(sen):
word embedding
'''
# vectors
_f_model = os.path.join(curdir, 'data', 'words.vector')
_f_url = os.environ.get("SYNONYMS_WORD2VEC_BIN_URL_ZH_CN", "https://static-public.chatopera.com/ml/synonyms/words.vector.gz")
_f_model = os.path.join(curdir, 'data', 'words.vector.gz')
_download_model = not os.path.exists(_f_model)
if "SYNONYMS_WORD2VEC_BIN_MODEL_ZH_CN" in ENVIRON:
_f_model = ENVIRON["SYNONYMS_WORD2VEC_BIN_MODEL_ZH_CN"]
_download_model = False
def _load_w2v(model_file=_f_model, binary=True):
'''
load word2vec model
'''
if not os.path.exists(model_file):
print("os.path : ", os.path)
if not os.path.exists(model_file) and _download_model:
print("\n[Synonyms] downloading data from %s to %s ... \n this only happens if SYNONYMS_WORD2VEC_BIN_URL_ZH_CN is not present and Synonyms initialization for the first time. \n It would take minutes that depends on network." % (_f_url, model_file))
wget.download(_f_url, out = model_file)
print("\n[Synonyms] download is done.\n")
elif not os.path.exists(model_file):
print("[Synonyms] os.path : ", os.path)
raise Exception("Model file [%s] does not exist." % model_file)
return KeyedVectors.load_word2vec_format(
model_file, binary=binary, unicode_errors='ignore')
print(">> Synonyms on loading vectors [%s] ..." % _f_model)
print("[Synonyms] on loading vectors [%s] ..." % _f_model)
_vectors = _load_w2v(model_file=_f_model)
def _get_wv(sentence, ignore=False):

402
synonyms/wget.py Normal file
View File

@ -0,0 +1,402 @@
#!/usr/bin/env python
"""
Download utility as an easy way to get file from the net
python -m wget <URL>
python wget.py <URL>
Downloads: http://pypi.python.org/pypi/wget/
Development: http://bitbucket.org/techtonik/python-wget/
wget.py is not option compatible with Unix wget utility,
to make command line interface intuitive for new people.
Public domain by anatoly techtonik <techtonik@gmail.com>
Also available under the terms of MIT license
Copyright (c) 2010-2014 anatoly techtonik
"""
import sys, shutil, os
import tempfile
import math
PY3K = sys.version_info >= (3, 0)
if PY3K:
import urllib.request as urllib
import urllib.parse as urlparse
else:
import urllib
import urlparse
__version__ = "2.3-beta1"
def filename_from_url(url):
""":return: detected filename or None"""
fname = os.path.basename(urlparse.urlparse(url).path)
if len(fname.strip(" \n\t.")) == 0:
return None
return fname
def filename_from_headers(headers):
"""Detect filename from Content-Disposition headers if present.
http://greenbytes.de/tech/tc2231/
:param: headers as dict, list or string
:return: filename from content-disposition header or None
"""
if type(headers) == str:
headers = headers.splitlines()
if type(headers) == list:
headers = dict([x.split(':', 1) for x in headers])
cdisp = headers.get("Content-Disposition")
if not cdisp:
return None
cdtype = cdisp.split(';')
if len(cdtype) == 1:
return None
if cdtype[0].strip().lower() not in ('inline', 'attachment'):
return None
# several filename params is illegal, but just in case
fnames = [x for x in cdtype[1:] if x.strip().startswith('filename=')]
if len(fnames) > 1:
return None
name = fnames[0].split('=')[1].strip(' \t"')
name = os.path.basename(name)
if not name:
return None
return name
def filename_fix_existing(filename):
"""Expands name portion of filename with numeric ' (x)' suffix to
return filename that doesn't exist already.
"""
dirname = '.'
name, ext = filename.rsplit('.', 1)
names = [x for x in os.listdir(dirname) if x.startswith(name)]
names = [x.rsplit('.', 1)[0] for x in names]
suffixes = [x.replace(name, '') for x in names]
# filter suffixes that match ' (x)' pattern
suffixes = [x[2:-1] for x in suffixes
if x.startswith(' (') and x.endswith(')')]
indexes = [int(x) for x in suffixes
if set(x) <= set('0123456789')]
idx = 1
if indexes:
idx += sorted(indexes)[-1]
return '%s (%d).%s' % (name, idx, ext)
# --- terminal/console output helpers ---
def get_console_width():
"""Return width of available window area. Autodetection works for
Windows and POSIX platforms. Returns 80 for others
Code from http://bitbucket.org/techtonik/python-pager
"""
if os.name == 'nt':
STD_INPUT_HANDLE = -10
STD_OUTPUT_HANDLE = -11
STD_ERROR_HANDLE = -12
# get console handle
from ctypes import windll, Structure, byref
try:
from ctypes.wintypes import SHORT, WORD, DWORD
except ImportError:
# workaround for missing types in Python 2.5
from ctypes import (
c_short as SHORT, c_ushort as WORD, c_ulong as DWORD)
console_handle = windll.kernel32.GetStdHandle(STD_OUTPUT_HANDLE)
# CONSOLE_SCREEN_BUFFER_INFO Structure
class COORD(Structure):
_fields_ = [("X", SHORT), ("Y", SHORT)]
class SMALL_RECT(Structure):
_fields_ = [("Left", SHORT), ("Top", SHORT),
("Right", SHORT), ("Bottom", SHORT)]
class CONSOLE_SCREEN_BUFFER_INFO(Structure):
_fields_ = [("dwSize", COORD),
("dwCursorPosition", COORD),
("wAttributes", WORD),
("srWindow", SMALL_RECT),
("dwMaximumWindowSize", DWORD)]
sbi = CONSOLE_SCREEN_BUFFER_INFO()
ret = windll.kernel32.GetConsoleScreenBufferInfo(console_handle, byref(sbi))
if ret == 0:
return 0
return sbi.srWindow.Right+1
elif os.name == 'posix':
from fcntl import ioctl
from termios import TIOCGWINSZ
from array import array
winsize = array("H", [0] * 4)
try:
ioctl(sys.stdout.fileno(), TIOCGWINSZ, winsize)
except IOError:
pass
return (winsize[1], winsize[0])[0]
return 80
def bar_thermometer(current, total, width=80):
"""Return thermometer style progress bar string. `total` argument
can not be zero. The minimum size of bar returned is 3. Example:
[.......... ]
Control and trailing symbols (\r and spaces) are not included.
See `bar_adaptive` for more information.
"""
# number of dots on thermometer scale
avail_dots = width-2
shaded_dots = int(math.floor(float(current) / total * avail_dots))
return '[' + '.'*shaded_dots + ' '*(avail_dots-shaded_dots) + ']'
def bar_adaptive(current, total, width=80):
"""Return progress bar string for given values in one of three
styles depending on available width:
[.. ] downloaded / total
downloaded / total
[.. ]
if total value is unknown or <= 0, show bytes counter using two
adaptive styles:
%s / unknown
%s
if there is not enough space on the screen, do not display anything
returned string doesn't include control characters like \r used to
place cursor at the beginning of the line to erase previous content.
this function leaves one free character at the end of string to
avoid automatic linefeed on Windows.
"""
# process special case when total size is unknown and return immediately
if not total or total < 0:
msg = "%s / unknown" % current
if len(msg) < width: # leaves one character to avoid linefeed
return msg
if len("%s" % current) < width:
return "%s" % current
# --- adaptive layout algorithm ---
#
# [x] describe the format of the progress bar
# [x] describe min width for each data field
# [x] set priorities for each element
# [x] select elements to be shown
# [x] choose top priority element min_width < avail_width
# [x] lessen avail_width by value if min_width
# [x] exclude element from priority list and repeat
# 10% [.. ] 10/100
# pppp bbbbb sssssss
min_width = {
'percent': 4, # 100%
'bar': 3, # [.]
'size': len("%s" % total)*2 + 3, # 'xxxx / yyyy'
}
priority = ['percent', 'bar', 'size']
# select elements to show
selected = []
avail = width
for field in priority:
if min_width[field] < avail:
selected.append(field)
avail -= min_width[field]+1 # +1 is for separator or for reserved space at
# the end of line to avoid linefeed on Windows
# render
output = ''
for field in selected:
if field == 'percent':
# fixed size width for percentage
output += ('%s%%' % (100 * current // total)).rjust(min_width['percent'])
elif field == 'bar': # [. ]
# bar takes its min width + all available space
output += bar_thermometer(current, total, min_width['bar']+avail)
elif field == 'size':
# size field has a constant width (min == max)
output += ("%s / %s" % (current, total)).rjust(min_width['size'])
selected = selected[1:]
if selected:
output += ' ' # add field separator
return output
# --/ console helpers
__current_size = 0 # global state variable, which exists solely as a
# workaround against Python 3.3.0 regression
# http://bugs.python.org/issue16409
# fixed in Python 3.3.1
def callback_progress(blocks, block_size, total_size, bar_function):
"""callback function for urlretrieve that is called when connection is
created and when once for each block
draws adaptive progress bar in terminal/console
use sys.stdout.write() instead of "print,", because it allows one more
symbol at the line end without linefeed on Windows
:param blocks: number of blocks transferred so far
:param block_size: in bytes
:param total_size: in bytes, can be -1 if server doesn't return it
:param bar_function: another callback function to visualize progress
"""
global __current_size
width = min(100, get_console_width())
if sys.version_info[:3] == (3, 3, 0): # regression workaround
if blocks == 0: # first call
__current_size = 0
else:
__current_size += block_size
current_size = __current_size
else:
current_size = min(blocks*block_size, total_size)
progress = bar_function(current_size, total_size, width)
if progress:
sys.stdout.write("\r" + progress)
class ThrowOnErrorOpener(urllib.FancyURLopener):
def http_error_default(self, url, fp, errcode, errmsg, headers):
raise Exception("%s: %s" % (errcode, errmsg))
def download(url, out=None, bar=bar_adaptive):
"""High level function, which downloads URL into tmp file in current
directory and then renames it to filename autodetected from either URL
or HTTP headers.
:param bar: function to track download progress (visualize etc.)
:param out: output filename or directory
:return: filename where URL is downloaded to
"""
names = dict()
names["out"] = out or ''
names["url"] = filename_from_url(url)
# get filename for temp file in current directory
prefix = (names["url"] or names["out"] or ".") + "."
(fd, tmpfile) = tempfile.mkstemp(".tmp", prefix=prefix, dir=".")
os.close(fd)
os.unlink(tmpfile)
# set progress monitoring callback
def callback_charged(blocks, block_size, total_size):
# 'closure' to set bar drawing function in callback
callback_progress(blocks, block_size, total_size, bar_function=bar)
if bar:
callback = callback_charged
else:
callback = None
(tmpfile, headers) = ThrowOnErrorOpener().retrieve(url, tmpfile, callback)
names["header"] = filename_from_headers(headers)
if os.path.isdir(names["out"]):
filename = names["header"] or names["url"]
filename = names["out"] + "/" + filename
else:
filename = names["out"] or names["header"] or names["url"]
# add numeric ' (x)' suffix if filename already exists
if os.path.exists(filename):
filename = filename_fix_existing(filename)
shutil.move(tmpfile, filename)
#print headers
return filename
usage = """\
usage: wget.py [options] URL
options:
-o --output FILE|DIR output filename or directory
-h --help
--version
"""
if __name__ == "__main__":
if len(sys.argv) < 2 or "-h" in sys.argv or "--help" in sys.argv:
sys.exit(usage)
if "--version" in sys.argv:
sys.exit("wget.py " + __version__)
from optparse import OptionParser
parser = OptionParser()
parser.add_option("-o", "--output", dest="output")
(options, args) = parser.parse_args()
url = sys.argv[1]
filename = download(args[0], out=options.output)
print("")
print("Saved under %s" % filename)
r"""
features that require more tuits for urlretrieve API
http://www.python.org/doc/2.6/library/urllib.html#urllib.urlretrieve
[x] autodetect filename from URL
[x] autodetect filename from headers - Content-Disposition
http://greenbytes.de/tech/tc2231/
[ ] make HEAD request to detect temp filename from Content-Disposition
[ ] process HTTP status codes (i.e. 404 error)
http://ftp.de.debian.org/debian/pool/iso-codes_3.24.2.orig.tar.bz2
[ ] catch KeyboardInterrupt
[ ] optionally preserve incomplete file
[x] create temp file in current directory
[ ] resume download (broken connection)
[ ] resume download (incomplete file)
[x] show progress indicator
http://mail.python.org/pipermail/tutor/2005-May/038797.html
[x] do not overwrite downloaded file
[x] rename file automatically if exists
[x] optionally specify path for downloaded file
[ ] options plan
[x] -h, --help, --version (CHAOS speccy)
[ ] clpbar progress bar style
_ 30.0Mb at 3.0 Mbps eta: 0:00:20 30% [===== ]
[ ] test "bar \r" print with \r at the end of line on Windows
[ ] process Python 2.x urllib.ContentTooShortError exception gracefully
(ideally retry and continue download)
(tmpfile, headers) = urllib.urlretrieve(url, tmpfile, callback_progress)
File "C:\Python27\lib\urllib.py", line 93, in urlretrieve
return _urlopener.retrieve(url, filename, reporthook, data)
File "C:\Python27\lib\urllib.py", line 283, in retrieve
"of %i bytes" % (read, size), result)
urllib.ContentTooShortError: retrieval incomplete: got only 15239952 out of 24807571 bytes
[ ] find out if urlretrieve may return unicode headers
[ ] test suite for unsafe filenames from url and from headers
[ ] security checks
[ ] filename_from_url
[ ] filename_from_headers
[ ] MITM redirect from https URL
[ ] https certificate check
[ ] size+hash check helpers
[ ] fail if size is known and mismatch
[ ] fail if hash mismatch
"""