中文长文本分类、短句子分类、多标签分类、两句子相似度(Chinese Text Classification of Keras NLP, multi-label classify, or sentence classify, long or short),字词句向量嵌入层(embeddings)和网络层(graph)构建基类,FastText,TextCNN,CharCNN,TextRNN, RCNN, DCNN, DPCNN, VDCNN, CRNN, Bert, Xlnet, Albert, Attention, DeepMoji, HAN, 胶囊网络-CapsuleNet, Transformer-encode, Seq2seq, SWEM, LEAM, TextGCN
albertbertcapsulecharcnncrnndcnndpcnnembeddingsfasttexthankeraskeras-textclassificationleamnlprcnntext-classificationtextcnntransformervdcnnxlnet
keras_textclassification | ||
test | ||
__init__.py | ||
.gitignore | ||
LICENSE | ||
README.md | ||
requirements.txt | ||
setup.py |
Keras-TextClassification
keras_textclassification(代码主体,未完待续...)
- Bert-fineture
- FastText
- TextCNN
- charCNN
- TextRNN
- TextRCNN
- TextDCNN
- TextDPCNN
- TextVDCNN
- TextCRNN
- DeepMoji
- SelfAttention
- HAN
- CapsuleNet
run(运行, 以FastText为例)
- 1. 进入keras_textclassification/m01_FastText目录,
- 2. 训练: 运行 train.py, 例如: python train.py
- 3. 预测: 运行 predict.py, 例如: python predict.py
- 说明: 默认不带pre train的random embedding,训练和验证语料只有100条,完整语料移步下面data查看下载
run(test/sample实例)
- bert,word2vec,random样例在test/目录下, 注意word2vec(char or word), random-word, bert(chinese_L-12_H-768_A-12)未全部加载,需要下载
- predict_bert_text_cnn.py
- tet_char_bert_embedding.py
- tet_char_random_embedding.py
- tet_char_word2vec_embedding.py
- tet_word_random_embedding.py
- tet_word_word2vec_embedding.py
keras_textclassification/data
- 数据下载
** github项目中只是上传部分数据,需要的前往链接: https://pan.baidu.com/s/1I3vydhmFEQ9nuPG2fDou8Q 提取码: rket
- baidu_qa_2019(百度qa问答语料,只取title作为分类样本,17个类,有一个是空'',已经压缩上传)
- baike_qa_train.csv
- baike_qa_valid.csv
- embeddings
- chinese_L-12_H-768_A-12/(取谷歌预训练好点的模型,已经压缩上传)
- term_char.txt(已经上传, 项目中已全, wiki字典, 还可以用新华字典什么的)
- term_word.txt(未上传, 项目中只有部分, 可参考词向量的)
- w2v_model_merge_short.vec(未上传, 项目中只有部分, 词向量, 可以用自己的)
- w2v_model_wiki_char.vec(已上传百度网盘, 项目中只有部分, 自己训练的维基百科字向量, 可以用自己的)
- model
- 预训练模型存放地址
项目说明
-
- 构建了base基类(网络(graph)、向量嵌入(词、字、句子embedding)),后边的具体模型继承它们,代码简单
-
- conf存放项目数据、模型的地址, data存放数据和语料, etl为数据预处理模块,
模型与论文paper题与地址
- FastText: Bag of Tricks for Efficient Text Classification
- TextCNN: Convolutional Neural Networks for Sentence Classification
- charCNN: Character-Aware Neural Language Models
- TextRNN: Recurrent Neural Network for Text Classification with Multi-Task Learning
- RCNN: Recurrent Convolutional Neural Networks for Text Classification
- DCNN: A Convolutional Neural Network for Modelling Sentences
- DPCNN: Deep Pyramid Convolutional Neural Networks for Text Categorization
- VDCNN: Very Deep Convolutional Networks
- CRNN: A C-LSTM Neural Network for Text Classification
- DeepMoji: Using millions of emojio ccurrences to learn any-domain represent ations for detecting sentiment, emotion and sarcasm
- SelfAttention: Attention Is All You Need
- HAN: Hierarchical Attention Networks for Document Classification
- CapsuleNet: Dynamic Routing Between Capsules
参考/感谢
- 文本分类项目: https://github.com/mosu027/TextClassification
- 文本分类看山杯: https://github.com/brightmart/text_classification
- Kashgari项目: https://github.com/BrikerMan/Kashgari
- 文本分类Ipty : https://github.com/lpty/classifier
- keras文本分类: https://github.com/ShawnyXiao/TextClassification-Keras
- keras文本分类: https://github.com/AlexYangLi/TextClassification