multi-class and charCNN-zhang

This commit is contained in:
yongzhuo 2019-08-14 22:30:32 +08:00
parent 7761430b97
commit 4c86e528d7
12 changed files with 1866 additions and 3 deletions

View File

@ -27,6 +27,7 @@
# run(test/sample实例)
- bert,word2vec,random样例在test/目录下, 注意word2vec(char or word), random-word, bert(chinese_L-12_H-768_A-12)未全部加载,需要下载
- multi_class/目录下以text-cnn为例进行多标签分类实例转化为multi-onehot标签类别分类则取一定阀值的类
- predict_bert_text_cnn.py
- tet_char_bert_embedding.py
- tet_char_random_embedding.py
@ -40,6 +41,10 @@
- baidu_qa_2019百度qa问答语料只取title作为分类样本17个类有一个是空'',已经压缩上传)
- baike_qa_train.csv
- baike_qa_valid.csv
-byte_multi_news今日头条2018新闻标题多标签语料1070个标签fate233爬取, 地址为: [byte_multi_news](https://github.com/fate233/toutiao-multilevel-text-classfication-dataset)
-labels.csv
-train.csv
-valid.csv
- embeddings
- chinese_L-12_H-768_A-12/(取谷歌预训练好点的模型,已经压缩上传)
- term_char.txt(已经上传, 项目中已全, wiki字典, 还可以用新华字典什么的)
@ -57,7 +62,8 @@
# 模型与论文paper题与地址
* FastText: [Bag of Tricks for Efficient Text Classification](https://arxiv.org/abs/1607.01759)
* TextCNN [Convolutional Neural Networks for Sentence Classification](https://arxiv.org/abs/1408.5882)
* charCNN [Character-Aware Neural Language Models](https://arxiv.org/abs/1508.06615)
* charCNN-kim [Character-Aware Neural Language Models](https://arxiv.org/abs/1508.06615)
* charCNN-zhang: [Character-level Convolutional Networks for Text Classification](https://arxiv.org/pdf/1509.01626.pdf)
* TextRNN [Recurrent Neural Network for Text Classification with Multi-Task Learning](https://www.ijcai.org/Proceedings/16/Papers/408.pdf)
* RCNN [Recurrent Convolutional Neural Networks for Text Classification](http://www.nlpr.ia.ac.cn/cip/~liukang/liukangPageFile/Recurrent%20Convolutional%20Neural%20Networks%20for%20Text%20Classification.pdf)
* DCNN: [A Convolutional Neural Network for Modelling Sentences](https://arxiv.org/abs/1404.2188)

View File

@ -21,6 +21,11 @@ path_embedding_vector_word2vec_word = path_root + '/data/embeddings/w2v_model_me
path_baidu_qa_2019_train = path_root + '/data/baidu_qa_2019/baike_qa_train.csv'
path_baidu_qa_2019_valid = path_root + '/data/baidu_qa_2019/baike_qa_valid.csv'
# 今日头条新闻多标签分类
path_byte_multi_news_train = path_root + '/data/byte_multi_news/train.csv'
path_byte_multi_news_valid = path_root + '/data/byte_multi_news/valid.csv'
path_byte_multi_news_label = path_root + '/data/byte_multi_news/labels.csv'
# fast_text config
# 模型目录
path_model_dir = path_root + "/data/model/fast_text/"

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,9 @@
# 数据
*来源: [https://github.com/fate233/toutiao-multilevel-text-classfication-dataset](https://github.com/fate233/toutiao-multilevel-text-classfication-dataset)
*标签: 1070个
*数目:原始语料每个文件大约60万条有5个文件
*分隔符: |,|
*处理后数据:news_world/other,news_world|,|美国总统不是特朗普,而是班农
*处理后数据格式:labelquestion
*原始数据:1000866069|,|tip,news|,|【互联网资讯】PPT设计宝典!十招教你做出拿得出手的PPT|,|互联网,美国,ppt,powerpoint,幻灯片,演示文稿,微软,字体列表|,|
*原始数据格式:新闻ID分类代码新闻字符串仅含标题新闻关键词新闻类型

View File

@ -0,0 +1,170 @@
label|,|ques
news_sports/basketball/NBA,news_sports/other,news_sports|,|那些被詹皇打散重建的球队
news_tech/internet/e-business/O2O,news_finance/entrepreneurship/other,news_finance/entrepreneurship,news_finance|,|“三个成倍”是衡量O2O创业项目能否成功的关键
news_sports,news_entertainment|,|快看!梅西演电影了 网友:拉玛西亚影视学院最正宗
news_society/civil/social_insurance,news_society/other,news_agriculture/farmer,news_society,news_agriculture|,|农民50岁还干活只为买“退休”一次缴满9万元划算不划算
news_home/other,news_home|,|只需做好这四点,就能让你养的天竺葵全年花开不断!
news_travel/other,news_home/floriculture,news_travel|,|长沙,红梅盛开含笑迎接新年
news|,|赋得真情必动人,那些轻触心尖儿的情诗
news_society/news_law/statute,news_politics/other,news_society/news_law,news_society,news_politics|,|「第212期」重拳出击治“老赖”丨不开玩笑拿下13个亿
news_finance/investment/other,news_tech/other,news_politics/other,news_finance/investment,news_finance,news_tech,news_politics|,|中国品牌创新发展工程推荐企业
news_food/other,news_food|,|新人报到 | 15分钟“一人食”营养早餐怎么做
news_fashion/dressing,news_entertainment/vulgar,news_fashion|,|泫雅新MV将BRA外穿得太性感差点被禁
news_tech/other,news_tech|,|奥迪、乐视、丰田要让汽车VR从虚拟到现实还有几道坎
news_sports/tennis,news_sports|,|与费纳德同处一个时代的不幸,他最有话说
emotion/marriage/other,emotion/affection/other,emotion/marriage,emotion/affection,emotion|,|小姑子帮我带大儿子二十年后她找我借3万我给她转账30万
news_entertainment/hk_taiwan_entertainment,news_entertainment|,|因为一句爱国话在台湾疯狂掉粉 为什么还有人去骂他不爱国?
news_health/other,news_health|,|业内动态 | ERAS理念下的术后镇痛研讨会纪实总结
news|,|吸烟有害健康,戒不了,怎么办?让茶帮你
news_entertainment/west_entertainment/other,news_entertainment/west_entertainment,news_entertainment|,|《哈利波特》“斯内普教授”艾伦·里克曼去世
news_society/news_law/other,news_house/house_renting,news_society/news_law,news_society|,|“119”警示丨三小场所、出租屋火灾案例警示
digital/cellphone/huawei_phone,digital/operating_system/Android,news_tech/software,digital/operating_system,digital/hardware,digital/cellphone,digital,news_tech|,|国内首发安卓7.0 华为Mate 9之EMUI 5.0体验
news_sports/olympic_games,news_sports/ping_pong,news_sports|,|她是国乒第一个在奥运会让球的选手,让球后选择退役并远嫁韩国
news_entertainment/other,news_entertainment|,|49岁翁虹全家近照富豪老公为生儿子改名女儿可爱文静像美妈
digital/cellphone/other,digital/operating_system/Android,news/pandian,digital/cellphone,digital|,|手机也疯狂!盘点那些打破世界纪录的手机
emotion/marriage/other,emotion/affection/other,emotion/marriage,emotion/affection,emotion/jitang,emotion/low_grage_emotion,emotion|,|当妻子重要的不是有漂亮脸蛋,重要的是你必须有智慧
news_home/decoration/furniture,news_home/decoration,news_home|,|简单装修怎么搭配实木家具?这几个案例请收下
news_history/chinese_history/ancient_chinese_history/other,news_history/chinese_history/ancient_chinese_history,news_history/chinese_history,news_history|,|短暂存在的三川郡,影响中国两千年的三件事
news_society/other,news_society|,|注意啦!高速上遇到这事,千万别停车!别开车门!别开窗!
digital/coolplay/other,digital/cellphone/other,digital/computer_peripheral/mechanical_keyboard,digital/coolplay,digital/cellphone,digital|,|装逼族的“大宝剑”之花式秀键盘
news_society/civil/other,news_society/civil,government,news_society|,|成安交警大队2016年中秋小长假期间两公布一提示
news_travel/self_driving_tour,news_travel|,|我国唯一融山海湖为一体的秘境
news|,|谁家少得了书桌?不过这么美的你肯定没见过
news|,|独家|陈灼昊案判决书全文
news_military/weaponry/other,news_edu/art_education/painting,news_military/weaponry,news_military|,|致歉···弄错国产54手枪和仿制原版的苏联TT-33手枪
news_entertainment,news|,|考试作弊是门艺术~但是谁用过这些奇葩方法?
news_collect/plaything,news_collect|,|蜜蜡附有色膜是什么鬼?
news_pet/other,news_pet|,|老爷爷救了两受伤狐狸,可当它们伤好了准备放走时,却发生了这事
digital/cellphone/other,digital/hardware,digital/cellphone,digital|,|打造极致Hi-Fi影视手机 国产品牌谁也不敢与vivo为敌
news|,|父母的位置摆不正,子孙就会很危险!
emotion/affection/other,emotion/marriage/extramarital_affair,emotion/marriage,emotion/affection,emotion/low_grage_emotion,emotion|,|激情过后——请给你的婚外情披上合法的外衣
news_society/other,news_society|,|鸡蛋如何装进油桶?今天全面解析
news_sports/basketball/NBA,news_sports/basketball,news_sports|,|他是03年NBA选秀三杰最大苦主防守如铜墙铁壁一般
government,news_society/civil/other,news_society/civil,news_society|,|[余票信息]快关注、早出手,春运之路更顺心
digital/coolplay/wearable_devices,digital/other,news_tech/internet_cellphone,news_tech,digital|,|小米生态链近期发狂!?1MORE、90分近日又有新品问世!
news_culture/other,news_food/tea,news_culture|,|品精典布朗山金瓜,鉴一片茶叶里多少春秋!
news_tech,news_food/wine_drinking/other,news_finance/entrepreneurship/other,news_food/wine_drinking,news_finance/entrepreneurship,news|,|程序员创业首先学会喝酒!
news_fashion/cosmetology/body_care/other,news_fashion/cosmetology/body_care,news_fashion/other,news/pandian,news_fashion|,|经常按摩疏通淋巴,还可以瘦脸!明星们都这么做!
news|,|数学建模思想|从现实对象到数学模型!
news_edu/edu_upgrade/english_language,news_edu/other,news_edu|,|1706四六级全科备考计划
news_culture/art/calligraphy,news_edu/art_education/painting,news_collect/other,news_culture/art,news_culture,news_collect|,|3.3一元拍卖会︱名家书画专场
digital/cellphone/other,digital/photography/camera,digital/cellphone,digital|,|这也能叫黑科技双镜头cool1 dual笑而不语
news_car/other,news_car|,|4S店猫腻之选装
news_history|,|毛主席曾经特赦过七次战犯,为何没有特赦汪精卫的夫人陈璧君?
news_car/sports_car,news|,|这款车只卖给中国人,太霸道了!
digital/coolplay/virtual_reality,digital/hardware,news_tech/artificial_intelligence,technique/natural_language_processing,news_tech|,|自然裸手ARVR体验 uSens凌感正式发布Fingo
news_entertainment/gossip,news_entertainment|,|近期娱乐圈的四大重磅新闻 一条刷新三观两条让人心痛不愿相信!
news_health/regimen/other,news_health/medical_news/other,news_health/cancer/other,news_health/cancer,news_health/regimen,news_health/medical_news,news_health|,|精品原创︱大肠癌肝转移患者“能活多久”?
news|,|最爱二次元美女丨少女前线美图
news_fashion/dressing/other,news_fashion/dressing,news_fashion|,|2016最火的吊带衫搭配方案全在这了
news_finance/investment/stock/other,news_finance/investment/stock,news_agriculture/planting,news_finance/investment,news_agriculture,news_finance|,|尿素 暴涨背后的4个理由 及后市预测!
lottery/other,lottery|,|大乐透16139期关注凤尾28、29 不要错过38亿奖池
news_finance/investment/stock/other,news_finance/investment/stock,news_finance/investment,news_finance|,|宇辉战舰盘前点睛20160711
news_baby/baby_growth/baby_parenting/other,news_food/menu/other,news_baby/baby_growth/baby_parenting,news_baby/baby_nurturing/baby_nursing,news_baby/baby_nurturing,news_food/menu,news_baby/baby_growth,news_food,news_baby,news_psychology|,|半大孩子喜欢的几个家常菜
news_photography|,|归零归零 丨 更新预告 采集神秘白色体液的小护士 5月20日
news_food/dessert,news_food|,|美得如此丧心病狂的甜点,是想让我吃还是不吃呢~
news_fashion/dressing/other,funny/marvel,news_fashion/dressing,news_fashion,funny|,|被路边小哥拉进理发店,进去之后竟然是这样的!
news_car/car_usage/other,news_car/huohua_car_usage,news_car/car_usage,news_car|,|开车三年半,有一样东西该换了,可许多温州人却不知道!
news_finance/entrepreneurship/other,news_tech/maker,news_finance/entrepreneurship,news_finance,news_tech|,|创客100第十五期开放日以前他用报道把副部级官员送进监狱今天他用内容“勾引”读者
news_tech/other,news_tech|,|熟记这11个最常用PS快捷键 早日成为专业设计师
news_sports/football/football_world/football_england,news_sports/football/football_world,news_sports/football,news_sports|,|足球风云汇:收官献礼,这场大球不可错过
news_astrology/numerology/chinese_zodiac,news_culture/other,news_culture|,|第一轮生肖贺岁纪念币价值连城,你见过哪个?
news_sports/football/football_world/other,news_sports/football/football_world,news_sports/football,news_sports|,|足坛15大终身无缘世界杯的巨星谁最遗憾
news_society/other,news_politics/other,news_military/other,news_society,news_politics,news_military|,|淄博中心城区公安武警联勤武装巡逻启动
news_food/menu/other,news_food/menu,news_food|,|凉拌猪心:几道小动作“除腥增鲜”做出优质下酒菜!
science_all/animal/snake,science_all/other,science_all|,|蛇来啦!简单讲下曼巴属的区分!
news_baby/pregnancy/pregnancying,news_baby/other,news_baby|,|中国唐氏儿成了美国网红,唐筛不合格,你愿意把胎儿生下来吗?
news_tech/internet/e-business/other,news_tech/internet/e-business,news_agriculture/farming,news_tech/internet,news_agriculture,news_tech|,|冷链学术|冷链物流基础上的生鲜电商发展研究
news_health/sexual_health/women_health/gynecology,news_health/sexual_health/women_health,news_health/sexual_health,news_health|,|关于子宫内膜异位症,你应该先看看这一篇!
news_society/news_law/other,news_politics/politics_law,news_society/news_law,government,news_society|,|临漳交警执勤捡手机 归还失主受称赞
news_history/chinese_history/ancient_chinese_history/other,news_history/chinese_history/ancient_chinese_history,news_history/chinese_history,news_history|,|赵匡胤一脉能重新获得江山,靠得竟是一个女人的梦!
news_history/chinese_history/ancient_chinese_history/other,news_history/chinese_history/ancient_chinese_history,news_history/chinese_history,news_history|,|秦国年仅12岁的奇才吕不韦不到的事他办到了
news_culture/reading/poetry,news|,|网友最有情怀20首诗词我有一壶酒可以慰风尘
news_home/decoration/furniture,news_home/decoration,news_home|,|装修老司机给出的实用小窍门,句句戳心!总有几点是你所忽略掉的
news|,|浙江仙居上演荧光炫彩夜跑
news_military/weaponry/other,news_military/weaponry,news_military/military_china,news_military|,|MP5冲锋枪如此成功适合中国警察吗其实有远比它更好的选择
news_society/news_law/statute,news_society/news_law,government,news_society|,|法庭启动联动司法 诉前成功调解家庭纠纷
funny/other,funny|,|哈哈哈哈哈!最强买家秀来袭,双十一防剁手全靠它了!
news_baby/baby_growth/baby_parenting/other,news_baby/baby_growth/baby_parenting,news_baby/baby_growth,news_baby|,|教孩子学会情绪管理成才情商占80%智商仅占20%
news_edu/edu_upgrade/other,news_edu/edu_upgrade,news_edu|,|兴趣是最好的老师|小芸老师开导学生软硬兼施智慧满满
news_health/regimen/other,news_health/cancer/other,news_health/cancer,news/pandian,news_health/regimen,news_health|,|全球十大垃圾食品,吃了只会增加负担!
news_society/other,news_society|,|加拿大3名男警察性侵“警花” 或利用“失身酒”
news_society/civil/other,news_politics/politics_law,news_society/civil,news_society|,|中秋节期间玉山县道路出行提示
news_home/floriculture,news|,|花箱盆栽合理利用花园空间
news_history/chinese_history/ancient_chinese_history/other,news_history/chinese_history/ancient_chinese_history,news_culture/other,news_history/chinese_history,news_culture|,|微观茶市丨顾绍培师生紫砂作品展东莞启幕
digital/cellphone/other,digital/hardware,digital/cellphone,digital|,|小米5s真正的强大总是由内而外
news_pet/dogs,news_pet|,|二哈陷入自恋中无法自拔,每天都照镜子欣赏自己,这幕让人笑喷
news_home/decoration/other,news_home/decoration,news_home|,|头一次见水电安装地面这样开槽,楼下邻居知道了定会“拼命”的!
news_tech,news_edu/online_education,news|,|网易有道CEO周枫在线教育的风暴之眼
news_tech/internet/mobile_internet/other,news_finance/investment/stock/computer_industry,news_finance/financing/other,news_tech/internet/mobile_internet,news_finance/business_management/marketing,news_finance/financing,news_tech/internet,news_finance,news_tech|,|每日e报京东去年为员工缴纳五险一金超27亿元 传乐视金融欲收购数码视讯支付牌照
news_society/news_law/other,news_society/news_law,news_society|,|冒充烟草专卖局工作人员 女子专骗开便利店老人
news_history/military_history/anti_japanese_war,emotion/affection/love,news_entertainment|,|《毕业歌》热血与激情铸就不灭青春
news_food/tea,emotion/jitang,news_food|,|夏至夜,一锅幸福的茶鸡蛋 | 松鼠的厨房
news|,|一根有斑点的香蕉竟然这么厉害?绝对长知识!
news_baby/pregnancy/pregnancying,news_baby/pregnancy,news_baby|,|与胎儿做游戏不是天方夜谭!踢肚游戏,可培养出健壮、灵敏的宝宝
news_tech,news_finance/business_management/marketing,news|,|Annie岳被忽略的分类信息网站营销
news_fashion/fashion_man/mens_clothing,news_fashion/dressing/shoes,news_fashion/dressing,news_fashion|,|想要问问你敢不敢,像我这样穿运动服走红毯
news_history/military_history/anti_japanese_war,news_history/other,news_history|,|他屡败日寇后日寇以厚禄诱降 被他断然拒绝 朱德赞其太行屏障
news_culture/reading/literature/other,news_culture/art/other,news_culture/reading/literature,news_culture/reading/poetry,news_culture/art,news_culture/reading,news_culture|,|诗歌文学——凤鸣新春
news|,|小雪丨天渐寒,雪渐盛,又是一年将尽时
news_travel/other,news_travel|,|赏红叶—郎山红叶忆英雄
news_health/regimen/diet_therapy,news_health/traditional_chinese_medicine/chinese_herbology,news_health/regimen/healthy_eating,news_health/traditional_chinese_medicine,news_health/regimen,news_health|,|那些药膳靠谱么?
news_history,news_culture/reading/literature/ancient_poetry,news_culture/reading/literature/other,news_culture/reading/literature,news_culture/traditional_chinese/other,news_culture/traditional_chinese,news_society/other,news_society,news_culture|,|国庆长假探亲,亲戚间到底为何攀比?
digital/cellphone/other,digital/cellphone,news_tech/internet_cellphone,digital,news_tech|,|每1.1秒卖出一台2016年OPPO为何销量暴涨
news_travel/other,news_travel|,|原来“世外桃源”真的存在!
news_history/military_history/anti_japanese_war,news_history/chinese_history/modern_history_china,news_history/other,news_history|,|缅怀12.9关于12.9,请收下这些!
news_finance,news|,|新书《京东》即将上市:为什么是京东?
news_food/menu/chinese_food,news_food/catering_industry,news_food/cooking_skill,news_food|,|刘忠明收徒好威风!
news|,|全北京向上看:双彩虹配火烧云 一半彩虹一半火焰
news_tech/other,news_travel/travel_industry,news_travel,news_tech|,|新趋势:超三成网友愿在航空公司购买出境游产品
news_sports/other,news_sports|,|中断运动两星期会怎样?
news|,|春季患上心理感冒怎么办?
news_culture/art/handicraft,news_culture/art,news_culture|,|上周无锡灵山梵宫失火的廊厅里到底藏有哪些绝世珍宝?
news_society/civil/environmental_protection/environmental_pollution,news_society/civil/environmental_protection/air_pollution,news_society/civil/environmental_protection,news_society/civil/conserve_energy_emission,news_finance/other,news_finance|,|本想做个倒爷不料搞出世界500强24年不上市他为啥牛
news_history/other,news_history|,|他是李广的孙子,以五千步兵力战匈奴八万大军
news_entertainment/film_tv/movie/euro_and_us_movie,news_entertainment/west_entertainment/hollywood,news_entertainment/film_tv/movie,news_entertainment/film_tv,news_entertainment|,|能看到这两位“黑帮教父”同台对垒,是多少影迷的夙愿
news_fashion/fashion_man/mens_clothing,news_fashion/dressing/shoes,news_fashion/fashion_man,news_fashion|,|你真以为你看全了GUCCI男装秀太天真了
news_car/new_energy_car/electric_car,news_car/luxury_car,news_car/car_industry,news_car/german_car,news_car|,|增速要高于高端车市场 奔驰如何押宝中国?
news_edu/other,news_tech/software,news_edu|,|不会做的题,找“题谷”看老师视频讲解啊!
news_comic/blood_cartoon,news|,|母亲节,你老妈请求添加你为好友
news_society/weather/other,news_society/weather,news_society|,|烟台今起3天都有阵雨气温仍较高今天最高温31℃
news_food/menu/other,news_food/menu,news_food|,|用最简单的方法炒出最好吃的土豆丝,一上桌抢着吃
news_home/other,news_home|,|新型环保背景墙,敬请欣赏
news_tech,news|,|双十一战火已燃,京东拒绝刷单出奇招
news_sports/fight/boxing,news_sports/fight/integrated_combat,news_sports/fight,news_sports|,|邹市明被高估拳迷年轻5岁打伦龙满地找牙 KO播求不是梦
technique,science_all,news_world,news_travel,news_tech,news_sports,news_society,news_politics,news_pet,news_military,news_house,news_home,news_history,news_health,news_game,news_food,news_finance,news_fashion,news_entertainment,news_edu,news_culture,news_comic,news_car,news_baby,news_astrology,lottery,immigration,funny,fashion_wedding,emotion,digital,buddhism|,|花卉瓶插的理想水温是多少
news_tech,news|,|国内资本为何蜂拥投资直播行业?
news_society/other,news_society|,|事关3700000宝鸡人的好消息距离2018年还
news_tech/internet/mobile_internet/other,news_finance/entrepreneurship/other,news_tech/internet/mobile_internet,news_finance/entrepreneurship,news_finance|,|推出伙伴创业计划,滴滴到底想要干什么?
news_tech/internet/mobile_internet/other,news_tech/internet/mobile_internet,news_tech/other,news_media,news_tech|,|活在朋友圈的自拍一代,把整个世界都当成了背景!
news_photography|,|藤椅上小憩的长腿美少女,笑容清纯甜美,身姿窈窕,极品女神!
news_fashion/cosmetology/make_up/other,news_fashion/cosmetology/beautify/skin_care,news_fashion/cosmetology/beautify,news_fashion/cosmetology/make_up,news_fashion/other,news_fashion|,|冒着被海关叔叔扣下也要带回的药妆在这里(附香邂格蕾获奖名单)
news_food/snack,news_food|,|今日立冬,北京的吃货们该进补了
news_photography|,|美翻了泉州夜景,你见过了吗?
news_culture/other,emotion/jitang,news_culture|,|国馆丨总要争输赢的人,没有未来
emotion/jitang,news_psychology|,|测一下你的社交性格?
news_history/chinese_history/ancient_chinese_history/other,news_history/chinese_history/ancient_chinese_history,news_history/chinese_history,news_history|,|康熙皇帝为什么是中国历史上最了解西方科学技术的帝王
news_finance/investment/other,news_society/civil/social_insurance,news_finance/financing/insurance,news_finance/investment,news_society/civil,news_finance|,|医保到底有没有用?有什么用?
news/pandian,news_comic/japan_cartoon,news_comic/manhua,news_comic|,|打破主角光环!解析路飞四档
news_society/other,news_society|,|抢订退伍兵在行动,你被抢了吗?
news_sports/basketball/NBA,news_sports/basketball,news_sports|,|由飞踹下体引发的溃败,更加证明格林才是勇士胜负的关键手!
news_politics/politics_law,news_society/other,news_society|,|临漳交警春节前夕开展夜查统一行动
digital,news|,|让家里wifi信号增强10倍只要一个它
news_home/other,news_agriculture/animal_husbandry,news_agriculture,news_home|,|原本让人头疼的水葫芦,你却在家养的让人羡慕!
news_food/seafood,news|,|据说这款被称为史上最强的饵料能把红虫都打闭口?
news_finance/investment/stock/usa_stock,news_finance/financing/other,news_finance/macro_economic/macro_economic_china,news_finance/financing,news_finance/macro_economic,news_finance|,|国际评定机构下调中国信用主权评级?不必太在意
news_food/menu,news_food|,|日本豆腐的9种不同做法合集
news_comic/manhua/other,news/pandian,news_comic/manhua,news_comic|,|细思密恐|漫画|细数人类创造的那些恐怖发明
news_politics,news_house,news_finance|,|两会初探:代表委员们如何给房地产建言献策?
news_food/other,news_food|,|栗蘑美食系列之三“栗蘑炖鸡”
news_tech/internet/e-business/other,news_finance/business_management/marketing,news_tech/internet/e-business,news_tech/other,news/singles_day,news_edu/philosophy,news_tech|,|好物哲学+内容营销 京东超市赢在双11
news_astrology/other,news_astrology|,|星座女神:本周运势0704—0710
digital/coolplay/wearable_devices,digital/coolplay,digital|,|Basslet手环让乐迷在外也能感受低音炮的震撼
news_sports/football/football_china/china_league,news_sports/football/football_china,news_sports/other,news_sports|,|卡纳瓦罗有能力带领天津权健冲超?我不太相信
news_finance/investment/foreign_exchange,news_finance/other,news_finance|,|人民币中间价大跌超400点海外资产配置势在必行
news_entertainment/gossip,news_entertainment|,|出轨、把妹、各取所需?这些明星的人设为何说崩就崩
news_house/other,news/pandian,news_house|,|翻开2016 : 看看升龙集团成功的八大关键词
news_entertainment/film_tv/movie/other,news_entertainment/film_tv/movie,news_entertainment/film_tv,news_pet/cats,news_entertainment|,|《九条命》,毛裤先生笑傲三甲
news_baby/baby_growth/other,news_baby/baby_nurturing/baby_nursing,news_baby/baby_nurturing,news_baby/baby_growth,news_baby|,|宝宝说话晚是“贵人语迟”?说话晚的孩子智商高吗?
news_sports/basketball/NBA,news_sports/other,news_sports|,|连续两个赛季18次三双威少再成历史第一胡子兄弟爆双熊成砥柱
news_entertainment/film_tv/movie/horror_movie,news_entertainment/film_tv/movie,news_entertainment/film_tv,news_entertainment|,|不超过1000个人看过这部有点像《前目的地》的悬疑恐怖片
Can't render this file because it has a wrong number of fields in line 11.

View File

@ -0,0 +1,98 @@
label|,|ques
news_tech/internet/internet_finance/p2p_finance,news_finance/investment/finance_management,news_finance/other,news_finance|,|行业整顿后的P2P理财会变成什么样愉见财经
emotion/jitang,news|,|老婆:今天小年,你好吗
news_astrology/other,news_astrology|,|测试 | 流言蜚语能伤你多少
news_finance/investment/stock/usa_stock,news|,|"朵女郎"登陆纽约时代广场纳斯达克大屏为中国健儿加油
news_entertainment/other,news_entertainment|,|贝克汉姆虐布鲁克林千百回,却始终待小七如初恋
news_history/chinese_history/ancient_chinese_history/other,news_history/chinese_history/ancient_chinese_history,news_history/chinese_history,news_history|,|明朝的庚戌之变究竟是谁之过?
news_health/other,news_health|,|手术中医生想要“套套”,护士却不解其意
news_essay|,|无忧之人老练,竟然也有烦心之事
digital/cellphone/other,digital/cellphone,digital|,|小米有史以来下架最快的手机,你有买过么?
news_world/other,news_world|,|美国总统不是特朗普,而是班农
news_edu/edu_upgrade/college_entrance_examination,news_edu/edu_upgrade,news_edu|,|沈阳一考点挂霸气条幅 刷爆朋友圈
news_local,emotion/jitang,news|,|几年的放纵,换来的是一生卑微,致济南所有青春里的孩子!
news_comic/manhua/other,news_comic/manhua,news_society/other,news_society|,|怎样成为一个人见人爱的奇男子?「颜值篇」
news_game/mobile_game/other,emotion/affection/other,emotion/affection,emotion/jitang,news_game/mobile_game,emotion,news_game|,|感情不是游戏,谁也伤不起
news_society/civil/environmental_protection/air_pollution,news_society/civil/environmental_protection/environmental_pollution,news_society/civil/environmental_protection,news_society/other,news_society|,|全球污染城市排名出炉,面对严重雾霾三岁以下儿童家长怎么应对!
news_baby/pregnancy/other,news_baby/pregnancy,news_baby|,|我的暖男天使
emotion,news|,|吃雪饼没变旺被举报:生活里的奇葩为什么多了起来?
news_society/other,news_society|,|如果你不了解这个词经常出入会有危险!
emotion/marriage/other,emotion/marriage,news_essay|,|休闲空间你以为你是谁?
news_tech,digital|,|出门访友 异地也能轻松下载
news_baby/baby_growth/baby_parenting/pre_education,news_tech/other,technique/cloud_computing,news_edu/family_education,news_tech|,|探访成都明星创业公司之铁皮人:智能玩具不能取代亲子教育
news_tech/internet/other,news_tech/internet,news_tech|,|巨人雅虎没有失败!英雄迟暮的故事还有很多,因为大家喜欢听
news_edu/edu_upgrade/chinese_language,news_culture/reading/poetry,news|,|高中语文诗词鉴赏四步教学法
news_culture/art/traditional_opera,news/pandian,news_entertainment/drama,news_culture/art,news_culture|,|盘点2016年蒲剧大事记
news_society/other,news_society|,|惊艳!武汉首辆有轨电车试跑,真的快通车了
funny/other,funny|,|妹妹对姐姐说:姐今天蚊子好多哦~
news_society,news|,|黄梅|龙感湖飞鸟翔集
news|,|“最初一公里”是国产水果提高市场竞争力的关键
news_fashion/dressing/other,news_home/decoration/home_design,news_home/other,news_design/other,news_fashion/dressing,news_design,news_home,news_fashion|,|2016年度流行色搭配9款客厅设计势不可挡
news_politics/politics_law,news_society/other,news_society|,|学习消防安全常识二十条
news_health/body_building/other,news_health/body_building,news_health/slimming,news_health|,|健身千万别忽略小腿的锻炼6种方式增强小腿力量减缓腿部衰老
news_sports/basketball/other,news/pandian,news_sports/basketball,news_sports|,|第一大前走好,盘点邓肯和其他顶级大前的对位。
news_society/other,news_society|,|寻找“我身边的工匠”手机诵读大赛 3000元大奖邀您参加
news_car/car_usage/other,news_car/huohua_car_usage,news_car/car_usage,news_car|,|汽车天窗7个使用小技巧
news,news_home/other,news_home|,|火龙果小森林制作日记及教程
news_edu/edu_upgrade/adult_education,news_politics/other,news_edu/other,news_edu,news_politics|,|马克思主义学院在职研修班10月23日截止报名!
news_history/world_history/tomb,news_history/archaeology,news_culture/historical_relic,news_history|,|一个灵异的陵墓,墓道刻有一句诅咒,千百年来,盗墓贼均无一生还
news_history,news|,|狠太子弑父皇,在宫中大开杀戒,下令剖开父皇爱妃的胸膛
news_health/cancer/breast_cancer,news_health/cancer,news_health/health_soft,news_health|,|女人刮腋毛会不会惹乳腺癌?
news_military/navy/warship,news_military/weaponry/aircraft_carrier,news_entertainment/other,news_military/navy,news_military/weaponry,news_entertainment|,|感动!一个海军航空兵的士兵突击
digital/coolplay/smart_home/other,digital/coolplay/smart_home,digital/coolplay,digital|,|一款可以远程遥控的开关----COOLTOUCH无线智能开关
news,video_funny|,|央视财经频道就静宁县苹果产业进行专访
news_tech/other,digital/hardware,news_tech|,|跨界化反乐视“413生态共享之夜”是一道怎样的化学方程式
emotion/marriage/other,news_society/other,news_baby/other,emotion/marriage,news_baby,news_society|,|连续三胎生下脑瘫孩子,心力交瘁的夫妻俩得知原因后陷入绝望
emotion/jitang,news_essay|,|我们都坐在没有月光的屋子里
news_tech/internet/mobile_internet/other,news_tech/internet/mobile_internet,news_tech/other,news_tech|,|专车新政落地,神州、优步、滴滴谁最受益?
news_travel/other,news_travel|,|天下第一油菜花
news_food/menu/other,news_food/seafood,news_food/menu,news_food|,|男人最爱的下酒菜!这么烧,鲜香味美,一盘铁定不够吃
news_fashion/cosmetology/body_care/other,news_health/body_building/other,news_fashion/cosmetology/body_care,news_health/body_building,news_health/slimming,news_health|,|牛掰的是:已经有肌肉的人仍在塑形脱脂变型男,而你仍是个喷子.
funny,news|,|62岁老人曾4小时玩108次跳楼机
emotion/marriage/other,emotion/affection/love,emotion/affection,emotion/marriage,emotion/jitang,emotion|,|“如果不是对的人,无论追多久也不会换来爱情的”
digital/operating_system/Android,digital/pad,digital/hardware,news_tech/software,digital|,|联想:安卓平板还是有市场的
news_food/menu/other,news_food/menu,news_food|,|告诉你,我家的肉饼都是这样煎出来的
news_food/other,news_food|,|味道 | 大文豪元好问曾用它祭雁,这到底是什么美味?
news_food/other,news_food|,|我想和你做件温暖又健康的事!吃小火锅,你请!
news_pet/other,news_pet|,|故事:为了泡妞打赌,傻缺穿着红裤衩去河边撵狗,中邪差点跳了河
news_food/wine_drinking/grape_wine,news_food/wine_drinking,news|,|怎样做葡萄酒,来自酒厂的秘方教你自制红酒
news_food,news|,|我们驱车20多公里就为了这一顿魂牵梦萦的牛肉
news_politics/politics_law,news_society/other,government,news_society|,|校车安全检查
news_fashion/cosmetology/beautify/skin_care,news_health/regimen/other,news_health/daily_health,news_health/regimen,news_health|,|嫩手炸弹洗碗布上的细菌竟是带有40亿个活细菌伤手又伤身
emotion/affection/other,emotion/affection,emotion/low_grage_emotion,emotion/jitang,emotion|,|男人的爱,女人永远觉得不够
news_baby,news|,|6月8日海洋日阅读推荐这是我们的海洋我们的责任
news_society/news_law/statute,news_politics/other,news_society/news_law,news_society,news_politics|,|夜深了,法院会议室的灯还亮着······
news_entertainment/film_tv/movie/film_festival,news_entertainment/other,news_entertainment|,|第三届丝路电影节形象大使成龙获奥斯卡终身成就奖
news_society/other,news_society|,|肯陪伴、悦成长,从“心”关爱留守儿童
digital/coolplay/virtual_reality,news_tech/other,digital/coolplay,news_tech,digital|,|排名第一遥遥领先三星Gear VR占到七成以上
news_society/civil/philanthropy,news_edu/edu_upgrade/college,news_edu/other,news_edu|,|陈天桥捐1.15亿美元背后:为啥中国富豪喜欢给美国大学捐款?
digital/coolplay/other,digital/headset,digital/hardware,digital/coolplay,digital|,|iPhone7音质救星——3款千元级便携耳放推荐
news_pet/cats,news_pet|,|猫咪躲水沟盖下,七十多岁老奶奶怕它饿坏,趴地上塞饲料,太感人
news_history/other,news/pandian,news_history|,|揭秘朱元璋杀人:众大臣装疯卖傻才能免于一死
news_tech/internet/other,news_tech/internet,news_tech|,|一天纳税1亿不过是小目标马云的大目标是要做“英雄联盟”
news_world/diplomacy/middle_east,news_military/military_world,news_world/diplomacy,news_world,news_military|,|俄罗斯不惧拼刺刀 贴近北约部署核导弹 美国使“阴招”应对
news_fashion/dressing,news_fashion|,|穿好boyfriend jeans时尚变身没商量
news_culture/traditional_chinese/other,news_politics/other,news_culture/traditional_chinese,news_agriculture/countryside,news_culture,news_politics|,|每年3000万扶持文化旅游发展还首设孟子文化奖……邹城这是给全国人民送大礼
news_baby/baby_growth,news_baby|,|看哭了无数妈妈的断奶故事
news_society/other,news_politics/other,news_society,news_politics|,|海西消防力保元宵节期间全州消防安全形势稳定
news_entertainment|,|蒋勤勤18岁出道青涩照美貌赛天仙(图)
emotion/affection/other,emotion/affection,emotion/low_grage_emotion,emotion|,|老公让她人流七次,爱爱时避孕到底是谁的任务?
news_comic/japan_cartoon/other,news_comic/japan_cartoon,news_comic/blood_cartoon,news_comic|,|没有这个齐神看真的是要死了,幸好续作制作决定来得像龙卷风
news_astrology/other,news_astrology|,|星座女神本周运势10.17—10.23
news_military/other,news_military|,|盘点各种“异型”枪,你所不知道的伪装枪
news_car/car_exhibition,news_car/south_korean_car,news_car/car_new_arrival,news_car|,|或成起亚最快量产车Stinger亮相北美车展
news_health/other,news_health|,|8个动作帮你 瘦腿!瘦腰!瘦手臂,一个月大变身!
news_game/mobile_game/other,news_fashion/fashion_man/other,news_fashion/fashion_man,news_game/mobile_game,news_society/positive,news_fashion,news_game|,|吉田羊 × Acne Studios
news_tech/internet/mobile_internet/other,news_tech/internet/e-business/O2O,news_tech/internet/mobile_internet,news_tech/internet,news_tech|,|美团与点评合并,这是一个时代的开始?也许是一个时代的结束
science_all/other,science_all|,|男子翻修房子时,发现神秘地下室,真是稀奇!
news_tech/software,digital/cellphone,digital,news_tech|,|iPhone花样式刷机教程
news_sports/swimming,news_sports|,|中国赚钱最多的运动员,第一原来是她
digital/appliances/small_home_appliance,news_health/other,news_health|,|提醒!这种杯子不能用来喝水,赶紧换掉,修武、武陟、焦作人家里都有!
news_entertainment/film_tv/movie/other,news_entertainment/film_tv/movie,news_entertainment/film_tv,news_entertainment|,|实力派黄渤将加盟《西游记:女儿国》
news_sports/olympic_games/other,news_entertainment/other,news_sports/ping_pong,news_sports/olympic_games,news_sports,news_entertainment|,|张继科成大忙人 元宵晚会他将演唱这首老歌 你还记得当初的她吗
news_society/news_law/statute,news_society/news_law,news_society|,|阆中开展夏季治安清查 一批涉案人员被拘留
digital/other,digital|,|小蚁智能行车记录仪实测画面曝光,值得期待!
news_culture/reading/poetry,news|,|诗词闲读2天净沙·秋
news|,|慧保周报2016年第22周
digital/cellphone/other,digital/hardware,digital/cellphone,news_tech/internet_cellphone,digital,news_tech|,|MTK的芯片能不能配得上高端手机
news_entertainment/film_tv/movie/chinese_movie,news_entertainment/film_tv/movie/horror_movie,news_entertainment/film_tv/movie,news_entertainment/film_tv,news_entertainment|,|她做了半个世纪的配角演员,成香港电影传奇,却一生单身无人伴
Can't render this file because it has a wrong number of fields in line 4.

View File

@ -18,6 +18,48 @@ import re
import os
def txt_read(file_path, encode_type='utf-8'):
"""
读取txt文件默认utf8格式
:param file_path: str, 文件路径
:param encode_type: str, 编码格式
:return: list
"""
list_line = []
try:
file = open(file_path, 'r', encoding=encode_type)
while True:
line = file.readline()
line = line.strip()
if not line:
break
list_line.append(line)
file.close()
except Exception as e:
print(str(e))
finally:
return list_line
def txt_write(list_line, file_path, type='w', encode_type='utf-8'):
"""
txt写入list文件
:param listLine:list, list文件写入要带"\n"
:param filePath:str, 写入文件的路径
:param type: str, 写入类型, w, a等
:param encode_type:
:return:
"""
try:
file = open(file_path, type, encoding=encode_type)
file.writelines(list_line)
file.close()
except Exception as e:
print(str(e))
def extract_chinese(text):
"""
只提取出中文字母和数字
@ -178,3 +220,141 @@ class PreprocessText:
random.shuffle(indexs)
x_, y_ = x_[indexs], y_[indexs]
return x_, y_
def transform_multilabel_to_multihot(sample, label=1070):
"""
:param sample: [1, 2, 3, 4]
:param label: 1022
:return: [1, 0, 1, 1, ......]
"""
result = np.zeros(label)
result[sample] = 1
res = result.tolist()
# res = ''.join([str(r) for r in res])
return res
class PreprocessTextMulti:
"""
数据预处理, 输入为csv格式, [label,ques]
"""
def __init__(self):
self.l2i_i2l = None
if os.path.exists(path_fast_text_model_l2i_i2l):
self.l2i_i2l = load_json(path_fast_text_model_l2i_i2l)
def prereocess_idx(self, pred):
if os.path.exists(path_fast_text_model_l2i_i2l):
pred_i2l = {}
i2l = self.l2i_i2l['i2l']
for i in range(len(pred)):
pred_i2l[i2l[str(i)]] = pred[i]
pred_i2l_rank = [sorted(pred_i2l.items(), key=lambda k: k[1], reverse=True)]
return pred_i2l_rank
else:
raise RuntimeError("path_fast_text_model_label2index is None")
def prereocess_pred_xid(self, pred):
if os.path.exists(path_fast_text_model_l2i_i2l):
pred_l2i = {}
l2i = self.l2i_i2l['l2i']
for i in range(len(pred)):
pred_l2i[pred[i]] = l2i[pred[i]]
pred_l2i_rank = [sorted(pred_l2i.items(), key=lambda k: k[1], reverse=True)]
return pred_l2i_rank
else:
raise RuntimeError("path_fast_text_model_label2index is None")
def preprocess_label_ques_to_idx(self, embedding_type, path, embed, rate=1, shuffle=True):
if type(path) == str:
label_ques = txt_read(path)
ques = list()
label = list()
for lq in label_ques[1:]:
lqs = lq.split('|,|')
ques.append(lqs[1])
label.append(lqs[0])
elif type(path) == list and ',' in path[0]:
label = [label_ques.split(',')[0] for label_ques in path]
ques = [label_ques.split(',')[1] for label_ques in path]
else:
raise RuntimeError('type of path is not true')
len_ql = int(rate * len(ques))
if len_ql <= 50: # 数量较少时候全取, 不管rate
len_ql = len(ques)
ques = ques[: len_ql]
label = label[: len_ql]
print('rate ok!')
ques = [str(q).strip().upper() for q in ques]
# label = [[l_ for l_ in str(l).upper().split(',')] for l in label]
# 获取单个标签, 如 ['news,tips', 'news,news_tech']转化为['news','tips','news_tech']
# label_list = []
# for l in label:
# label_single = str(l).strip().upper().split(',')
# label_list = label_list + label_single
# label_set = set(label_list)
# len_label_set = len(label_set)
# print(len_label_set)
from keras_textclassification.conf.path_config import path_byte_multi_news_label
byte_multi_news_label = txt_read(path_byte_multi_news_label)
byte_multi_news_label = [i.strip().upper() for i in byte_multi_news_label]
label_set = set(byte_multi_news_label)
len_label_set = len(label_set)
# 保存标签类标,数字文本转化
count = 0
label2index = {}
index2label = {}
for label_one in label_set:
label2index[label_one] = count
index2label[count] = label_one
count = count + 1
l2i_i2l = {}
l2i_i2l['l2i'] = label2index
l2i_i2l['i2l'] = index2label
save_json(l2i_i2l, path_fast_text_model_l2i_i2l)
x = []
for que in ques:
que_embed = embed.sentence2idx(que)
x.append(que_embed) # [[], ]
print('que_embed ok!')
# 转化为多标签类标
label_multi_list = []
count = 0
for l in label:
count += 1
label_single = str(l).strip().upper().split(',')
label_single_index = [l2i_i2l['l2i'][ls] for ls in label_single]
label_multi = transform_multilabel_to_multihot(label_single_index, label=len_label_set)
label_multi_list.append(label_multi)
print('label_multi_list ok!')
if embedding_type == 'bert':
x_, y_ = np.array(x), np.array(label_multi_list)
if shuffle:
indexs = [ids for ids in range(len(y_))]
random.shuffle(indexs)
x_, y_ = x_[indexs], y_[indexs]
x_1 = np.array([x[0] for x in x_])
x_2 = np.array([x[1] for x in x_])
x_all = [x_1, x_2]
return x_all, y_
else:
x_, y_ = np.array(x), np.array(label_multi_list)
if shuffle:
indexs = [ids for ids in range(len(y_))]
random.shuffle(indexs)
x_, y_ = x_[indexs], y_[indexs]
return x_, y_

View File

@ -2,4 +2,55 @@
# !/usr/bin/python
# @time :2019/6/12 14:32
# @author :Mo
# @function :
# @function :graph of charCNN-zhang
# @paper : Character-level Convolutional Networks for Text Classification(https://arxiv.org/pdf/1509.01626.pdf)
from __future__ import print_function, division
# char cnn
from keras.layers import Convolution1D, MaxPooling1D, ThresholdedReLU
from keras.layers import Dense, Dropout, Flatten
from keras.models import Model
from keras_textclassification.base.graph import graph
class CharCNNGraph(graph):
def __init__(self, hyper_parameters):
"""
初始化
:param hyper_parameters: json超参
"""
self.char_cnn_layers = hyper_parameters['model'].get('char_cnn_layers',
[[256, 7, 3], [256, 7, 3], [256, 3, -1], [256, 3, -1], [256, 3, -1], [256, 3, 3]],)
self.full_connect_layers = hyper_parameters['model'].get('full_connect_layers', [1024, 1024],)
self.threshold = hyper_parameters['model'].get('threshold', 1e-6)
super().__init__(hyper_parameters)
def create_model(self, hyper_parameters):
"""
构建神经网络
:param hyper_parameters:json, hyper parameters of network
:return: tensor, moedl
"""
super().create_model(hyper_parameters)
x = self.word_embedding.output
# x = Reshape((self.len_max, self.embed_size, 1))(embedding_output) # (None, 50, 30, 1)
# cnn + pool
for char_cnn_size in self.char_cnn_layers:
x = Convolution1D(filters = char_cnn_size[0],
kernel_size = char_cnn_size[1],)(x)
x = ThresholdedReLU(self.threshold)(x)
if char_cnn_size[2] != -1:
x = MaxPooling1D(pool_size = char_cnn_size[2],
strides = 1)(x)
x = Flatten()(x)
# full-connect
for full in self.full_connect_layers:
x = Dense(units=full,)(x)
x = ThresholdedReLU(self.threshold)(x)
x = Dropout(self.dropout)(x)
output = Dense(units=self.label, activation=self.activate_classify)(x)
self.model = Model(inputs=self.word_embedding.input, outputs=output)
self.model.summary(120)

View File

@ -0,0 +1,100 @@
# -*- coding: UTF-8 -*-
# !/usr/bin/python
# @time :2019/8/14 11:08
# @author :Mo
# @function :train of CharCNNGraph_kim with baidu-qa-2019 in question title
# 适配linux
import pathlib
import sys
import os
project_path = str(pathlib.Path(os.path.abspath(__file__)).parent.parent.parent)
sys.path.append(project_path)
# 地址
from keras_textclassification.conf.path_config import path_model, path_fineture, path_model_dir, path_hyper_parameters
# 训练验证数据地址
from keras_textclassification.conf.path_config import path_baidu_qa_2019_train, path_baidu_qa_2019_valid
# 数据预处理, 删除文件目录下文件
from keras_textclassification.data_preprocess.text_preprocess import PreprocessText, delete_file
# 模型图
from keras_textclassification.m03_CharCNN.graph_zhang import CharCNNGraph as Graph
# 计算时间
import time
def train(hyper_parameters=None, rate=1.0):
"""
训练函数
:param hyper_parameters: json, 超参数
:param rate: 比率, 抽出rate比率语料取训练
:return: None
"""
if not hyper_parameters:
hyper_parameters = {
'len_max': 50, # 句子最大长度, 固定 推荐20-50
'embed_size': 300, # 字/词向量维度
'vocab_size': 20000, # 这里随便填的,会根据代码里修改
'trainable': True, # embedding是静态的还是动态的, 即控制可不可以微调
'level_type': 'char', # 级别, 最小单元, 字/词, 填 'char' or 'word'
'embedding_type': 'random', # 级别, 嵌入类型, 还可以填'random'、 'bert' or 'word2vec"
'gpu_memory_fraction': 0.66, #gpu使用率
'model': {'label': 17, # 类别数
'batch_size': 32, # 批处理尺寸, 感觉原则上越大越好,尤其是样本不均衡的时候, batch_size设置影响比较大
'filters': [2, 3, 4, 5], # 卷积核尺寸
'filters_num': 300, # 卷积个数 text-cnn:300-600
'channel_size': 1, # CNN通道数
'dropout': 0.5, # 随机失活, 概率
'decay_step': 100, # 学习率衰减step, 每N个step衰减一次
'decay_rate': 0.9, # 学习率衰减系数, 乘法
'epochs': 20, # 训练最大轮次
'patience': 10, # 早停,2-3就好
'lr': 1e-3, # 学习率, 对训练会有比较大的影响, 如果准确率一直上不去,可以考虑调这个参数
'l2': 1e-9, # l2正则化
'activate_classify': 'softmax', # 最后一个layer, 即分类激活函数
'loss': 'categorical_crossentropy', # 损失函数
'metrics': 'accuracy', # 保存更好模型的评价标准
'is_training': True, # 训练后者是测试模型
'model_path': path_model,
# 模型地址, loss降低则保存的依据, save_best_only=True, save_weights_only=True
'path_hyper_parameters': path_hyper_parameters, # 模型(包括embedding),超参数地址,
'path_fineture': path_fineture, # 保存embedding trainable地址, 例如字向量、词向量、bert向量等
# only charCNN_zhang
'threshold': 1e-6,
'char_cnn_layers': [[256, 7, 3], [256, 7, 3], [256, 3, -1], [256, 3, -1], [256, 3, -1], [256, 3, 3]], # small
# [[1024, 7, 3], [1024, 7, 3], [1024, 3, -1], [1024, 3, -1], [1024, 3, -1], [1024, 3, 3]], # large
'full_connect_layers': [1024, 1024], # [2048, 2048], large
},
'embedding': {'layer_indexes': [12], # bert取的层数,
# 'corpus_path': '', # embedding预训练数据地址,不配则会默认取conf里边默认的地址
},
'data':{'train_data': path_baidu_qa_2019_train, # 训练数据
'val_data': path_baidu_qa_2019_valid # 验证数据
},
}
# 删除先前存在的模型\embedding微调模型等
delete_file(path_model_dir)
time_start = time.time()
# graph初始化
graph = Graph(hyper_parameters)
print("graph init ok!")
ra_ed = graph.word_embedding
# 数据预处理
pt = PreprocessText()
x_train, y_train = pt.preprocess_label_ques_to_idx(hyper_parameters['embedding_type'],
hyper_parameters['data']['train_data'],
ra_ed, rate=rate, shuffle=True)
x_val, y_val = pt.preprocess_label_ques_to_idx(hyper_parameters['embedding_type'],
hyper_parameters['data']['val_data'],
ra_ed, rate=rate, shuffle=True)
print("data propress ok!")
print(len(y_train))
# 训练
graph.fit(x_train, y_train, x_val, y_val)
print("耗时:" + str(time.time()-time_start))
if __name__=="__main__":
train(rate=0.01) # sample条件下设为1,否则训练语料可能会很少

View File

@ -1,5 +1,5 @@
# -*- coding: UTF-8 -*-
# !/usr/bin/python
# @time :2019/6/3 10:50
# @time :2019/8/14 21:23
# @author :Mo
# @function :

View File

@ -0,0 +1,84 @@
# -*- coding: UTF-8 -*-
# !/usr/bin/python
# @time :2019/8/14 17:40
# @author :Mo
# @function :
# 适配linux
import pathlib
import sys
import os
project_path = str(pathlib.Path(os.path.abspath(__file__)).parent.parent.parent)
sys.path.append(project_path)
# 地址
from keras_textclassification.conf.path_config import path_model, path_fineture, path_model_dir, path_hyper_parameters
# 训练验证数据地址
from keras_textclassification.conf.path_config import path_byte_multi_news_valid, path_byte_multi_news_train
# 数据预处理, 删除文件目录下文件
from keras_textclassification.data_preprocess.text_preprocess import PreprocessTextMulti, read_and_process, load_json
# 模型图
from keras_textclassification.m02_TextCNN.graph import TextCNNGraph as Graph
# 模型评估
from sklearn.metrics import classification_report
# 计算时间
import time
import numpy as np
def pred_input(path_hyper_parameter=path_hyper_parameters):
# 输入预测
# 加载超参数
hyper_parameters = load_json(path_hyper_parameter)
pt = PreprocessTextMulti()
# 模式初始化和加载
graph = Graph(hyper_parameters)
graph.load_model()
ra_ed = graph.word_embedding
ques = '我要打王者荣耀'
# str to token
ques_embed = ra_ed.sentence2idx(ques)
if hyper_parameters['embedding_type'] == 'bert':
x_val_1 = np.array([ques_embed[0]])
x_val_2 = np.array([ques_embed[1]])
x_val = [x_val_1, x_val_2]
else:
x_val = ques_embed
# 预测
pred = graph.predict(x_val)
print(pred)
# 取id to label and pred
pre = pt.prereocess_idx(pred[0])
ls_nulti = []
for ls in pre[0]:
if ls[1] >= 0.1:
ls_nulti.append(ls)
print(pre[0])
print(ls_nulti)
while True:
print("请输入: ")
ques = input()
ques_embed = ra_ed.sentence2idx(ques)
print(ques_embed)
if hyper_parameters['embedding_type'] == 'bert':
x_val_1 = np.array([ques_embed[0]])
x_val_2 = np.array([ques_embed[1]])
x_val = [x_val_1, x_val_2]
else:
x_val = ques_embed
pred = graph.predict(x_val)
pre = pt.prereocess_idx(pred[0])
ls_nulti = []
for ls in pre[0]:
if ls[1] >= 0.1:
ls_nulti.append(ls)
print(pre[0])
print(ls_nulti)
if __name__=="__main__":
# 测试集预测
# pred_tet(path_test=path_byte_multi_news_valid, rate=1) # sample条件下设为1,否则训练语料可能会很少
# 可输入 input 预测
pred_input()

View File

@ -0,0 +1,90 @@
# -*- coding: UTF-8 -*-
# !/usr/bin/python
# @time :2019/8/14 16:14
# @author :Mo
# @function :
# 适配linux
import pathlib
import sys
import os
project_path = str(pathlib.Path(os.path.abspath(__file__)).parent.parent.parent)
sys.path.append(project_path)
# 地址
from keras_textclassification.conf.path_config import path_model, path_fineture, path_model_dir, path_hyper_parameters
# 训练验证数据地址
from keras_textclassification.conf.path_config import path_byte_multi_news_train, path_byte_multi_news_valid
# 数据预处理, 删除文件目录下文件
from keras_textclassification.data_preprocess.text_preprocess import PreprocessTextMulti, delete_file
# 模型图
from keras_textclassification.m02_TextCNN.graph import TextCNNGraph as Graph
# 计算时间
import time
def train(hyper_parameters=None, rate=1.0):
if not hyper_parameters:
hyper_parameters = {
'len_max': 50, # 句子最大长度, 固定推荐20-50, bert越长会越慢, 占用空间也会变大, 本地win10-4G设为20就好, 过大小心OOM
'embed_size': 300, # 字/词向量维度, bert取768, word取300, char可以更小些
'vocab_size': 20000, # 这里随便填的,会根据代码里修改
'trainable': True, # embedding是静态的还是动态的, 即控制可不可以微调
'level_type': 'char', # 级别, 最小单元, 字/词, 填 'char' or 'word', 注意:word2vec模式下训练语料要首先切好
'embedding_type': 'random', # 级别, 嵌入类型, 还可以填'random'、 'bert' or 'word2vec"
'gpu_memory_fraction': 0.66, #gpu使用率
'model': {'label': 1070, # 类别数
'batch_size': 32, # 批处理尺寸, 感觉原则上越大越好,尤其是样本不均衡的时候, batch_size设置影响比较大
'dropout': 0.5, # 随机失活, 概率
'decay_step': 100, # 学习率衰减step, 每N个step衰减一次
'decay_rate': 0.9, # 学习率衰减系数, 乘法
'epochs': 20, # 训练最大轮次
'patience': 3, # 早停,2-3就好
'lr': 1e-3, # 学习率, bert取5e-5, 其他取1e-3, 对训练会有比较大的影响, 如果准确率一直上不去,可以考虑调这个参数
'l2': 1e-9, # l2正则化
'activate_classify': 'softmax', # 最后一个layer, 即分类激活函数
'loss': 'categorical_crossentropy', # 损失函数, 可能有问题, 可以自己定义
'metrics': 'top_k_categorical_accuracy', # 1070个类, 太多了先用topk, 这里数据k设置为最大:33
# 'metrics': 'categorical_accuracy', # 保存更好模型的评价标准
'is_training': True, # 训练后者是测试模型
'model_path': path_model,
# 模型地址, loss降低则保存的依据, save_best_only=True, save_weights_only=True
'path_hyper_parameters': path_hyper_parameters, # 模型(包括embedding),超参数地址,
'path_fineture': path_fineture, # 保存embedding trainable地址, 例如字向量、词向量、bert向量等
},
'embedding': {'layer_indexes': [13], # bert取的层数
# 'corpus_path': '', # embedding预训练数据地址,不配则会默认取conf里边默认的地址
},
'data':{'train_data': path_byte_multi_news_train, # 训练数据
'val_data': path_byte_multi_news_valid, # 验证数据
},
}
# 删除先前存在的模型和embedding微调模型等
delete_file(path_model_dir)
time_start = time.time()
# graph初始化
graph = Graph(hyper_parameters)
print("graph init ok!")
ra_ed = graph.word_embedding
# 数据预处理
pt = PreprocessTextMulti()
x_train, y_train = pt.preprocess_label_ques_to_idx(hyper_parameters['embedding_type'],
hyper_parameters['data']['train_data'],
ra_ed, rate=rate, shuffle=True)
print('train data propress ok!')
x_val, y_val = pt.preprocess_label_ques_to_idx(hyper_parameters['embedding_type'],
hyper_parameters['data']['val_data'],
ra_ed, rate=rate, shuffle=True)
print("data propress ok!")
print(len(y_train))
# 训练
graph.fit(x_train, y_train, x_val, y_val)
print("耗时:" + str(time.time()-time_start))
if __name__=="__main__":
train(rate=0.01)
# 注意: 4G的080Ti的GPU、win10下batch_size=32,len_max=20, gpu<=0.87, 应该就可以bert-fineture了。
# 全量数据训练一轮(batch_size=32),就能达到80%准确率(验证集), 效果还是不错的
# win10下出现过错误,gpu、len_max、batch_size配小一点就好:ailed to allocate 3.56G (3822520832 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory