diff --git a/.travis.yml b/.travis.yml new file mode 100644 index 0000000..5927dbe --- /dev/null +++ b/.travis.yml @@ -0,0 +1,6 @@ +language: python +python: + - "2.7" +install: + - pip install -r requirements.txt +script: pytest diff --git a/LICENSE b/LICENSE new file mode 100644 index 0000000..cd6a57c --- /dev/null +++ b/LICENSE @@ -0,0 +1,21 @@ +The MIT License (MIT) + +Copyright (c) 2016 JackeyGao + +Permission is hereby granted, free of charge, to any person obtaining a copy +of this software and associated documentation files (the "Software"), to deal +in the Software without restriction, including without limitation the rights +to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +copies of the Software, and to permit persons to whom the Software is +furnished to do so, subject to the following conditions: + +The above copyright notice and this permission notice shall be included in +all copies or substantial portions of the Software. + +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN +THE SOFTWARE. diff --git a/README.md b/README.md index a18d32a..7492183 100644 --- a/README.md +++ b/README.md @@ -1,41 +1,41 @@ -# chinese-poetry | [全宋词爬取过程及数据分析](http://jackeygao.io/words/crawl-ci.html) +chinese-poetry +============== -![中华古诗](https://raw.githubusercontent.com/jackeyGao/chinese-poetry/master/images/full-tang-poetry.png "中华古诗") +[![Build Status](https://travis-ci.org/jackeyGao/chinese-poetry.svg?branch=master)](https://travis-ci.org/jackeyGao/chinese-poetry) +[![License](http://img.shields.io/badge/license-mit-blue.svg?style=flat-square)](https://github.com/jackeyGao/chinese-poetry/blob/master/LICENSE) +[![tang poetry](https://img.shields.io/badge/tang%20poetry-5.5w-green.svg)]() +[![song poetry](https://img.shields.io/badge/song%20poetry-22w-green.svg)]() +[![song ci](https://img.shields.io/badge/song%20Ci-21k-green.svg)]() -## 唐诗高频词 +最全的中华古典文集数据库, 包含5.5万首唐诗、26万首宋诗和2.1万首宋词. 唐宋两朝斤1.4万古诗人, 和两宋时期1.5K词人. 数据来源于互联网. -![唐诗高频词](https://raw.githubusercontent.com/jackeyGao/chinese-poetry/master/images/tang_text_topK.png "唐诗高频词") +**为什么要做这个仓库?** 古诗是中华民族乃至全世界的瑰宝, 我们应该传承下去, 虽然有古典文集, 但大多数人并没有拥有这些书籍. 从某种意义上来说, 这些庞大的文集离我们是有一定距离的。而电子版方便拷贝, 所以此开源数据库诞生了. 你可以用此数据做任何有益的事情, 甚至我也可以帮助你. -## 唐诗作者作品榜 +古诗采集没有记录过程, 因为古诗数据庞大,目标网站有限制, 采集过程经常中断超过了一个星期.2017年新加入全宋词, [全宋词爬取过程及数据分析](http://jackeygao.io/words/crawl-ci.html). -![唐诗作者作品榜](https://raw.githubusercontent.com/jackeyGao/chinese-poetry/master/images/tang_author_topK.png "唐诗作者作品榜") -## 宋诗高频词 +## 数据分析 -![宋诗高频词](https://raw.githubusercontent.com/jackeyGao/chinese-poetry/master/images/song_text_topK.png "宋诗高频词") +一些简单的高频分析 -## 宋诗作者作品榜 +||| +| :---: | :---: | +| ![唐诗高频词](https://raw.githubusercontent.com/jackeyGao/chinese-poetry/master/images/tang_text_topK.png "唐诗高频词") 唐诗高频词 | ![唐诗作者作品榜](https://raw.githubusercontent.com/jackeyGao/chinese-poetry/master/images/tang_author_topK.png "唐诗作者作品榜") 唐诗作者作品榜 | +| ![宋诗高频词](https://raw.githubusercontent.com/jackeyGao/chinese-poetry/master/images/song_text_topK.png "宋诗高频词" )宋诗高频词 | ![宋诗作者作品榜](https://raw.githubusercontent.com/jackeyGao/chinese-poetry/master/images/song_author_topK.png "宋诗作者作品榜") 宋诗作者作品榜 | +| ![宋词高频词](https://raw.githubusercontent.com/jackeyGao/chinese-poetry/master/images/ci_words_topK.png "宋词高频词") 宋词高频词 |![宋词作者作品榜](https://raw.githubusercontent.com/jackeyGao/chinese-poetry/master/images/ci_author_topK.png "宋词作者作品榜") 宋词作者作品榜 | -![宋诗作者作品榜](https://raw.githubusercontent.com/jackeyGao/chinese-poetry/master/images/song_author_topK.png "宋诗作者作品榜") - -## 宋词作者作品榜 - -![宋词作者作品榜](https://raw.githubusercontent.com/jackeyGao/chinese-poetry/master/images/ci_author_topK.png "宋词作者作品榜") - -## 宋词高频词 - -![宋词高频词](https://raw.githubusercontent.com/jackeyGao/chinese-poetry/master/images/ci_words_topK.png) - -## 两宋喜欢的词牌名 ![两宋喜欢的词牌名](https://raw.githubusercontent.com/jackeyGao/chinese-poetry/master/images/ci_rhythmic_topK.png) +
两宋喜欢的词牌名
## 数据分发形式 -### 唐诗JSON +古诗数据分发采用繁体字的分组JSON文件, 保留繁体能更大程序的保存原数据. 宋词数据分发采用sqlite数据库, 使用简体字(无原因, 采集源就是简体, 如需繁体请自行转换). -```json +### 古诗JSON结构 + +```text [ { "strains": [ @@ -56,7 +56,7 @@ ... 每单个JSON文件1000条唐诗记录. ] ``` -### 作者JSON +### 作者JSON结构 ```json [ @@ -64,14 +64,24 @@ "name": "太宗皇帝", "desc": "帝姓李氏,諱世民,神堯次子,聰明英武。貞觀之治,庶幾成康,功德兼隆。由漢以來,未之有也。而銳情經術, >初建秦邸,即開文學館,召名儒十八人爲學士。既即位,殿左置弘文館,悉引內學士,番宿更休。聽朝之間,則與討論典籍,雜以文詠。或日昃夜艾,未嘗少怠。詩筆草隸,卓越前古。至於天文秀發,沈麗高朗,有唐三百年風雅之盛,帝實有以啓之焉。在位二十四年,諡曰文。集四十卷。館閣書目,詩一卷,六十九首。今編詩一卷。" }, - ... 所有诗人信息在一个文件 + ... ] ``` +## 案例展示 -## 贡献 +- [animalize](https://github.com/animalize) **/** [QuanTangshi](https://github.com/animalize/QuanTangshi) *离线全唐诗 Android* +- [justdark](https://github.com/justdark) **/** [pytorch-poetry-gen](https://github.com/justdark/pytorch-poetry-gen) *a char-RNN based on pytorch* +- [Clover27](https://github.com/Clover27) **/** [ancient-Chinese-poem-generator](https://github.com/Clover27/ancient-Chinese-poem-generator) *Ancient-Chinese-Poem-Generator* +- 欢迎补充 + +## 贡献&讨论 + +提交issue来优化完善此数据库, 你也可以联系我的邮箱 gaojunqi@outlook.com + + -目前仅有全唐诗记录, 唐后的一些古诗未收集,古诗乃中华文化瑰宝, 如果有靠谱的数据源也欢迎提交PR加入. ## License -[MIT](https://zh.wikipedia.org/wiki/MIT%E8%A8%B1%E5%8F%AF%E8%AD%89) 许可证. + +[MIT](https://github.com/jackeyGao/chinese-poetry/blob/master/LICENSE) 许可证. diff --git a/images/WechatIMG1.jpeg b/images/WechatIMG1.jpeg new file mode 100644 index 0000000..880c3ff Binary files /dev/null and b/images/WechatIMG1.jpeg differ diff --git a/json/poet.tang.49000.json b/json/poet.tang.49000.json index 146092a..2ca8e41 100644 --- a/json/poet.tang.49000.json +++ b/json/poet.tang.49000.json @@ -15199,4 +15199,4 @@ ], "title": "句" } -] \ No newline at end of file +] diff --git a/requirements.txt b/requirements.txt new file mode 100644 index 0000000..7565aae --- /dev/null +++ b/requirements.txt @@ -0,0 +1 @@ +pytest==3.1.0 diff --git a/test_poetry.py b/test_poetry.py new file mode 100644 index 0000000..27454ec --- /dev/null +++ b/test_poetry.py @@ -0,0 +1,34 @@ +#! -*- coding: utf-8 -*- +import os, json, sqlite3 + +def check_json(f): + filepath = os.path.join('./json', f) + with open(filepath) as file: + try: + _ = json.loads(file.read()) + return True + except: + assert False, u"校验(%s)失败" % f + + +def test_json(): + """ + 测试古诗JSON文件是否有效 + """ + map(check_json, os.listdir('./json')) + + + +def test_sqlite(): + """ + 测试ci数据库文件是否有效 + """ + conn = sqlite3.connect('./ci/ci.db') + + c = conn.cursor() + + c.execute("SELECT name FROM sqlite_master WHERE type='table'") + + tables = c.fetchall() + + assert len(tables) == 2, u"Sqlite文件异常"