Merge pull request #2 from jackeyGao/master

merge
This commit is contained in:
animalize
2017-07-14 11:07:33 +08:00
committed by GitHub
7 changed files with 100 additions and 28 deletions

6
.travis.yml Normal file
View File

@ -0,0 +1,6 @@
language: python
python:
- "2.7"
install:
- pip install -r requirements.txt
script: pytest

21
LICENSE Normal file
View File

@ -0,0 +1,21 @@
The MIT License (MIT)
Copyright (c) 2016 JackeyGao
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.

View File

@ -1,41 +1,41 @@
# chinese-poetry | [全宋词爬取过程及数据分析](http://jackeygao.io/words/crawl-ci.html)
chinese-poetry
==============
![中华古诗](https://raw.githubusercontent.com/jackeyGao/chinese-poetry/master/images/full-tang-poetry.png "中华古诗")
[![Build Status](https://travis-ci.org/jackeyGao/chinese-poetry.svg?branch=master)](https://travis-ci.org/jackeyGao/chinese-poetry)
[![License](http://img.shields.io/badge/license-mit-blue.svg?style=flat-square)](https://github.com/jackeyGao/chinese-poetry/blob/master/LICENSE)
[![tang poetry](https://img.shields.io/badge/tang%20poetry-5.5w-green.svg)]()
[![song poetry](https://img.shields.io/badge/song%20poetry-22w-green.svg)]()
[![song ci](https://img.shields.io/badge/song%20Ci-21k-green.svg)]()
## 唐诗高频词
最全的中华古典文集数据库, 包含5.5万首唐诗、26万首宋诗和2.1万首宋词. 唐宋两朝斤1.4万古诗人, 和两宋时期1.5K词人. 数据来源于互联网.
![唐诗高频词](https://raw.githubusercontent.com/jackeyGao/chinese-poetry/master/images/tang_text_topK.png "唐诗高频词")
**为什么要做这个仓库?** 古诗是中华民族乃至全世界的瑰宝, 我们应该传承下去, 虽然有古典文集, 但大多数人并没有拥有这些书籍. 从某种意义上来说, 这些庞大的文集离我们是有一定距离的。而电子版方便拷贝, 所以此开源数据库诞生了. 你可以用此数据做任何有益的事情, 甚至我也可以帮助你.
## 唐诗作者作品榜
古诗采集没有记录过程, 因为古诗数据庞大,目标网站有限制, 采集过程经常中断超过了一个星期.2017年新加入全宋词, [全宋词爬取过程及数据分析](http://jackeygao.io/words/crawl-ci.html).
![唐诗作者作品榜](https://raw.githubusercontent.com/jackeyGao/chinese-poetry/master/images/tang_author_topK.png "唐诗作者作品榜")
## 宋诗高频词
## 数据分析
![宋诗高频词](https://raw.githubusercontent.com/jackeyGao/chinese-poetry/master/images/song_text_topK.png "宋诗高频词")
一些简单的高频分析
## 宋诗作者作品榜
|||
| :---: | :---: |
| ![唐诗高频词](https://raw.githubusercontent.com/jackeyGao/chinese-poetry/master/images/tang_text_topK.png "唐诗高频词") 唐诗高频词 | ![唐诗作者作品榜](https://raw.githubusercontent.com/jackeyGao/chinese-poetry/master/images/tang_author_topK.png "唐诗作者作品榜") 唐诗作者作品榜 |
| ![宋诗高频词](https://raw.githubusercontent.com/jackeyGao/chinese-poetry/master/images/song_text_topK.png "宋诗高频词" )宋诗高频词 | ![宋诗作者作品榜](https://raw.githubusercontent.com/jackeyGao/chinese-poetry/master/images/song_author_topK.png "宋诗作者作品榜") 宋诗作者作品榜 |
| ![宋词高频词](https://raw.githubusercontent.com/jackeyGao/chinese-poetry/master/images/ci_words_topK.png "宋词高频词") 宋词高频词 |![宋词作者作品榜](https://raw.githubusercontent.com/jackeyGao/chinese-poetry/master/images/ci_author_topK.png "宋词作者作品榜") 宋词作者作品榜 |
![宋诗作者作品榜](https://raw.githubusercontent.com/jackeyGao/chinese-poetry/master/images/song_author_topK.png "宋诗作者作品榜")
## 宋词作者作品榜
![宋词作者作品榜](https://raw.githubusercontent.com/jackeyGao/chinese-poetry/master/images/ci_author_topK.png "宋词作者作品榜")
## 宋词高频词
![宋词高频词](https://raw.githubusercontent.com/jackeyGao/chinese-poetry/master/images/ci_words_topK.png)
## 两宋喜欢的词牌名
![两宋喜欢的词牌名](https://raw.githubusercontent.com/jackeyGao/chinese-poetry/master/images/ci_rhythmic_topK.png)
<center>两宋喜欢的词牌名</center>
## 数据分发形式
### 唐诗JSON
古诗数据分发采用繁体字的分组JSON文件, 保留繁体能更大程序的保存原数据. 宋词数据分发采用sqlite数据库, 使用简体字(无原因, 采集源就是简体, 如需繁体请自行转换).
```json
### 古诗JSON结构
```text
[
{
"strains": [
@ -56,7 +56,7 @@
... 每单个JSON文件1000条唐诗记录.
]
```
### 作者JSON
### 作者JSON结构
```json
[
@ -64,14 +64,24 @@
"name": "太宗皇帝",
"desc": "帝姓李氏,諱世民,神堯次子,聰明英武。貞觀之治,庶幾成康,功德兼隆。由漢以來,未之有也。而銳情經術, >初建秦邸,即開文學館,召名儒十八人爲學士。既即位,殿左置弘文館,悉引內學士,番宿更休。聽朝之間,則與討論典籍,雜以文詠。或日昃夜艾,未嘗少怠。詩筆草隸,卓越前古。至於天文秀發,沈麗高朗,有唐三百年風雅之盛,帝實有以啓之焉。在位二十四年,諡曰文。集四十卷。館閣書目,詩一卷,六十九首。今編詩一卷。"
},
... 所有诗人信息在一个文件
...
]
```
## 案例展示
## 贡献
- [animalize](https://github.com/animalize) **/** [QuanTangshi](https://github.com/animalize/QuanTangshi) *离线全唐诗 Android*
- [justdark](https://github.com/justdark) **/** [pytorch-poetry-gen](https://github.com/justdark/pytorch-poetry-gen) *a char-RNN based on pytorch*
- [Clover27](https://github.com/Clover27) **/** [ancient-Chinese-poem-generator](https://github.com/Clover27/ancient-Chinese-poem-generator) *Ancient-Chinese-Poem-Generator*
- 欢迎补充
## 贡献&讨论
提交issue来优化完善此数据库, 你也可以联系我的邮箱 gaojunqi@outlook.com
<img src="https://raw.githubusercontent.com/jackeyGao/chinese-poetry/master/images/WechatIMG1.jpeg" width="150" height="200" />
目前仅有全唐诗记录, 唐后的一些古诗未收集,古诗乃中华文化瑰宝, 如果有靠谱的数据源也欢迎提交PR加入.
## License
[MIT](https://zh.wikipedia.org/wiki/MIT%E8%A8%B1%E5%8F%AF%E8%AD%89) 许可证.
[MIT](https://github.com/jackeyGao/chinese-poetry/blob/master/LICENSE) 许可证.

BIN
images/WechatIMG1.jpeg Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 25 KiB

View File

@ -15199,4 +15199,4 @@
],
"title": "句"
}
]
]

1
requirements.txt Normal file
View File

@ -0,0 +1 @@
pytest==3.1.0

34
test_poetry.py Normal file
View File

@ -0,0 +1,34 @@
#! -*- coding: utf-8 -*-
import os, json, sqlite3
def check_json(f):
filepath = os.path.join('./json', f)
with open(filepath) as file:
try:
_ = json.loads(file.read())
return True
except:
assert False, u"校验(%s)失败" % f
def test_json():
"""
测试古诗JSON文件是否有效
"""
map(check_json, os.listdir('./json'))
def test_sqlite():
"""
测试ci数据库文件是否有效
"""
conn = sqlite3.connect('./ci/ci.db')
c = conn.cursor()
c.execute("SELECT name FROM sqlite_master WHERE type='table'")
tables = c.fetchall()
assert len(tables) == 2, u"Sqlite文件异常"