6
.travis.yml
Normal file
6
.travis.yml
Normal file
@ -0,0 +1,6 @@
|
||||
language: python
|
||||
python:
|
||||
- "2.7"
|
||||
install:
|
||||
- pip install -r requirements.txt
|
||||
script: pytest
|
||||
21
LICENSE
Normal file
21
LICENSE
Normal file
@ -0,0 +1,21 @@
|
||||
The MIT License (MIT)
|
||||
|
||||
Copyright (c) 2016 JackeyGao
|
||||
|
||||
Permission is hereby granted, free of charge, to any person obtaining a copy
|
||||
of this software and associated documentation files (the "Software"), to deal
|
||||
in the Software without restriction, including without limitation the rights
|
||||
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
||||
copies of the Software, and to permit persons to whom the Software is
|
||||
furnished to do so, subject to the following conditions:
|
||||
|
||||
The above copyright notice and this permission notice shall be included in
|
||||
all copies or substantial portions of the Software.
|
||||
|
||||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
||||
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
||||
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
||||
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
||||
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
||||
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
|
||||
THE SOFTWARE.
|
||||
64
README.md
64
README.md
@ -1,41 +1,41 @@
|
||||
# chinese-poetry | [全宋词爬取过程及数据分析](http://jackeygao.io/words/crawl-ci.html)
|
||||
chinese-poetry
|
||||
==============
|
||||
|
||||

|
||||
[](https://travis-ci.org/jackeyGao/chinese-poetry)
|
||||
[](https://github.com/jackeyGao/chinese-poetry/blob/master/LICENSE)
|
||||
[]()
|
||||
[]()
|
||||
[]()
|
||||
|
||||
## 唐诗高频词
|
||||
最全的中华古典文集数据库, 包含5.5万首唐诗、26万首宋诗和2.1万首宋词. 唐宋两朝斤1.4万古诗人, 和两宋时期1.5K词人. 数据来源于互联网.
|
||||
|
||||

|
||||
**为什么要做这个仓库?** 古诗是中华民族乃至全世界的瑰宝, 我们应该传承下去, 虽然有古典文集, 但大多数人并没有拥有这些书籍. 从某种意义上来说, 这些庞大的文集离我们是有一定距离的。而电子版方便拷贝, 所以此开源数据库诞生了. 你可以用此数据做任何有益的事情, 甚至我也可以帮助你.
|
||||
|
||||
## 唐诗作者作品榜
|
||||
古诗采集没有记录过程, 因为古诗数据庞大,目标网站有限制, 采集过程经常中断超过了一个星期.2017年新加入全宋词, [全宋词爬取过程及数据分析](http://jackeygao.io/words/crawl-ci.html).
|
||||
|
||||

|
||||
|
||||
## 宋诗高频词
|
||||
## 数据分析
|
||||
|
||||

|
||||
一些简单的高频分析
|
||||
|
||||
## 宋诗作者作品榜
|
||||
|||
|
||||
| :---: | :---: |
|
||||
|  唐诗高频词 |  唐诗作者作品榜 |
|
||||
| 宋诗高频词 |  宋诗作者作品榜 |
|
||||
|  宋词高频词 | 宋词作者作品榜 |
|
||||
|
||||

|
||||
|
||||
## 宋词作者作品榜
|
||||
|
||||

|
||||
|
||||
## 宋词高频词
|
||||
|
||||

|
||||
|
||||
## 两宋喜欢的词牌名
|
||||
|
||||

|
||||
|
||||
<center>两宋喜欢的词牌名</center>
|
||||
|
||||
## 数据分发形式
|
||||
|
||||
### 唐诗JSON
|
||||
古诗数据分发采用繁体字的分组JSON文件, 保留繁体能更大程序的保存原数据. 宋词数据分发采用sqlite数据库, 使用简体字(无原因, 采集源就是简体, 如需繁体请自行转换).
|
||||
|
||||
```json
|
||||
### 古诗JSON结构
|
||||
|
||||
```text
|
||||
[
|
||||
{
|
||||
"strains": [
|
||||
@ -56,7 +56,7 @@
|
||||
... 每单个JSON文件1000条唐诗记录.
|
||||
]
|
||||
```
|
||||
### 作者JSON
|
||||
### 作者JSON结构
|
||||
|
||||
```json
|
||||
[
|
||||
@ -64,14 +64,24 @@
|
||||
"name": "太宗皇帝",
|
||||
"desc": "帝姓李氏,諱世民,神堯次子,聰明英武。貞觀之治,庶幾成康,功德兼隆。由漢以來,未之有也。而銳情經術, >初建秦邸,即開文學館,召名儒十八人爲學士。既即位,殿左置弘文館,悉引內學士,番宿更休。聽朝之間,則與討論典籍,雜以文詠。或日昃夜艾,未嘗少怠。詩筆草隸,卓越前古。至於天文秀發,沈麗高朗,有唐三百年風雅之盛,帝實有以啓之焉。在位二十四年,諡曰文。集四十卷。館閣書目,詩一卷,六十九首。今編詩一卷。"
|
||||
},
|
||||
... 所有诗人信息在一个文件
|
||||
...
|
||||
]
|
||||
```
|
||||
|
||||
## 案例展示
|
||||
|
||||
## 贡献
|
||||
- [animalize](https://github.com/animalize) **/** [QuanTangshi](https://github.com/animalize/QuanTangshi) *离线全唐诗 Android*
|
||||
- [justdark](https://github.com/justdark) **/** [pytorch-poetry-gen](https://github.com/justdark/pytorch-poetry-gen) *a char-RNN based on pytorch*
|
||||
- [Clover27](https://github.com/Clover27) **/** [ancient-Chinese-poem-generator](https://github.com/Clover27/ancient-Chinese-poem-generator) *Ancient-Chinese-Poem-Generator*
|
||||
- 欢迎补充
|
||||
|
||||
## 贡献&讨论
|
||||
|
||||
提交issue来优化完善此数据库, 你也可以联系我的邮箱 gaojunqi@outlook.com
|
||||
|
||||
<img src="https://raw.githubusercontent.com/jackeyGao/chinese-poetry/master/images/WechatIMG1.jpeg" width="150" height="200" />
|
||||
|
||||
目前仅有全唐诗记录, 唐后的一些古诗未收集,古诗乃中华文化瑰宝, 如果有靠谱的数据源也欢迎提交PR加入.
|
||||
|
||||
## License
|
||||
[MIT](https://zh.wikipedia.org/wiki/MIT%E8%A8%B1%E5%8F%AF%E8%AD%89) 许可证.
|
||||
|
||||
[MIT](https://github.com/jackeyGao/chinese-poetry/blob/master/LICENSE) 许可证.
|
||||
|
||||
BIN
images/WechatIMG1.jpeg
Normal file
BIN
images/WechatIMG1.jpeg
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 25 KiB |
@ -15199,4 +15199,4 @@
|
||||
],
|
||||
"title": "句"
|
||||
}
|
||||
]
|
||||
]
|
||||
|
||||
1
requirements.txt
Normal file
1
requirements.txt
Normal file
@ -0,0 +1 @@
|
||||
pytest==3.1.0
|
||||
34
test_poetry.py
Normal file
34
test_poetry.py
Normal file
@ -0,0 +1,34 @@
|
||||
#! -*- coding: utf-8 -*-
|
||||
import os, json, sqlite3
|
||||
|
||||
def check_json(f):
|
||||
filepath = os.path.join('./json', f)
|
||||
with open(filepath) as file:
|
||||
try:
|
||||
_ = json.loads(file.read())
|
||||
return True
|
||||
except:
|
||||
assert False, u"校验(%s)失败" % f
|
||||
|
||||
|
||||
def test_json():
|
||||
"""
|
||||
测试古诗JSON文件是否有效
|
||||
"""
|
||||
map(check_json, os.listdir('./json'))
|
||||
|
||||
|
||||
|
||||
def test_sqlite():
|
||||
"""
|
||||
测试ci数据库文件是否有效
|
||||
"""
|
||||
conn = sqlite3.connect('./ci/ci.db')
|
||||
|
||||
c = conn.cursor()
|
||||
|
||||
c.execute("SELECT name FROM sqlite_master WHERE type='table'")
|
||||
|
||||
tables = c.fetchall()
|
||||
|
||||
assert len(tables) == 2, u"Sqlite文件异常"
|
||||
Reference in New Issue
Block a user