Reshaping, Reorganizing, and Aggregating

Reshaping, Reorganizing, and Aggregating

在半年之后,Introducing the Series and DataFrame 终于迎来了续集。本文是 Mastering pandas for finance 一书第三章: Reshaping, Reorganizing, and Aggregating 的学习笔记。

加载历史数据

原书上加载历史数据的方法已经不在可用,推荐使用 tushare 提供的数据。注册即可获得 API 密钥。

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import tushare as ts
import datetime as dt
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
pro = ts.pro_api('你从Tushare申请的密钥')

# 加载历史数据
pingan = pro.daily(
ts_code = '000001.SZ',
start_date = "20180101",
end_date = "20181231"
)
wanke = pro.daily(
ts_code = '000002.SZ',
start_date = "20180101",
end_date = "20181231"
)

# # 导出为csv文件
# pingan.to_csv("pingan.csv")
# wanke.to_csv("wanke.csv")
#
# # 读入csv文件
# pingan = pd.read_csv('pingan.csv', index_col = 0, parse_dates = True)
# wanke = pd.read_csv('wanke.csv', index_col = 0, parse_dates = True)

及联多个数据框对象

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
wanke.trade_date = pd.to_datetime(wanke.trade_date)
wanke.index = wanke.trade_date
pingan.trade_date = pd.to_datetime(pingan.trade_date)
pingan.index = pingan.trade_date
# 选择12月份的收盘数据
pingan01 = pingan['2018-12'][['close']]
# 选择11月份的收盘数据
pingan02 = pingan['2018-11'][['close']]
# 纵向连接两个序列的前三行
pd.concat([pingan01.head(3), pingan02.head(3)])
# close
# trade_date
# 2018-12-28 9.38
# 2018-12-27 9.28
# 2018-12-26 9.30
# 2018-11-30 10.36
# 2018-11-29 10.20
# 2018-11-28 10.28
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# 索引也可以有重复值
wanke01 = wanke['2018-12'][['close']]
wanke01[:3]
# close
# trade_date
# 2018-12-28 23.82
# 2018-12-27 23.44
# 2018-12-26 23.80
pingan01[:3]
# close
# trade_date
# 2018-12-28 9.38
# 2018-12-27 9.28
# 2018-12-26 9.30
closes = pd.concat([wanke01[:3], pingan01[:3]],
keys = ["wanke", "pingan"])
closes
# close
# trade_date
# wanke 2018-12-28 23.82
# 2018-12-27 23.44
# 2018-12-26 23.80
# pingan 2018-12-28 9.38
# 2018-12-27 9.28
# 2018-12-26 9.30
closes.loc['pingan'][:3]
# close
# trade_date
# 2018-12-28 9.38
# 2018-12-27 9.28
# 2018-12-26 9.30

纵向合并含多个变量的数据框

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
pinganav = pingan[['close', 'vol']]
wankeav = wanke[['close', 'vol']]
wankea = wanke[['close']]
pd.concat([pinganav[:3], wankea[:3]], sort = False,
keys = ["pingan", "wanke"])
# close vol
# trade_date
# pingan 2018-12-28 9.38 576604.00
# 2018-12-27 9.28 624593.27
# 2018-12-26 9.30 421140.60
# wanke 2018-12-28 23.82 NaN
# 2018-12-27 23.44 NaN
# 2018-12-26 23.80 NaN

内连接:只连接共有的部分

Python
1
2
3
4
5
6
7
8
9
pd.concat([pinganav[:3], wankea[:3]], join = 'inner')
# close
# trade_date
# 2018-12-28 9.38
# 2018-12-27 9.28
# 2018-12-26 9.30
# 2018-12-28 23.82
# 2018-12-27 23.44
# 2018-12-26 23.80

更改连接轴:横向连接

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
pingana = pingan[['close']]
pd.concat([pingana[:3], wankea[:3]], axis = 1,
keys = ["pingan", "wanke"])
# pingan wanke
# close close
# trade_date
# 2018-12-28 9.38 23.82
# 2018-12-27 9.28 23.44
# 2018-12-26 9.30 23.80
pd.concat([pinganav[:3], wankeav[:3]], axis = 1,
keys = ["pingan", "wanke"])
# pingan wanke
# close vol close vol
# trade_date
# 2018-12-28 9.38 576604.00 23.82 322810.93
# 2018-12-27 9.28 624593.27 23.44 352501.78
# 2018-12-26 9.30 421140.60 23.80 221987.26
pd.concat([pinganav[:3], wankea[:3]], axis = 1, join = 'inner',
keys = ['pingan', 'wanke'])
# pingan wanke
# close vol close
# trade_date
# 2018-12-28 9.38 576604.00 23.82
# 2018-12-27 9.28 624593.27 23.44
# 2018-12-26 9.30 421140.60 23.80
Python
1
2
3
4
5
6
7
8
9
# 忽略索引的连接(连接之后重新建立索引)
pd.concat([pingana[:3], wankea[:3]], ignore_index = True)
# close
# 0 9.38
# 1 9.28
# 2 9.30
# 3 23.82
# 4 23.44
# 5 23.80

Merging DataFrame Objects

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
pingana = pingana.reset_index()
pingana[:3]
# trade_date close
# 0 2018-12-28 9.38
# 1 2018-12-27 9.28
# 2 2018-12-26 9.30

pinganv = pingan[['vol']].reset_index()
pinganv[:3]
# trade_date vol
# 0 2018-12-28 576604.00
# 1 2018-12-27 624593.27
# 2 2018-12-26 421140.60

pd.merge(pingana, pinganv)[:3]
# trade_date close vol
# 0 2018-12-28 9.38 576604.00
# 1 2018-12-27 9.28 624593.27
# 2 2018-12-26 9.30 421140.60

# 合并方式:left/right/outer/inner
pingana05 = pingana[0:5]
pingana05
# trade_date close
# 0 2018-12-28 9.38
# 1 2018-12-27 9.28
# 2 2018-12-26 9.30
# 3 2018-12-25 9.34
# 4 2018-12-24 9.42

pingana24 = pingana[2:4]
pingana24
# trade_date close
# 2 2018-12-26 9.30
# 3 2018-12-25 9.34

# 默认使用内连接
pd.merge(pingana05, pingana24)
# trade_date close
# 0 2018-12-26 9.30
# 1 2018-12-25 9.34

# 外连接
pd.merge(pingana05, pingana24, how = "outer")
# trade_date close
# 0 2018-12-28 9.38
# 1 2018-12-27 9.28
# 2 2018-12-26 9.30
# 3 2018-12-25 9.34
# 4 2018-12-24 9.42

# 左连接
pd.merge(pingana05, pingana24, how = "left")
# trade_date close
# 0 2018-12-28 9.38
# 1 2018-12-27 9.28
# 2 2018-12-26 9.30
# 3 2018-12-25 9.34
# 4 2018-12-24 9.42

pd.merge(pingana05, pingana24, how = "right")
# trade_date close
# 0 2018-12-26 9.30
# 1 2018-12-25 9.34

数据透视表

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# 首先把股票的名称插入第一列:
pingan.insert(0, 'symbol', '平安银行')
wanke.insert(0, 'symbol', '万科A')
# 合并
combined = pd.concat([pingan, wanke]).sort_index()
s4p = combined[['symbol', 'close', 'open', 'high', 'low', 'vol']].reset_index()
closes = s4p.pivot(
index = 'trade_date',
columns = 'symbol',
values = 'close'
)
closes
# symbol 万科A 平安银行
# trade_date
# 2018-01-02 32.56 13.70
# 2018-01-03 32.33 13.33
# ... ...
# 2018-12-27 23.44 9.28
# 2018-12-28 23.82 9.38
# [243 rows x 2 columns]

Stacking and unstacking

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
stackedCloses = closes.stack()
stackedCloses
# trade_date symbol
# 2018-01-02 万科A 32.56
# 平安银行 13.70
# 2018-01-03 万科A 32.33
# 平安银行 13.33
# ...
# 2018-12-27 万科A 23.44
# 平安银行 9.28
# 2018-12-28 万科A 23.82
# 平安银行 9.38
# Length: 486, dtype: float64

stackedCloses.loc['2018-01-02', '平安银行']
# 13.7

stackedCloses.loc['2018-01-02']
# trade_date symbol
# 2018-01-02 万科A 32.56
# 平安银行 13.70
# dtype: float64

stackedCloses.loc[:, '平安银行']
# trade_date
# 2018-01-02 13.70
# ...
# 2018-12-28 9.38
# Length: 243, dtype: float64

# 选择某只股票的:
unstackedCloses = stackedCloses.unstack()
unstackedCloses[:3]
# symbol 万科A 平安银行
# trade_date
# 2018-01-02 32.56 13.70
# 2018-01-03 32.33 13.33
# 2018-01-04 33.12 13.25

Melting

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
melted = pd.melt(s4p, id_vars = ['trade_date', 'symbol'])
melted[:5]
# trade_date symbol variable value
# 0 2018-01-02 平安银行 close 13.70
# 1 2018-01-02 万科A close 32.56
# 2 2018-01-03 万科A close 32.33
# 3 2018-01-03 平安银行 close 13.33
# 4 2018-01-04 平安银行 close 13.25

melted[(melted.symbol == "平安银行") & (melted.trade_date == "2018-01-02")]
# trade_date symbol variable value
# 0 2018-01-02 平安银行 close 13.70
# 486 2018-01-02 平安银行 open 13.35
# 972 2018-01-02 平安银行 high 13.93
# 1458 2018-01-02 平安银行 low 13.32
# 1944 2018-01-02 平安银行 vol 2081592.55

Grouping and aggregating

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# Spliting
s4g = combined[['symbol', 'close']].reset_index()
s4g[:3]
# trade_date symbol close
# 0 2018-01-02 平安银行 13.70
# 1 2018-01-02 万科A 32.56
# 2 2018-01-03 万科A 32.33
s4g.insert(1, 'Year', pd.DatetimeIndex(s4g['trade_date']).year)
s4g.insert(2, 'Month', pd.DatetimeIndex(s4g['trade_date']).month)
s4g[:5]
# trade_date Year Month symbol close
# 0 2018-01-02 2018 1 平安银行 13.70
# 1 2018-01-02 2018 1 万科A 32.56
# 2 2018-01-03 2018 1 万科A 32.33
# 3 2018-01-03 2018 1 平安银行 13.33
# 4 2018-01-04 2018 1 平安银行 13.25

grouped = s4g.groupby('symbol')
type(grouped.groups)
# dict

grouped.groups
# {'万科A': Int64Index([ 1, 2, 5, 6, 8, 10, 13, 14, 16, 18,
# ...
# 466, 468, 471, 473, 474, 476, 479, 481, 483, 484],
# dtype='int64', length=243),
# '平安银行': Int64Index([ 0, 3, 4, 7, 9, 11, 12, 15, 17, 19,
# ...
# 467, 469, 470, 472, 475, 477, 478, 480, 482, 485],
# dtype='int64', length=243)}
len(grouped), grouped.ngroups
# (2, 2)

编写一个预览分组对象的函数:

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
def print_groups(groupobject):
for name, group in groupobject:
print(name)
print(group.head())

print_groups(grouped)
# 万科A
# trade_date Year Month symbol close
# 1 2018-01-02 2018 1 万科A 32.56
# 2 2018-01-03 2018 1 万科A 32.33
# 5 2018-01-04 2018 1 万科A 33.12
# 6 2018-01-05 2018 1 万科A 34.76
# 8 2018-01-08 2018 1 万科A 35.99
# 平安银行
# trade_date Year Month symbol close
# 0 2018-01-02 2018 1 平安银行 13.70
# 3 2018-01-03 2018 1 平安银行 13.33
# 4 2018-01-04 2018 1 平安银行 13.25
# 7 2018-01-05 2018 1 平安银行 13.30
# 9 2018-01-08 2018 1 平安银行 12.96

Python
1
2
3
4
5
6
# 查看每组的大小
grouped.size()
# symbol
# 万科A 243
# 平安银行 243
# dtype: int64
Python
1
2
3
4
5
6
7
# 获取某个组的数据
grouped.get_group('万科A')
# trade_date Year Month symbol close
# 1 2018-01-02 2018 1 万科A 32.56
# .. ... ... ... ... ...
# 484 2018-12-28 2018 12 万科A 23.82
# [243 rows x 5 columns]
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# 使用多个分组变量
mcg = s4g.groupby(['symbol', 'Year', 'Month'])
print_groups(mcg)
# ('万科A', 2018, 1)
# trade_date Year Month symbol close
# 1 2018-01-02 2018 1 万科A 32.56
# 2 2018-01-03 2018 1 万科A 32.33
# 5 2018-01-04 2018 1 万科A 33.12
# 6 2018-01-05 2018 1 万科A 34.76
# 8 2018-01-08 2018 1 万科A 35.99
# ('万科A', 2018, 2)
# trade_date Year Month symbol close
# 45 2018-02-01 2018 2 万科A 37.50
# 47 2018-02-02 2018 2 万科A 37.38
# 49 2018-02-05 2018 2 万科A 36.99
# 50 2018-02-06 2018 2 万科A 35.82
# 52 2018-02-07 2018 2 万科A 32.98
# …… …… ……
# ('平安银行', 2018, 12)
# trade_date Year Month symbol close
# 447 2018-12-03 2018 12 平安银行 10.59
# 449 2018-12-04 2018 12 平安银行 10.59
# 450 2018-12-05 2018 12 平安银行 10.45
# 453 2018-12-06 2018 12 平安银行 10.25
# 454 2018-12-07 2018 12 平安银行 10.28
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# 设定分组变量为索引
mi = s4g.set_index(['symbol', 'Year', 'Month'])
# 因为我们现在已经设定了三级索引,所以现在可以直接按照索引的级别进行分组,例如使用第一级索引(level = 0)也就是股票代码进行分组:
mig_l1 = mi.groupby(level = 0)
print_groups(mig_l1)
# 万科A
# trade_date close
# symbol Year Month
# 万科A 2018 1 2018-01-02 32.56
# 1 2018-01-03 32.33
# 1 2018-01-04 33.12
# 1 2018-01-05 34.76
# 1 2018-01-08 35.99
# 平安银行
# trade_date close
# symbol Year Month
# 平安银行 2018 1 2018-01-02 13.70
# 1 2018-01-03 13.33
# 1 2018-01-04 13.25
# 1 2018-01-05 13.30
# 1 2018-01-08 12.96
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# 也可以直接使用索引的名字
mig_l2 = mi.groupby(level = ['symbol', 'Year', 'Month'])
print_groups(mig_l2)
# ('万科A', 2018, 1)
# trade_date close
# symbol Year Month
# 万科A 2018 1 2018-01-02 32.56
# 1 2018-01-03 32.33
# 1 2018-01-04 33.12
# 1 2018-01-05 34.76
# 1 2018-01-08 35.99
# ('万科A', 2018, 2)
# trade_date close
# symbol Year Month
# 万科A 2018 2 2018-02-01 37.50
# 2 2018-02-02 37.38
# 2 2018-02-05 36.99
# 2 2018-02-06 35.82
# 2 2018-02-07 32.98
# ('万科A', 2018, 3)
# trade_date close
# ······ ······
# symbol Year Month
# 平安银行 2018 12 2018-12-03 10.59
# 12 2018-12-04 10.59
# 12 2018-12-05 10.45
# 12 2018-12-06 10.25
# 12 2018-12-07 10.28

聚合

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
mig_l2.agg(np.mean)
s4g.groupby(['symbol', 'Year', 'Month'], as_index = False).agg(np.mean)[:5]
s4g.groupby(['symbol', 'Year', 'Month'], as_index = False).agg([np.mean, np.std])
# close
# mean std
# symbol Year Month
# 万科A 2018 1 37.344545 2.777506
# 2 34.046000 2.003632
# 3 32.299545 1.010778
# 4 30.981667 1.550697
# 5 27.046364 0.949995
# 6 26.726000 1.247327
# 7 23.405909 0.509889
# 8 22.773478 0.971749
# 9 23.756842 0.806061
# 10 22.208889 1.072150
# 11 24.757727 0.779224
# 12 25.167000 1.037487
# 平安银行 2018 1 13.891364 0.562954
# 2 12.764667 0.942367
# 3 11.671818 0.444218
# 4 11.306111 0.379026
# 5 10.772273 0.284738
# 6 9.852500 0.431934
# 7 8.963636 0.308507
# 8 9.359130 0.545710
# 9 10.214737 0.374987
# 10 10.630000 0.412624
# 11 10.587273 0.250774
# 12 9.979500 0.459124
# Python

Comments

Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now

×