ytq520

绘图和可视化、数据结构基础、单元测试

基础知识

数据分析 7 绘图和可视化
信息可视化（也叫绘图）是数据分析中最重要的⼯作之⼀。它可能是探索过程的⼀部分，例
如，帮助我们找出异常值、必要的数据转换、得出有关模型的idea等。另外，做⼀个可交互的
数据可视化也许是⼯作的最终⽬标。
matplotlib是⼀个⽤于创建出版质量图表的桌⾯绘图包（主要是2D⽅⾯）。
matplotlib⽀持各种操作系统上许多不同的GUI后端，⽽且还能将图⽚导出为各种常⻅的⽮量
（vector）和光栅（raster）图：PDF、SVG、JPG、PNG、BMP、GIF等。
matplotlib API⼊⻔
matplotlib的通常引⼊约定是
In [11]: import matplotlib.pyplot as plt
在Jupyter中运⾏%matplotlib notebook 或在IPython中运⾏%matplotlib
In [12]: import numpy as np
In [13]: data = np.arange(10)
In [14]: data
Out[14]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [15]: plt.plot(data)
Figure和Subplot
matplotlib的图像都位于Figure对象中。你可以⽤plt.figure创建⼀个新的Figure
In [16]: fig = plt.figure()
不能通过空Figure绘图。必须⽤add_subplot创建⼀个或多个subplot才⾏
# 图像应该是2×2的（即最多4张图），且当前选中的是4个subplot中的第⼀个（编号从1开始）
In [17]: ax1 = fig.add_subplot(2, 2, 1)
matplotlib会在最后⼀个⽤过的subplot（如果没有则创建⼀个）上进⾏绘制
fig = plt.figure()
ax1 = fig.add_subplot(2, 2, 1)
# 直⽅图
plt.hist(np.random.randn(100), bins=20, color='k', alpha=0.3)
ax2 = fig.add_subplot(2, 2, 2)
plt.scatter(np.arange(30), np.arange(30) + 3 *
np.random.randn(30))
ax3 = fig.add_subplot(2, 2, 3)
plt.plot([1.5, 3.5, -2, 1.6])
In [20]: plt.plot(np.random.randn(50).cumsum(), 'k--')
“k—“是⼀个线型选项，⽤于告诉matplotlib绘制⿊⾊虚线图
plt.subplots，它可以创建⼀个新的Figure，并返回⼀个含有已创建的subplot对象的NumPy数
组
In [24]: fig, axes = plt.subplots(2, 3)
In [25]: axes
Out[25]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7fb626374048>,
 <matplotlib.axes._subplots.AxesSubplot object at 0x7fb62625db00>,
 <matplotlib.axes._subplots.AxesSubplot object at 0x7fb6262f6c88>],
 [<matplotlib.axes._subplots.AxesSubplot object at 0x7fb6261a36a0>,
 <matplotlib.axes._subplots.AxesSubplot object at 0x7fb626181860>,
 <matplotlib.axes._subplots.AxesSubplot object at 0x7fb6260fd4e0>]],
dtype
=object)
调整subplot周围的间距
默认情况下，matplotlib会在subplot外围留下⼀定的边距，并在subplot之间留下⼀定的间
距。间距跟图像的⾼度和宽度有关，因此，如果你调整了图像⼤⼩（不管是编程还是⼿⼯），
间距也会⾃动调整。利⽤Figure的subplots_adjust⽅法可以轻⽽易举地修改间距
wspace和hspace⽤于控制宽度和⾼度的百分⽐，可以⽤作subplot之间的间距。下⾯是⼀个简
单的例⼦，将间距收缩到了0
fig, axes = plt.subplots(2, 2, sharex=True, sharey=True)
for i in range(4):
 plt.subplot(2, 2, i + 1)
plt.hist(np.random.randn(500), bins=50, color='k', alpha=0.5)
plt.subplots_adjust(wspace=0, hspace=0)
颜⾊、标记和线型
matplotlib的plot函数接受⼀组X和Y坐标，还可以接受⼀个表示颜⾊和线型的字符串缩写。
In [30]: from numpy.random import randn
In [31]: plt.plot(randn(30).cumsum(), ‘ko—‘)
简单⽅法
plt.plot(np.random.randn(30).cumsum(), color='k', linestyle='dashed',
marker='i')
⾮实际数据点默认是按线性⽅式插值的。可以通过drawstyle选项修改
在线型图中，⾮实际数据点默认是按线性⽅式插值的。可以通过drawstyle选项修改
In [33]: data = np.random.randn(30).cumsum()
In [34]: plt.plot(data, 'k--', label='Default')
Out[34]: [<matplotlib.lines.Line2D at 0x7fb624d86160>]
In [35]: plt.plot(data, 'k-', drawstyle='steps-post', label='steps-post')
Out[35]: [<matplotlib.lines.Line2D at 0x7fb624d869e8>]
设置标题、轴标签、刻度以及刻度标签
In [37]: fig = plt.figure()
In [38]: fig.add_subplot(1, 1, 1)
In [39]: plt.plot(np.random.randn(1000).cumsum())
要改变x轴刻度，最简单的办法是使⽤set_xticks和set_xticklabels。前者告诉matplotlib要将刻
度放在数据范围中的哪些位置，默认情况下，这些位置也就是刻度标签。但我们可以通过
set_xticklabels将任何其他的值⽤作标签
In [40]: plt.xticks([0, 250, 500, 750, 1000], ['one', 'two', 'three', 'four',
'five'], rotation=30, fontsize='small')
In [42]: plt.title('My first matplotlib plot')
Out[42]: <matplotlib.text.Text at 0x7fb624d055f8>
In [43]: plt.xlabel('Stages')
添加图例
In [44]: from numpy.random import randn
In [45]: fig = plt.figure(); fig.add_subplot(1, 1, 1)
In [46]: plt.plot(randn(1000).cumsum(), 'k', label='one')
Out[46]: [<matplotlib.lines.Line2D at 0x7fb624bdf860>]
In [47]: plt.plot(randn(1000).cumsum(), 'k--', label='two')
Out[47]: [<matplotlib.lines.Line2D at 0x7fb624be90f0>]
In [48]: plt.plot(randn(1000).cumsum(), 'k.', label='three')
Out[48]: [<matplotlib.lines.Line2D at 0x7fb624be9160>]
# plt.legend()来⾃动创建图例
plt.legend(loc='best')
读取⽂件并显示图表
from datetime import datetime
fig = plt.figure()
fig.add_subplot(1, 1, 1)
data = pd.read_csv('examples/spx.csv', index_col=0, parse_dates=True)
spx = data['SPX']
plt.plot(spx, 'k-')
图表保存到⽂件
plt.savefig('figpath.png', dpi=400, bbox_inches='tight')
使⽤pandas和seaborn绘图
Seaborn简化了许多常⻅可视类型的创建
线型图
Series和DataFrame都有⼀个⽤于⽣成各类图表的plot⽅法。默认情况下，它们所⽣成的是线
型图
In [60]: s = pd.Series(np.random.randn(10).cumsum(), index=np.arange(0, 100,
10))
In [61]: s.plot()
该Series对象的索引会被传给matplotlib，并⽤以绘制X轴。可以通过use_index=False禁⽤该
功能。X轴的刻度和界限可以通过xticks和xlim选项进⾏调节，Y轴就⽤yticks和ylim
DataFrame的plot⽅法会在⼀个subplot中为各列绘制⼀条线，并⾃动创建图例
In [62]: df = pd.DataFrame(np.random.randn(10, 4).cumsum(0),
 ....: columns=['A', 'B', 'C', 'D'],
 ....: index=np.arange(0, 100, 10))
In [63]: df.plot()
柱状图
plot.bar()和plot.barh()分别绘制⽔平和垂直的柱状图。这时，Series和DataFrame的索引将会
被⽤作X（bar）或Y（barh）刻度
In [64]: fig, axes = plt.subplots(2, 1)
In [65]: data = pd.Series(np.random.rand(16), index=list('abcdefghijklmnop'))
In [66]: data.plot.bar(ax=axes[0], color='k', alpha=0.7)
Out[66]: <matplotlib.axes._subplots.AxesSubplot at 0x7fb62493d470>
In [67]: data.plot.barh(ax=axes[1], color='k', alpha=0.7)
对于DataFrame，柱状图会将每⼀⾏的值分为⼀组，并排显示
In [69]: df = pd.DataFrame(np.random.rand(6, 4),
 ....: index=['one', 'two', 'three', 'four', 'five',
'six'],
 ....: columns=pd.Index(['A', 'B', 'C', 'D'],
name='Genus'))
In [70]: df
Out[70]:
Genus A B C D
one 0.370670 0.602792 0.229159 0.486744
two 0.420082 0.571653 0.049024 0.880592
three 0.814568 0.277160 0.880316 0.431326
four 0.374020 0.899420 0.460304 0.100843
five 0.433270 0.125107 0.494675 0.961825
six 0.601648 0.478576 0.205690 0.560547
In [71]: df.plot.bar()
设置stacked=True即可为DataFrame⽣成堆积柱状图，这样每⾏的值就会被堆积在⼀起
In [73]: df.plot.barh(stacked=True, alpha=0.5)
柱状图有⼀个⾮常不错的⽤法：利⽤value_counts图形化显示Series中各值的出现频率，⽐如
s.value_counts().plot.bar()
以有关⼩费的数据集为例, 假设我们想要做⼀张堆积柱状图以展示每天各种聚会规模的数据点
的百分⽐。⽤read_csv将数据加载进来，然后根据⽇期和聚会规模创建⼀张交叉表
In [75]: tips = pd.read_csv('examples/tips.csv')
In [76]: party_counts = pd.crosstab(tips['day'], tips['size'])
In [77]: party_counts
Out[77]:
size 1 2 3 4 5 6
day
Fri 1 16 1 1 0 0
Sat 2 53 18 13 1 0
Sun 0 39 15 18 3 1
Thur 1 48 4 5 1 3
# Not many 1- and 6-person parties
In [78]: party_counts = party_counts.loc[:, 2:5]
然后进⾏规格化，使得各⾏的和为1，并⽣成图表
In [79]: party_pcts = party_counts.div(party_counts.sum(1), axis=0)
In [80]: party_pcts
Out[80]:
size 2 3 4 5
day
Fri 0.888889 0.055556 0.055556 0.000000
Sat 0.623529 0.211765 0.152941 0.011765
Sun 0.520000 0.200000 0.240000 0.040000
Thur 0.827586 0.068966 0.086207 0.017241
In [81]: party_pcts.plot.bar()
通过该数据集就可以看出，聚会规模在周末会变⼤
使⽤seaborn可以减少⼯作量。⽤seaborn来看每天的⼩费⽐例
In [83]: import seaborn as sns
In [84]: tips['tip_pct'] = tips['tip'] / (tips['total_bill'] - tips['tip'])
In [85]: tips.head()
Out[85]:
 total_bill tip smoker day time size tip_pct
0 16.99 1.01 No Sun Dinner 2 0.063204
1 10.34 1.66 No Sun Dinner 3 0.191244
2 21.01 3.50 No Sun Dinner 3 0.199886
3 23.68 3.31 No Sun Dinner 2 0.162494
4 24.59 3.61 No Sun Dinner 4 0.172069
In [86]: sns.barplot(x='tip_pct', y='day', data=tips, orient='h')
绘制在柱状图上的⿊线代表95%置信区间
In [88]: sns.barplot(x='tip_pct', y='day', hue='time', data=tips, orient='h')
直⽅图和密度图
直⽅图（histogram）是⼀种可以对值频率进⾏离散化显示的柱状图。数据点被拆分到离散
的、间隔均匀的⾯元中，绘制的是各⾯元中数据点的数量。再以前⾯那个⼩费数据为例，通过
在Series使⽤plot.hist⽅法，我们可以⽣成⼀张“⼩费占消费总额百分⽐”的直⽅图
In [92]: tips[‘tip_pct’].plot.hist(bins=50)
与此相关的⼀种图表类型是密度图，它是通过计算“可能会产⽣观测数据的连续概率分布的估
计”⽽产⽣的。⼀般的过程是将该分布近似为⼀组核（即诸如正态分布之类的较为简单的分
布）。因此，密度图也被称作KDE（Kernel Density Estimate，核密度估计）图
In [94]: tips['tip_pct'].plot.density()
⽤seaborn绘制
In [96]: comp1 = np.random.normal(0, 1, size=200)
In [97]: comp2 = np.random.normal(10, 2, size=200)
In [98]: values = pd.Series(np.concatenate([comp1, comp2]))
In [99]: sns.distplot(values, bins=100, color='k')
点图或散布图是观察两个⼀维数据序列之间的关系的有效⼿段
加载了来⾃statsmodels项⽬的macrodata数据集，选择了⼏个变量，然后计算对数差
In [100]: macro = pd.read_csv('examples/macrodata.csv')
In [101]: data = macro[['cpi', 'm1', 'tbilrate', 'unemp']]
In [102]: trans_data = np.log(data).diff().dropna()
In [103]: trans_data[-5:]
Out[103]:
 cpi m1 tbilrate unemp
198 -0.007904 0.045361 -0.396881 0.105361
199 -0.021979 0.066753 -2.277267 0.139762
200 0.002340 0.010286 0.606136 0.160343
201 0.008419 0.037461 -0.200671 0.127339
202 0.008894 0.012202 -0.405465 0.042560
使⽤seaborn的regplot⽅法，它可以做⼀个散布图，并加上⼀条线性回归的线
In [105]: sns.regplot('m1', 'unemp', data=trans_data)
Out[105]: <matplotlib.axes._subplots.AxesSubplot at 0x7fb613720be0>
In [106]: plt.title('Changes in log %s versus log %s' % ('m1', 'unemp'))
seaborn提供了⼀个便捷的pairplot函数，它⽀持在对⻆线上放置每个变量的直⽅图或密度估
计
In [107]: sns.pairplot(trans_data, diag_kind='kde', plot_kws={'alpha': 0.2})
分⾯⽹格（facet grid）和类型数据
数据集有额外的分组维度, seaborn有⼀个有⽤的内置函数factorplot，可以简化制作多种分⾯
图
In [108]: sns.factorplot(x='day', y='tip_pct', hue='time', col='smoker',
 .....: kind='bar', data=tips[tips.tip_pct < 1])
除了在分⾯中⽤不同的颜⾊按时间分组，我们还可以通过给每个时间值添加⼀⾏来扩展分⾯⽹
格
In [109]: sns.factorplot(x='day', y='tip_pct', row='time',
 .....: col='smoker',
 .....: kind='bar', data=tips[tips.tip_pct < 1])
factorplot⽀持其它的绘图类型，你可能会⽤到。例如，盒图（它可以显示中位数，四分位
数，和异常值）就是⼀个有⽤的可视化类型
In [110]: sns.factorplot(x='tip_pct', y='day', kind='box',
 .....: data=tips[tips.tip_pct < 0.5])

###数据结构 –链表

class Node(object):
    def __init__(self, data):
        self.data = data
        self.next = None

    def set_data(self, data):
        self.data = data

    def set_next(self, node):
        self.next = node


node1 = Node('apple')
node2 = Node('pear')
node3 = Node('watermelon')
node4 = Node('Strawberry')

node1.set_next(node2)
node2.set_next(node3)
node3.set_next(node4)

node = node1
#
# while True:
#     print(node.data)
#     if node.next:
#         node = node.next
#         continue
#     break

# head = node1
# while True:
#     print(head.data)
#     if head.next is None:
#         break
#     head = head.next


class LinkedList(object):
    def __init__(self):
        self.head = None

    def append_item(self, item):
        temp = Node(item)
        if self.head is None:
            self.head = temp
            return temp.data
        current = self.head
        # 遍历链表
        while current.next is not None:
            current = current.next
        # 此时current是链表的最后一个元素
        current.set_next(temp)
        return temp.data

    def remove_item(self, item):
        current = self.head
        pre = None
        while current is not None:
            if current.data == item:
                if not pre:
                    self.head = current.next
                else:
                    pre.set_next(current.next)
                break
            else:
                pre = current
                current = current.next

    def remove_by_index(self, index):
        if self.head is None or index > self.size() or index < 0:
            print('LinkList is empty')
        current = self.head
        if index == 0:
            p = self.head
            self.head = p.next
            current = self.head
        else:
            current = self.head
            post = self.head
            j = 0
            while current.next is not None and j < index:
                post = current
                current = current.next
                j += 1
            if index == j:
                post.next = current.next
        current = self.head
        while current is not None:
            print(current.data)
            current = current.next

    def add_item(self, item, index):
        if index <= 1:
            temp = Node(item)
            temp.set_next(self.head)
            self.head = temp
        elif index > self.size():
            self.append_item(item)
        else:
            temp = Node(item)
            count = 1
            pre = None
            current = self.head
            while count < index:
                count += 1
                pre = current
                current = current.next
            pre.set_next(temp)
            temp.set_next(current)
            return temp.data

    def size(self):
        current = self.head
        count = 0
        while current is not None:
            count += 1
            current = current.next
        return count

    def get_item(self, index):
        if self.head is None:
            print('LinkList is empty')
        i = 1
        current = self.head
        while current.next is not None and i < index:
            current = current.next
            i += 1
        if i == index:
            return current.data
        else:
            print('target is not exist')

    def index(self, item):
        if self.head is None:
            print('LinkList is empty')
        current = self.head
        i = 0
        while current.next is not None and not current.data == item:
            current = current.next
            i += 1

        if current.data == item:
            return i
        else:
            print('没有此值')

    def __str__(self):
        str1 = ''
        current = self.head
        str1 += str(current.data) + ','
        while current.next:
            current = current.next
            str1 += str(current.data) + ','
        return str1


link = LinkedList()
# a = link.append_item(5)
# print(a)
k = ['e', 2, 3, 'a', 'b', 'c', 'd']
list1 = []
for i in k:
    list1.append(link.append_item(i))
print(list1)
print(link.size())
print(link.get_item(4))
print(link.add_item('apple', 4))
link.remove_by_index(4)
link.remove_item('apple')
print(link)

###数据结构 – 堆栈

class Stack(object):

    def __init__(self, max_size):
        self.container = []
        self.max_size = max_size

    def push(self, item):
        # 判断栈是否已满
        if self.is_full():
            return 'stack is full'
        self.container.append(item)

    def pop(self, item):
        # 判断栈是否为空
        if not self.is_empty():
            self.container.pop(item)

    def size(self):
        return len(self.container)

    def is_empty(self):
        return len(self.container) == 0

    def is_full(self):
        return len(self.container) == self.max_size

    def index(self, item):
        for i in range(len(self.container)):
            if self.container[i] == item:
                print(i)


stack = Stack(6)
for i in range(5):
    stack.push(i)
print(stack.container)
stack.push('a')
print(stack.container)
# stack.pop(2)
# print(stack.container)
stack.index(4)

###单元测试

import unittest

class TestClass01(unittest.TestCase):

    def setUp(self):
        print('setup....')
        self.l = []    
        self.l.append(3)        

    def test_case02(self):
        self.assertEquals(len(self.l), 1)
        my_pi = 3.14
        self.assertFalse(isinstance(my_pi, int))

    def test_case01(self):
        my_str = "Carmack"
        my_int = 999
        self.assertTrue(isinstance(my_str, str))
        self.assertTrue(isinstance(my_int, int))

if __name__ == '__main__':
	unittest.main()