光棍节初探 TensorFlow（一）：数据集的预处理

11 月 11 日这天注定对我具有了一定的意义。不是因为它是购物狂欢节或光棍节，而是因为在这一天，我第一次尝试使用 TensorFlow 搭建了一个简单的神经网络。我希望用几篇文章记录这个过程。

最近在读 Fundamentals of Deep Learning 这本书。我选择它的原因是讲解得通俗易懂，又会直白地点出重点内容。然而当我读到第三章「Implementing Neural Networks in TensorFlow」时，整个人就好像懵了一样。对于一个从来没接触过 TensorFlow 的人来说，是难以通过看代码直接理解 Graph, Session 等等这些新概念的。于是联想到程序员的思维修炼里面提到的「SQ3R 阅读法」，赶紧先放下这本书，到网上找其他关于 TensorFlow 的资料，值得推荐的是：

TensorFlow 官方文档中文版
TF Girls「TensorFlow Tutorial」修炼指南（这老师很幽默）
- youtube 地址
- bilibili 地址

没想到我居然也能一天完成了一个基础的神经网络（虽然是从下午 1 点到晚上 2 点）。现在到了「SQ3R 阅读法」中的很重要的 Recite（复述）这步———把这个过程写成文章发布到博客里。

所使用的数据集来自 The Street View House Numbers (SVHN) Dataset，这是一个关于识别街景照片中出现的数字的数据集。

读取数据

首先下载 Format 2 格式的数据，即 .mat 格式的数据。我们先在 iPython 里面探索一下数据：

In[1]: from scipy.io import loadmat as load

In[2]: train_data = load('data/train_32x32.mat')
  ...: test_data = load('data/test_32x32.mat')

In[3]: train_data.keys()
Out[3]: dict_keys(['__header__', '__version__', '__globals__', 'X', 'y'])

In[4]: train_data['X'].shape
Out[4]: (32, 32, 3, 73257)

In[5]: train_data['y'].shape
Out[5]: (73257, 1)

我们把训练集和测试集的样本和标签提取出来。我不想让训练数据集的标签是一个二维数组，所以简单调整一下：

In[6]: train_samples = train_data['X']
  ...: train_labels = train_data['y'].reshape(train_data['y'].shape[0])
  ...: test_samples = test_data['X']
  ...: test_labels = test_data['y'].reshape(test_data['y'].shape[0])

In[7]: train_labels.shape
Out[7]: (73257,)

转换数据维度

这时 train_samples 的维度是 (32, 32, 3, 73257)，即(图片高，图片宽，通道数，图片数)。很奇怪原始格式把图片数放在了第四个维度上。我们希望 train_samples 的维度是(图片数，图片高，图片宽，通道数)，即 (73257, 32, 32, 3) 的模式。而 train_labels 也需要一些变化。现在的 train_labels 中每个 label 都是图像上对应的数字，如 3，我们希望它变成 [0, 0, 0, 1, 0, 0, 0, 0, 0, 0] 的模式。其中麻烦一点的是 .mat 格式的数据中并没有 0，而是用 10 来表示 0，这需要我们做一点小小的调整：

import numpy as np

def reformat(samples, labels):
    # 改变原始数据的形状
    # (图片高，图片宽，通道数，图片数) -> (图片数，图片高，图片宽，通道数)
    # labels 转换为 one-hot encoding [3] -> [0, 0, 0, 1, 0, 0, 0, 0, 0, 0]
    samples = np.transpose(samples, (3, 0, 1, 2))

    one_hot_labels = np.zeros((labels.shape[0], 10))
    for i, label in enumerate(labels):
        index = label if label != 10 else 0
        one_hot_labels[i, index] = 1.0

    return samples, one_hot_labels

压缩数据通道数，可视化数据

然后再把图片的 RGB 三通道压缩成一通道的灰度模式，同时压缩映射到 -1.0~1.0 上：

def normalize(samples):
    """
    @ samples: numpy array
    """
    samples = np.add.reduce(samples, keepdims=True, axis=3) / 3.0
    return samples / 128.0 - 1.0

改变成灰度图之后，我们现在想看看最初的图片和灰度图。参数 huidu 是用来表示想要展示的是否是灰度图。如果是，则调用关于灰度图的相关函数：

import matplotlib.pyplot as plt

def inspect(datasets, labels, i, huidu=False):
    # 显示图片查看
    print(labels[i])

    if huidu:
        huidu_shape = (datasets.shape[1], datasets.shape[2])
        plt.imshow(datasets[i].reshape(huidu_shape), cmap="gray")
    else:
        plt.imshow(datasets[i])
    plt.show()

我们对训练集和测试集进行调整维度和压缩通道的预处理：

In[8]: re_train_samples, re_train_labels = reformat(train_samples, train_labels)
  ...: re_test_samples, re_test_labels = reformat(test_samples, test_labels)

In[9]: final_train_samples = normalize(re_train_samples)
  ...: final_train_labels = re_train_labels
  ...: final_test_samples = normalize(re_test_samples)
  ...: final_test_labels = re_test_labels

现在我们随便选择训练集中的一个样本，看看它的原始图片和灰度图：

inspect(re_train_samples, train_labels, 31960)

inspect(final_train_samples, train_labels, 31960, huidu=True)

下面就是灰度图：

好了，一切完美运行！最后，我们想看看我们训练集和测试集的标签的分布情况，画出数字 0~9 的分布情况的直方图：

from collections import Counter

def distribution(labels, name):
    # 查看 labels 的分布，并画出统计图
    count = Counter(labels)
    y_pos = np.arange(len(count))
    y_count = [count[i] if i != 0 else count[10] for i in y_pos]
    plt.bar(y_pos, y_count, align='center', alpha=0.5)
    plt.xticks(y_pos, y_pos)
    plt.ylabel('Count')
    plt.title(name + ' Label Distribution')
    plt.show()

In[10]: distribution(train_labels, 'Train')

In[11]: distribution(test_labels, 'Train')

可以看到 train_labels 和 test_labels 具有相似的分布结构，说明训练集和测试集的划分还算合理，我们可以接下来继续用。

至此，我们的预处理就结束了。我们把这里所有的代码整理为 load.py 以供接下来的神经网络使用：

from scipy.io import loadmat as load
import matplotlib.pyplot as plt
import numpy as np
from collections import Counter


# 使用 tensorflow 实现图像识别
def load_data():
    train_data = load('data/train_32x32.mat')
    test_data = load('data/test_32x32.mat')
    # extra_data = load('data/extra_32x32.mat')
    return train_data, test_data


def reformat(samples, labels):
    # 改变原始数据的形状
    # (图片高，图片宽，通道数，图片数) -> (图片数，图片高，图片宽，通道数)
    # labels 转换为 one-hot encoding [3] -> [0, 0, 0, 1, 0, 0, 0, 0, 0, 0]
    samples = np.transpose(samples, (3, 0, 1, 2))

    one_hot_labels = np.zeros((labels.shape[0], 10))
    for i, label in enumerate(labels):
        index = label if label != 10 else 0
        one_hot_labels[i, index] = 1.0

    return samples, one_hot_labels


def normalize(samples):
    """
    灰度化：(R + G + B) / 3（省内存，加快训练速度）
    将图片从 0 ~ 255 映射到 -1.0 ~ 1.0
    @ samples: numpy array
    """
    samples = np.add.reduce(samples, keepdims=True, axis=3) / 3.0
    return samples / 128.0 - 1.0


def distribution(labels, name):
    # 查看 labels 的分布，并画出统计图
    count = Counter(labels)
    y_pos = np.arange(len(count))
    y_count = [count[i] if i != 0 else count[10] for i in y_pos]
    plt.bar(y_pos, y_count, align='center', alpha=0.5)
    plt.xticks(y_pos, y_pos)
    plt.ylabel('Count')
    plt.title(name + ' Label Distribution')
    plt.show()


def inspect(datasets, labels, i, huidu=False):
    # 显示图片查看
    print(labels[i])

    if huidu:
        huidu_shape = (datasets.shape[1], datasets.shape[2])
        plt.imshow(datasets[i].reshape(huidu_shape), cmap="gray")
    else:
        plt.imshow(datasets[i])
    plt.show()

train_data, test_data = load_data()

train_samples = train_data['X']
train_labels = train_data['y'].reshape(train_data['y'].shape[0])
test_samples = test_data['X']
test_labels = test_data['y'].reshape(test_data['y'].shape[0])
# test_samples = extra_data['X']
# test_labels = extra_data['y']

print('Train Data Samples Shape: ', train_samples.shape)
print('Train Data Labels Shape: ', train_labels.shape)

print('Test Data Samples Shape: ', test_samples.shape)
print('Test Data Labels Shape: ', test_labels.shape)

# print('Extra Data Samples Shape: ', extra_data['X'].shape)
# print('Extra Data Labels Shape: ', extra_data['y'].shape)

re_train_samples, re_train_labels = reformat(train_samples, train_labels)
re_test_samples, re_test_labels = reformat(test_samples, test_labels)

final_train_samples = normalize(re_train_samples)
final_train_labels = re_train_labels
final_test_samples = normalize(re_test_samples)
final_test_labels = re_test_labels

num_labels = final_train_labels.shape[1] # 10
image_size = final_train_samples.shape[1] # 32
num_channel = final_train_samples.shape[3] # 1

if __name__ == '__main__':
    # See some pictures
    inspect(re_train_samples, train_labels, 1, huidu=False)

    # See some gray pictures
    inspect(final_train_samples, train_labels, 1, huidu=True)

    # See the distribution of the labels
    distribution(train_labels, 'Train')
    distribution(test_labels, 'Test')

结语

你可能心里说：「骗子，你根本没写 TensorFlow 的内容」。嘿嘿，先别打我，我们接下来会把神经网络的相关代码写在另一个程序 network.py 里面，在那里就需要 TensorFlow 的内容了。碍于篇幅，不好在一篇文章中写完。关于使用 TensorFlow 构建基础的神经网络，我会在下一篇文章中介绍。

顺带一提，在写日志的时候，把自己写过的代码再过一遍，我觉得是一种很好的学习方法：）为早日精通 TensorFlow 奋斗！

XZY's BLOG

读取数据

转换数据维度

压缩数据通道数，可视化数据

结语