2016-05-18

RNN的理解与实践

对于RNN和CNN的区别，最感性的认识就是CNN适用于做网格计算，所以经常被应用于图像处理的问题；而RNN更适用于处理序列，所以经常被应用于文本的处理。而对RNN的兴趣源于寒小阳&龙心尘的一篇博文，可以利用神经网络模仿人类写作，可以跑出来具有小四风格的文章段落。正好有师妹推荐了新晋机器学习Python神库Keras，那就理论加实践一起认识一下神奇的循环神经网络吧。

RNN结构及原理浅析

首先，需要说明的是，RNN依然是一个神经网络，其基本结构与普通的神经网络基本一致，如下图所示Ref：

但所不同的是，传统的神经网络（包括CNN）假定所有输入和输出都是相互独立的。而RNN的基本假设是输入序列之间是存在相互影响的。RNN之所以被称为循环，就是它会按照序列的输入时间不断重复训练该网络，并利用后向传播算法不断迭代更新权重$(U, V, W)$，这也就是为什么可以将RNN一层网络“展开”为$n$层，而$n$是序列输入的长度，也是时间总步长。同时，不难理解为什么一个RNN层“展开”的结点都共用一套参数。这里的“展开”是虚拟的。

其中，$U$是从输入层到隐藏层$S$的参数，$W$是$t$时刻到$t+1$时刻的参数，$V$是隐藏层到输出层的参数。图片来源于Bengio的Deep Learning一书中第二部分第10章Sequence Modeling: Recurrent and Recursive Nets。

RNN的基本算法如下：

$x_t$是$t$时刻的输入，可以是单词或者句子的One-hot编码；
隐藏层$s_t=f(Ux_t+Ws_{t-1})$，通常函数$f$会选择tanh或者ReLU，$s_{t-1}$是前一时刻的隐藏状态，当$t=0$时，$s_{-1}$通常被初始化为0向量；
输出$o_t=\mathrm{Softmax}(Vs_t)$。

值得注意的是，虽然隐藏层状态的更新经历了全输入序列，但由于神经网络参数训练的机制，只有当前时刻前近段时间内输入对其有影响，这也是符合人类记忆的基本规律。

最简单的RNN介绍到这，在此基础之上还有双向RNN和LSTM，暂时不做展开。我们直接介绍将要用到的character-level RNN。

Character-Level RNN

大神Karpathy在其一篇博客上面提到了Character-level RNN（以下简称char-RNN）最初的想法便是扔给RNN一堆文本，让其从字母级别进行学习，让RNN写出以某字母或者某单词开始，最可能的字符序列，组织成词或者句子。

例如有四个字母h, e, l, o。将其映射成k维的向量，用hello作为训练样本对网络进行训练。完成训练后，以h为首字母，让网络自动输出最有可能的字母接在后面，组成一个单词。

例子很简单，最终肯定会生成单词hello，而举这个例子的目的是理解char_RNN的工作原理。第一步先扔给RNN一个字母h，后面接“h”的置信度是1.0,接e是2.2，接“l”是-3.0，接“o”4.1. 因为我们的训练数据是“hello”，所以对于下一个字母“e”我们想要提高其的置信度（绿色）并降低其他字母的置信度（红色）. 同样地，我们每走一步都要提高绿色字母的置信度。因为RNN组成均可微，所以可以采用backpropagation算法调整权重。然后，我们可以执行参数更新。如果我们持续喂给RNN同样的输入后，会发现正确的字符 (如第一步中的”e”) 的置信度便会稍高，而不正确的字符的置信度将会稍低。我们然后重复这个过程多次，直到网络收敛和其预测就是最终符合训练数据，下一步就总是正确的字符。

更具体的细节，例如会同时用标准的Softmax分类器同步更新输出变量。RNN用mini-batch随机梯度下降算法更新参数(mini-batch Stochastic Gradient Descent)，或者也可以用RMSProp或Adam (per-parameter adaptive learning rate methods)更新参数。需要注意的是，训练数据中有两个“l”，而两个“l”出现的置信度是不一样的，这是因为RNN依赖于上下文，而不仅仅是前一个字母。

在测试时，我们会得到下一个可能的字母的概率分布，依据这个概率分布可以得到下一下最有可能出现的字母，重复这个过程，来看看出现什么奇迹吧！

出于教学的目的，作者Karpathy用Python/numpy写了一个小的character-level RNN语言模型的Demo（需翻墙）。只有100行代码，这里贴出来与大家分享，希望可以给大家一些直观的具体的理解。现在作者Karpathy及其团队更专注于更快更强的Lua/Torch代码库。

"""
Minimal character-level Vanilla RNN model. Written by Andrej Karpathy (@karpathy)
BSD License
"""
import numpy as np

# data I/O 导入英文纯文本文件，建立词到索引、索引到词的映射
data = open('input.txt', 'r').read() # should be simple plain text file
chars = list(set(data))
data_size, vocab_size = len(data), len(chars)
print 'data has %d characters, %d unique.' % (data_size, vocab_size)
char_to_ix = { ch:i for i,ch in enumerate(chars) }
ix_to_char = { i:ch for i,ch in enumerate(chars) }

# hyperparameters 定义隐藏层的神经元数量、每一步处理多长的序列以及学习速率
hidden_size = 100 # size of hidden layer of neurons
seq_length = 25 # number of steps to unroll the RNN for
learning_rate = 1e-1

# model parameters 随机生成神经网络参数矩阵，Wxh即U，Whh即W，Why即V，以及偏置单元
Wxh = np.random.randn(hidden_size, vocab_size)*0.01 # input to hidden
Whh = np.random.randn(hidden_size, hidden_size)*0.01 # hidden to hidden
Why = np.random.randn(vocab_size, hidden_size)*0.01 # hidden to output
bh = np.zeros((hidden_size, 1)) # hidden bias
by = np.zeros((vocab_size, 1)) # output bias

def lossFun(inputs, targets, hprev):
  """
  inputs,targets are both list of integers.
  hprev is Hx1 array of initial hidden state
  returns the loss, gradients on model parameters, and last hidden state
  """
  xs, hs, ys, ps = {}, {}, {}, {}
  hs[-1] = np.copy(hprev)
  loss = 0
  # forward pass 正向传播过程，输入采用1-k向量表示，隐藏层函数采用tanh
  for t in xrange(len(inputs)):
    xs[t] = np.zeros((vocab_size,1)) # encode in 1-of-k representation
    xs[t][inputs[t]] = 1
    hs[t] = np.tanh(np.dot(Wxh, xs[t]) + np.dot(Whh, hs[t-1]) + bh) # hidden state 参考文献Learning Recurrent Neural Networks with Hessian-Free Optimization
    ys[t] = np.dot(Why, hs[t]) + by # unnormalized log probabilities for next chars
    ps[t] = np.exp(ys[t]) / np.sum(np.exp(ys[t])) # probabilities for next chars
    loss += -np.log(ps[t][targets[t],0]) # softmax (cross-entropy loss)
  # backward pass: compute gradients going backwards
  dWxh, dWhh, dWhy = np.zeros_like(Wxh), np.zeros_like(Whh), np.zeros_like(Why)
  dbh, dby = np.zeros_like(bh), np.zeros_like(by)
  dhnext = np.zeros_like(hs[0])
  for t in reversed(xrange(len(inputs))):
    dy = np.copy(ps[t])
    dy[targets[t]] -= 1 # backprop into y. see http://cs231n.github.io/neural-networks-case-study/#grad if confused here
    dWhy += np.dot(dy, hs[t].T)
    dby += dy
    dh = np.dot(Why.T, dy) + dhnext # backprop into h
    dhraw = (1 - hs[t] * hs[t]) * dh # backprop through tanh nonlinearity
    dbh += dhraw
    dWxh += np.dot(dhraw, xs[t].T)
    dWhh += np.dot(dhraw, hs[t-1].T)
    dhnext = np.dot(Whh.T, dhraw)
  for dparam in [dWxh, dWhh, dWhy, dbh, dby]:
    np.clip(dparam, -5, 5, out=dparam) # clip to mitigate exploding gradients
  return loss, dWxh, dWhh, dWhy, dbh, dby, hs[len(inputs)-1]

def sample(h, seed_ix, n):
  """ 
  sample a sequence of integers from the model 
  h is memory state, seed_ix is seed letter for first time step
  """
  x = np.zeros((vocab_size, 1))
  x[seed_ix] = 1
  ixes = []
  for t in xrange(n):
    h = np.tanh(np.dot(Wxh, x) + np.dot(Whh, h) + bh)
    y = np.dot(Why, h) + by
    p = np.exp(y) / np.sum(np.exp(y))
    ix = np.random.choice(range(vocab_size), p=p.ravel())
    x = np.zeros((vocab_size, 1))
    x[ix] = 1
    ixes.append(ix)
  return ixes

n, p = 0, 0
mWxh, mWhh, mWhy = np.zeros_like(Wxh), np.zeros_like(Whh), np.zeros_like(Why)
mbh, mby = np.zeros_like(bh), np.zeros_like(by) # memory variables for Adagrad
smooth_loss = -np.log(1.0/vocab_size)*seq_length # loss at iteration 0
while True:
  # prepare inputs (we're sweeping from left to right in steps seq_length long)
  if p+seq_length+1 >= len(data) or n == 0: 
    hprev = np.zeros((hidden_size,1)) # reset RNN memory
    p = 0 # go from start of data
  inputs = [char_to_ix[ch] for ch in data[p:p+seq_length]]
  targets = [char_to_ix[ch] for ch in data[p+1:p+seq_length+1]]

  # sample from the model now and then
  if n % 100 == 0:
    sample_ix = sample(hprev, inputs[0], 200)
    txt = ''.join(ix_to_char[ix] for ix in sample_ix)
    print '----\n %s \n----' % (txt, )
    
  # forward seq_length characters through the net and fetch gradient
  loss, dWxh, dWhh, dWhy, dbh, dby, hprev = lossFun(inputs, targets, hprev)
  smooth_loss = smooth_loss * 0.999 + loss * 0.001
  if n % 100 == 0: print 'iter %d, loss: %f' % (n, smooth_loss) # print progress

  # perform parameter update with Adagrad
  for param, dparam, mem in zip([Wxh, Whh, Why, bh, by], 
                                [dWxh, dWhh, dWhy, dbh, dby], 
                                [mWxh, mWhh, mWhy, mbh, mby]):
    mem += dparam * dparam
    param += -learning_rate * dparam / np.sqrt(mem + 1e-8) # adagrad update

  p += seq_length # move data pointer
  n += 1 # iteration counter

Keras

Keras的作者不是希腊人就是对文学很感兴趣，因为Keras在希腊语中意指号角。其文学形象来源于古希腊和拉丁文学，最初出现于Odyssey。梦境被分为被幻像欺骗的人，通过象牙门到达凡间的人，预言家和从号角门中来的人。最近越来越热的深度学习上层封装库，简单列一下优缺点吧，因为使用的话看文档三分钟就可以搭出一个深度学习的模型原型，支持可插拔的神经网络层，不能更赞。但调试不易，且不利于对于原理的理解。

项目地址：https://github.com/fchollet/keras
文档地址：http://keras.io/

优点：

文档非常全且细致。
提供较为上层的框架，搞个深度学习的原型非常方便。
更新很快，且基于Python，支持CPU、GPU运算。
现在已经可以切换backend了，可以选择用theano还是用tensorflow

缺点：

原理上理解还是建议动手去搭
运行效率较低
调试不易
更新太快，Github上面的基于Keras的代码基本要根据最新的文档改一遍才能用。

Char-RNN Using Keras

由于是小试牛刀，所以直接从Github上面扒了一段Keras的char-RNN代码，修改一下，跑沙士比亚《凯撒大帝(The Tragedy of Julius Ceasar)》的剧本。

# -*- coding: utf-8 -*-
"""
Created on Sat May 21 14:34:08 2016

@author: yangsicong
"""

import numpy as np

from keras.models import Sequential
from keras.layers.core import Dense, Activation, Dropout, TimeDistributedDense
from keras.layers.recurrent import LSTM

text = open('./input.txt', 'r').read()
char_to_idx = { ch: i for (i, ch) in enumerate(sorted(list(set(text)))) }
idx_to_char = { i: ch for (ch, i) in char_to_idx.items() }
vocab_size = len(char_to_idx)

print('Working on %d characters (%d unique)' % (len(text), vocab_size))

SEQ_LENGTH = 64
BATCH_SIZE = 16
BATCH_CHARS = len(text) / BATCH_SIZE
LSTM_SIZE = 512
LAYERS = 3

def read_batches(text):
    T = np.asarray([char_to_idx[c] for c in text], dtype=np.int32)
    X = np.zeros((BATCH_SIZE, SEQ_LENGTH, vocab_size))
    Y = np.zeros((BATCH_SIZE, SEQ_LENGTH, vocab_size))

    for i in range(0, BATCH_CHARS - SEQ_LENGTH - 1, SEQ_LENGTH):
        X[:] = 0
        Y[:] = 0
        for batch_idx in range(BATCH_SIZE):
            start = batch_idx * BATCH_CHARS + i
            for j in range(SEQ_LENGTH):
                X[batch_idx, j, T[start+j]] = 1
                Y[batch_idx, j, T[start+j+1]] = 1

        yield X, Y


def build_model(batch_size, seq_len):
    model = Sequential()
    model.add(LSTM(LSTM_SIZE, return_sequences=True, batch_input_shape=(batch_size, seq_len, vocab_size), stateful=True))
    model.add(Dropout(0.2))
    for l in range(LAYERS - 1):
        model.add(LSTM(LSTM_SIZE, return_sequences=True, stateful=True))
        model.add(Dropout(0.2))

    model.add(TimeDistributedDense(vocab_size))
    model.add(Activation('softmax'))
    model.compile(loss='categorical_crossentropy', optimizer='adagrad')
    return model


print 'Building model.'
test_model = build_model(1, 1)
training_model = build_model(BATCH_SIZE, SEQ_LENGTH)
print '... done'

def sample(epoch, sample_chars=256):
    test_model.reset_states()
    test_model.load_weights('./tmp/keras_char_rnn.%d.h5' % epoch)
    header = 'LSTM based '
    sampled = [char_to_idx[c] for c in header]

    for c in header:
        batch = np.zeros((1, 1, vocab_size))
        batch[0, 0, char_to_idx[c]] = 1
        test_model.predict_on_batch(batch)

    for i in range(sample_chars):
        batch = np.zeros((1, 1, vocab_size))
        batch[0, 0, sampled[-1]] = 1
        softmax = test_model.predict_on_batch(batch)[0].ravel()
        sample = np.random.choice(range(vocab_size), p=softmax)
        sampled.append(sample)

    print ''.join([idx_to_char[c] for c in sampled])

for epoch in range(100):
    for i, (x, y) in enumerate(read_batches(text)):
        loss = training_model.train_on_batch(x, y)
        print epoch, i, loss

        if i % 1000 == 0:
            training_model.save_weights('./tmp/keras_char_rnn.%d.h5' % epoch, overwrite=True)
            sample(epoch)

结果是我的小mba风扇呜呜直转，跑了两天多，RNN像一个小孩子从刚刚咿呀学语，到最后能出完整的单词和句子，简直不能神奇更多。

- epoch:0, batch:1000, loss:2.13260865211
LSTM based fo?

OUREOTIN:
Why rom hames ane ttwe woe, wheu menerk.

SeDIS:
Ay, by shue le art feramt your of
Thin hen a soating lener ti vis
The to mime
Bo  rinh. oliwhew of: she hiant,
Anles woer deie hizh theew on wore,
The te dwyialt im sishor' va.

- epoch:40, batch:1000, loss:1.24812698364
LSTM based sight:
This splits agurated with your cheek to wor,
Even and more perjection of thy life,
You should not visit such aljufper'd up:
And you will did my deft, to know the bases:
Our sweet Worwing thrafts to such convey'd his
own's boar weakness. Take i

- epoch:99, batch:1000, loss:1.11117362976
LSTM based Norfaxen!

GREMIO:
Fight, they know it.

CATESBY:
My Lord of York, here he not appear'd toice.

Widow:
Fetch them well: is thy son was from me all:
Were my behind the earth rabes through himself.

PROSPERO:
I will confess him with your infroch

扯些闲篇

用char-RNN学习写汉字的一篇文章。

以及黑镜第三季中有一集可以学习男友的通信及聊天记录，可以模拟男友与自己聊天，栩栩如生。如今看来完全可以用RNN来进行训练。

本文标题:RNN的理解与实践

文章作者:Sicong Yang

发布时间:2016-05-18, 23:31:10

最后更新:2016-08-25, 10:22:24

原始链接:http://jaybeka.github.io/2016/05/18/rnn-intuition-practice/

许可协议: "署名-非商用-相同方式共享 4.0" 转载请保留原文链接及作者。