RNN的理解与实践

RNN的理解与实践

对于RNN和CNN的区别,最感性的认识就是CNN适用于做网格计算,所以经常被应用于图像处理的问题;而RNN更适用于处理序列,所以经常被应用于文本的处理。而对RNN的兴趣源于寒小阳&龙心尘的一篇博文,可以利用神经网络模仿人类写作,可以跑出来具有小四风格的文章段落。正好有师妹推荐了新晋机器学习Python神库Keras,那就理论加实践一起认识一下神奇的循环神经网络吧。

RNN结构及原理浅析

首先,需要说明的是,RNN依然是一个神经网络,其基本结构与普通的神经网络基本一致,如下图所示Ref

RNN神经网络结构图

但所不同的是,传统的神经网络(包括CNN)假定所有输入和输出都是相互独立的。而RNN的基本假设是输入序列之间是存在相互影响的。RNN之所以被称为循环,就是它会按照序列的输入时间不断重复训练该网络,并利用后向传播算法不断迭代更新权重$(U, V, W)$,这也就是为什么可以将RNN一层网络“展开”为$n$层,而$n$是序列输入的长度,也是时间总步长。同时,不难理解为什么一个RNN层“展开”的结点都共用一套参数。这里的“展开”是虚拟的。

RNN更新过程

其中,$U$是从输入层到隐藏层$S$的参数,$W$是$t$时刻到$t+1$时刻的参数,$V$是隐藏层到输出层的参数。图片来源于Bengio的Deep Learning一书中第二部分第10章Sequence Modeling: Recurrent and Recursive Nets

RNN的基本算法如下:

  1. $x_t$是$t$时刻的输入,可以是单词或者句子的One-hot编码;
  2. 隐藏层$s_t=f(Ux_t+Ws_{t-1})$,通常函数$f$会选择tanh或者ReLU,$s_{t-1}$是前一时刻的隐藏状态,当$t=0$时,$s_{-1}$通常被初始化为0向量;
  3. 输出$o_t=\mathrm{Softmax}(Vs_t)$。

值得注意的是,虽然隐藏层状态的更新经历了全输入序列,但由于神经网络参数训练的机制,只有当前时刻前近段时间内输入对其有影响,这也是符合人类记忆的基本规律。

最简单的RNN介绍到这,在此基础之上还有双向RNN和LSTM,暂时不做展开。我们直接介绍将要用到的character-level RNN。

Character-Level RNN

大神Karpathy在其一篇博客上面提到了Character-level RNN(以下简称char-RNN)最初的想法便是扔给RNN一堆文本,让其从字母级别进行学习,让RNN写出以某字母或者某单词开始,最可能的字符序列,组织成词或者句子。

例如有四个字母h, e, l, o。将其映射成k维的向量,用hello作为训练样本对网络进行训练。完成训练后,以h为首字母,让网络自动输出最有可能的字母接在后面,组成一个单词。

例子很简单,最终肯定会生成单词hello,而举这个例子的目的是理解char_RNN的工作原理。第一步先扔给RNN一个字母h,后面接“h”的置信度是1.0,接e是2.2,接“l”是-3.0,接“o”4.1. 因为我们的训练数据是“hello”,所以对于下一个字母“e”我们想要提高其的置信度(绿色)并降低其他字母的置信度(红色). 同样地,我们每走一步都要提高绿色字母的置信度。因为RNN组成均可微,所以可以采用backpropagation算法调整权重。然后,我们可以执行参数更新。如果我们持续喂给RNN同样的输入后,会发现正确的字符 (如第一步中的”e”) 的置信度便会稍高,而不正确的字符的置信度将会稍低。我们然后重复这个过程多次,直到网络收敛和其预测就是最终符合训练数据,下一步就总是正确的字符。

char-RNN原理示意图

更具体的细节,例如会同时用标准的Softmax分类器同步更新输出变量。RNN用mini-batch随机梯度下降算法更新参数(mini-batch Stochastic Gradient Descent),或者也可以用RMSProp或Adam (per-parameter adaptive learning rate methods)更新参数。需要注意的是,训练数据中有两个“l”,而两个“l”出现的置信度是不一样的,这是因为RNN依赖于上下文,而不仅仅是前一个字母。

在测试时,我们会得到下一个可能的字母的概率分布,依据这个概率分布可以得到下一下最有可能出现的字母,重复这个过程,来看看出现什么奇迹吧!

出于教学的目的,作者Karpathy用Python/numpy写了一个小的character-level RNN语言模型的Demo(需翻墙)。只有100行代码,这里贴出来与大家分享,希望可以给大家一些直观的具体的理解。现在作者Karpathy及其团队更专注于更快更强的Lua/Torch代码库。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
"""
Minimal character-level Vanilla RNN model. Written by Andrej Karpathy (@karpathy)
BSD License
"""

import numpy as np

# data I/O 导入英文纯文本文件,建立词到索引、索引到词的映射
data = open('input.txt', 'r').read() # should be simple plain text file
chars = list(set(data))
data_size, vocab_size = len(data), len(chars)
print 'data has %d characters, %d unique.' % (data_size, vocab_size)
char_to_ix = { ch:i for i,ch in enumerate(chars) }
ix_to_char = { i:ch for i,ch in enumerate(chars) }

# hyperparameters 定义隐藏层的神经元数量、每一步处理多长的序列以及学习速率
hidden_size = 100 # size of hidden layer of neurons
seq_length = 25 # number of steps to unroll the RNN for
learning_rate = 1e-1

# model parameters 随机生成神经网络参数矩阵,Wxh即U,Whh即W,Why即V,以及偏置单元
Wxh = np.random.randn(hidden_size, vocab_size)*0.01 # input to hidden
Whh = np.random.randn(hidden_size, hidden_size)*0.01 # hidden to hidden
Why = np.random.randn(vocab_size, hidden_size)*0.01 # hidden to output
bh = np.zeros((hidden_size, 1)) # hidden bias
by = np.zeros((vocab_size, 1)) # output bias

def lossFun(inputs, targets, hprev):
"""
inputs,targets are both list of integers.
hprev is Hx1 array of initial hidden state
returns the loss, gradients on model parameters, and last hidden state
"""

xs, hs, ys, ps = {}, {}, {}, {}
hs[-1] = np.copy(hprev)
loss = 0
# forward pass 正向传播过程,输入采用1-k向量表示,隐藏层函数采用tanh
for t in xrange(len(inputs)):
xs[t] = np.zeros((vocab_size,1)) # encode in 1-of-k representation
xs[t][inputs[t]] = 1
hs[t] = np.tanh(np.dot(Wxh, xs[t]) + np.dot(Whh, hs[t-1]) + bh) # hidden state 参考文献Learning Recurrent Neural Networks with Hessian-Free Optimization
ys[t] = np.dot(Why, hs[t]) + by # unnormalized log probabilities for next chars
ps[t] = np.exp(ys[t]) / np.sum(np.exp(ys[t])) # probabilities for next chars
loss += -np.log(ps[t][targets[t],0]) # softmax (cross-entropy loss)
# backward pass: compute gradients going backwards
dWxh, dWhh, dWhy = np.zeros_like(Wxh), np.zeros_like(Whh), np.zeros_like(Why)
dbh, dby = np.zeros_like(bh), np.zeros_like(by)
dhnext = np.zeros_like(hs[0])
for t in reversed(xrange(len(inputs))):
dy = np.copy(ps[t])
dy[targets[t]] -= 1 # backprop into y. see http://cs231n.github.io/neural-networks-case-study/#grad if confused here
dWhy += np.dot(dy, hs[t].T)
dby += dy
dh = np.dot(Why.T, dy) + dhnext # backprop into h
dhraw = (1 - hs[t] * hs[t]) * dh # backprop through tanh nonlinearity
dbh += dhraw
dWxh += np.dot(dhraw, xs[t].T)
dWhh += np.dot(dhraw, hs[t-1].T)
dhnext = np.dot(Whh.T, dhraw)
for dparam in [dWxh, dWhh, dWhy, dbh, dby]:
np.clip(dparam, -5, 5, out=dparam) # clip to mitigate exploding gradients
return loss, dWxh, dWhh, dWhy, dbh, dby, hs[len(inputs)-1]

def sample(h, seed_ix, n):
"""
sample a sequence of integers from the model
h is memory state, seed_ix is seed letter for first time step
"""

x = np.zeros((vocab_size, 1))
x[seed_ix] = 1
ixes = []
for t in xrange(n):
h = np.tanh(np.dot(Wxh, x) + np.dot(Whh, h) + bh)
y = np.dot(Why, h) + by
p = np.exp(y) / np.sum(np.exp(y))
ix = np.random.choice(range(vocab_size), p=p.ravel())
x = np.zeros((vocab_size, 1))
x[ix] = 1
ixes.append(ix)
return ixes

n, p = 0, 0
mWxh, mWhh, mWhy = np.zeros_like(Wxh), np.zeros_like(Whh), np.zeros_like(Why)
mbh, mby = np.zeros_like(bh), np.zeros_like(by) # memory variables for Adagrad
smooth_loss = -np.log(1.0/vocab_size)*seq_length # loss at iteration 0
while True:
# prepare inputs (we're sweeping from left to right in steps seq_length long)
if p+seq_length+1 >= len(data) or n == 0:
hprev = np.zeros((hidden_size,1)) # reset RNN memory
p = 0 # go from start of data
inputs = [char_to_ix[ch] for ch in data[p:p+seq_length]]
targets = [char_to_ix[ch] for ch in data[p+1:p+seq_length+1]]

# sample from the model now and then
if n % 100 == 0:
sample_ix = sample(hprev, inputs[0], 200)
txt = ''.join(ix_to_char[ix] for ix in sample_ix)
print '----\n %s \n----' % (txt, )

# forward seq_length characters through the net and fetch gradient
loss, dWxh, dWhh, dWhy, dbh, dby, hprev = lossFun(inputs, targets, hprev)
smooth_loss = smooth_loss * 0.999 + loss * 0.001
if n % 100 == 0: print 'iter %d, loss: %f' % (n, smooth_loss) # print progress

# perform parameter update with Adagrad
for param, dparam, mem in zip([Wxh, Whh, Why, bh, by],
[dWxh, dWhh, dWhy, dbh, dby],
[mWxh, mWhh, mWhy, mbh, mby]):
mem += dparam * dparam
param += -learning_rate * dparam / np.sqrt(mem + 1e-8) # adagrad update

p += seq_length # move data pointer
n += 1 # iteration counter

Keras

Keras的作者不是希腊人就是对文学很感兴趣,因为Keras在希腊语中意指号角。其文学形象来源于古希腊和拉丁文学,最初出现于Odyssey。梦境被分为被幻像欺骗的人,通过象牙门到达凡间的人,预言家和从号角门中来的人。最近越来越热的深度学习上层封装库,简单列一下优缺点吧,因为使用的话看文档三分钟就可以搭出一个深度学习的模型原型,支持可插拔的神经网络层,不能更赞。但调试不易,且不利于对于原理的理解。

项目地址:https://github.com/fchollet/keras
文档地址:http://keras.io/

优点:

  • 文档非常全且细致。
  • 提供较为上层的框架,搞个深度学习的原型非常方便。
  • 更新很快,且基于Python,支持CPU、GPU运算。
  • 现在已经可以切换backend了,可以选择用theano还是用tensorflow

缺点:

  • 原理上理解还是建议动手去搭
  • 运行效率较低
  • 调试不易
  • 更新太快,Github上面的基于Keras的代码基本要根据最新的文档改一遍才能用。

Char-RNN Using Keras

由于是小试牛刀,所以直接从Github上面扒了一段Keras的char-RNN代码,修改一下,跑沙士比亚《凯撒大帝(The Tragedy of Julius Ceasar)》的剧本。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
# -*- coding: utf-8 -*-
"""
Created on Sat May 21 14:34:08 2016

@author: yangsicong
"""


import numpy as np

from keras.models import Sequential
from keras.layers.core import Dense, Activation, Dropout, TimeDistributedDense
from keras.layers.recurrent import LSTM

text = open('./input.txt', 'r').read()
char_to_idx = { ch: i for (i, ch) in enumerate(sorted(list(set(text)))) }
idx_to_char = { i: ch for (ch, i) in char_to_idx.items() }
vocab_size = len(char_to_idx)

print('Working on %d characters (%d unique)' % (len(text), vocab_size))

SEQ_LENGTH = 64
BATCH_SIZE = 16
BATCH_CHARS = len(text) / BATCH_SIZE
LSTM_SIZE = 512
LAYERS = 3

def read_batches(text):
T = np.asarray([char_to_idx[c] for c in text], dtype=np.int32)
X = np.zeros((BATCH_SIZE, SEQ_LENGTH, vocab_size))
Y = np.zeros((BATCH_SIZE, SEQ_LENGTH, vocab_size))

for i in range(0, BATCH_CHARS - SEQ_LENGTH - 1, SEQ_LENGTH):
X[:] = 0
Y[:] = 0
for batch_idx in range(BATCH_SIZE):
start = batch_idx * BATCH_CHARS + i
for j in range(SEQ_LENGTH):
X[batch_idx, j, T[start+j]] = 1
Y[batch_idx, j, T[start+j+1]] = 1

yield X, Y


def build_model(batch_size, seq_len):
model = Sequential()
model.add(LSTM(LSTM_SIZE, return_sequences=True, batch_input_shape=(batch_size, seq_len, vocab_size), stateful=True))
model.add(Dropout(0.2))
for l in range(LAYERS - 1):
model.add(LSTM(LSTM_SIZE, return_sequences=True, stateful=True))
model.add(Dropout(0.2))

model.add(TimeDistributedDense(vocab_size))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adagrad')
return model


print 'Building model.'
test_model = build_model(1, 1)
training_model = build_model(BATCH_SIZE, SEQ_LENGTH)
print '... done'

def sample(epoch, sample_chars=256):
test_model.reset_states()
test_model.load_weights('./tmp/keras_char_rnn.%d.h5' % epoch)
header = 'LSTM based '
sampled = [char_to_idx[c] for c in header]

for c in header:
batch = np.zeros((1, 1, vocab_size))
batch[0, 0, char_to_idx[c]] = 1
test_model.predict_on_batch(batch)

for i in range(sample_chars):
batch = np.zeros((1, 1, vocab_size))
batch[0, 0, sampled[-1]] = 1
softmax = test_model.predict_on_batch(batch)[0].ravel()
sample = np.random.choice(range(vocab_size), p=softmax)
sampled.append(sample)

print ''.join([idx_to_char[c] for c in sampled])

for epoch in range(100):
for i, (x, y) in enumerate(read_batches(text)):
loss = training_model.train_on_batch(x, y)
print epoch, i, loss

if i % 1000 == 0:
training_model.save_weights('./tmp/keras_char_rnn.%d.h5' % epoch, overwrite=True)
sample(epoch)

结果是我的小mba风扇呜呜直转,跑了两天多,RNN像一个小孩子从刚刚咿呀学语,到最后能出完整的单词和句子,简直不能神奇更多。

- epoch:0, batch:1000, loss:2.13260865211
LSTM based fo?

OUREOTIN:
Why rom hames ane ttwe woe, wheu menerk.

SeDIS:
Ay, by shue le art feramt your of
Thin hen a soating lener ti vis
The to mime
Bo  rinh. oliwhew of: she hiant,
Anles woer deie hizh theew on wore,
The te dwyialt im sishor' va.

- epoch:40, batch:1000, loss:1.24812698364
LSTM based sight:
This splits agurated with your cheek to wor,
Even and more perjection of thy life,
You should not visit such aljufper'd up:
And you will did my deft, to know the bases:
Our sweet Worwing thrafts to such convey'd his
own's boar weakness. Take i

- epoch:99, batch:1000, loss:1.11117362976
LSTM based Norfaxen!

GREMIO:
Fight, they know it.

CATESBY:
My Lord of York, here he not appear'd toice.

Widow:
Fetch them well: is thy son was from me all:
Were my behind the earth rabes through himself.

PROSPERO:
I will confess him with your infroch

扯些闲篇

用char-RNN学习写汉字的一篇文章

以及黑镜第三季中有一集可以学习男友的通信及聊天记录,可以模拟男友与自己聊天,栩栩如生。如今看来完全可以用RNN来进行训练。

文章目录
  1. 1. RNN的理解与实践
    1. 1.1. RNN结构及原理浅析
    2. 1.2. Character-Level RNN
    3. 1.3. Keras
    4. 1.4. Char-RNN Using Keras
    5. 1.5. 扯些闲篇
,