NLP之记忆网络(Memory Network)学习笔记

2018-02-11

1. 原始记忆网络(Memory Network)

由FAIR的Jason Weston等在2015年提出，主要应用于文本问答任务中，相比之前的方法，在F1值上取得了显著的提升。核心思想是考虑到典型的循环神经网络(如Valinna-RNN、LSTM、GRU)依赖状态向量进行序列状态的记忆，记忆能力受限，很难记忆过去的一些事实；因而增加单独的、可以读写的记忆组件。

1.1 模型结构

输入组件$I$对输出的原始数据进行特征变换。
记忆组件$G$就是对输入的特征向量进行存储的槽；每次一个输入进来时，更新已有的记忆槽。
输出组件$O$根据输入选择匹配的记忆，并结合输入和记忆产生最终的输出。
回复组件根据输出$O$产生最终的回复结果。
整个模型的flow分为四步：
- 输入转换:$x \rightarrow I(x)$
- 根据新输入更新记忆：$m_{i}=G(m_{i}, I(x), m)$
- 输出计算：$o=O(I(x), m)$
- 根据输出产生最终的回复: $r=R(o)$

1.2 文本问答

针对文本问答任务，核心的推理在输出O和回复R模块。
- O模块选取k(k=2)个辅助推理的记忆内容：$o_{j}=\arg\max\limits_{i=1,2,…,N}s_{O}(x, m_{i}), j=1,2,…,k$
- R模块选取最终回复的词：$r=\arg\max_{w \in W}s_{R}([x,m_{o_{1}},m_{o_{2}}], w)$，$W$代表词典中所有词。
- 打分函数:$s(x,y)=\Phi_{x}(x)^{T}U^{T}U\Phi_{y}(y)$，$\Phi$用来对输入进行特征映射，$U$是需要学习的权重矩阵。
- 训练目标函数
  - 这里采用的是类似SVM中的最大边缘损失函数。

2.动态记忆网络(Dynamic Memory Network)

由Ankit Kumar等在2016年提出，并指出很多NLP任务可以看作QA任务。DMN在文本问答、文本分类、词性标注几个不同的任务中都取得了很好的效果。

2.1 模型架构

单独的情景记忆模块：情景记忆模块通常需要多轮迭代更新，实验结果显示多轮迭代对于需要推理的任务十分重要，下图中的两条线代表两次阅读输入事实，迭代次数也是个需要调整的超参数。每一轮的内容$e_{i}$根据输入事实的表征$c_{t}$、输入问题的表征$q$以及之前的记忆$m_{i-1}$来产生。
- $h_{t}^{i} = g_{t}^{i}GRU(c_{t}, h_{t-1}^{i})+(1-g_{t}^{i})h_{t-1}^{i}$
- $e^{i} = h_{T_{C}}^{i}$，$T_{C}$是输入事实的个数
记忆模块：根据之前的记忆内容、问题表征、情景记忆内容来产生。记忆表征初始化为$m^{0}=q$。
- $m^{i} = GRU(e^{i}, m_{i-1})$
输入表征、问题表征以及答案的产生均基于GRU。
采用了注意力机制来计算事实表征$c$、内部记忆$m$和问题表征$q$之间的匹配得分值。整体来说，attention的计算较为复杂。
- $z(c,m,q)=[c, m, q, c\circ q, |c-q|, |c-m|, c^{T}W^{(b)}q, c^{T}W^{(b)}m]$
- $G(c,m,q)=\sigma(W^{(2)}tanh(W^{(1)}z(c,m,q)+b^{(1)})+b^{(2)})$
- $g_{t}^{i} = G(c_{t}, m^{i-1}, q)$
相比记忆网络，动态记忆网络增加了attention机制，并且记忆内容端到端可微。

2.2 代码实现

整个DMN的代码如下，详细功能参考注释

class DynamicMemoryNetwork(object):
    '''
    定义动态记忆网络的结构
    '''
    def __init__(self, config):
        self.config = config #针对整个网络超参数的配置
        if self.config.train_mode:
            self.train, self.valid, self.word_embedding, self.max_q_len, self.max_sentences, self.max_sen_len, \
            self.num_supporting_facts, self.vocab_size = babi_input.load_babi(self.config, split_sentences=True)
        else:
            self.test, self.word_embedding, self.max_q_len, self.max_sentences, self.max_sen_len, \
            self.num_supporting_facts, self.vocab_size = babi_input.load_babi(self.config, split_sentences=True)
        self.question_holder = tf.placeholder(tf.int32, shape=(self.config.batch_size, self.max_q_len))
        self.input_placeholder = tf.placeholder(tf.int32, shape=(self.config.batch_size, self.max_sentences, self.max_sen_len))
        self.question_len_holder = tf.placeholder(tf.int32, shape=(self.config.batch_size, ))
        self.input_len_holder = tf.placeholder(tf.int32, shape=(self.config.batch_size, ))
        self.answer_holder = tf.placeholder(tf.int64, shape=(self.config.batch_size, ))
        self.rel_label_holder = tf.placeholder(tf.int32, shape=(self.config.batch_size, self.num_supporting_facts))
        self.dropout_placeholder = tf.placeholder(tf.float32)

        encoding = np.ones((self.config.embed_size, self.max_sen_len), dtype=np.float32)
        for i in range(1, self.config.embed_size+1):
            for j in range(1, self.max_sen_len+1):
                encoding[i-1, j-1] = (i - (self.config.embed_size-1)/2)*(j - (self.max_sen_len-1)/2)
        encoding = 1 + 4 * encoding / (self.config.embed_size*self.max_sen_len)
        self.encoding = np.transpose(encoding)
        self.embeddings = tf.Variable(self.word_embedding.astype(np.float32), name="embedding")

        self.output = self.inference()
        self.pred = self.get_predictions(self.output)
        self.calculate_loss = self.add_loss_op(self.output)
        self.train_step = self.add_training_op(self.calculate_loss)
        self.merged = tf.summary.merge_all()

    def get_input_represetation(self):
        '''
        计算输入事实的表征
        :return:
        '''
        inputs = tf.nn.embedding_lookup(self.embeddings, self.input_placeholder)
        inputs = tf.reduce_sum(inputs*self.encoding, 2)
        forward_gru_cell = tf.contrib.rnn.GRUCell(self.config.hidden_size)
        backward_gru_cell = tf.contrib.rnn.GRUCell(self.config.hidden_size)
        outputs, _ = tf.nn.bidirectional_dynamic_rnn(
            forward_gru_cell,
            backward_gru_cell,
            inputs,
            dtype=np.float32,
            sequence_length=self.input_len_holder
        )
        fact_vecs = tf.reduce_sum(tf.stack(outputs), axis=0)
        fact_vecs = tf.nn.dropout(fact_vecs, self.dropout_placeholder)

        return fact_vecs

    def get_question_representation(self):
        '''
        计算输入问题的表征
        :return:
        '''
        questions = tf.nn.embedding_lookup(self.embeddings, self.question_holder)
        gru_cell = tf.contrib.rnn.GRUCell(self.config.hidden_size)
        _, q_vec = tf.nn.dynamic_rnn(
            gru_cell,
            questions,
            dtype=np.float32,
            sequence_length=self.question_len_holder
        )

        return q_vec

    def get_attention(self, q_vec, fact_vec, prev_memory, reuse):
        '''
        计算注意力分值
        :param q_vec:
        :param fact_vec:
        :param prev_memory:
        :param reuse:
        :return:
        '''
        with tf.variable_scope('attention', reuse=reuse):
            features = [fact_vec*q_vec, fact_vec*prev_memory, tf.abs(fact_vec - q_vec), tf.abs(fact_vec - prev_memory)]
            feature_vec = tf.concat(features, 1)
            attention = tf.contrib.layers.fully_connected(feature_vec,
                                                          self.config.embed_size,
                                                          activation_fn=tf.nn.tanh,
                                                          reuse=reuse,
                                                          scope='fc1')
            attention = tf.contrib.layers.fully_connected(attention,
                                                          1,
                                                          activation_fn=None,
                                                          reuse=reuse,
                                                          scope='fc2')
            return attention

    def generate_episode(self, memory, q_vec, fact_vecs, hop_index):
        '''
        计算下一轮的情景记忆内容
        :param memory:
        :param q_vec:
        :param fact_vecs:
        :param hop_index:
        :return:
        '''
        attentions = [tf.squeeze(self.get_attention(q_vec, memory, fv, bool(hop_index) or bool(i)), axis=1)
                      for i, fv in enumerate(tf.unstack(fact_vecs, axis=1))]
        attentions = tf.transpose(tf.stack(attentions))
        self.attentions.append(attentions)
        attentions = tf.nn.softmax(attentions)
        attentions = tf.expand_dims(attentions, axis=-1)
        reuse = True if hop_index > 0 else False
        print('fact_vecs:', fact_vecs.shape)
        print('attentions:', attentions.shape)
        gru_inputs = tf.concat([fact_vecs, attentions], 2)
        print('gru_inputs:', gru_inputs.shape)
        with tf.variable_scope('attention_gru', reuse=reuse):
            _, episode = tf.nn.dynamic_rnn(
                AttentionGRUCell(self.config.hidden_size),
                gru_inputs,
                dtype=np.float32,
                sequence_length=self.input_len_holder
            )
        return episode

    def add_answer_module(self, rnn_output, q_vec):
        '''
        计算答案
        :param rnn_output:
        :param q_vec:
        :return:
        '''
        rnn_output = tf.nn.dropout(rnn_output, self.dropout_placeholder)
        output = tf.layers.dense(
            tf.concat([rnn_output, q_vec], 1),
            self.vocab_size,
            activation=None
        )

        return output

    def inference(self):
        '''
        根据记忆内容计算输出表征
        :return:
        '''
        with tf.variable_scope('question', initializer=tf.contrib.layers.xavier_initializer()):
            q_vec = self.get_question_representation()

        with tf.variable_scope('input', initializer=tf.contrib.layers.xavier_initializer()):
            fact_vecs = self.get_input_represetation()

        self.attentions = []

        with tf.variable_scope('memory', initializer=tf.contrib.layers.xavier_initializer()):
            prev_memory = q_vec

            for i in range(self.config.num_hops):
                episode = self.generate_episode(prev_memory, q_vec, fact_vecs, i)
                with tf.variable_scope('hop_%d' % i):
                    prev_memory = tf.layers.dense(
                        tf.concat([prev_memory, episode, q_vec], 1),
                        self.config.hidden_size,
                        activation=tf.nn.relu
                    )
            output = prev_memory

        with tf.variable_scope('answer', initializer=tf.contrib.layers.xavier_initializer()):
            output = self.add_answer_module(output, q_vec)

        return output

    def get_predictions(self, output):
        preds = tf.nn.softmax(output)
        return tf.argmax(preds, 1)

    def add_loss_op(self, output):
        gate_loss = 0
        if self.config.strong_supervision:
            for i, attention in enumerate(self.attentions):
                labels = tf.gather(tf.transpose(self.rel_label_holder), 0)
                gate_loss += tf.reduce_sum(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=attention, labels=labels))

        loss = self.config.beta*tf.reduce_sum(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=output, labels=self.answer_holder)) + gate_loss

        for v in tf.trainable_variables():
            if not 'bias' in v.name.lower():
                loss += self.config.l2 * tf.nn.l2_loss(v)

        tf.summary.scalar('loss', loss)

        return loss

    def add_training_op(self, loss):
        optimizer = tf.train.AdamOptimizer(learning_rate=self.config.lr)
        grad_and_vars = optimizer.compute_gradients(loss)

        if self.config.clip_grads:
            grad_and_vars = [(tf.clip_by_norm(grad, self.config.max_grad_value), var) for grad, var in grad_and_vars]

        train_op = optimizer.apply_gradients(grad_and_vars)
        return train_op

    def run_epoch(self, session, data, num_epoch=0, train_writer=None, train_op=None, verbose=2, train=False):
        '''
        计算训练集或测试集当前epoch的平均损失和准确率
        :param session:
        :param data:
        :param num_epoch:
        :param train_writer:
        :param train_op:
        :param verbose:
        :param train:
        :return:
        '''
        dropout = self.config.dropout
        total_steps = len(data[0])
        total_loss = []
        accuracy = 0
        p = np.random.permutation(len(data[0]))
        qp, ip, ql, il, im, a, r = data
        qp, ip, ql, il, im, a, r = qp[p], ip[p], ql[p], il[p], im[p], a[p], r[p]
        for step in range(total_steps/self.config.batch_size):
            index = range(step*self.config.batch_size, (step+1)*self.config.batch_size)
            feed = {
                self.question_holder: qp[index],
                self.input_placeholder: ip[index],
                self.question_len_holder: ql[index],
                self.input_len_holder: il[index],
                self.answer_holder: a[index],
                self.rel_label_holder: r[index],
                self.dropout_placeholder: dropout
            }
            if train_op is not None:
                loss, pred, summary, _ = session.run([self.calculate_loss, self.pred, self.merged, train_op],
                                                     feed_dict=feed)
            else:
                loss, pred, summary = session.run([self.calculate_loss, self.pred, self.merged], feed_dict=feed)
            if train_writer is not None:
                train_writer.add_summary(summary, num_epoch*total_steps+step)
            answers = a[step*self.config.batch_size:(step+1)*self.config.batch_size]
            accuracy += np.sum(answers==pred)/len(answers)
            total_loss.append(loss)

        return np.mean(total_loss), accuracy/float(total_steps)

基于注意力机制的AttentionGRUCell类的代码如下

class AttentionGRUCell(RNNCell):
    '''
    基于Attention机制的GRUCell实现
    '''

    def __init__(self, num_units, activation=tanh):
        RNNCell.__init__(self)
        self._num_units = num_units
        self._activation_fn = activation

    @property
    def output_size(self):
        return self._num_units

    @property
    def state_size(self):
        return self._num_units

    def __call__(self, inputs, state, scope=None):
        with tf.variable_scope(scope or 'attention_gru_cell'):
            with tf.variable_scope('gates'):
                inputs, z = tf.split(inputs, num_or_size_splits=[self._num_units, 1], axis=1)
                r = self._linear([inputs, state], self._num_units, True)
                r = tf.nn.sigmoid(r)
            with tf.variable_scope('candidate'):
                r = r * self._linear(state, self._num_units, False)
            with tf.variable_scope('input'):
                x = self._linear(inputs, self._num_units,True)
            h_hat = self._activation_fn(r + x)
            new_h = (1 - z) * state + z * h_hat
        return new_h, new_h

    def _linear(self, args, output_size, bias, bias_start=0.0):
        if not nest.is_sequence(args):
            args = [args]
        total_arg_size = 0
        shapes = [a.get_shape() for a in args]
        for shape in shapes:
            total_arg_size += shape[1].value

        dtype = [a.dtype for a in args][0]
        scope = tf.get_variable_scope()
        with tf.variable_scope(scope) as outer_scope:
            weights = tf.get_variable('weights', [total_arg_size, output_size], dtype=dtype)
            if len(args) == 1:
                res = tf.matmul(args[0], weights)
            else:
                res = tf.matmul(tf.concat(args, 1), weights)
            if not bias:
                return res
            with tf.variable_scope(outer_scope) as inner_scope:
                inner_scope.set_partitioner = None
                biases = tf.get_variable('biases', [output_size], dtype=dtype,
                                         initializer=tf.constant_initializer(bias_start, dtype=dtype))
            return tf.nn.bias_add(res, biases)

babi数据集的预处理、加载代码参考文末的参考资料链接[3]。

Elias Xu

NLP之记忆网络(Memory Network)学习笔记

1. 原始记忆网络(Memory Network)

1.1 模型结构

1.2 文本问答

2.动态记忆网络(Dynamic Memory Network)

2.1 模型架构

2.2 代码实现

3. 参考资料