Skip to article frontmatterSkip to article content

Encoding, Decoding, and Learning: The Math of Sequence-to-Sequence Translation

In this post, you’ll explore the mathematical foundations of sequence-to-sequence (seq2seq) models for machine translation. We’ll focus on the RNN Encoder–Decoder architecture, which consists of two main components: an encoder RNN and a decoder RNN. As an example, we’ll see how this model can be trained to translate English phrases into French 🇬🇧➡️🇫🇷.

Definition

Sequence-to-Sequence (Seq2Seq) Translation \textcolor{green}{\text{Sequence-to-Sequence (Seq2Seq) Translation }}

Sequence-to-Sequence (Seq2Seq) is a deep learning framework designed to convert an input sequence (e.g., a sentence in French) into an output sequence (e.g., its English translation). It consists of two main components:

  •  Encoder\textcolor{green}{\text{ Encoder}} – Processes the input sequence and compresses it into a context vector\textcolor{green}{\text{context vector}}, a fixed-length representation capturing the input’s semantics.

  •  Decoder\textcolor{green}{\text{ Decoder}} – Generates the output sequence step-by-step\textcolor{green}{\text{step-by-step}}, conditioned on the context vector from the encoder.

Key Features

  • Handles variable-length input and output sequences (e.g., translating sentences of different lengths).
  • Supports various architectures: RNNs, LSTMs, GRUs, and more recently, Transformers (state-of-the-art).
  • Trained end-to-end using teacher forcing i.e., during training, the decoder is fed the true previous token rather than its own prediction.

 \textcolor{green}{\text{ }}

Preliminaries: Recurrent Neural Networks

Figure 1: RNN Architecture (Image by the author).

The input xt\textcolor{green}{x_t}

In a Recurrent Neural Network (RNN), the input at time step tt, denoted as xt\textcolor{green}{x_t}, represents the data fed into the network at that point in the sequence. It is typically a vector (e.g., a word embedding) in Rd\mathbb{R}^d, where dd is the embedding dimension.

As an example, lte’s consider the 4-word sentence: “Cryptocurrency is the future” \textcolor{green}{\text{``Cryptocurrency is the future''}}. This sequence has length m=4m = 4. The corresponding inputs are:

x1=embedding(“Cryptocurrency”);x2=embedding(“is”);x3=embedding(“the”);x4=embedding(“future”)\textcolor{green}{x_1} = \text{embedding}(\textcolor{green}{\text{``Cryptocurrency''}});\quad \textcolor{green}{x_2} = \text{embedding}(\textcolor{green}{\text{``is''}}) ; \quad \textcolor{green}{x_3} = \text{embedding}(\textcolor{green}{\text{``the''}}); \quad \textcolor{green}{x_4} = \text{embedding}(\textcolor{green}{\text{``future''}})

Hidden State ht\textcolor{red}{h_t}

The hidden state serves as the network’s internal memory, encoding contextual information from the sequence up to time tt. It is updated as shown above, and plays a key role in maintaining temporal dependencies. At each time step, the hidden state ht\textcolor{red}{h_t} is updated based on the current input xt\textcolor{green}{x_t} and the previous hidden state ht1\textcolor{red}{h_{t-1}}:

ht=f(Whhht1+Wxhxt+bh)\textcolor{red}{h_t} = f(W_{hh} \textcolor{red}{h_{t-1}} + W_{xh} \textcolor{green}{x_t} + b_h)

Here WxhW_{xh} and WhhW_{hh} are weight matrices, bhb_h is a bias vector, and ff is a nonlinear activation function such as tanh\tanh or ReLU.

Initial Hidden State h0\textcolor{red}{h_0}

The initial hidden state h0\textcolor{red}{h_0} represents the starting memory of the RNN before any input is processed. It is typically initialized in one of the following ways:

  • As a zero vector:
    h0=0Rn \textcolor{red}{h_0} = \mathbf{0} \in \mathbb{R}^n where nn is the dimensionality of the hidden state.

  • With small random values (e.g., sampled from a normal distribution):
    h0N(0,σ2I) \textcolor{red}{h_0} \sim \mathcal{N}(0, \sigma^2 I)

  • As a learned parameter:
    h0\textcolor{red}{h_0} can also be treated as a trainable vector that is learned during training, allowing the model to adapt its initial memory based on the data.

The choice depends on the specific task, model design, and desired behavior at the start of the sequence.

Output yt\textcolor{blue}{y_t}

The output at time step tt is computed from the hidden state:

yt=g(Whyht+by)\textcolor{blue}{y_t} = g(W_{hy} \textcolor{red}{h_t} + b_y)

where WhyW_{hy} is the output weight matrix, byb_y is a bias vector and gg is an activation function such as softmax or sigmoid, depending on the task.

RNN Encoder–Decoder

The RNN Encoder–Decoder architecture, introduced by Cho et al. (2014) and Sutskever et al. (2014), operates by encoding an input sentence into a fixed-length vector and then decoding it to generate an output sequence.

Figure 2: Encode-Decoder architecture (Image and comment from Cho et al. (2014)).

  • In this framework, the encoder processes the input sequence x=(x1,x2,,xT){\textbf{x}} = (\textcolor{green}{x_1}, \textcolor{green}{x_2}, \dots, \textcolor{green}{x_T} ) and transforms it into a context vector c\textcolor{red}{c}. Typically, a recurrent neural network (RNN) is employed for this transformation as follows:
ht=f(ht1,xt)\textcolor{red}{h_t} = f( \textcolor{red}{h_{t-1}} , \textcolor{green}{x_t})

and

c=q({h1,,hT})\textcolor{red}{c}= q(\{ \textcolor{red}{h_1}, \dots , \textcolor{red}{h_T} \})

Here, htRn\textcolor{red}{h_t} \in \mathbb{R}^n represents the hidden state at time step tt, and the context vector c\textcolor{red}{c} is derived from the sequence of hidden states. The functions ff and qq are nonlinear operations; for example, Sutskever et al. (2014) used an LSTM for ff and set q({h1,,hT})=hTq(\{ \textcolor{red}{h_1}, \dots , \textcolor{red}{h_T} \}) = \textcolor{red}{h_T}.

  • The decoder is trained to generate each word yt\textcolor{blue}{y_{t'}} based on the context vector c\textcolor{red}{c} and the previously generated words {y1,,yt1}\{\textcolor{blue}{y_1}, \dots , \textcolor{blue}{y_{t'-1}} \}. It models the probability distribution over the output sequence y\textbf{y} by factorizing it into a product of conditional probabilities:
p(y)=t=1Tp(yt{y1,,yt1},c)p(\textbf{y}) = \prod_{t'=1}^{T'} p(\textcolor{blue}{y_{t'}} | \{\textcolor{blue}{y_1}, \dots , \textcolor{blue}{y_{t'-1}} \}, \textcolor{red}{c})

where y=(y1,,yT)\textbf{y} = (\textcolor{blue}{y_1}, \dots , \textcolor{blue}{ y_{T'}} ).

When using an RNN, each conditional probability is computed as p(yt{y1,,yt1},c)=g(yt1,ht,c)p(\textcolor{blue}{y_{t'}} | \{\textcolor{blue}{y_1}, \dots , \textcolor{blue}{y_{t'-1}} \}, \textcolor{red}{c}) = g(\textcolor{blue}{y_{t'-1}}, \textcolor{red}{h_{t'}}, \textcolor{red}{c}), where gg is a nonlinear function (possibly multi-layered) that outputs the probability of yt\textcolor{blue}{y_{t'}}, and ht\textcolor{red}{h_{t'}} denotes the decoder’s hidden state at time tt'.

Time to Dive into the Implementation 💻

References
  1. Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 10.3115/v1/d14-1179
  2. Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to Sequence Learning with Neural Networks. arXiv. 10.48550/ARXIV.1409.3215