A GRU (Gated Recurrent Unit) is a type of recurrent neural network that is used for sequence-to-sequence problems, such as language translation or speech recognition. A bi-GRU is a bidirectional version of a GRU, which means that it processes the input sequence in two directions: from beginning to end, and from end to beginning.
Firstly let’s focus on GRU ;
The architecture of a GRU (Gated Recurrent Unit) consists of two main components: a reset gate and an update gate.
The reset gate determines how much of the previous hidden state to forget, and the update gate determines how much of the new input information to incorporate into the hidden state. The hidden state is then updated based on the values of the reset and update gates, as well as the new input.
Here is a more detailed explanation of the architecture of a GRU:
- Input and hidden states: The GRU takes two inputs: the current input (x_t) and the previous hidden state (h_{t-1}).
- Reset gate: The reset gate (r_t) is a sigmoid activation layer that determines how much of the previous hidden state to forget. It is computed as: r_t = sigmoid(W_r * [h_{t-1}, x_t] + b_r)
- Update gate: The update gate (z_t) is also a sigmoid activation layer that determines how much of the new input information to incorporate into the hidden state. It is computed as: z_t = sigmoid(W_z * [h_{t-1}, x_t] + b_z)
- Candidate hidden state: The candidate hidden state (h_t’) is a tanh activation layer that calculates a new, candidate hidden state based on the current input and the previous hidden state, scaled by the reset gate. It is computed as: h_t’ = tanh(W * [r_t * h_{t-1}, x_t] + b)
5. New hidden state: The new hidden state (h_t) is a combination of the previous hidden state and the candidate hidden state, scaled by the update gate. It is computed as: h_t = (1 — z_t) * h_{t-1} + z_t * h_t’
This process is repeated for each time step in the input sequence, with the hidden state at each time step being used as input to the next time step. The final hidden state is then used to make predictions for the task at hand, such as language translation or speech recognition.
The main difference between a GRU and a bi-GRU is that a bi-GRU has two separate hidden states, one for each direction, and it concatenates the final hidden states from both directions before making its final prediction. This allows the bi-GRU to capture information from both the past and the future of the input sequence, whereas a regular GRU only has access to information from the past.
In practice, bi-GRUs are often used for tasks such as natural language processing, where the model needs to understand the context of a word in a sentence in order to make accurate predictions.
References :