You can't just see. You have to know what and where to look.
Visual Attention is the biological process where people dynamically focus on parts of their visual field for object localization, identification, and characterization. Put another way, people see only a portion of the things in their field of view in high resolution. The high-resolution part (fovea) is where they extract the features of the thing they are looking at. They then focus on another part of their visual field and repeat feature extraction until they recognize what they are looking at. In other words, Visual Attention answers what and where to look.
Visual Attention inspired the Attention mechanisms that have become ubiquitous in deep learning. These mechanisms have profoundly impacted tasks like image recognition and machine translation. The Transformer architecture at the heart of LLMs and Generative AI is based entirely on Attention mechanisms.
Given their inspiration, it is unsurprising that Attention mechanisms were first used for image recognition. Dr. Kunihiko Fukushima helped pioneer them by adding Selective Attention to his Neocognitron (link). He used efferent signals from his model's later stages to reinforce the earlier stages that responded to recognized stimuli. In other words, when presented with two patterns, the model's early-stage neurons recognized both. So, the neurons associated with the patterns turned on. The model was trained to identify one of the two patterns in its classification stage. It then sends efferent signals back to the earlier stages, reinforcing the neurons that responded to the recognized pattern. So, his improved Neocognitron could selectively focus on or ignore patterns.
Bahdanau et al.'s use of an Attention mechanism (Additive Attention) for machine translation helped shine a spotlight on it (link). At the time, RNN encoder-decoders were the cat's meow in machine translation. However, they were limited in the sentences they could capture for a word because they compressed the sentence into fixed-length vectors. In other words, the sentences had to be short. As a result, NMTs (neural machine translation) needed something better. Bahdanau et al. extended an NMT encoder-decoder with an alignment model. Their alignment model scores the words that make up the sentence for a given prediction. So, more relevant words are weighted more heavily. In other words, to predict the translation at time t (yₜ), their NMT model considers the decoder state at time t, the context for the word, and the last translation (yₜ₋₁).
Bahdanau et al. used a bidirectional RNN (BiRNN) as their encoder. A forward RNN that reads the sentence left to right and a backward RNN that reads it right to left comprises the BiRNN. The words that precede a word are its context. In other words,
hₜ = hidden state of Dₙ at time t
hᵣₜ = hidden state of Dᵣ at time t
where hₜ is the forward RNN's (Dₙ) representation of the word at time t, and hᵣₜ is the reverse RNN's (Dᵣ) representation of the word at time t.
So, the forward RNN produces a word's representation by referencing the words to its left, while the backward RNN references the ones to its right. The context for the word at time t is simply the concatenation of the hidden states of the forward and backward RNNs at time t. Put simply,
rₜ = Concat( hₜ, hᵣₜ )
cₜ = [ rₜ, rₜ₊₁, rₜ₊₂, .., rₜ₊ₙ ]
where rₜ is the contextualized representation of the word at time t. In other words, the BiRNN produces a representation sequence (cₜ) that includes context from the words left and right of a word.
The Decoder is just an RNN. It processes cₜ as a sentence. All the representations preceding t are the context for the prediction at time t (Yₜ). However, each context representation (rₜ, rₜ₋₁, rₜ₋₂, .., rₜ₋ₙ) impacts Yₜ to varying degrees. A feedforward network alignment model weights each context representation by relevance to the sentence before the Decoder's RNN processes it.
eᵢₜ = a( wₜ₋₁, rₜ )
αᵢₜ = exp( eᵢₜ ) / ∑ exp( eᵢₖ )
cᵥᵥₜ = ∑ αᵢₜrₜ
where a calculates the preceding word's relevance to a word, eᵢₜ is the relevance of a word to the sentence, αᵢₜ is the amount to weight rₜ by, and cᵥᵥₜ is sequence to be processed by the Decoder's RNN. In other words, to understand a word, you must look at its preceding and following words.
Addictive Attention improved NMT's contextual understanding by allowing it to focus on the relevant parts of the context. Additionally, looking at preceding and following contexts improves translation accuracy. For example, when translating the man into French (l'homme), looking at the following word (man) allows you to accurately translate the into l' instead of le, la, or les. Additive Attention also lets NMTs handle longer sentences. In other words, Additive Attention solved the RNN's fixed-length vector compression problem with generated, contextualized vectors. However, RNNs compute those vectors. So, Bahdanau et al.'s NMT architecture is still limited by recursion.
Cheng et al. generalized Additive Attention to create Self Attention (link). Instead of relying on recursion, Self Attention relies on dot products. Self Attention uses sets of key (all the words in the sentence) and value (all the words in the sentence) vectors. Their dot products map each part of the keys to each part of the values based on their relevance. Unlike recursion, you can compute dot products in parallel, and they better capture long-range dependencies.
Cheng et al. applied Self Attention to Machine Reading. They replaced an LSTM's memory cell with memory networks. Additionally, they used Self Attention within their sequence encoder and for memory addressing. In other words, their use of Attention relates each word in a sentence to the other words in that sentence. Their approach bested or was comparable to the state-of-the-art methods at the time in language modeling, sentiment analysis, and natural language inference.
It is hard to overstate the importance of attention mechanisms. They are a cornerstone of modern deep learning. To see their impact, look no further than OpenAI's ChatGPT and its contemporaries.