What is attention mechanism in neural networks

What is attention

Attention mechanisms in neural networks have attracted widespread attention. In this article I will try to find the common points and use cases of various mechanisms and explain the principles and implementation of the two soft visual attention.

For laypeople, the neural network attention mechanism is a neural network that can focus on its input (or properties) and select certain inputs. We set the input as x∈RdThe feature vector is z∈Rk, A∈ [0.1]kIs the attention vector, fφ(X) pay attention to the network. In general, the realization of attention is:

a = fφ(X)

Or za= a⊙z

In the above equation [1], ⊙ represents the corresponding element-wise multiplication operation. Here we introduce the concepts of soft and hard attention. The former means that the value mask is between 0 and 1 during the multiplication, while the latter means that the value mask is forced to be divided into 0 or 1, i.e. a∈ {0,1}k. For the latter we can hide the exponential feature vector with great attention: e.g.a= z [a]. This increases its dimension.

To understand the importance of attention, we need to look at the essence of a neural network - it is a function approximator. Depending on its architecture, it can approximate different types of functions. Neural networks are generally used in the architecture of chain matrix multiplication and corresponding elements, with input or eigenvectors only interacting during the addition.

The attention mechanism can be used to compute the mask that can be used for feature multiplication. This operation considerably expands the functional space approximated by the neural network and enables new applications.

Regardless of their shape, different types of inputs can be observed. When entering matrix values ​​such as images, we introduce the concept of visual attention. Define the image as I∈RH * W, G∈Rh * wIt is a look, that is, the attention mechanism is applied to the image.

Hard attention

Hard attention in pictures has long been used, e.g. B. when cropping pictures. The concept is very simple and just needs to be indexed. In Python and TensorFlow, hard attention can be implemented as:

The only problem with the above form is that it is indistinguishable. If you want to understand the parameters of the model, you need to use help such as the Score-Function Estimator.

Soft attention

In the simplest variant of Attention, the soft attention for the image does not differ from the vector-valued features implemented in formula [1]. The paper "Show, Attend and Tell: Creating Neural Captions with Visual Attention" documents its early application.

Paper address:


This model learns a certain part of the image and generates a language to describe that part.

However, soft attention is not economical for the calculation. The hidden part of the input has no influence on the result, but still has to be calculated. At the same time it is also parameterized and the sigmoid activation function that realizes the attention is independent of each other. Multiple destinations can be selected at the same time. In practice, however, we usually want to focus selectively on one or more elements in the scene.

In the following, I will discuss DRAW and Spatial Transformer networks and introduce two mechanisms for solving the above problems. You can also adjust the size of the entrance to further improve performance.

Address of the DRAW introductory paper:


Address of the introductory paper for Spatial Transformer Networks:


Gaussian attention

Gaussian attention uses a parameterized one-dimensional Gaussian filter to create an image-sized attention map. Definition ay= RH, Ax= RwFor the attention vector, the attention mask can be written as follows:

In the above figure, the top row stands for axThe rightmost column stands for ayThe middle rectangle stands for a. To visualize the results, only 0 and 1 are included in the vector. In practice, they can be implemented by one-dimensional Gaussian function vectors. In general, the number of Gaussian functions corresponds to the dimension of space, and each vector is represented by three parameters: the center of the first Gaussian μ, the distance between the Gaussian centers of the continuous distribution, and the standard deviation σ of the Gaussian distribution. With these parameter variables, attention and insight have become differentiable, and the difficulty of learning is also greatly reduced.

Since only part of the image can be selected in the above example, the remaining images need to be cleaned up, so using attention seems a bit uneconomical. If we do not use vectors directly, we form them separately in matrix A.y∈Rh * HwithAx∈Rw * WIt could be better.

Now every row of every matrix has a Gaussian value, and the parameter d gives a certain distance from the center of the Gaussian distribution in successive rows. A look can be expressed as:

I recently used this mechanism in an article on RNN awareness for object tracking that was about HART (Hierarchical Attentive Recurrent Tracking).

Paper address:


Here is an example, on the left is the input image and on the right is the attention shown by the green box on the main image.

The following code allows you to create the above matrix value mask for small batch samples in TensorFlow. If you want to create A.yYou can call it Ay = gaussian_mask (u, s, d, h, H), where u, s, d represent μ, σ, and d, respectively, expressed in pixels this way.

We can also write a function to extract an image directly from the image:

Spatial Transformer

Spatial Transformer (STN) enables more general transformations and can distinguish cropping of images. Cropping images is also one of the possible use cases. It consists of two components, a grid generator and a sampler. The grid generator specifies the grid of points to be sampled from and the sampler is the sample. In the latest Sonnet neural network library from DeepMind, implementation with TensorFlow is very easy.

Gaunssian Attention vs. Spatial Transformer

The behaviors implemented by Gaunssian Attention and Spatial Transformer are very similar. How do we determine which implementation to choose? Here are some nuances:

Gaussian attention is a hyperparametric cropping mechanism that requires 6 parameters but only 4 degrees of freedom (y, x, height and width). STN only needs four parameters.

I haven't done any testing yet, but STN should be faster. It is based on a linear interpolation of the sampling points, while Gaussian attention has to perform two matrix multiplications.

Gaussian attention should be easier to train. This is because each pixel in view can be a convex combination of relatively large blocks of pixels in the source image, making it easier to find the cause of the error. On the other hand, STN is based on linear interpolation and the gradient at each sample point is only non-zero at the two closest pixel points.

The attention mechanism expands the capabilities of neural networks and can approximate more complex functions. Or, put more intuitively, it can focus on a specific part of the input, improve the performance of natural language benchmarks, and introduce new features like closed captions, addresses in memory networks, and neural programs.

In my opinion, the main use case of attention has not yet been discovered. For example, we know that the objects in the video are consistent and coherent and don't suddenly disappear from picture to picture. The attention mechanism can be used to express this consistency. As for later development, I will continue to pay attention to it.


Original release time: 2017-10-16