Paper Reading Attention is all you need



A simple network architecture called “the Transformer” is proposed, based solely on attention mechanism, dispensing with recurrence and convolutions entirely. More parallelizable and easier to train.

The Transformer uses stacked self-attention and point-wise, fully connected layers for both the encoder and decoder.


WMT 2015 English-German dataset consisting of about 4.5 million sentence pairs.


Recurrent models has a sequential training nature — precluding parallelization within training examples.

“Auto-regressive” - consuming the previously generated symbols as additional input when generating the next.

Looks like this paper has a great explanation of the attention mechanism. “An attention function can be described as mapping a query and a set of key-blue pairs to an output, where the query, keys, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key."

Why self-attention:

  1. Total computational complexity per layer
  2. The amount of computation that can be parallelized
  3. The path length between long-range dependencies in the network

Practical Value

What you can learn from this to make your research better?

Details and Problems From the presenters’ point of view, what questions might audience ask?

Attention: allowing modeling of dependencies without regard to their distance in the input or output sequences

Self-attention is using attention to related different positions in the same sequence in order to model that sequence. NOTE: which is different from the task of translation.

Q: can transformer be used in similar comprehension scenarios as the self-attention models?