# Total Internal Reflection

Technology and Art

# Transformers using PyTorch : Worklog Part 2

Avishek Sen Gupta on 14 January 2023

We continue looking at the Transformer architecture from where we left from Part 1. When we’d stopped, we’d set up the Encoder stack, but had stopped short of adding positional encoding, and starting work on the Decoder stack. In this post, we will focus on setting up the training cycle.

Specifically, we will cover:

• Positional Encoding
• Set up the basic training regime via Teacher Forcing

We will also lay out the dimensional analysis a little more clearly, and add necessary unit tests to verify intended functionality. The code is available here.

### Positional Encoding

You can see the code for visualising the positional encoding here. Both images below show the encoding map at different levels of zoom.

The code in the main Transformer implementation which implements the positional embedding is shown below.


# The encoder output is injected directly into the sublayer of every Decoder. To build up the chain of Decoders
# in PyTorch, so that we can put the full stack inside a Sequential block, we simply inject the encoder output
# to the root Decoder, and have it output the encoder output (together with the actual Decoder output) as part of
# the Decoder's actual output to make it easy for the next Decoder in the stack to consume the Encoder and Decoder
# outputs
def forward(self, input):
encoder_output, previous_stage_output = input
# Adds the residual connection to the output of the attention layer
ffnn_outputs = torch.stack(
return (encoder_output, layer_normed_ffnn_output)



## Data Flow

The diagram below (you’ll need to zoom in) shows the data flow for a single Encoder/Decoder, with 8 attention blocks per multihead attention layer. $$n$$ represents the number of words passed into the Encoder. $$m$$ represents the number of words passed into the Decoder. $$V$$ represents the length of the full vocabulary.

The dimensions of the data at each stage are depicted to facilitate understanding.

## Notes on the Code

• The last word in the output is added to the output buffer, during inference.
• The encoder output is injected directly into the sublayer of every Decoder. To build up the chain of Decoders in PyTorch, so that we can put the full stack inside a Sequential block, we simply inject the encoder output to the root Decoder, and have it output the encoder output (together with the actual Decoder output) as part of the Decoder’s actual output to make it easy for the next Decoder in the stack to consume the Encoder and Decoder outputs.
• The code does not set up parameters in a form suitable for optimisation yet. There are several Module-subclasses which are really only there for the convenience of not having to call the forward() methods explicitly. In a sequel, we will collapse most of the parameters to be part of only a couple of Module subclasses.

• The class diagram is shown above. The composition hierarchy is quite straightforward, though there are some associations missing because of the shortcomings of the tool used to generate this (Pyreverse).
• Specifically, DecoderStack contains a bunch of Decoders, and EncoderStack contains a bunch of Encoders.
• qkv_source and masked_qkv_source contain instances of SingleSourceQKVLayer.
• unmasked_qkv_source contains an instance of MultiSourceQKVLayer.
• Some of the members in the classes are repeated because of Pyreverse duplicating the information from type hints.
• More notes can be found in the source itself.

## Conclusion

We have built and tested the basic Transformer architecture. However, we still need to do the following:

• Build a proper vocabulary. Our current vocabulary is hard-coded, and contains random vectors.
• Several tensors are reused as parameters. Some of these need to be separate parameters.
• There are several Module subclasses. For optimisation, we will need to centralise where we register our parameters.
• We still need to train the Transformer.

All of the above, we will work on in the sequel to this post.

## References

tags: Machine Learning - PyTorch - Programming - Deep Learning - Transformers