从头构建
- https://www.datacamp.com/tutorial/building-a-transformer-with-py-torch
- https://xiaosheng.blog/2022/06/28/use-pytorch-to-implement-transformer
- https://www.cnblogs.com/wevolf/p/12484972.html
Grouped-query Attention/Multi-query Attention
-
https://towardsdatascience.com/demystifying-gqa-grouped-query-attention-3fb97b678e4a
- GQA combines multi-head attention (MHA) with multi-query attention (MQA), providing a fair trade-off between quality and speed. GQA minimizes memory bandwidth demands by grouping query heads, making it appropriate for scaling up models. GQA has been used in place of typical multi-head attention in recent models such as the LLaMA-2 and Mistral7B.
- https://medium.com/@maxshapp/grouped-query-attention-gqa-explained-with-code-e56ee2a1df5a
- https://github.com/fkodom/grouped-query-attention-pytorch