←   Back

Tags

Hierarchical Reasoning Model

Reading Hierarchical Reasoning Models by Wang et al.

August 2025

Hierarchical Reasoning Model

Quantization ( INCOMPLETE )

Mechanics of Quantization

July 2025

Quantization ( INCOMPLETE )

Swarm of Attention Variants

Comprehensive overview of attention mechanism variants. Exploring multi-head attention, sparse attention, and other innovative approaches to the attention paradigm.

April 2025

Swarm of Attention Variants

RoPE

Rotary Position Embedding for transformers. A method to encode relative positional information directly into attention computations.

April 2025

RoPE

(Q?)KV Cache

I got tired of intuitively knowing how KV-Cache works without seeing it from first principles for myself, so here you go.

April 2025

(Q?)KV Cache

Tiny Stories

(Paper Notes) On *'TinyStories: How Small Can Language Models Be and Still Speak Coherent English?'* by Eldan and Li

April 2025

Tiny Stories

Generative Adverserial Networks

just a whiteboard, dm me if errors

March 2025

Generative Adverserial Networks

GRPO

Group Relative Policy Optimization for reinforcement learning. Advanced techniques for stable policy updates and improved training efficiency in RL systems.

February 2025

GRPO

The Inner Mechanism of Byte Pair Encoding

BPE is a subword tokenization technique that's used to split words into smaller, more frequent subunits, which reduces vocabulary size while still being able to represent a large variety of words. BPE was used in models such as GPT-1 and GPT-2 for tokenization

January 2025

The Inner Mechanism of Byte Pair Encoding

Improving Language Understanding by Generative Pre-Training

Paper Notes

January 2025

Improving Language Understanding by Generative Pre-Training

Transformers

The architecture that changed everything in deep learning. Self-attention mechanisms, encoder-decoder structure, and the foundation for modern language models.

January 2025

Transformers

Attention

The foundational mechanism that revolutionized deep learning. Understanding query-key-value interactions and how attention enables models to focus on relevant information.

January 2025

Attention

Neural Scaling Laws

Paper Notes

January 2025

Neural Scaling Laws

Mathematics of BPTT

From first principles

January 2025

Mathematics of BPTT

Residuals

Here's how we always have a $∂$ of at least $\frac{∂L}{∂x_{l+2}}$ for the $lth$ layer with a Residual Connection at every other $l$, and why we need the Identity Transformation to maintain this important feature of residual networks, for **deep networks**.

November 2024

Residuals

Backprop through Convolutions

Understanding gradient flow through convolutional layers. Mathematical derivations and intuitive explanations of backpropagation in CNNs.

October 2024

Backprop through Convolutions