Architecture
Sign in to follow this category-
State Space Models and Mamba (beyond Transformers)
Classical SSMs, S4 (HiPPO, Long Range Arena), Mamba selectivity, parallel scan, Mamba-2 duality, and Jamba hybrids. Linear cost vs quadratic attention.
15 min read -
Mixture-of-Experts (MoE) at scale
MoE: router, top-k routing, auxiliary load-balancing loss, capacity, expert parallelism. Total vs active parameters (Switch, GLaM, Mixtral, DeepSeek-V3).
16 min read