The MAMBA Model transformer which has a language modeling head on top (linear layer with weights tied into the input
Mamba, like Flash focus, makes an attempt to Restrict the amount of moments we have to go from DRAM https://k2spiceshop.com/product/liquid-k2-on-paper-online/