Compiler optimizations for 5.8ms GPT-OSS-120B inference (not on GPUs)