How to Get 1.5 TFlops of FP32 Performance on a Single M1 CPU Core