> I think you undersell the M1's SIMD abilities a bit relative to Kaby Lake.
Yes, the units in KBL are 256-bit versus the 128-bit in M1, but there are four units in M1, for a total vector width of 512 bits. KBL has only 3 vector ports, for a maximum width of 768, but most operations cannot execute on all three ports, so the effective total width is often lower.
Total vector execution width matters more than how large a single register/unit is, for most algorithms.