Improve pack
Reimplements hsimd_packh and hsimd_packl in terms of the uzp1 and uzp2 NEON instructions. The commit history shows some other approaches I tried. It appears that the uzp implementation performs the best.
I think the saturating versions of hsimd_packl (hsimd_packus and hsimd_packss) should also be given a once-over, but I don't believe we'll be able to use uzp instructions as they don't saturate the values. Instead, we should use uqxtn and sqxtn. The existing implmentation of hsimd_packus does this already, but it could be expanded for fw >= 16 && fw <= 64. I will make a note of that when I make a list of functions to update, but thought it was worth mentioning here.