Implement `simd_popcount` using NEON `cnt`.
The cnt
instruction gives a population count for bytes in a vector,
i.e. for <8 x i8> or <16 x i8>. This implmentation counts set bits in
the bytes of a vector and uses NEON's addp
(pairwise add) to reduce
the vector to the appropriate field width and element count.
Performance gains from this change are pretty good. The popcount kernel shows a reduction of ~10% in cycles used on average.