How to convert between uint64_t and poly64_t on ARM?

  arm, arm64, c++

I’d like to perform polynomial multiplication of two uint64_t values (where the least significant bit (the one got by w&1) is the least significant coefficient (the a0 in for w(x)=∑iai*xi )) on ARM and get the least significant 64 coefficients (a0…a63) of the result as uint64_t (so result>>i&1 is ai).
It’s not clear to me, however, what is the standard-compliant way to convert uint64_t to poly64_t and (least significant part of) poly128_t to uint64_t.

poly8_t, poly16_t, poly64_t and poly128_t are defined as unsigned integer types. It is unspecified whether these are the same type as uint8_t, uint16_t, uint64_t and uint128_t for overloading and mangling purposes.

ACLE does not define whether int64x1_t is the same type as int64_t, or whether uint64x1_t is the same type as uint64_t, or whether poly64x1_t is the same as poly64_t for example for C++ overloading purposes.


Above quotes opens some scary possibilities in my head like perhaps the bit order is flipped, or there’s some padding, or who knows, maybe these are some structs.

So far I’ve come out with these two:

poly64_t uint64_t_to_poly64_t(uint64_t x) {
  return vget_lane_p64(vcreate_p64(x), 0);
uint64_t less_sinificant_half_of_poly128_t_to_uint64_t(poly128_t big) {
  return vgetq_lane_u64(vreinterpretq_u64_p128(big), 0);

But they seem cumbersome (as they go through some intermediary stuff like poly64x1_t), and still make some assumptions (like that poly128_t can be treated as a vector of two uint64_t, and that the the 0-th uint64_t will contain the "less significant coefficients", and that least significant polynomial coefficient will be at the least significant uint64_t‘s bit).

OTOH it seems that I can simply "ignore" the whole issue, and just pretend that integers are polynomials as the two functions produce the same assembly:

uint64_t polynomial_mul_low(uint64_t v,uint64_t w) {
    const poly128_t big = vmull_p64(uint64_t_to_poly64_t(v),
    return less_sinificant_half_of_poly128_t_to_uint64_t(big);

uint64_t polynomial_mul_low_naive(uint64_t v,uint64_t w) {
    return vmull_p64(v,w);

that is:

        fmov    d0, x0
        fmov    d1, x1
        pmull   v0.1q, v0.1d, v1.1d
        fmov    x0, d0

also, the assembly for uint64_t_to_poly_64_t and less_sinificant_half_of_poly128_t_to_uint64_t seems to be a no-op, which supports the hypothesis that there are no steps involved in conversion, really.
(See above in action:


uint64_t polynomial_mul_low_naive(uint64_t v,uint64_t w) {
    return (uint64_t)vmull_p64(poly64_t{v},poly64_t{w});

seems to compile, and while the {..}s give me the soothing confidence that no narrowing occurred, I’m still unsure if the order of the bits and order of the coefficients are guaranteed to be consistent, and thus have some worries about the final (uint64_t) cast.

I want my code to be correct w.r.t. to standards, as opposed to just work by an accident, as it has to be written once and run on many ARM64 platforms, hence my question:

How does one perform a proper conversion between polyXXX_t and uintXXX_t, and how does one extract "lower half of coefficients" from polyXXX_t?

Source: Windows Questions C++