OK, this is tricky. The finite element interface from dune-localfunctions expects a std::vector to fill. This is hardcoded.
So, I can get rid of extra allocations by using some kind of cache structure, but still that will be uses of std::vector on the device later.
I suppose in the long run we'll have to reimplement the Q1/P1 basis such that it accepts std::arrays, and exports the number of basis functions in a constexpr way.
An alternative would be to use ReservedVector, but that still needs an upper bound on the number of entries.