[Timing] Synchronize, report from rank 0 only
The synchronization is done by taking the max over the ranks. The reasoning for taking the max is that the other ranks will have to wait for the rank that takes the longest. The reasoning for not summing up the timings is that we aren't doing that for the individual threads running on the device either.
Addresses: #88 (closed)