Try to reduce the amount of uploads per iteration, change the order of execution for the scatter kernel.