Benchmarking the PCI express capabilities with CUDA I stumbled across the weird behaviour that a 4 MB block seems to achieve the best sustainable bandwidth. At least when writing to the host.
However, transmitting more than 4 MB but with 4 MB data packets (let's call it blocked copy) does leave a gap in performance.
Although the performance is regained at the end with almost filling the whole GPU memory, the question is what causes the performance to drop to 2GB/s in the first place.
Another interesting question is the jump in performance at 1e6 bytes. Possibly a switch in protocols
No comments:
Post a Comment