Next: Acknowledgements Up: A PCI based Previous: Low-latency context switches

Conclusion

We have developed the DSNIC by continuously keeping the design aim in mind: optimise for low-latency high-throughput communication, requiring little CPU load. To accomplish this, the CSP based API is extended with facilities for asynchronous, non-blocking, and zero-copy communication. Furthermore, the communication protocol splits messages into limited size packets. This avoids continuous wormhole blocking and reduces the network latency. The protocol supports adaptive routing, and allows full bandwidth utilisation by a sliding window protocol and a non-restricting acknowledgement scheme. Reliability is ensured by the very low BER of DS link networks and the use of end-to-end flow control.

This way, we have managed to obtain a system that reaches 8.3 Mbytes/s unidirectional process-to-process throughput over a single virtual link. A maximum throughput of 16.6 Mbytes/s is reached for bidirectional communication. In both unidirectional and bidirectional communication, up to 90 % of the maximum theoretical bandwidth can be exploited. The DSNIC offers a 1-byte message latency of 67 s. This is fast, considering that this is the process-to-process latency. During 8.3 Mbytes/s unidirectional communication, the DSNIC requires 7 % CPU load for a packet size of 4096 bytes. For that same packet size and 16.6 Mbytes/s bidirectional communication the CPU is loaded for 19 %.

We have observed that high throughput and low CPU load can only be obtained by using large packets, e.g., 4096 bytes long. On the other hand, we have shown that the ideal packet size for the 512 Clos DS link network is small: 28 bytes. Filling this gap completely would require dedicated hardware.

We have shown that protocol off-loading has the potential to result in a significant gain of available CPU power, e.g., up to 53 % for packet size 1024. Furthermore, we have shown that this method requires a substantial amount of on-board computing power.

In general purpose operating systems, context switches are a source of overhead. We have shown the overhead of one form of context switch: the interrupt. An interrupt that handles the communication of a single packet takes 32.5 s, whereas the same operation on the same CPU can also be performed in 11.5 s, an overhead of 65 %. Furthermore, we estimated the performance loss caused by context switches when using processes instead of threads to implement concurrent communication and computation.

Major performance improvement can be obtained by avoiding the use of OS facilities that require context switches, i.e., kernel calls, task switches, and interrupts; and obtaining adequate alternatives for them.

Next: Acknowledgements Up: A PCI based Previous: Low-latency context switches

Marcel Boosten
Wed Mar 11 14:25:07 MET 1998