DSNIC: Brief Firmware Description.

Marcel Boosten

CERN, EP/ATD, Geneva, Switzerland

December 12 1997

Introduction
Firmware design
Architecture
Detailed design
Resource usage
Performance

Introduction

The DSNIC firmware has been designed to full fill two requirements:

It should provide network access to the DS link control network to allow a PCI based system to be the DS network controller.

It should support a high speed message passing protocol.

The DSNIC card contains two C101 DS link interfaces. The first C101 is used to provide register I/O based network access meant for network control. The second C101 is used to support the high speed message communication protocol.

We have designed a high speed message passing protocol with the following characteristics:

To avoid long term blocking of a wormhole route, the protocol should split messages into packets.

The protocol should support adaptive routing because adaptive routing has advantages for both throughput and reliability. The consequences of supporting adaptive routing is out-of-order packet delivery. The protocol should therefore reorder packets on the receiving side.

To avoid network congestion, the message receiver should be capable of receiving data as soon as it is offered. Since the buffering capabilities of each end-node are limited, the protocol uses end-to-end flow control to avoid packet loss due to buffer overflow.

Firmware design

The design of the firmware is the result of the following arguments:

Packet communication interface: Measurements have proven that the maximum link throughput cannot be reached if the host CPU is used to transfer every single byte of a message via register I/O. Therefore, the DSNIC firmware offers an packet communication interface that allows DMA transfers of packet data.

DMA to user space: A well known problem in interfacing is memory-to-memory copying. Memory-to-memory copying is an expensive and non-essential operation which causes both CPU loading and latency. To avoid packet reordering to be an expensive operation, the software should be capable to determine on basis of the packet header where the packet data is to be stored via DMA to user space. The firmware supports this by splitting packet header from packet data and generating interrupts on header reception.

Low latency acknowledges: Due to end-to-end flow control is the latency of acknowledge packets critical for the message latency. The firmware improves the acknowledge latency by offering two separate outgoing FIFOs for acknowledge and data packets, which allows acknowledges to pass by data packets.

Support different packet sizes: The packet size determinates the number of packets that need to be handled by the host CPU, and it therefore strongly influences the CPU load. To investigate this influence, the firmware supports packet sizes of 2 to the power n, where 2 < n < 13.

Maximise throughput: To allow maximum throughput communication, the firmware should hide the latency of the controlling software. The firmware does this by providing data buffering via FIFOs.

Architecture

Figure 1 shows a block diagram of the firmware. The functions are described in AHDL and compiled to run on the -3 version of the Altera 10K50.

Figure 1: AHDL functional block diagram.

Detailed design

COMIN takes care of packet reception, it splits packet headers from packet data. The headers are send to the HEADLENFIFO, and the data is send to the INFIFO. Apart from splitting, COMIN also counts the packet length, which is known as soon as the end-of-packet character has been received. Also this length is send to the HEADLENFIFO. Data in the HEADLENFIFO is delivered to an AMCC mailbox, which can generate an interrupt. On a header reception interrupt, the host CPU can establish the receiving DMA.

COMOUT takes care of sending data and acknowledge packets. If data is available in both the ACKFIFO and the OUTFIFO, data in the ACKFIFO will be selected. This way, acknowledges can pass by enqueued data packets.

In order to hide the latency of the controlling software, there are FIFOs on both the receiving and the transmitting side. The buffer size of 1 Kbyte was chosen as a good fit into the internal resources of the Altera FPGA and it can hide a latency up to 100 usec which should be sufficient to hide software latency.

To avoid a store-and-forward system, the receiving DMA must be initialised as soon as the header of the packet is received. However, the length of the DMA transfer should be known beforehand. This is not the case, since the packet length is determined by COMIN via counting. Therefore FLUSH pads out each packet to the full packet length. DMAs can now be set up immediately on header reception, using the full packet length.

AMCCMUX multiplexes the AMCC to all the FIFOs. PASSTHRU provides access to the registers of the C101 and the reset links.

Resource usage

Figure 2: Resource usage of the firmware.

Figure 2 shows the FPGA resource usage. The Altera 10K50 contains a total of 2880 Logic Cells (LCs). Due to limitations in logic placement, only about 80% of these resources can actually be used. Upto 20% of the LCs in the firmware can be made available for additional purposes by reducing the size of the ACKFIFO and the HEADLENFIFO. Currently, these FIFOs are maximised to just fill the Altera 10K50.

Figure 2 contains more functional blocks than Figure 1. The extra blocks FASTSLOW and SLOWFAST take care of data transfer from one clock domain to the other. Furthermore, there are IO and BUF blocks which ensure the timing constraints with respect to the C101 interface.

Performance

Figure 3 shows the throughput versus message length for bidirectional message exchange (Comm2) and for some different packet sizes (8, 1024, and 4096). The results show the end-to-end data bandwidth between two Linux processes. Each process is running on a 200 MHz Pentium PC.

Figure 3: Bidirectional throughput versus message length for different packet sizes.

This graph proves that the DSNIC can reach a bidirectional process-to-process throughput of 16.6 Mbytes/s, which is 90% of the theoretical bandwidth.