Appendix A:

Atlas Communication Benchmarks Report for the GPMIMD Machine

Further details of work on IEEE 1355 DS links can be found at The CERN IEEE 1355 home page

These benchmarks are described in Communication Benchmarks for Trigger Applications

Configuration 
Name GPMIMD Machine 
Machine 64 T9000 Transputer nodes using 58 C104 switches 
OS NONE 
Compiler SGS-Thomson, oc (occam
Options -na -b -t9000 -h -y -GAMMAE -n 
Library SGS-Thomson toolset libraries 
Time Measurment Transputer clock (1 microsecond resolution) 
Reported by Roger Heeley 
Benchmark version 0.1 
Date August, 1997 

Basic Benchmarks

NOTE: Nodes are all 20 MHz, 16 Mbyte, GAMMA REVE03 HTRAMS. Running 8K cache 8K internal memory. All links running at 100 Mbits/s. C104s are REV BetaB02. All measurements use only one of the four independant T9000 networks. The shortest message used is a single integer, i.e. 4 bytes on the T9000.
 
Com-1.1 (one-way) 
Packet Size  Time (in microseconds) 
4 6.6 
64 16.9 
256 56.9 
1024 216.9 
Note 1.1: Times are for the sender side. A single virtual link connects the two processors. Processors are on the same motherboard, i.e. one C104 between them.
Com-1.2 (two-way) 
Packet Size  Time (in microseconds) 
4 4.9 
64 10.1 
256 33.6 
1024 127.2 
Note 1.2: Time is total wall clock time on sender divided by two. A single virtual link is connected in each direction. Processors are on the same motherboard, i.e. one C104 between them.
Com-1.3 (all-to-all) 
Processors  Packet Size  Time (in microseconds) 
4 4 17.2 
4 64 42.7 
4 256 146.9 
4 1024 565.5 
8 4 38.5 
8 64 101.1 
8 256 368.3 
8 1024 1427.6 
16 4 79.2 
16 64 228.4 
16 256 799.5 
16 1024 3185.8 
62 4 320.3 
62 64 1707.3 
62 256 6267.8 
62 1024 23167.0 
Note 1.3: Time is total wall clock time for all processors to send a single packet to all other processors. Every processor has a virtual link to every other processor and uses them all in parallel. All processors in the machine are used (including slot 0s), except one for multiplexing host I/O.
Com-1.4 (pairs) 
Total Processors  Packet Size  Time (in microseconds) 
4 4 13.1 
4 64 23.4 
4 256 84.7 
4 1024 329.7 
8 4 13.3 
8 64 23.6 
8 256 84.9 
8 1024 329.8 
16 4 13.3 
16 64 23.6 
16 256 84.9 
16 1024 330.1 
52 4 13.3 
52 64 23.6 
52 256 85.0 
52 1024 330.3 
Note 1.4: Time is average wall clock time across all senders. One virtual link between sources and destinations. Sources and destinations are always on different motherboards, i.e. 3 C104s between them. The 8 slot 0 T9000s are not used, at worst they have 5 C104s between them. When a single virtual link is in use the extra C104s greatly increase latency.
Com-1.5 (outfarming) 
Receivers  Packet Size  Time (in microseconds) 
4 4 13.5 
4 64 30.6 
4 256 127.3 
4 1024 513.9 
8 4 21.2 
8 64 61.6 
8 256 256.1 
8 1024 1034.6 
16 4 41.6 
16 64 122.9 
16 256 512.5 
16 1024 2228.1 
53 4 138.8 
53 64 407.1 
53 256 1853.0 
53 1024 7664.1 
Note 1.5: The Time is the total wall clock time for one packet to EVERY destination. Different data to each destination. A single virtual link to every destination which are all used in parallel on the source. Slot 0 T9000s are not used, and wherever possible the source is on a different motherboard to all destinations. This was not possible for the 53 destination measurements where 4 destinations were on the same motherboard as the source. Note that the total number of processors is 'Receivers' + 1
Com-1.6 (multicast) 
Receivers  Packet Size  Time (in microseconds) 
4 4 13.2 
4 64 30.4 
4 256 126.3 
4 1024 510.4 
8 4 18.78 
8 64 59.9 
8 256 249.5 
8 1024 1007.9 
16 4 37.1 
16 64 119.8 
16 256 500.0 
16 1024 2020.6 
53 4 122.8 
53 64 397.0 
53 256 1656.5 
53 1024 6694.2 
Note 1.6: The Time is the total wall clock time for one packet to EVERY destination. Now same data is sent to all destinations. A single virtual link to every destination w hich are all used in parallel on the source. Slot 0 T9000s are not used, and whe rever possible the source is on a different motherboard to all destinations. This was not possible for the 53 destination measurements where 4 destinations w ere on the same motherboard as the source Note that the total number of processors is 'Receivers' + 1
Com-1.7 (funnel) 
Senders  Packet Size  Time (in microseconds) 
4 4 19.8 
4 64 76.3 
4 256 287.6 
4 1024 1132.4 
8 4 19.8 
8 64 76.3 
8 256 288.0 
8 1024 1133.2 
16 4 39.9 
16 64 152.6 
16 256 575.0 
16 1024 2264.7 
53 4 132.2 
53 64 499.6 
53 256 1898.8 
53 1024 7582.1 
Note 1.7: The Time is the total wall clock time for one packet from EVERY source. A single virtual link to every source w hich are all used in parallel on the destination. Slot 0 T9000s are not used, and whe rever possible the destination is on a different motherboard to all sources. This was not possible for the 53 source measurements where 4 sources w ere on the same motherboard as the destination. Note that the total number of processors is 'Senders' + 1


Application Benchmarks

Com-2.1 (push farm with supervisor) 
Senders 
Receivers 
Time (in microseconds) 
2
2
41.13 
4
2
50.91 
6
2
60.28 
8
2
69.38
16
2
107.52
26
2
155.06
2
4
40.88
4
4
50.65
6
4
60.01
8
4
69.20
16
4
107.37
26
4
154.96
2
6
40.80
4
6
50.57
6
6
59.93
8
6
69.15
16
6
107.33
26
6
154.93
2
8
40.76
4
8
50.52
6
8
59.91
8
8
69.11
16
8
107.29
26
8
154.90
2
16
40.70
4
16
50.48
6
16
59.84
8
16
69.05
16
16
107.23
26
16
154.84
2
26
36.78
4
26
45.65
6
26
54.38
8
26
63.12
16
26
97.90
26
26
143.13
Note that the total number of processors is 'Senders' + 'Receivers' + 1. Message size is always 16 bytes. The time reported is for a single event on the supervisor. The results improve for 26 destinations, this is due to the supervisor residing on the same motherboard as some destinations when all 26 destinations are in use.
Com-2.2 (pull farm with supervisor) 
Senders  Receivers 
Time (in microseconds) 
2
2
51.85
4
2
62.93
6
2
73.85
8
2
86.89
16
2
127.40
26
2
192.84
2
4
51.70
4
4
62.77
6
4
73.76 
8
4
86.48
16
4
127.26
26
4
192.51
2
6
51.62
4
6
62.69
6
6
73.71
8
6
86.34
16
6
127.22
26
6
192.40
2
8
51.59
4
8
62.65
6
8
73.69
8
8
86.36
16
8
127.19
26
8
192.38
2
16
51.54
4
16
62.60
6
16
73.64
8
16
86.26
16
16
127.12
26
16
192.33
2
26
50.78
4
26
61.82
6
26
72.90
8
26
85.50
16
26
126.39
26
26
191.72
Note that the total number of processors is 'Senders' + 'Receivers' + 1. Message size is always 16 bytes. The time reported is for a single event on the supervisor. The results improve for 26 destinations, this is due to the supervisor residing on the same motherboard as some destinations when all 26 destinations are in use.
 
Com-3.1 (active ROBs and push-farm)
Total Processors 
Message length
Delay (microseconds) 
Time (microseconds) 
3
64
0
37.1
3
64
100
143.7
3
64
400
490.2
3
64
1600
1873.3
3
256
0
159.8
3
256
100
205.6
3
256
400
511.7
3
256
1600
1875.0
3
1024
0
650.1
3
1024
100
650.4
3
1024
400
767.9
3
1024
1600
1880.1
5
64
0
58.2
5
64
100
169.0
5
64
400
518.1
5
64
1600
1882.1
5
256
0
219.0
5
256
100
289.9
5
256
400
607.6
5
256
1600
1902.9
5
1024
0
860.2
5
1024
100
859.4
5
1024
400
1023.9
5
1024
1600
1959.9
9
64
0
142.7
9
64
100
239.6
9
64
400
584.9
9
64
1600
1841.5
9
256
0
633.0
9
256
100
669.0
9
256
400
1024
9
256
1600
2089.2
9
1024
0
2593.6
9
1024
100
2593.9
9
1024
400
2687.9
9
1024
1600
3148.2
 
Com-3.2 (passive ROBs and pull farm) 
Total Processors  Delay (microseconds) 
Time (in microseconds) 
3
0
470.3
3
100
561.2
3
400
767.87
3
1600
1891.7
5
0
938.9
5
100
1029.4
5
400
1279.7
5
1600
2302.7
9
0
2815.4
9
100
2906.4
9
400
3071.9
9
1600
3920.2