Appendix A:
Atlas Communication Benchmarks Report for the GPMIMD Machine
Further details of work on IEEE 1355 DS links can be found at The
CERN IEEE 1355 home page
These benchmarks are described in Communication
Benchmarks for Trigger Applications
Configuration |
Name |
GPMIMD Machine |
Machine |
64 T9000 Transputer nodes using 58 C104 switches |
OS |
NONE |
Compiler |
SGS-Thomson, oc (occam) |
Options |
-na -b -t9000 -h -y -GAMMAE -n |
Library |
SGS-Thomson toolset libraries |
Time Measurment |
Transputer clock (1 microsecond resolution) |
Reported by |
Roger Heeley |
Benchmark version |
0.1 |
Date |
August, 1997 |
Basic Benchmarks
NOTE: Nodes are all 20 MHz, 16 Mbyte, GAMMA REVE03 HTRAMS. Running 8K
cache 8K internal memory. All links running at 100 Mbits/s. C104s are REV
BetaB02. All measurements use only one of the four independant T9000 networks.
The shortest message used is a single integer, i.e. 4 bytes on the T9000.
Com-1.1 (one-way) |
Packet Size |
Time (in microseconds) |
4 |
6.6 |
64 |
16.9 |
256 |
56.9 |
1024 |
216.9 |
Note 1.1: Times are for the sender side. A single virtual link connects
the two processors. Processors are on the same motherboard, i.e. one C104
between them.
Com-1.2 (two-way) |
Packet Size |
Time (in microseconds) |
4 |
4.9 |
64 |
10.1 |
256 |
33.6 |
1024 |
127.2 |
Note 1.2: Time is total wall clock time on sender divided by two. A
single virtual link is connected in each direction. Processors are on the
same motherboard, i.e. one C104 between them.
Com-1.3 (all-to-all) |
Processors |
Packet Size |
Time (in microseconds) |
4 |
4 |
17.2 |
4 |
64 |
42.7 |
4 |
256 |
146.9 |
4 |
1024 |
565.5 |
8 |
4 |
38.5 |
8 |
64 |
101.1 |
8 |
256 |
368.3 |
8 |
1024 |
1427.6 |
16 |
4 |
79.2 |
16 |
64 |
228.4 |
16 |
256 |
799.5 |
16 |
1024 |
3185.8 |
62 |
4 |
320.3 |
62 |
64 |
1707.3 |
62 |
256 |
6267.8 |
62 |
1024 |
23167.0 |
Note 1.3: Time is total wall clock time for all processors to send a
single packet to all other processors. Every processor has a virtual link
to every other processor and uses them all in parallel. All processors
in the machine are used (including slot 0s), except one for multiplexing
host I/O.
Com-1.4 (pairs) |
Total Processors |
Packet Size |
Time (in microseconds) |
4 |
4 |
13.1 |
4 |
64 |
23.4 |
4 |
256 |
84.7 |
4 |
1024 |
329.7 |
8 |
4 |
13.3 |
8 |
64 |
23.6 |
8 |
256 |
84.9 |
8 |
1024 |
329.8 |
16 |
4 |
13.3 |
16 |
64 |
23.6 |
16 |
256 |
84.9 |
16 |
1024 |
330.1 |
52 |
4 |
13.3 |
52 |
64 |
23.6 |
52 |
256 |
85.0 |
52 |
1024 |
330.3 |
Note 1.4: Time is average wall clock time across all senders. One virtual
link between sources and destinations. Sources and destinations are always
on different motherboards, i.e. 3 C104s between them. The 8 slot 0 T9000s
are not used, at worst they have 5 C104s between them. When a single virtual
link is in use the extra C104s greatly increase latency.
Com-1.5 (outfarming) |
Receivers |
Packet Size |
Time (in microseconds) |
4 |
4 |
13.5 |
4 |
64 |
30.6 |
4 |
256 |
127.3 |
4 |
1024 |
513.9 |
8 |
4 |
21.2 |
8 |
64 |
61.6 |
8 |
256 |
256.1 |
8 |
1024 |
1034.6 |
16 |
4 |
41.6 |
16 |
64 |
122.9 |
16 |
256 |
512.5 |
16 |
1024 |
2228.1 |
53 |
4 |
138.8 |
53 |
64 |
407.1 |
53 |
256 |
1853.0 |
53 |
1024 |
7664.1 |
Note 1.5: The Time is the total wall clock time for one packet to EVERY
destination. Different data to each destination. A single virtual link
to every destination which are all used in parallel on the source. Slot
0 T9000s are not used, and wherever possible the source is on a different
motherboard to all destinations. This was not possible for the 53 destination
measurements where 4 destinations were on the same motherboard as the source.
Note that the total number of processors is 'Receivers' + 1
Com-1.6 (multicast) |
Receivers |
Packet Size |
Time (in microseconds) |
4 |
4 |
13.2 |
4 |
64 |
30.4 |
4 |
256 |
126.3 |
4 |
1024 |
510.4 |
8 |
4 |
18.78 |
8 |
64 |
59.9 |
8 |
256 |
249.5 |
8 |
1024 |
1007.9 |
16 |
4 |
37.1 |
16 |
64 |
119.8 |
16 |
256 |
500.0 |
16 |
1024 |
2020.6 |
53 |
4 |
122.8 |
53 |
64 |
397.0 |
53 |
256 |
1656.5 |
53 |
1024 |
6694.2 |
Note 1.6: The Time is the total wall clock time for one packet to EVERY
destination. Now same data is sent to all destinations. A single virtual
link to every destination w hich are all used in parallel on the source.
Slot 0 T9000s are not used, and whe rever possible the source is on a different
motherboard to all destinations. This was not possible for the 53 destination
measurements where 4 destinations w ere on the same motherboard as the
source Note that the total number of processors is 'Receivers' + 1
Com-1.7 (funnel) |
Senders |
Packet Size |
Time (in microseconds) |
4 |
4 |
19.8 |
4 |
64 |
76.3 |
4 |
256 |
287.6 |
4 |
1024 |
1132.4 |
8 |
4 |
19.8 |
8 |
64 |
76.3 |
8 |
256 |
288.0 |
8 |
1024 |
1133.2 |
16 |
4 |
39.9 |
16 |
64 |
152.6 |
16 |
256 |
575.0 |
16 |
1024 |
2264.7 |
53 |
4 |
132.2 |
53 |
64 |
499.6 |
53 |
256 |
1898.8 |
53 |
1024 |
7582.1 |
Note 1.7: The Time is the total wall clock time for one packet from
EVERY source. A single virtual link to every source w hich are all used
in parallel on the destination. Slot 0 T9000s are not used, and whe rever
possible the destination is on a different motherboard to all sources.
This was not possible for the 53 source measurements where 4 sources w
ere on the same motherboard as the destination. Note that the total number
of processors is 'Senders' + 1
Application Benchmarks
Com-2.1 (push farm with supervisor)
|
Senders
|
Receivers
|
Time (in microseconds)
|
2
|
2
|
41.13
|
4
|
2
|
50.91
|
6
|
2
|
60.28
|
8
|
2
|
69.38
|
16
|
2
|
107.52
|
26
|
2
|
155.06
|
2
|
4
|
40.88
|
4
|
4
|
50.65
|
6
|
4
|
60.01
|
8
|
4
|
69.20
|
16
|
4
|
107.37
|
26
|
4
|
154.96
|
2
|
6
|
40.80
|
4
|
6
|
50.57
|
6
|
6
|
59.93
|
8
|
6
|
69.15
|
16
|
6
|
107.33
|
26
|
6
|
154.93
|
2
|
8
|
40.76
|
4
|
8
|
50.52
|
6
|
8
|
59.91
|
8
|
8
|
69.11
|
16
|
8
|
107.29
|
26
|
8
|
154.90
|
2
|
16
|
40.70
|
4
|
16
|
50.48
|
6
|
16
|
59.84
|
8
|
16
|
69.05
|
16
|
16
|
107.23
|
26
|
16
|
154.84
|
2
|
26
|
36.78
|
4
|
26
|
45.65
|
6
|
26
|
54.38
|
8
|
26
|
63.12
|
16
|
26
|
97.90
|
26
|
26
|
143.13
|
Note that the total number of processors is 'Senders' + 'Receivers'
+ 1. Message size is always 16 bytes. The time reported is for a single
event on the supervisor. The results improve for 26 destinations, this
is due to the supervisor residing on the same motherboard as some destinations
when all 26 destinations are in use.
Com-2.2 (pull farm with supervisor) |
Senders |
Receivers |
Time (in microseconds)
|
2
|
2
|
51.85
|
4
|
2
|
62.93
|
6
|
2
|
73.85
|
8
|
2
|
86.89
|
16
|
2
|
127.40
|
26
|
2
|
192.84
|
2
|
4
|
51.70
|
4
|
4
|
62.77
|
6
|
4
|
73.76
|
8
|
4
|
86.48
|
16
|
4
|
127.26
|
26
|
4
|
192.51
|
2
|
6
|
51.62
|
4
|
6
|
62.69
|
6
|
6
|
73.71
|
8
|
6
|
86.34
|
16
|
6
|
127.22
|
26
|
6
|
192.40
|
2
|
8
|
51.59
|
4
|
8
|
62.65
|
6
|
8
|
73.69
|
8
|
8
|
86.36
|
16
|
8
|
127.19
|
26
|
8
|
192.38
|
2
|
16
|
51.54
|
4
|
16
|
62.60
|
6
|
16
|
73.64
|
8
|
16
|
86.26
|
16
|
16
|
127.12
|
26
|
16
|
192.33
|
2
|
26
|
50.78
|
4
|
26
|
61.82
|
6
|
26
|
72.90
|
8
|
26
|
85.50
|
16
|
26
|
126.39
|
26
|
26
|
191.72
|
Note that the total number of processors is 'Senders' + 'Receivers'
+ 1. Message size is always 16 bytes. The time reported is for a
single event on the supervisor. The results improve for 26 destinations,
this is due to the supervisor residing on the same motherboard as some
destinations when all 26 destinations are in use.
Com-3.1 (active ROBs and push-farm)
|
Total Processors
|
Message length
|
Delay (microseconds)
|
Time (microseconds)
|
3
|
64
|
0
|
37.1
|
3
|
64
|
100
|
143.7
|
3
|
64
|
400
|
490.2
|
3
|
64
|
1600
|
1873.3
|
3
|
256
|
0
|
159.8
|
3
|
256
|
100
|
205.6
|
3
|
256
|
400
|
511.7
|
3
|
256
|
1600
|
1875.0
|
3
|
1024
|
0
|
650.1
|
3
|
1024
|
100
|
650.4
|
3
|
1024
|
400
|
767.9
|
3
|
1024
|
1600
|
1880.1
|
5
|
64
|
0
|
58.2
|
5
|
64
|
100
|
169.0
|
5
|
64
|
400
|
518.1
|
5
|
64
|
1600
|
1882.1
|
5
|
256
|
0
|
219.0
|
5
|
256
|
100
|
289.9
|
5
|
256
|
400
|
607.6
|
5
|
256
|
1600
|
1902.9
|
5
|
1024
|
0
|
860.2
|
5
|
1024
|
100
|
859.4
|
5
|
1024
|
400
|
1023.9
|
5
|
1024
|
1600
|
1959.9
|
9
|
64
|
0
|
142.7
|
9
|
64
|
100
|
239.6
|
9
|
64
|
400
|
584.9
|
9
|
64
|
1600
|
1841.5
|
9
|
256
|
0
|
633.0
|
9
|
256
|
100
|
669.0
|
9
|
256
|
400
|
1024
|
9
|
256
|
1600
|
2089.2
|
9
|
1024
|
0
|
2593.6
|
9
|
1024
|
100
|
2593.9
|
9
|
1024
|
400
|
2687.9
|
9
|
1024
|
1600
|
3148.2
|
Com-3.2 (passive ROBs and pull farm) |
Total Processors |
Delay (microseconds) |
Time (in microseconds)
|
3
|
0
|
470.3
|
3
|
100
|
561.2
|
3
|
400
|
767.87
|
3
|
1600
|
1891.7
|
5
|
0
|
938.9
|
5
|
100
|
1029.4
|
5
|
400
|
1279.7
|
5
|
1600
|
2302.7
|
9
|
0
|
2815.4
|
9
|
100
|
2906.4
|
9
|
400
|
3071.9
|
9
|
1600
|
3920.2
|