Performance tests JGroups 2.6.4


Author: Bela Ban
Date: August 2008

Configuration


The tests were conducted in the JBoss lab in Atlanta. Each box was:


The tests were run only on 64-bit systems.

Switches

The following switch was used:

Network

Note that neither the network nor the test machines were reserved exclusively for the performance tests, so some of
the fluctuations are explained by the fact that other users/processes were using the machines and/or generating traffic on the network. This was mitigated somewhat by running all tests in the AM hours (8am - 3pm) in Europe, when the servers in
Atlanta GA were mostly idle. We also used machine which have no cron jobs (like cruisecontrol etc) scheduled that could interfere with the tests.


JVM configuration

The following JVMs were used:

The options used for starting the JVMs were:
-server -Xmx600M -Xms400M -XX:+UseParallelGC -XX:+AggressiveHeap -XX:CompileThreshold=100 -XX:SurvivorRatio=8 -XX:TargetSurvivorRatio=90 -XX:MaxTenuringThreshold=31 -Dlog4j.configuration=file:/home/bela/log4j.properties -Djgroups.bind_addr=${MYTESTIP_1} -Dcom.sun.management.jmxremote -Dresolve.dns=false


Design of the tests

The goal of the test is to measure the time it takes to reliably send N messages to all nodes in a cluster. (We are currently not interested in measuring remote group RPC, this may be done in a future performance test.) The test works as follows.

The test is part of JGroups: org.jgroups.tests.perf.Test. The driver is JGroups independent and can also be used to measure JMS performance and pure UDP or TCP performance (all that needs to be done is to write a Transport interface implementation, with a send() and a receive()).

The configuration file (config.txt) is shown below (only the JGroups configuration is shown):

# Class implementing the org.jgroups.tests.perf.Transport interface

transport=org.jgroups.tests.perf.transports.JGroupsTransport
#transport=org.jgroups.tests.perf.transports.UdpTransport
#transport=org.jgroups.tests.perf.transports.TcpTransport

# Number of messages a sender sendes
num_msgs=1000000

# Message size in bytes
msg_size=1000

# Expected number of group members
num_members=2

# Number of senders in the group. Min 1, max num_members
num_senders=2

# dump stats every n msgs
log_interval=100000



This file must be the same for all nodes, so it is suggested to place it in a shared file system, e.g. an NFS mounted directory.
The following parameters are used:

transport

An implementation of Transport

num_msgs

Number of messages to be sent by a sender

msg_size

Number of bytes of a single message

num_members

Number of members. When members are started, they wait until num_members nodes have joined the cluster, and then the senders start sending messages

num_senders

Number of senders (must be less then or equal to num_members). This allows each receiver to compute the total number of messages to be received: num_senders * num_msgs. In the example, we'll receive 2 million 1K messages

log_interval

Output about sent and received messages will be output to stdout (and to file if -f is used (see below)) every log_interval messages


The options for the test driver are:
bela@laptop /cygdrive/c
$ java org.jgroups.tests.perf.Test -help
Test [-help] ([-sender] | [-receiver]) [-config <config file>] [-props <stack config>] [-verbose] [-jmx] [-dump_stats] [-f <filename>]

-sender / - receiver

Whether this process is a sender or a receiver (a sender is always a receiver as well)

-config <file>

Points to the configuration file, e.g. -config /home/bela/config.txt

-props <props>

The JGroups protocol stack configuration. Example: -props c:\fc-fast-minimalthreads.xml. Can be any URL

-verbose

Verbose output

-jmx

Enables JMX instrumentation (requires JVM with JMX MBeanServer, e.g. JDK5). This will cause the VM not to terminate when done. To access the process via JMX, the -Dcom.sun.management.jmxremote property has to be defined and jconsole can be used. For more details see http://wiki.jboss.org/wiki/Wiki.jsp?page=JMX.

-dump_stats

Dumps some JMX statistics after the run, e.g. number of messages sent, number of times blocked etc

-f <filename>

Dumps number of messages sent, message rate, throughput, current time, free and total memory to <filename> every log_interval milliseonds. This is the main tool to generate charts on memory behavior or message rate variance

-help

This help


The files needed to run te tests are included with JGroups (source distribution) as well: JGroups/conf/config.txt is the test configuration file, and JGroups/conf/{udp,tcp}.xml are the JGroups protocol stack config files. Note that all the test runs will have the JGroups stack configuration file included so that they can be reproduced.

For additional details on how to run the tests refer to http://wiki.jboss.org/wiki/Wiki.jsp?page=PerfTests.


Parameters

The parameters that are varied are:



Results

The plain results are available here for the 4 node cluster, here for the 6 node cluster and here for the 8 node cluster. The OpenOffice spreadsheet containing all results can be obtained here.
The performance tests were run on the cluster, measuring message rate (messages received/sec) and throughput (MB received/sec), as described in http://wiki.jboss.org/wiki/Wiki.jsp?page=PerfTests.
The JGroups configurations used are:



Message rate and throughput

The Y axes define message rate and throughput, and the X axis defines the message size. The message rate is computed as <number of messages sent> * <number of senders> / <time to receive all messages>. For example, a message rate of 35000 for 1K means that a receiver received 35000 1K messages on average per second. On average means that - at the end of a test run - all nodes post their results, which are collected, summed up and divided by the number of nodes. So if we have 2 nodes, and they post message rates of 10000 and 20000, the average messsage rate will be 15000.
The throughput is simply the average message rate times the message size.


4 node cluster




6 node cluster




8 node cluster





10 node cluster





Scalability per node

Scalability per cluster

Interpretation of results

General

All of the tests were conducted with a relatively conservative configuration, which would allow for the tests to be run for an extended period of time. For example, the number of credits in flow control was relatively small. Had we set them to a higher value, we would have probably gotten better results, but the tests would not have been able to run for an extended period of time without running out of memory. For instance, the number of credits should be a function of the group size (more credits for more nodes in the cluster), however, we used (more or less) the same configuration for all tests. We will introduce an enhanced version of the flow control protocol which dynamically adapts credits, e.g. based on free memory, loss rate etc, in http://jira.jboss.com/jira/browse/JGRP-2.

Compared to the previous tests (2007), the message rates and throughput for UDP is now considerably higher than before, in most cases even higher than the TCP based configuration. This is caused by performance modifications in JGroups and by setting core.net.rmem_max to 20000000 (from 131'000). The latter change allows the UDP stack to buffer more datagrams, triggering fewer retransmissions and therefore increasing performance. Note that https://jira.jboss.org/jira/browse/JGRP-806 certainly also made a big difference.

Note that for this test, for administrative reasons, we could not make use of jumbo frames and the MTU was 1500 instead of 9000 for the 2007 test. This likely affects throughput for larger messages.

The throughput of UDP has little variance for 1K messages and ranges from 56 MB/sec on 4 nodes to 51 MB/sec on 10 nodes. This is significantly better than TCP which starts at 49 MB/sec on 4 nodes and drops off to 34 MB/sec on 10 nodes. We can also see that the drop in throughput for UDP is worse for 2.5K messages (from 94 MB/sec on 4 nodes down to 65 MB/sec on 10 nodes) and for 5K messages (from 107 MB/sec on 4 nodes down to 77 MB/sec on 10 nodes). We attribute this to the fact that jumbo frames were not used: large UDP datagrams are fragmented into more IP packets (that need to get retransmitted on a single loss) with an MTU of 1500 than with 9000. TCP doesn't have this issue as it generates IP packets which are under the MTU's size.


Scalability

As can be seen in “Scalability per node”, message rates and throughput degrade slightly as we increase the cluster size. Reasons could be that the switch starts dropping packets when overloaded or flow control (FC) has too few credits (which are statically set, but should be allocated as a function of the cluster size). This will be investigated in one of the next releases.

Note that all numbers for message rates and throughput are shown per node, ie. the number of MBs a single node receives on average per second. If we aggregate those numbers and show the total traffic over the entire cluster, we can see (in “Scalability per cluster”) that we scale pretty well, although not 100% linearly. The figure shows that we can deliver ca. 500'000 1K messages over a cluster of 10 with UDP, and have an aggregate cluster wide throughput of over 800 MB/sec.

Processing of messages and scalability

The test program currently doesn't do any processing at the receiver when a message is received. Instead, a message counter is increased and the size added to the total number of messages received so far. This is of course an ideal situation, and not realistic for normal use, where receivers might unmarshal the message's byte buffer into something meaningful (e.g. an HTTP session modification), and process it (e.g. updating an HTTP session). (The test by the way allows to take processing time at the receiver into account by setting the processing_delay value).



Fluctuation of results (reproducability)

Compared to the 2.4 performance tests, we didn't run the tests multiple times. Instead, we ran each test only once, so the numbers shown are not averaged over multiple runs of a given test. We also didn't adjust the JGroups configs for the different cluster sizes. For example, the number of credits in flow control (FC and SFC) should have been increased with increasing cluster size. However, we wanted to see whether we get reasonable numbers for the same configs run on different clusters.

If you run the perf tests yourself and get even better numbers, we'd be interested to hear from you !

Conclusion and outlook

We showed that performance of JGroups is good for clusters ranging from 4 to 10 nodes. Compared to the 2007 tests, the UDP stack now has better performance than the TCP stack.