Throughput Performance: Comparing Apples to Apples 2

If you’re serious about the performance of your distributed system, you probably read with interest the performance claims made by network middleware vendors. And if you’re a network middleware vendor, you’ve probably published your share of performance claims. (RTI has comprehensive performance numbers available for both our DDS and JMS APIs.) But in order to know which claims are meaningful — and more importantly, which are useful to you — it’s important to understand what you’re reading. In the words on one of my coworkers, “many apples are compared to rhinoceroses.”

First of all, there are three primary axes along which people tend to measure network performance:

  • Latency: end-to-end, round-trip, and loaded latency; the amount of time it takes to send a certain amount of data from somewhere to somewhere else under various conditions
  • Latency jitter: the amount of variation in latency measurements
  • Throughput: the number of data quanta (either raw bytes or fixed-size samples/messages) transmitted over a given amount of time

The whitepaper “The Data-Centric Future” on the RTI website has a good overview of general performance considerations, so I won’t reiterate that material here. What I’ll focus on today is throughput, and specifically some of the subtleties you’ll want to keep in mind when you’re evaluating a networking middleware product.

Size Matters.
The size of the data samples you send in a throughput test has a huge impact on the throughput you measure. At small data sizes (dozens to hundreds of bytes), performance is dominated by the expense of traversing the layers of software in between your application and the network: your middleware, if any; your operating system’s network stack, including any system calls; and your network driver. As the data size increases, these fixed costs become less significant relative to the cost of actually copying the data (including the “copy” of the data across the network).

This difference in the throughput profile among different data sizes means that if you’re reading a performance report, and you see “bytes” but your application deals with “samples” or “messages,” it’s important to understand the sample size(s) used to generate that report. It’s not sound to generate some data with one sample size, add up the total number of bytes, and then divide by another sample size, because you’re not correctly accounting for per-sample constant factors.

One vendor — an enterprise software vendor newly entering the messaging market, who shall remain nameless — made a series of throughput measurements for 256-byte samples. This vendor then declared that what their customers really cared about was 16-byte samples, and so immediately multiplied their measurements by 16 (= 256 / 16) and published those extrapolated results! Depending on how they were planning on sending those 16-byte samples, they were assuming either that (a) the cost of performing 16x the number of network sends is zero or (b) the cost of packing and unpacking 16-byte samples into and out of 256-byte chunks is zero. Of course, both of these costs are emphatically non-zero. But that brings me to my next point:

Understand data batching.
Because of the high fixed cost of a network send, especially relative to the cost of copying a small data sample, it is a common practice to batch multiple data samples together and send them together as a unit.

Suppose you’re publishing 64-byte samples. Remember that each time you send a packet on the network, you’re also sending a couple dozen bytes of IP header data and whatever meta-data your middleware requires. That adds up to a 30-100% space penalty — added to the time penalty discussed above. But if you can amortize these costs over many samples, they become much less important. In fact, batching data can increase your effective throughput by more than an order of magnitude in some cases.

In our experience, you can send 50-80,000 smallish packets per second using commodity OS, computing, and networking components. When you see samples-per-second-style throughput numbers much higher than that, it means that those samples are being batched under the hood.

Note that data batching is an intrinsic part of the TCP protocol, so any middleware implementation that relies on TCP batches data all the time.

Differentiate between one-to-n and aggregate throughput.
There are two ways to look at throughput: an application-centric view and a network-centric view. Which one of these a given community cares about governs which one gets measured and reported by vendors that market into that community. It means that you need to be aware of what you’re reading when you see “n samples per second,” especially when dealing with new vendor.

  • Applications typically care about the number of samples that can be sent/received to/from particular destinations using particular data producing and consuming objects. For example, suppose I’m publishing sensor data. I know that my device has a new value available 10,000 times every second. If I try to send that much data, will it work? This viewpoint is relevant to applications with individual data streams that place a significant burden on the network all by themselves. Streaming media, high-rate sensor data, real-time command and control, and other similar domains are in this category. Throughput data is typically reported from one (the data producer) to n (the number of data consumers).
  • Other systems take a network-centric view, measuring the total number of samples in flight across an entire system. This view is relevant when both of the following are true: (a) individual data streams may not be demanding by themselves, but there are many of them, and (b) all of those streams have a common choke point. Enterprise integration and web services often fall into this category, as services are invoked on human time scales, and middleware implementations typically include central message broker components. Network-centric throughput is typically reported in aggregate, from n (producers) to m (consumers), where all (n + m) entities bottleneck through a common broker. The goal of the test, in such a case, is to measure the limits of the broker itself, not of the applications that use it.

When you’re evaluating a throughput claim, be sure you know which one of these scenarios you’re looking at! I can tell you that there was a flurry of activity at RTI when a competing vendor started touting 6 million samples per second — until we read further into that vendor’s testing methodology and discovered that the result was an aggregate across 60 applications. For the record, the throughput numbers you will find on RTI’s website — showing over 1 million samples per second — are measured 1-to-n. That’s one publisher on one box publishing to one destination.

Understand the architecture.
To me, 1-to-n numbers are the honest numbers when you have a peer-to-peer solution, as RTI does. That’s because, assuming your switch can keep up, there’s nothing to test other than the so-called “client” applications themselves. We can saturate a gigabit link for data sizes not much over 100 bytes, and come close even for very small sizes. Do you have more data to send than can fit over a single link? Then add another link. At that point, you’re no longer testing the performance of the RTI infrastructure; you’re testing the performance of your switch.

Knowledge is Power, or, Forewarned is Forearmed.
Now, hopefully, you have a better understanding of how throughput performance numbers data is measured and reported, what to expect from that data, and what to look out for. You’re ready to enter the wide and wild world of performance evaluation. (Care to try your own hand? Pick up a copy of RTI Data Distribution Service or RTI Message Service and run the comprehensive performance test you can find in RTI’s Knowledge Base — search for “Example Performance Test.”)

Of course, there’s a lot more to network performance besides throughput. No doubt I (and/or someone else) will be returning to talk about latency, loaded latency, jitter, competitive analysis, and other topics in the future — stay tuned.

2 comments

  1. I am conducting some performance tests on DDS with a toy application. I am trying to understand you statement above 1 million samples being sent from 1 publisher to 1 subscriber since I am not getting anywhere near that and other information I found indicates a rate of about 25,000 samples (UDP packets) is more typical for a point to point test environment (two PCs running RT Linux connected by 100 MB ethernet. Each PC has a single core 1GHz processor). I have two nodes which circulate one sample around a two node network. I have a publisher and subscriber on each node. One publisher sends Topic1 to the remote subscriber which turns it around and publishes Topic2 to the other node’s subscriber which turns it back to Topic1 etc. I know this does not match the test you describe above but it is more realistic for our environment then one where one node does nothing but publish samples and another receives.

    If I am interpreting your statement of 1 million samples persecond as 1 million UDP packets that would mean it takes 1 microsecond to run up or down the TCP/IP stack. This is much faster then anything I have measured. Is my interpretation of your data correct?

    Like

  2. Hi George,
    Thanks for your question.

    The message-rate throughput numbers we post at http://www.rti.com/products/dds/benchmarks-cpp-linux.html, to which I was referring, are measured in units of DDS samples per second. (A “sample” is the object you pass to the DDS write() call.) You’re correct that a packet send rate in the tens-of-thousands is typical. However, a DDS sample may or may not correspond one-to-one to a UDP packet. For example:

    –> If you’ve configured RTI Data Distribution Service to batch data, many samples will be combined into a single UDP packet to reduce the number of trips through the IP stack. This is the configuration we use to generate our published results.

    –> If your sample size is greater than the maximum UDP datagram size (64 KB, or smaller on some embedded OSes), RTI Data Distribution Service will fragment a single logical sample over multiple UDP packets and then reassemble them on the receiving side.

    Given that, there are a number of factors that can explain the differences between our published performance numbers and what you’re seeing.

    1. The platform and networks are different. As explained on our performance results page (URL above), as of this writing, we generate our published numbers using 2.4 GHz dual-core machines connected to a gigabit ethernet network. That hardware is significantly more capable than what you’ve indicated you’re using. In particular, your 100-Mbit network has 10x less capacity than our testing environment, so you can expect a theoretical maximum throughput close to 100K samples/sec than to 1M samples/sec.

    2. The batching configuration may be different. By default, RTI Data Distribution Service does not batch samples, because doing so can impact latency. Based on the results you’re seeing, you may not have altered this default setting. Look in your code or XML QoS profile file for the BatchQosPolicy, which is configured on the DataWriter entity. If you’re not familiar with this policy, you can find more information in the API documentation that’s included in your product distribution.

    3. The sample sizes may be different. Note on the graph how the sample rate falls off as the sample size increases. At very small sizes, this reflects the increasing cost of copying larger amounts of data. Once the network becomes saturated, it reflects the purely mathematical relationship (number of samples) ~= (network capacity) / (sample size).

    There are a couple of things you can do to reduce #2 and #3 above.

    –> If you’d like to run our test configurations on your hardware directly, you can download our example performance test. Visit our Knowledge Base at https://www.rti.com/kb/index.html, click “Performance,” then click “Example Performance Test for RTI Data Distribution Service.” This is the exact code we use to generate our published numbers.

    –> Even easier: if you’re using a recent version of RTI Data Distribution Service, you can find an example QoS profile configuration in your product distribution. Look in example/QoS/high_throughput.xml.

    I hope this was helpful.
    Best regards,
    – Rick

    Like

Submit a comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s