Headius: 04/03/08

One of the big reasons people have been interested in JRuby lately is because it is currently the only Ruby implementation that can both host multiple instances in a single process and run threads *actually* in parallel. Ruby 1.8 is entirely green-threaded. Rubinius has support for multiple parallel in-process VMs, but each one is still green-threaded. Ruby 1.9 uses native threads, but they're almost always blocking on a giant lock. IronRuby uses native CLR threads, but can't host multiple instances in a single process. JRuby FTW.

Of course true parallel execution is not without its pitfalls, and we stumbled across one this week. No, it wasn't a lock contention issue or data corruption or anything you're thinking of. It was smaller, sneakier, and much lower-level.

Have a look at this code:

public class Trouble {
    public static int i = 1;
    public static int firedCount = 0;
    
    public static int totalSize = 1000000000;
    public static int maxLoops = 5;
    
    public static void main(String[] args) {
        final int threadCount = Integer.parseInt(args[0]);
        Thread threads[] = new Thread[threadCount];
        
        try {
            for (int loops = maxLoops; loops > 0; loops--) {
                firedCount = 0;
                long start = System.currentTimeMillis();

                for (int j = 0; j < threadCount; j++) threads[j] = new Thread() {
                    public void run() {
                        int total = totalSize / threadCount;
                        for (int k = 0; k < total; k++) {
                            if (((i += 1) & 0xFF) == 0) doNothing();
                        }
                    }
                };

                for (int j = 0; j < threadCount; j++) {
                    threads[j].start();
                }

                for (int j = 0; j < threadCount; j++) {
                    threads[j].join();
                }
                
                System.out.println("time: " + (System.currentTimeMillis() - start));
                System.out.println("fired: " + firedCount);
            }
        } catch (Exception e) {
            e.printStackTrace();
            System.exit(1);
        }
    }
    
    public static void doNothing() {
        firedCount++;
    }
}

It looks pretty straightforward, right? Given a number on the command line, it fires up that many threads, divides the overall loop count by the number of threads, and starts them all running. Each thread increments the same static field 'i', and if ever i % 256 == 0, they call doNothing which increments a hit counter. Simple!

This is obviously a pretty dumb microbenchmark, but it mirrors a similar piece of code in JRuby. Because Ruby 1.8 supports various "unsafe" threading operations, we must also support them. But since we can't kill a target thread or raise an exception arbitrarily in a target thread, we must periodically checkpoint. Every thread checks whether one of those events has fired at various points during execution. The full list is describe on the RubySpec wiki page on Ruby threading, but the bit we're interested in is the check that happens every 256 method calls. In JRuby, this was implemented with a static counter, and every time a call was made the counter was incremented and checked.

So because JRuby has the ability to *actually* run in parallel, a group at Sun has been working on some microbenchmarks to test our threaded scalability. The benchmark is fairly simple: given a array of strings, reverse each string and replace it in the array. When running with n threads, divide the array into n segments to parallelize the work. We'd naturally expect that performance would improve as threads were added up to the number of cores in the system. But we discovered a problem we never expected to see: threaded scaling completely tanked.

A Surprising Effect

Here's the result of running the Java benchmark on Java 5 with one and then two threads:

~/NetBeansProjects/jruby ➔ java -server Trouble 1
time: 3924
fired: 3906250
time: 3945
fired: 3906250
time: 1841
fired: 3906250
time: 1882
fired: 3906250
time: 1896
fired: 3906250
~/NetBeansProjects/jruby ➔ java -server Trouble 2
time: 3243
fired: 4090645
time: 3245
fired: 4100505
time: 1173
fired: 3906049
time: 1233
fired: 3906188
time: 1173
fired: 3906134

This is on my Core Duo MacBook Pro. With one thread, overall CPU usage is at around 60-65%. Spinning up the second thread consumes the remaining 35-40%, and so we see an improvement in the overall execution time. Yay parallel execution.

But now have a look at the Java 6 numbers on the same machine:

~/NetBeansProjects/jruby ➔ java -server Trouble 1
time: 1772
fired: 3906250
time: 1973
fired: 3906250
time: 2748
fired: 3906250
time: 2114
fired: 3906250
time: 2294
fired: 3906250
~/NetBeansProjects/jruby ➔ java -server Trouble 2
time: 3402
fired: 3848648
time: 3805
fired: 3885471
time: 4145
fired: 3866850
time: 4140
fired: 3839130
time: 3658
fired: 3880202

Woah, what the hell happened here? Not only is overall performance worse, the two-thread run is significantly slower than the single thread. And we saw the exact same effect in JRuby: two threads running a given benchmark performed markedly worse than a single thread...exactly the opposite effect we'd like to see.

Caching 101

Source: CPU Cache article on Wikipedia

Accessing main memory during execution is slow...so slow that all modern processors have multiple levels of cache in which to cache memory locations that are used repeatedly. There's cache on the processor itself, cache in fast memory chips located near the processor, and so on. The cache, in whatever form, is a smaller, faster memory store. Within the cache, data is organized into cache lines, segments of data that map to specific locations in main memory. So if your processor uses cache lines of 16 bytes, the cache will hold some number of 16-byte entries for the most recently or most frequently-used data from memory.

Writes to memory also benefit from caching. Whether a cache is write-through or not, most writes do not immediately go to main memory for the same performance reasons: main memory is slow. So if you have a single processor accessing a small range of memory, rapidly reading and writing data, it may operate almost entirely from the cache. To help this process along, most compilers will try to organize memory locations likely to be accessed temporally near each other such that they'll be loaded into nearby memory locations; the more you can jam into each cache line, the more likely you'll make good use of available cache memory.

Of course at some point, the cache becomes stale, and writes must flushed and reads must be refreshed. This can happen (for example) when there's a cache miss, and the program accesses a memory location not already cached. In this case, one of the other cache entries must be evicted to make room for the new one. It will also happen naturally as the program progresses or as new programs and threads are scheduled to execute...eventually the cache must be flushed simply to ensure that main memory has been updated and all lines of execution see the change. But there's another point at which the cache or (more likely) an individual cache line might be flushed

Cache Line Ping-Pong

In a multiprocessor/multicore machine, you will usually have two or more threads actually running in parallel. Depending on the architecture, each core usually has its own cache, allowing the two threads to run independently without influencing each others' work. Of course things started to get interesting when you have multiple threads accessing the *same* memory locations, as we have in this case.

This brings us to the reason for Java 6's poor performance on the benchmark above. John Rose helped me out by disassembling the resulting native code for this benchmark on Java 5 and Java 6. It turned out that while Java 5 located the two static fields in different quadwords (16 bytes), Java 6 located both fields in the same quadword. This optimization allows better cache utilization...since the two fields are immediately next to each other in memory, they're almost certain to share a cache line. But it also has a different effect.

Cache line ping-pong happens when two threads executing against different caches need to read and write the same memory locations. Because the two threads need to see each others' changes, the cache line ends up bouncing back and forth between the two caches, generally causing a flush or refresh from main memory in the process.

Main memory is slow, remember? Because of the cache line ping-pong, adding threads for this benchmark actually caused things to slow down. Every new thread increased the amount of back-and forth between caches and main memory up to the number of cores...at which point, performance remained about constant for additional threads (because the amount of ping-pong was no worse).

It's interesting to note that single-thread performance was also considerably worse on Java 6. I don't yet have a good explanation for this.

The Fix

For the Java benchmark above, given what we now know about cache lines and such, a simple fix that Vladimir Sizikov found was to move one of the two statics to a different class. Java 6 only aligned within a single quadword the two fields when on the same class; moving one elsewhere presumably results in their living on different cache lines and the amount of ping-ponging is reduced.

In JRuby, where we had only a single counter but a lot of other stuff happening around it, the answer was much clearer: we split the single global counter into several per-thread counters and the effect was entirely erased. So the string-reversal benchmark started to scale nearly-linearly with additional threads, while Ruby 1.8, Ruby 1.9, and Rubinius showed a slight performance degradation as threads were added. JRuby FTW.

The Moral

I suppose there's a few things we can learn from this:

Don't use statics - Gilad Bracha probably said it best in his article "Cutting out Static":
Static variables are bad for for concurrency. Of course, any shared state is bad for concurrency, but static state is one more subtle time bomb that can catch you by surprise.
Touché, Gilad.
Don't share state unless you absolutely have to - This is perhaps a lesson we all believe we know, but until you're bitten by something sneaky like this you may not realize the what bad habits you have.
One man's optimization is another man's deoptimization - Java 6 is significantly faster than Java 5, presumably because of optimizations like this. But every optimization has a tradeoff, and this one caught me by surprise.

Headius

Thursday, April 03, 2008

Shared Data Considered Harmful

About Me

Recent Commits to jruby:master

JRuby Wiki Updates

Blog Archive

Links