Sunday, July 13, 2014

Long vs Short Benchmarks (also, I'm a true idiot)


For the second day in a row, I have to come clean on a dumb mistake. Benchmarks have a particular knack for doing me in, it would seem. And, as hard as I might try to rationalize the numbers that do not fit my world view, eventually the numbers get me. Which is why I like them so darn much.

This time around, it was Lasse Reichstein Holst Nielsen who pointed out a flaw in my numbers. Specifically, my exponentiation. While mucking with the number of times that my benchmarks get run, I tried one million times, or 1e+6. Only that is not what I put in my code:
class VisitorBenchmark extends BenchmarkBase {
  // ...
  void run() {
    for (var i=0; i<10^6; i++) {
      visitor.totalPrice = 0.0;
      nodes.accept(visitor);
    }
  }
}
As Lasse was kind enough to point out, the hat operator in Dart is “bitwise or.” So I was combining the bits in 1010 and 0110, giving me 1110… or 12. That's just slightly less than the ONE MILLION I was going for... Speaking of which, I was wondering why increasing that number to 10^20 didn't bring things to a grinding halt. Generally, running a loop 30 times does not do that. Sigh.

I should have noticed that the numbers did not line up, but in my defense, benchmark_harness ensures that the code being executed runs for at least 2 seconds. It will run the code a thousand or a billion times if needed to ensure that the code is executed for that 2 seconds. So I assumed that the lack of significant increase was due to fewer iterations needed.

Yup… I assumed.

Anyhow, I think this is the last of my blinders involved in this bit of benchmarking, which tries different implementations of the Visitor Pattern for eventual use in Design Patterns in Dart. Since I had already gotten useful data from the 12 iteration approach to benchmarking, I had even planned on moving on to other territory tonight.

And then I thought, “what happens if I do run the code ONE MILLION times?” Well, the answer is that it takes too long for me to wait. So instead, I try ten thousand for each of my three implementation benchmarks. The “classic” implementation that I will likely use in the book gets a benchmark like:
class VisitorBenchmark extends BenchmarkBase {
  const VisitorBenchmark() :
    super(
      "Classic Visitor Pattern",
      emitter: const ProperPrecisionScoreEmitter()
    );

  static void main() { new VisitorBenchmark().report(); }

  void run() {
    for (var i=0; i<1e+5; i++) {
      visitor.totalPrice = 0.0;
      nodes.accept(visitor);
    }
  }
}
This results in:
$ ./tool/benchmark.dart; \
    echo '--'; \
    ./tool/benchmark_single_dispatch_iteration.dart; \
    echo '--'; \
    ./tool/benchmark_visitor_traverse.dart
Classic Visitor Pattern (RunTime): 1.444e+7 µs.
Classic Visitor Pattern (RunTime): 1.447e+7 µs.
Classic Visitor Pattern (RunTime): 1.449e+7 µs.
--
Nodes iterate w/ single dispatch (RunTime): 1.509e+7 µs.
Nodes iterate w/ single dispatch (RunTime): 1.507e+7 µs.
Nodes iterate w/ single dispatch (RunTime): 1.509e+7 µs.
--
Visitor Traverses (RunTime): 1.121e+7 µs.
Visitor Traverses (RunTime): 1.113e+7 µs.
Visitor Traverses (RunTime): 1.121e+7 µs.
What is interesting about these numbers is that they are almost the complete opposite of what I found last night. In this case, making the Visitor object responsible for traversing the data structure is the clear winner. Also as interesting is that hand-optimizing the code with a single-dispatch call is the loser. Last night, with only 12 stinking runs, I found:
/tool/benchmark.dart; \
    echo '--'; \
    ./tool/benchmark_single_dispatch_iteration.dart; \
    echo '--'; \
    ./tool/benchmark_visitor_traverse.dart
Classic Visitor Pattern (RunTime): 1176 µs.
Classic Visitor Pattern (RunTime): 1167 µs.
Classic Visitor Pattern (RunTime): 1136 µs.
--
Nodes iterate w/ single dispatch (RunTime): 1097 µs.
Nodes iterate w/ single dispatch (RunTime): 1074 µs.
Nodes iterate w/ single dispatch (RunTime): 1078 µs.
--
Visitor Traverses (RunTime): 1179 µs.
Visitor Traverses (RunTime): 1142 µs.
Visitor Traverses (RunTime): 1118 µs.
Obviously the numbers are much smaller, but the winner for smaller sample sizes is the single dispatch approach. These were not just 12 runs and that is it. The default benchmark_harness implementation runs the benchmark code 10 times and will do so for at least 2 seconds. There are even a few warm-up runs for good measure. In other words, those numbers are legitimate benchmarks.

In terms of what to use in the book, I am not quite sure what to make of this. Clearly, the longer that code is executed (10+ seconds vs 2 seconds) gives the Dart VM more time to optimize calls. I do not have a good explanation for why making the visitor responsible for traversing the node structure wins—and maybe I do not need one.

My takeaway is that it would behoove me to investigate alternate approaches for each pattern and to include discussions for limited use (~1000 calls) vs heavy usage (10k+ calls).


Day #121

2 comments:

  1. I'm not sure a design pattern like this is really worth benchmarking. The numbers will change depending on application and browser, so any advice you come up with will go stale quickly.

    ReplyDelete
    Replies
    1. You're probably right. Even though there are slight differences between the numbers (and they change depending on the number of runs), the differences probably are not enough to warrant favoring one approach over the other.

      Still, I find the exercise fascinating due to the number of ways that I can mess up. More importantly, I am figuring out how to test and what I need to look for in the patterns that have more significant variations.

      Delete