japh(r) by Chris Strom: Precision vs Accuracy in Benchmarking

Thursday, July 10, 2014

Precision vs Accuracy in Benchmarking

I continue to explore benchmarking design patterns for Design Patterns in Dart. I am still in the preliminary stages of research for the book so the actual benchmarks are not as important as how I want to benchmark. The simple answer is to use benchmark_harness—the official benchmark harness for Dart. As I have found over the past couple of days, there is more involved than simply using it.

I have already determined that it is best to run implementation comparisons in separate files. Further investigation by Vyacheslav Egorov (aka @mraleph) suggests that I need to run the benchmark code longer than I currently am:

@eee_c I looked a bit at it today, but there are some seriously confusing performance flakes there. I recommend making run() longer running
— Vyacheslav Egorov (@mraleph) July 10, 2014

@eee_c e.g. by adding a loop in run() --- I can't explain flakes though they come and go when I try to catch them
— Vyacheslav Egorov (@mraleph) July 10, 2014

@eee_c everything stabilizes with longer running run()
— Vyacheslav Egorov (@mraleph) July 10, 2014

Taking Vyacheslav up on his advice, I work through each of my three Visitor Pattern implementations (in the visitor/tool directory of the Design Patterns public repository). To each, I add a loop to the run() method of the harness:

class NodesDoubleBenchmark extends BenchmarkBase {
  const NodesDoubleBenchmark() : super("Nodes iterate w/ double dispatch ");
  static void main() { new NodesDoubleBenchmark().report(); }

  void run() {
    for (var i=0; i<10^6; i++) {
      visitor.totalPrice = 0.0;
      nodes.accept(visitor);
    }
}

In each of the three benchmarks, I loop over the visit a million times. For good measure, I run each benchmark report 3 times.

When I run the code, I find:

$ ./tool/benchmark_for_single_dispatch.dart; \
  echo "--"; \
  ./tool/benchmark_for_double_dispatch.dart; \
  echo "--"; \
  ./tool/benchmark_traversing_visitor.dart 
Nodes iterate w/ single dispatch (RunTime): 1373.6263736263736 us.
Nodes iterate w/ single dispatch (RunTime): 1331.5579227696405 us.
Nodes iterate w/ single dispatch (RunTime): 1344.0860215053763 us.
--
Nodes iterate w/ double dispatch (RunTime): 1381.9060773480662 us.
Nodes iterate w/ double dispatch (RunTime): 1322.7513227513227 us.
Nodes iterate w/ double dispatch (RunTime): 1345.8950201884254 us.
--
Visitor Traverses(RunTime): 1452.4328249818445 us.
Visitor Traverses(RunTime): 1417.4344436569809 us.
Visitor Traverses(RunTime): 1411.4326040931546 us

I notice two things here. First, the difference between single and double dispatching is negligible. Also, the default reporter has a precision vs. accuracy.

When the objects structure is responsible for traversing itself, double dispatching surprisingly has no effect. I can only speculate that the VM recognizes the double dispatch target method and is able to optimize calls to it. Oddly (and unlike last night's non-looping benchmark), making the visitor responsible for traversing object structure is the loser.

Since this is an exercise more focused on how to benchmark than the actual results, I will let those slide for now. Besides, I am far more bothered by benchmark_harness' incredible precision (13 decimal places!) when the runs vary by 40µs. The reported precision implies an accuracy that is simply not possible.

If the numbers are varying at the third significant digit, it makes no sense to report more than 4 significant digits. This means that I need a new benchmark reporter. The benchmark_harness package defines a simple interface for its “emitters”:

abstract class ScoreEmitter {
  void emit(String testName, double value);
}

I implement this locally as ProperPrecisionScoreEmitter:

class ProperPrecisionScoreEmitter implements ScoreEmitter {
  const ProperPrecisionScoreEmitter();

  void emit(String testName, double value) {
    print("$testName (RunTime): ${value.toStringAsPrecision(4)} µs.");
  }
}

The key difference between the default implementation and mine is that I explicitly request a sane precision with double's toStringAsPrecision() method.

To use this sane precision emitter, I update my redirecting constructor in the benchmark class. The constructor in the BenchmarkBase accepts an optional named emitter:

class NodesDoubleBenchmark extends BenchmarkBase {
  const NodesDoubleBenchmark() :
    super(
      "Nodes iterate w/ double dispatch",
      emitter: const ProperPrecisionScoreEmitter()
    );
  static void main() { new NodesDoubleBenchmark().report(); }

  void run() {
    // run the benchmarked code here ...
  }
}

After doing that for each of my implementation benchmarks, I have:

$ ./tool/benchmark_for_single_dispatch.dart; \
  echo "--"; \
  ./tool/benchmark_for_double_dispatch.dart; \
  echo "--"; \
  ./tool/benchmark_traversing_visitor.dart 
Nodes iterate w/ single dispatch (RunTime): 1371 µs.
Nodes iterate w/ single dispatch (RunTime): 1322 µs.
Nodes iterate w/ single dispatch (RunTime): 1328 µs.
--
Nodes iterate w/ double dispatch (RunTime): 1377 µs.
Nodes iterate w/ double dispatch (RunTime): 1324 µs.
Nodes iterate w/ double dispatch (RunTime): 1316 µs.
--
Visitor Traverses (RunTime): 1476 µs.
Visitor Traverses (RunTime): 1417 µs.
Visitor Traverses (RunTime): 1445 µs.

Ahhh. Much better.

The deviation between runs remains an open question. Certainly other activity on my machine plays some part in it, but it would be nice if the numbers could have a little more precision. I will investigate the actual benchmark code tomorrow to see if this is possible.

Day #118

‹prev | My Chain | next›

japh(r) by Chris Strom

Thursday, July 10, 2014

Precision vs Accuracy in Benchmarking

No comments:

Post a Comment