Benchmarking in C++

Intro

 We’ll be creating a framework that attempts to provide a generic, cross platform (standard compliant), non dependent to third party libraries solution to the benchmarking problem. The end product will look like this :

benchmark<time_type, clock_type> bm;

bm. run(
    "Experiment Name"
    , sampleSize
    , { /* code to time */ }
    , "factor name"
    , factors. begin(), factors. end() );

bm. serialize(
    "benchmark name" , "output file name" , mode);

 The main features of this design are :

  • Parametrizable time and clock type per benchmark.
  • Multiple experiments can run on the same benchmark (hence we can group and compare results as we wish).
  • Adjustable sampling size (number of times each experiment will run).
  • A factor can be passed to the code under benchmark (simply put, when we graph the results, the factor will be the variable along the x axis).
  • The benchmarks are printable and serializable in a “Python friendly” form.

 As a complementary tool, a Python script will be developed, to read the output of a serialization and make graphs using matplotlib.
Finally, it will become apparent that the framework is mostly concerned (without excluding the possibility of expanding) with code runtime performance. The reason why this post isn’t titled “Timing C++ code” is because a sampling procedure and postprocessing of the results is incorporated as well.

Know why

 If you think something is correct, prove it with a test. If you  think something  is  fast, time it.  If  you think one algorithm  is   better  than  another,  find  some way to make a measurement to support your argument.

Do not depend  only upon your “common sense.” Do not depend only on theory. It is good to use theory to help guide you  to an  optimum  solution. But without measuring,   you   have   no   way   of   knowing  (nor convincing others)  of the  correctness of your theory.  Without  measurement,  you  have  a  faith-based system, not a scientific theory. [3]

Know how – techniques

 The dominant technique used here is instrumentation. The way we’ll be applying it is by mandating code under timing be called through a measuring structure. A simple version of this emerged after several design iterations :

template<typename TimeT = std::chrono::milliseconds>
struct measure
{
     template<typename F, typename ...Args>
     static typename TimeT::rep execution(
     F func, Args&&... args)
     {
         auto start = std::chrono::system_clock::now();
         func(std::forward<Args>(args)...);
         auto duration = std::chrono::duration_cast<
            TimeT>(
             std::chrono::system_clock::now() - start);
         return duration.count();
     }
};

 Things to note here are :

  • variadic templates are used to forward as many arguments as needed to the callable object (function, function pointer, functor, lambda or a block of code surrounded by a lambda).
  • The code we time won’t be rearranged by the optimizer and will always lie between those start / end calls to now(), so we can guarantee our timing will be valid.
  • There’s minimum overhead by this calling layer; in other words if passing the callable and the arguments (by &&) to the instrumentation structure disturbs the results, then there’s no point in the measurement to begin with.

 This has a very clean use pattern and minimum code pollution when it comes to timing :

auto executionTime = measure<>::execution(f0);

Know how – std::chrono

 C++11 introduced “a flexible collection of types that track time with varying degrees of precision”. In what would be a brief introduction to this collections, we can identify three workhorses : clocks, durations and time points.

 A clock is a class that contains time information. Things to know about clocks:

  • We call now() to get the current time of a clock
  • There are three types of clocks: steady_clock, system_clock and high_resolution_clock.
  • A clock’s tick period is the number of times it ticks per second (specified by  std::ratio)

 Durations are what we would call time types. The related class template std::chrono::duration<> can “create” time types in its instantiations. To give an example, this is how we define milliseconds :

std::chrono::duration< double, std::ratio<1,1000> >

 Finally a time point is the return type of clock::now().  It is implemented as if it stores a value of type Duration indicating the time interval from the start of the Clock‘s epoch.

Implementation

 A draft description of the entity design is shown in the diagram below :

class_diagram

 I may be providing more details on the implementation techniques on another post, but for now I’ll justify (for anyone reading the code) the usage of type erasure with the fact that I had to use function templates when implementing the experiments’ functionality, and those can’t be virtual.

Using the toolset

We’ll be using the toolset to fill in the mising graphs in Bjarne’s presentation. Therein Mr. Stroustrup proposes the following experiment :

  • Generate N random integers and insert them into a sequence so that each is inserted in its proper position in the numerical order.
  • Remove elements one at a time by picking a random position in the sequence and removing the element there.
  • For which N is it better to use a linked list than a vector (or an array) to represent the sequence ?

This is all it takes (assuming the correct headers are included) :

template<class Cont>
void CommonUse(int nElems)
{
	Cont cont;
	random_device rd;
	mt19937 eng{ rd() };
	uniform_int_distribution<int> distr(0, nElems);

	for (int i = 0, val; i < nElems; i++) {
		val = distr(eng);
		cont.insert(lower_bound(
			begin(cont), end(cont), val), val);
	}

	while (nElems) {
		auto it = std::begin(cont);
		advance(it, distr(eng) % nElems);
		cont.erase(it);
		--nElems;
	}
}

int main()
{
	bmk::benchmark<> bm;
	// 1. Run an experiment for vectors --------------
	bm.run("vector", 10, CommonUse< vector<int> >,
	    "number of elements", 
            { 10, 100, 1000, 10000, 100000, 1000000 });
	// 2. Run an experiment for lists ----------------
	bm.run("list", 10, CommonUse< list<int> >,
	    "number of elements", 
            { 10, 100, 1000, 10000, 100000, 1000000 });
	// 3. Record the results -------------------------
	bm.serialize("Why you should avoid Linked Lists",
		"vector_vs_list.txt");
}

 Here we average the results of 10 samples per experiment and use an initializer list to provide the factors. Factors must be the same type as the argument to the lambda and can also be given as an iterator delimited range (begin / end). If you think an operation should be excluded from the timing you can surround it with a timeout’s .tic() / .toc() as long as the timeout has the same clock and time type as the experiment :

bmk::timeout_ptr<> to = make_unique < bmk::timeout<> >();

to->tic();
val = distr(eng); // time excluded from the experiment
to->toc(); 

/* rest of the experiment */

return to;

 Experiments that return timeouts (actually timeout unique pointers) are handled differently (thank you expression SFINAE) and have the total duration stored in timeout subtracted from the experiment’s time (hopefully you won’t use timeout’s for trivial tasks that will take 0 nanoseconds anyway).

 If we were to open “vector_vs_list.txt” (just the results stored as a Python dictionary) with the visualizer python script this is what we get :

vector_vs_list

 Yes I only actually tried up to 10,000 elements; it takes waaaay too long when I add the last too factors as well (you can run the code yourself if you don’t believe me).

 If we were to run factorless experiments then the sequence size couldn’t be an input. So let’s choose for example’s sake the size 10,000; the visualization we get is a boxplot per experiment :

bxplt_vector_vs_list

Code Kata

  • Add the option to have each experiment run in its own thread (I’ll be posting a take on this but it’s good to get some ideas first).
  • I kind of cheated (not that it changes the results) in favor of vector. Can you spot where ?

Post scriptum

 tl;dr ? You can check the code in action here (it’s coliru so I’m only console printing the results).

References

  1. My post on SO
  2. cppreference.com on chrono
  3. C++ coding guidelines
  4. The chrono proposal
  5. The benchmark class and visualizer at the blog’s repository

Compile-time integer sequences

Intro

 We’ll be discussing a mechanism that can, during compilation, generate a sequence [0, N] given an integer N. So for example, the following mappings can be generated :

int_seq<3> -> { 0, 1, 2, 3 }
int_seq<2> -> { 0, 1, 2    }
int_seq<1> -> { 0, 1       }
int_seq<0> -> { 0          }

 We choose to use a closed interval in the examples so that the last case won’t produce an empty set (and an awkward compilation).

Know why

 Being a compile time thing, this sequence can be embeeded in type information or used in contexts that demand compile time constants. Typical examples include :

  • Filling a braced initializer list. This has numerous uses, consider eg:
    template<N>
    void fU() 
    {   // int_seq<N> would come in handy here
        int N[] = { 0, 1 ... N-1, N };  
    }
    
  • Pack expansion contexts; more on this on the next post, but for now just consider getting all tuple elements when all you know is its size and get<> demands a compile time constant.
  • In complex compile time computations (I’ll be posting about a compile time sqrt function, stay tuned)

Know how

 We present an implementation based on [3] tweaked to provide the closed interval :

// a sequence container -------------------------
template<int ...>
struct ints 
{ 
};
// integer sequence -----------------------------
template<int N, int... Is>
struct int_seq : int_seq<N - 1, N, Is...> 
{ 
};

template<int... Is>
struct int_seq<0, Is...> 
{
	using type = ints<0, Is...>;
};
// convenience alias -----------------------------
template<int N>
using IS = typename int_seq<N>::type; 

 What happens here is that int_seq inherits from “another version” of itself; Inheriting from an incomplete type, as myself during class definition, is forbidden, but since the template arguments differ, each int_seq inherits from a different class.

 This compile time (expanding) recursion, causes the integer sequence to be embeeded in the variadic integer pack. We can mentally trace an example of the numbers during instantiation :

int_seq<3>       inherits from int_seq<2, 3>
int_seq<2, 3>    inherits from int_seq<1, 2, 3>
int_seq<1, 2, 3> inherits from int_seq<0, 1, 2, 3>

int_seq<0, 1, 2, 3> contains { 0, 1, 2, 3 }

 So how do we use this thing ? The “secret” is to infer the integer sequence. Let’s showcase this by returning to our first use case: 

template<int... Is>
void fU(ints<Is...>&& s)
{
	for (auto i : { Is... }) cout << i << " ";
	cout << endl;
}

int main()
{
    fU(IS<5>());
}

 Since the sequence can’t “live” on its own, it’s used through a pack type container ints. Yet inside func we can refer to the sequence Is since it has been inferred; all it took was placing it in a deducable context!

CodeKata

 Now make a reverse integer list. I was tempted to submit a proposal for an STL extension (make_reverse_integer_sequence) but after solving this discovered (not to my surprise I admit) that it has been done before (just look at [2]).

So here’s my take at this:

// reverse integer sequence ---------------------
template<int C, int N, int... Is>
struct rev_seq : rev_seq<C - 1, N, N - C, Is...>
{
};

template<int N, int... Is>
struct rev_seq<0, N, Is...>
{
	using type = ints<N, Is...>;
};
// convenience alias ----------------------------
template<int N>
using RS = typename rev_seq<N, N>::type;

Post scriptum

 Related code can be found in the blog’s repository. A demo can be found here. Lastly, here’s a list of problems, in ascending order of difficulty :

  1. Trace the instantiation for RS<3> in the CodeKata.
  2. Tweak the code to provide the open sets (most implementations out there work this way, so that’s an easy one).
  3. Make a reverse integer list by reversing the front one.
  4. Make [N1, N2] ranges, ie ranges that can have their starting point specified eg
     [2,6] -> { 2, 3, 4, 5, 6 }
  5. Make [N1, N2, step] ranges, ie ranges that can have their starting point and step specified eg
     [2,6,2] -> { 2, 4, 6 }

 Hopefully I’ll get some responses on the above and compile the highlights (along with some commenting) in an upcoming post.

References

  1. Proposal N3658 
  2. reversing a list by Xeo
  3. the earliest implementaion I could find, by Johannes Schaub