As an avid C++ supporter I frequently had to face several objections to the language, mostly based on its (supposed) performance issues. The two most frequent objections have been:
- C++ is too slow. Better stick to plain old C.
- C++ is not worth the effort. Java provides an easier environment and it’s almost as fast as C++.
There’s some level of truth in both statements: C++ has many pitfalls which can make it perform (almost) as slowly as a Java program.
In this series of articles I will expose some common mistakes which I have made or seen, measure their impact, and show how they can be avoided.
Hopefully I can show that, with the savvy of an experienced programmer, C++ is as fast as C, while exposing a much richer programming paradigm.
First practical example in C
Let’s start from a simple example: a program which represents various types of geometric entities (circle, square, rectangle) and can compute the area of each.
We will not focus on how such entities can be created interactively, nor will we care about accurate computations or elegant data structures. To the contrary, computations will be done in fixed point (where PI is just 3, not 3.1415…) to avoid the overhead of floating point calculations, and data structures will be simple (to the point of naive) to be able to see in the finest detail the impact of the language implementation.
enum Type
{
CIRCLE,
SQUARE,
RECTANGLE
};
struct GeometricEntity
{
enum Type type;
union
{
struct
{
int radius;
} Circle;
struct
{
int side;
} Square;
struct
{
int width, height;
} Rectangle;
};
};
In this example we represent geometric entities with a struct containing the entity’s type and a union, with a sub-struct for each type we can represent.
We then define a function to compute the area of a given GeometricEntity, switching on the type:
int Area(struct GeometricEntity *ent)
{
switch(ent->type)
{
case CIRCLE:
return 3 * ent->Circle.radius * ent->Circle.radius;
case SQUARE:
return ent->Square.side * ent->Square.side;
case RECTANGLE:
return ent->Rectangle.width * ent->Rectangle.height;
}
}
(Yes I know there’s no default. It’s to show what happens to generated code).
Before the main function we define a helper to compute the duration between two timestamps:
double DeltaTime(struct timeval before, struct timeval after)
{
return (((double)after.tv_sec*1000000 + (double)after.tv_usec) -
((double)before.tv_sec*1000000 + (double)before.tv_usec)) / 1000000.0;
}
And finally our main, initializing three GeometricEntities and computing their area 100 million times:
int total;
int main()
{
struct timeval before, after;
struct GeometricEntity circle, square, rectangle;
int i;
circle.type = CIRCLE;
circle.Circle.radius = 1;
square.type = SQUARE;
square.Square.side = 1;
rectangle.type = RECTANGLE;
rectangle.Rectangle.height = 1;
rectangle.Rectangle.width = 1;
total = 0;
gettimeofday(&before, NULL);
for(i=0; i<100000000; i++)
{
total += Area(&circle);
total += Area(&square);
total += Area(&rectangle);
}
gettimeofday(&after, NULL);
printf("Time: %0.6f\n", DeltaTime(before, after), total);
return 0;
}
First practical example in C: assembly implementation
Let’s see how the “Area” function is implemented in the assembly generated by the compiler. This will help us later on to evaluate different alternatives in terms of performance and assess C/C++ differences.
Area:
cmpl $1, %eax ; type == 1 (SQUARE) ?
je .L3 ; yes, jump to .L3
jb .L4 ; type lower than 1 (CIRCLE) ?
; Then jump to .L4
cmpl $2, %eax ; type == 2 (RECTANGLE) ?
jne .L9 ; no (default), jump to .L9
movl 4(%rdi), %eax ; Code for RECTANGLE:
; load width into %eax
imull 8(%rdi), %eax ; multiply %eax by height
ret ; return value in %eax
.L9: ; default (missing in switch):
rep ret ; just return unknown value
.L4: ; Code for CIRCLE
movl 4(%rdi), %edx ; Load radius in %edx
leal (%rdx,%rdx,2), %eax ; %eax = radius * 3 (difficult
; to read optimized multiplication by 3)
imull %edx, %eax ; %eax = radius * radius * 3
ret ; return %eax
.L3: ; Code for SQUARE
movl 4(%rdi), %eax ; Load "side" in %eax
imull %eax, %eax ; multiply side by itself
ret ; and return
The snipped above shows that the compiler generated a series of comparisons on ent->type, followed by jumps to the appropriate piece of code which manages the entity type which was found.
First practical example in C: considerations
The example above is simple and functional, but it suffers from a severe architectural fault: to create new geometric entities we cannot just simply “add” new definitions and implementations but we must hack into the structure, adding new entries into the enum and adding cases to the switch statement inside the “Area” function (which will grow and become heavier, uglier, and slower because of the increasing number of cases the compiler will need to compare “type” against). Also, we must manually make sure we are using the right sub-structures in the union based on data type.
A better example in C: almost a class
In an object oriented world, we would define Circle, Square and Rectangle as subclasses of a base class. We are in C and there is no such thing, so we can implement an almost object oriented approach using structures instead of classes, composition instead of inheritance, and function pointers instead of class functions:
struct AbstractEntity
{
int (*Area)(const struct AbstractEntity *_this);
};
The “AbstractEntity” structure is the “base class” from which we will “inherit” Circle, Rectangle and Square. It contains a single member, a function pointer to a function which will compute the entity’s area. In more complex examples it could contain information which is common to all subclasses, such as the entity’s instance number or name.
To “derive” (well, almost) from this base structure we will use composition, here is an example for Circle:
struct Circle
{
struct AbstractEntity base;
int radius;
};
In this way “Circle” contains everything that “AbstractEntity” does, including the function pointer.
To initialize a “Circle” instance we proceed as follows:
struct Circle circle;
circle.base.Area = CircleArea;
circle.radius = 1;
CircleArea is a function which we define to compute the area of circles:
int CircleArea(struct AbstractEntity *_this)
{
return 3 * ((struct Circle *)_this)->radius *
((struct Circle *)_this)->radius;
}
Notice that the parameter of CircleArea must be AbstractEntity, not Circle, because it must fit into the signature of “Area” inside AbstractEntity. Hence the explicit cast in CircleArea’s body.
The implementations of Rectange and Square are straightforward:
struct Square
{
struct AbstractEntity base;
int side;
};
int SquareArea(struct AbstractEntity *_this)
{
return ((struct Square *)_this)->side *
((struct Square *)_this)->side;
}
struct Rectangle
{
struct AbstractEntity base;
int width, height;
};
int RectangleArea(struct AbstractEntity *_this)
{
return ((struct Rectangle *)_this)->height *
((struct Rectangle *)_this)->width;
}
Usage is equally straightforward in the main function:
{
struct Circle circle;
struct Square square;
struct Rectangle rectangle;
circle.base.Area = CircleArea;
circle.radius = 1;
square.base.Area = SquareArea;
square.side = 1;
rectangle.base.Area = RectangleArea;
rectangle.height = 1;
rectangle.width = 1;
total = 0;
gettimeofday(&before, NULL);
for(i=0; i<100000000; i++)
{
total += circle.base.Area(&circle);
total += square.base.Area(&square);
total += rectangle.base.Area(&rectangle);
}
gettimeofday(&after, NULL);
printf("Time: %0.6f\n", DeltaTime(before, after));
}
Please notice the usage of the function pointer when computing areas:
circle.base.Area(&circle);
Which is very similar to calling a C++ method, except we must pass the instance pointer explicitly (whereas C++ would pass it automatically as “this”).
Let’s compare performance of the first naive implementation (the one based on the switch on the type field) with the current one, based on function pointers:
We can see that the new implementation is significantly faster than the previous one (0.4 seconds versus 0.53).
Why does this happen ? Let’s have a look at assembly code to find out.
A better example in C: assembly implementation
Let’s take this simple code snippet and check what the compiler does:
total += circle.base.Area(&circle);
The generated assembly is as follows:
movq %r14, %rdi ; %r14 contains address of "circle",
; move it to %rdi to pass it to callee
call *32(%rsp) ; indirect function pointer call to circle.base.Area
movl (%rbx), %edx ; %rbx contains address of "total", move value
; of "total" to %edx
addl %edx, %eax ; Add value returned by circle.base.Area to %edx
movl %eax, (%rbx) ; store result in "total"
Code generated for “CircleArea” is as simple as:
CircleArea:
movl 8(%rdi), %edx
leal (%rdx,%rdx,2), %eax
imull %edx, %eax
ret
We can recognize a code fragment from the previous example, namely the fragment of the “Area” function which computes areas for circles.
A better example in C: considerations
Why does the new version of our example perform better than the first one ? Mostly because it does not contain conditional instructions to perform differently depending on the data type.
Conditional instructions, and especially switch statements, are bad for runtime performance because:
- The more cases they contain, the slower they become. Some compilers may try to optimize this issue with jump tables but this is possible only if case constants are dense.
- They break the CPU’s pipeline. Branch prediction may provide a partial cure.
Our second example does not generate any conditional instructions as the difference among data type management is insulated in the function pointer. We incur minor performance issues by calling a function with an indirect function call, but the overhead is negligible compared with performance gained by not switching.
As a final note, our new example is now more modular and readable.
Moving to C++
In C++ all concepts which we had to emulate manually (inheritance, methods) are available natively. As we will see, this has a little runtime impact.
Let’s rewrite our last example in C++:
class AbstractEntity
{
public:
virtual int Area() const = 0;
};
class Rectangle: public AbstractEntity
{
public:
int width, height;
virtual int Area() const override final
{
return height * width;
}
};
class Circle: public AbstractEntity
{
public:
int radius;
virtual int Area() const override final
{
return 3 * radius * radius;
}
};
class Square: public AbstractEntity
{
public:
int side;
virtual int Area() const override final
{
return side * side;
}
};
Here we don’t implement constructors or protected fields, including getters and setters, to make the example as close as possible to the C counterpart.
The main program can then be changed as follows:
Circle circle;
Square square;
Rectangle rectangle;
circle.radius = 1;
square.side = 1;
rectangle.height = 1;
rectangle.width = 1;
for(i=0; i<100000000; i++)
{
total += circle.Area();
total += square.Area();
total += rectangle.Area();
}
If we run this code we get a rather surprising result:
What happened here ? Looking at the generated assembly we see that the entire loop has been totally removed. The following optimizations have been performed by the compiler:
- Virtual calls can be removed by non-virtual calls, as the data type of the variables can be inferred at compile time
- Values of members in circle, square and rectangle can be inferred at compile time
- “Area” functions are declared as “const”, so they don’t modify the object on which they are invoked
- Calls to “Area” function can be inlined
- The only purpose of “Area” functions is to contribute to summing up “total”
- “total” is computed but not assigned, so it can be optimized out
As a consequence, the entire loop is removed.
We can try to force the compiler to keep the loop by using “total” in our final print:
printf("Time: %0.6f total=%d\n", DeltaTime(before, after), total);
However the output is this:
So the computation is correct, but time is still zero.
Looking at the assembly we can see that the optimizer has unrolled the loop and replaced everything with the final result. A pretty amazing performance:
movq total@GOTPCREL(%rip), %rbx ; Load the address of "total"
; in %rbx
movl $500000000, (%rbx) ; Move result of computation
; to "total"
CAVEAT: if we need to make sure a value is updated each time it’s used and not optimized out, we must declare the variable as “volatile”. This is the case when a variable represents a hardware register in an embedded system or in a driver, and/or when a variable is shared among threads.
Declaring “total” as volatile produces the following code, which updates “total” at each computation but still does not call the virtual methods because the compiler understands it’s not necessary for all reasons listed above:
movl (%rbx), %eax ; Move "total" to %eax
addl $3, %eax ; add 3 * 1 * 1 to %eax
; (1 is circle's radius)
movl %eax, (%rbx) ; Store "total"
movl (%rbx), %eax ; Load "total" again (why ?
; because it's declared volatile)
addl $1, %eax ; Add 1 * 1 (square) to %eax
movl %eax, (%rbx) ; Store "total"
movl (%rbx), %eax ; Load "total" again
addl $1, %eax ; Add 1 * 1 (rectangle) to %eax
movl %eax, (%rbx) ; Store "total"
To force the compiler to use virtual method calls we pass the following flags to GCC:
-fno-devirtualize -fno-devirtualize-speculatively
We also need to modify the main function to make life more difficult for the optimizer and avoid it does clever tricks:
Circle circle;
Square square;
Rectangle rectangle;
circle.radius = 1;
square.side = 1;
rectangle.height = 1;
rectangle.width = 1;
gettimeofday(&before, NULL);
AbstractEntity *e1, *e2, *e3;
e1 = &circle;
e2 = □
e3 = &rectangle;
total = 0;
for(i=0; i<100000000; i++)
{
total += e1->Area();
total += e2->Area();
total += e3->Area();
}
In this way the optimizer will leave method calls untouched. If we run the software we finally get:
This result is similar, but slightly slower, than the result we got from the C example based on function pointers (0.42 seconds versus 0.40 seconds). Why ?
Let’s look at this line
total += e1->Area();
The call (not including the addition to total) gets translated as follows:
movq 32(%rsp), %rax
movq %r14, %rdi
call *(%rax)
This looks very similar to the code generated in our previous C example based on function pointers:
movq %r14, %rdi
call *32(%rsp)
Why does the C++ version include one more statement than the C version ? After all, a virtual method call is just a hidden function pointer call…
The difference resides in a hidden data structure called the vtable, generated by the C++ compiler.
In our C example, the memory layout of the Circle structure was as follows:
As shown above, the circle instance contains a pointer to the “CircleArea” function which needs to be called. The two assembly instructions we have seen perform the following:
movq %r14, %rdi ; load pointer to struct in %rdi
; (parameter to CircleArea)
call *32(%rsp) ; call the "Area" function pointer
In C++ however we have an additional structure, the vtable, which holds a table of method pointers shared by all instances of the class.
This saves space in memory when a class holds more than one virtual method (as we need to store only the pointer to the vtable instead of holding all pointers to virtual methods). It can also be used for some run-time type information:
This however adds one level of indirection to method calls compared to the pure function pointer solution:
movq 32(%rsp), %rax ; Load address of vtable in %rax
movq %r14, %rdi ; Load address of instance in %rdi
; ("this" in the called method)
call *(%rax) ; call method pointer from vtable
If we look at the C++ method implementations we will see they are identical to the functions in the C example.
Implementation of C++ Circle::Area
_ZNK6Circle4AreaEv:
movl 8(%rdi), %edx
leal (%rdx,%rdx,2), %eax
imull %edx, %eax
ret
Implementation of C CircleArea:
CircleArea:
movl 8(%rdi), %edx
leal (%rdx,%rdx,2), %eax
imull %edx, %eax
ret
Conclusions
We have seen that the implementation of C++ virtual methods is equivalent, in terms of performance, to a C implementation based on function pointers (which is, in turn, more efficient than a C implementation based on explicit type checking).
There is a small overhead (about 5% in our example) due to the usage of vtables. This overhead becomes negligible in real applications (where real methods will not consist of a single line performing a multiplication).
We have seen that the C++ optimizer is much more aggressive than the C optimizer, often achieving better results than can be achieved by writing the same code in C. This is not due to black magic but to the additional fine grain of control which can be added to C++ code, for instance by specifying modifiers on method declarations which can give great hints to the optimizer.
Virtual methods can have a slight performance overhead compared to non-virtual methods (due to automatically generated vtable lookup and indirect function call via a pointer), so it’s recommended to declare methods as virtual only when strictly needed. Not only it makes code faster but it also visually shows when it makes sense that a method will be resolved runtime.
Next articles
In following articles we will go through other aspects of the C++ language and look in detail what impact they have on runtime performance, discussing about strategies to avoid unwanted and unseen bottlenecks.