How do I determine the number of x86 machine instructions executed in a C program?
Solution 1:
Terminology: what you're asking for is dynamic instruction count. e.g. counting an instruction inside a loop every time it's executed. This is usually roughly correlated with performance, but instructions-per-cycle can vary wildly.
- How many CPU cycles are needed for each assembly instruction?
- What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?
Something people also look at is static instruction count (or more usually just code-size, because that's what really matters for instruction-cache footprint, and disk-load times). For variable-length instruction sets like x86, those are correlated but not the same thing. On a RISC with fixed-length instructions, like MIPS or AArch64, it's closer but you still have padding for alignment of the start of functions, for example. That's a totally separate metric. gcc -Os
optimizes for code-size while trying not to sacrifice to much speed.
If you're on Linux, use gcc -O2 foo.c
to compile your code. -O2
doesn't enable auto-vectorization for gcc. (It does for clang). It's probably a good baseline level of optimization that will get rid of stuff in your C code that doesn't actually need to happen, to avoid silly differences between using more or fewer tmp variables to break up a big expression. Maybe use -Og
if you want minimal optimization, or -O0
if you want really dumb braindead code that compiles each statement separately and never keeps anything in registers between statements. (Why does clang produce inefficient asm with -O0 (for this simple floating point sum)?).
Yes, it matters a huge amount how you compile. gcc -O3 -march=native -ffast-math
might use a lot fewer instructions, if it auto-vectorizes a loop.
To stop your code from optimizing away, take an input from a command-line arg, or read it from a volatile
variable. Like volatile int size_volatile = 1234;
int size = size_volatile;
. And return or print a result, because if the program has no side-effects then the most efficient implementation is to just exit immediately.
Then run perf stat ./a.out
. That will use hardware performance counters to give you total instructions executed on behalf of your process. (Along with other counters, like CPU core clock cycles, and some software counters like page-faults
and time in microseconds.)
To count only user-space instructions, use perf stat -e instructions:u ./a.out
. (Or in recent perf versions, perf stat --all-user ./a.out
to apply :u to all events, even the default set.) Each hardware event counter has 2 bits that indicate whether it should be counting events in user, supervisor, or both, so the kernel's perf code doesn't have to run an instruction to stop counters for :u
events or anything like that.
That will still be a very big number even for a simple "hello world" program, like 180k if built normally, because that includes dynamic-linker startup and all the code that runs inside library functions. And CRT startup code that calls your main
, and that makes an exit
system call with main
's return value, if you return instead of calling exit(3)
.
You might statically link your C program to reduce that startup overhead, by compiling with gcc -O2 -static -fno-stack-protector -fno-pie -no-pie
perf
counting instructions:u
seems to be pretty accurate on my Skylake CPU. A statically-linked x86-64 binary that contains only 2 instructions gets 3 counts. Apparently there's one extra instruction being counted in the transition between kernel and user mode in one direction, but that's pretty minor.
$ cat > exit.asm <<EOF
global _start ; hand-written asm to check perf overhead
_start:
mov eax, 231 ; _NR_exit_group
syscall ; exit_group(EDI) (in practice zero)
EOF
$ nasm -felf64 exit.asm && ld -o exit exit.o # static executable, no CRT or libc
$ perf stat -e instructions:u ./exit
Performance counter stats for './exit':
3 instructions:u
0.000651529 seconds time elapsed
# for this 2-instruction hand-written program
Using ld
on its own is somewhat similar to linking with gcc -nostdlib -static
(which also implies -no-pie
; static-pie is a separate thing)
Minimal instruction count for a C program with CRT and libc: about 33k
A statically-linked binary made by the C compiler that calls puts
twice counts 33,202 instructions:u
. I compiled with gcc -O2 -static -fno-stack-protector -fno-pie -no-pie hello.c
.
Seems reasonable for glibc init functions, including stdio, and CRT startup stuff before calling main
. (main
itself only has 8 instructions, which I checked with objdump -drwC -Mintel a.out | less
).
If main just exited without printing, or especially if it called _exit(0)
or exit_group(0)
(the raw system calls, bypassing atexit stuff), you'd have fewer instructions from not using stdio.
Other resources:
-
Number of executed Instructions different for Hello World program Nasm Assembly and C
@MichaelPetch's answer shows how to use an alternate libc (MUSL) that doesn't need startup code to run for its
printf
to work. So you can compile a C program and set itsmain
as the ELF entry point (and call_exit()
instead of returning). -
How can I profile C++ code running on Linux? There are tons of profiling tools for finding hotspots, and expensive functions (including the time spent in functions they call, i.e. stack backtrace profiling). Mostly this isn't about counting instructions, though.
Binary instrumentation tools:
These are the heavy duty tools for counting instructions, including counting only specific kinds of instructions.
-
Intel Pin - A Dynamic Binary Instrumentation Tool
-
Intel® Software Development Emulator (SDE) This is based on PIN, and is handy for things like testing AVX512 code on a dev machine that doesn't support AVX512. (It dynamically recompiles so most instructions run natively, but unsupported instructions call an emulation routine.)
For example,
sde64 -mix -- ./my_program
will print an instruction-mix for your program, with total counts for each different instruction, and breakdowns by categories. See libsvm compiled with AVX vs no AVX for an example of the kind of output.It also gives you a table of total dynamic instruction counts per-function, as well as per-thread and global. SDE mix output doesn't work well on PIE executable, though: it thinks the dynamic linker is the executable (because it is), so compile with
gcc -O2 -no-pie -fno-pie prog.c -o prog
. It still doesn't see theputs
calls ormain
itself in the profile output for a hello world test program, though, and I don't know why not. -
Calculating “FLOP” using Intel® Software Development Emulator (Intel® SDE) An example of using SDE to count certain kinds of instructions, like
vfmadd231pd
.Intel CPUs have HW perf counters for events like
fp_arith_inst_retired.256b_packed_double
, so you can use those to count FLOPs instead. They actually count FMA as 2 events. So if you have an Intel CPU that can run your code natively, you can do that instead withperf stat -e -e fp_arith_inst_retired.256b_packed_double,fp_arith_inst_retired.128b_packed_double,fp_arith_inst_retired.scalar_double
. (And/or the events for single-precision.)But there aren't events for most other specific kinds of instructions, only FP math.
This is all Intel stuff; IDK what AMD has, or any stuff for ISAs other than x86. These are just the tools I've heard of; I'm sure there are lots of things I'm leaving out.
Solution 2:
As I mentioned in my top comments, one way to do this is to write a program that feeds commands to gdb
.
Specifically, the si
command (step ISA instruction).
I couldn't get this to work with pipes, but I was able to get it to work by putting gdb
under a pseudo-tty.
Edit: After thinking about it, I came up with a version that uses ptrace
directly on the target program instead of sending commands to gdb
. It is much faster [100x faster] and [probably] more reliable
So, here's the gdb
based control program. Note that this must be linked with -lutil
.
// gdbctl -- gdb control via pseudo tty
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <time.h>
#include <fcntl.h>
#include <errno.h>
#include <poll.h>
#include <pty.h>
#include <utmp.h>
#include <sys/types.h>
#include <sys/wait.h>
int opt_d; // 1=show debug output
int opt_e; // 1=echo gdb output
int opt_f; // 1=set line buffered output
int opt_x; // si repetition factor
int zpxlvl; // current trace level
int ptypar; // parent PTY fd
int ptycld; // child PTY fd
char name[100]; // child PTY device name
unsigned long long sicount; // single step count
const char *gdb = "(gdb) "; // gdb's prompt string
const char *waitstr; // currently active "wait for" string
char *waitstop[8] = { NULL }; // string that shows run is done
int stopflg; // 1=waitstop seen
char sicmd[100];
char waitbuf[10000]; // large buffer to scan for strings
char *waitdst = waitbuf; // current end position
pid_t pidgdb; // gdb's pid
pid_t pidfin; // stop pid
int status; // gdb's final status
double tvelap; // start time
#ifndef _USE_ZPRT_
#define _USE_ZPRT_ 1
#endif
static inline int
zprtok(int lvl)
{
return (_USE_ZPRT_ && (opt_d >= lvl));
}
#define dbg(_lvl,_fmt...) \
do { \
if (zprtok(_lvl)) \
printf(_fmt); \
} while (0)
// tvgetf -- get high precision time
double
tvgetf(void)
{
struct timespec ts;
double sec;
clock_gettime(CLOCK_REALTIME,&ts);
sec = ts.tv_nsec;
sec /= 1e9;
sec += ts.tv_sec;
return sec;
}
// xstrcat -- concatenate a string
char *
xstrcat(char *dst,const char *src)
{
int chr;
for (chr = *src++; chr != 0; chr = *src++)
*dst++ = chr;
*dst = 0;
return dst;
}
// gdbexit -- check for gdb termination
void
gdbexit(void)
{
// NOTE: this should _not_ happen
do {
if (pidgdb == 0)
break;
pidfin = waitpid(pidgdb,&status,WNOHANG);
if (pidfin == 0)
break;
pidgdb = 0;
printf("gdbexit: WAITPID status=%8.8X\n",status);
exit(8);
} while (0);
}
// gdbwaitpoll -- wait for prompt string
void
gdbwaitpoll(const char *buf)
{
char *cp;
char **wstr;
do {
gdbexit();
if (waitstr == NULL)
break;
// concatenate to big buffer
dbg(2,"BUF '%s'\n",buf);
waitdst = xstrcat(waitdst,buf);
// check for final termination string (e.g. "exited with")
for (wstr = waitstop; *wstr != NULL; ++wstr) {
cp = *wstr;
dbg(2,"TRYSTOP '%s'\n",cp);
cp = strstr(waitbuf,cp);
if (cp != NULL) {
stopflg = 1;
waitstop[0] = NULL;
}
}
// check for the prompt (e.g. "(gdb) ")
cp = strstr(waitbuf,waitstr);
if (cp == NULL)
break;
dbg(1,"HIT on '%s'\n",waitstr);
// got it reset things
waitbuf[0] = 0;
waitdst = waitbuf;
waitstr = NULL;
} while (0);
}
// gdbrecv -- process input from gdb
void
gdbrecv(void)
{
struct pollfd fds[1];
struct pollfd *fd = &fds[0];
int xlen;
char buf[1000];
fd->fd = ptypar;
fd->events = POLLIN;
while (1) {
gdbexit();
#if 1
int nfd = poll(fds,1,1);
if (nfd <= 0) {
if (waitstr != NULL)
continue;
break;
}
#endif
// get a chunk of data
xlen = read(ptypar,buf,sizeof(buf) - 1);
dbg(1,"gdbrecv: READ xlen=%d\n",xlen);
if (xlen < 0) {
printf("ERR: %s\n",strerror(errno));
break;
}
// wait until we've drained every bit of data
if (xlen == 0) {
if (waitstr != NULL)
continue;
break;
}
// add EOS char
buf[xlen] = 0;
dbg(1,"ECHO: ");
if (opt_e)
fwrite(buf,1,xlen,stdout);
// wait for our prompt
gdbwaitpoll(buf);
}
}
// gdbwaitfor -- set up prompt string to wait for
void
gdbwaitfor(const char *wstr,int loopflg)
{
waitstr = wstr;
if (waitstr != NULL)
dbg(1,"WAITFOR: '%s'\n",waitstr);
while ((waitstr != NULL) && loopflg && (pidgdb != 0))
gdbrecv();
}
// gdbcmd -- send command to gdb
void
gdbcmd(const char *str,const char *wstr)
{
int rlen = strlen(str);
int xlen = 0;
#if 0
printf("CMD/%d: %s",rlen,str);
#endif
gdbwaitfor(wstr,0);
for (; rlen > 0; rlen -= xlen, str += xlen) {
gdbexit();
xlen = write(ptypar,str,rlen);
if (xlen <= 0)
break;
dbg(1,"RET: rlen=%d xlen=%d\n",rlen,xlen);
gdbrecv();
}
dbg(1,"END/%d\n",xlen);
}
// gdbctl -- control gdb
void
gdbctl(int argc,char **argv)
{
// this is the optimal number for speed
if (opt_x < 0)
opt_x = 100;
if (opt_x <= 1) {
opt_x = 1;
sprintf(sicmd,"si\n");
}
else
sprintf(sicmd,"si %d\n",opt_x);
// create pseudo TTY
openpty(&ptypar,&ptycld,name,NULL,NULL);
pidgdb = fork();
// launch gdb
if (pidgdb == 0) {
//sleep(1);
login_tty(ptycld);
close(ptypar);
char *gargs[8];
char **gdst = gargs;
*gdst++ = "gdb";
*gdst++ = "-n";
*gdst++ = "-q";
*gdst++ = *argv;
*gdst = NULL;
execvp(gargs[0],gargs);
exit(9);
}
// make input from gdb non-blocking
#if 1
int flags = fcntl(ptypar,F_GETFL,0);
flags |= O_NONBLOCK;
fcntl(ptypar,F_SETFL,flags);
#endif
// wait
char **wstr = waitstop;
*wstr++ = "exited with code";
*wstr++ = "Program received signal";
*wstr++ = "Program terminated with signal";
*wstr = NULL;
printf("TTY: %s\n",name);
printf("SI: %d\n",opt_x);
printf("GDB: %d\n",pidgdb);
#if 1
sleep(2);
#endif
gdbwaitfor(gdb,1);
// prevent kill or quit commands from hanging
gdbcmd("set confirm off\n",gdb);
// set breakpoint at earliest point
#if 1
gdbcmd("b _start\n",gdb);
#else
gdbcmd("b main\n",gdb);
#endif
// skip over target program name
--argc;
++argv;
// add extra arguments
do {
if (argc <= 0)
break;
char xargs[1000];
char *xdst = xargs;
xdst += sprintf(xdst,"set args");
for (int avidx = 0; avidx < argc; ++avidx, ++argv) {
printf("XARGS: '%s'\n",*argv);
xdst += sprintf(xdst," %s",*argv);
}
xdst += sprintf(xdst,"\n");
gdbcmd(xargs,gdb);
} while (0);
// run the program -- it will stop at the breakpoint we set
gdbcmd("run\n",gdb);
// disable the breakpoint for speed
gdbcmd("disable\n",gdb);
tvelap = tvgetf();
while (1) {
// single step an ISA instruction
gdbcmd(sicmd,gdb);
// check for gdb aborting
if (pidgdb == 0)
break;
// check for target program exiting
if (stopflg)
break;
// advance count of ISA instructions
sicount += opt_x;
}
// get elapsed time
tvelap = tvgetf() - tvelap;
// tell gdb to quit
gdbcmd("quit\n",NULL);
// wait for gdb to completely terminate
if (pidgdb != 0) {
pidfin = waitpid(pidgdb,&status,0);
pidgdb = 0;
}
// close PTY units
close(ptypar);
close(ptycld);
}
// main -- main program
int
main(int argc,char **argv)
{
char *cp;
--argc;
++argv;
for (; argc > 0; --argc, ++argv) {
cp = *argv;
if (*cp != '-')
break;
switch (cp[1]) {
case 'd':
cp += 2;
opt_d = (*cp != 0) ? atoi(cp) : 1;
break;
case 'e':
cp += 2;
opt_e = (*cp != 0) ? atoi(cp) : 1;
break;
case 'f':
cp += 2;
opt_f = (*cp != 0) ? atoi(cp) : 1;
break;
case 'x':
cp += 2;
opt_x = (*cp != 0) ? atoi(cp) : -1;
break;
}
}
if (argc == 0) {
printf("specify target program\n");
exit(1);
}
// set output line buffering
switch (opt_f) {
case 0:
break;
case 1:
setlinebuf(stdout);
break;
default:
setbuf(stdout,NULL);
break;
}
gdbctl(argc,argv);
// print statistics
printf("%llu instructions -- ELAPSED: %.9f -- %.3f insts / sec\n",
sicount,tvelap,(double) sicount / tvelap);
return 0;
}
Here's a sample test program:
// tgt -- sample slave/test program
#include <stdlib.h>
#include <unistd.h>
#include <signal.h>
int opt_S;
int glob;
void
dumb(int x)
{
glob += x;
}
int
spin(int lim)
{
int x;
for (x = 0; x < lim; ++x)
dumb(x);
return x;
}
int
main(int argc,char **argv)
{
char *cp;
int lim;
int *ptr;
int code;
--argc;
++argv;
for (; argc > 0; --argc, ++argv) {
cp = *argv;
if (*cp != '-')
break;
switch (cp[1]) {
case 'S':
opt_S = cp[2];
break;
}
}
switch (opt_S) {
case 'f': // cause segfault
ptr = NULL;
*ptr = 23;
code = 91;
break;
case 'a': // abort
abort();
code = 92;
break;
case 't': // terminate us
signal(SIGTERM,SIG_DFL);
#if 0
kill(getpid(),SIGTERM);
#else
raise(SIGTERM);
#endif
code = 93;
break;
default:
code = 0;
break;
}
if (argc > 0)
lim = atoi(argv[0]);
else
lim = 10000;
lim = spin(lim);
lim &= 0x7F;
if (code == 0)
code = lim;
return code;
}
Here's a version that uses ptrace
that is much faster than the version that uses gdb
:
// ptxctl -- control via ptrace
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <time.h>
//#include <fcntl.h>
#include <errno.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <sys/ptrace.h>
#include <sys/user.h>
int opt_d; // 1=show debug output
int opt_e; // 1=echo progress
int opt_f; // 1=set line buffered output
unsigned long long sicount; // single step count
int stopflg; // 1=stop seen
pid_t pidtgt; // gdb's pid
pid_t pidfin; // stop pid
int status; // target's final status
char statbuf[1000]; // status buffer
int coredump; // 1=core dumped
int zpxlvl; // current trace level
int regsidx; // regs index
struct user_regs_struct regs[2]; // current regs
#define REGSALL(_cmd) \
_cmd(r15) \
_cmd(r14) \
_cmd(r13) \
_cmd(r12) \
_cmd(rbp) \
_cmd(rbx) \
_cmd(r11) \
_cmd(r10) \
_cmd(r9) \
_cmd(r8) \
_cmd(rax) \
_cmd(rcx) \
_cmd(rdx) \
_cmd(rsi) \
_cmd(rdi) \
_cmd(orig_rax) \
/*_cmd(rip)*/ \
_cmd(cs) \
_cmd(eflags) \
_cmd(rsp) \
_cmd(ss) \
_cmd(fs_base) \
_cmd(gs_base) \
_cmd(ds) \
_cmd(es) \
_cmd(fs) \
_cmd(gs)
#define REGSDIF(_reg) \
if (cur->_reg != prev->_reg) \
printf(" %16.16llX " #_reg "\n",cur->_reg);
double tvelap; // start time
#ifndef _USE_ZPRT_
#define _USE_ZPRT_ 1
#endif
static inline int
zprtok(int lvl)
{
return (_USE_ZPRT_ && (opt_d >= lvl));
}
#define dbg(_lvl,_fmt...) \
do { \
if (zprtok(_lvl)) \
printf(_fmt); \
} while (0)
// tvgetf -- get high precision time
double
tvgetf(void)
{
struct timespec ts;
double sec;
clock_gettime(CLOCK_REALTIME,&ts);
sec = ts.tv_nsec;
sec /= 1e9;
sec += ts.tv_sec;
return sec;
}
// ptxstatus -- decode status
char *
ptxstatus(int status)
{
int zflg;
int signo;
char *bp;
bp = statbuf;
*bp = 0;
// NOTE: do _not_ use zprtok here -- we need to force this on final
zflg = (opt_d >= zpxlvl);
do {
if (zflg)
bp += sprintf(bp,"%8.8X",status);
if (WIFSTOPPED(status)) {
signo = WSTOPSIG(status);
if (zflg)
bp += sprintf(bp," WIFSTOPPED signo=%d",signo);
switch (signo) {
case SIGTRAP:
break;
default:
stopflg = 1;
break;
}
}
if (WIFEXITED(status)) {
if (zflg)
bp += sprintf(bp," WIFEXITED code=%d",WEXITSTATUS(status));
stopflg = 1;
}
if (WIFSIGNALED(status)) {
signo = WTERMSIG(status);
if (zflg)
bp += sprintf(bp," WIFSIGNALED signo=%d",signo);
if (WCOREDUMP(status)) {
coredump = 1;
stopflg = 1;
if (zflg)
bp += sprintf(bp," -- core dumped");
}
}
} while (0);
return statbuf;
}
// ptxcmd -- issue ptrace command
long
ptxcmd(enum __ptrace_request cmd,void *addr,void *data)
{
long ret;
dbg(zpxlvl,"ptxcmd: ENTER cmd=%d addr=%p data=%p\n",cmd,addr,data);
ret = ptrace(cmd,pidtgt,addr,data);
dbg(zpxlvl,"ptxcmd: EXIT ret=%ld\n",ret);
return ret;
}
// ptxwait -- wait for target to be stopped
void
ptxwait(const char *reason)
{
dbg(zpxlvl,"ptxwait: %s pidtgt=%d\n",reason,pidtgt);
pidfin = waitpid(pidtgt,&status,0);
// NOTE: we need this to decide on stop status
ptxstatus(status);
dbg(zpxlvl,"ptxwait: %s status=(%s) pidfin=%d\n",
reason,statbuf,pidfin);
}
// ptxwhere -- show where we are
void
ptxwhere(int initflg)
{
struct user_regs_struct *cur;
struct user_regs_struct *prev;
do {
prev = ®s[regsidx];
if (initflg) {
ptxcmd(PTRACE_GETREGS,NULL,prev);
break;
}
regsidx = ! regsidx;
cur = ®s[regsidx];
ptxcmd(PTRACE_GETREGS,NULL,cur);
printf("RIP: %16.16llX (%llu)\n",cur->rip,sicount);
if (opt_e < 2)
break;
REGSALL(REGSDIF);
} while (0);
}
// ptxctl -- control ptrace
void
ptxctl(int argc,char **argv)
{
pidtgt = fork();
// launch target program
if (pidtgt == 0) {
pidtgt = getpid();
ptxcmd(PTRACE_TRACEME,NULL,NULL);
execvp(argv[0],argv);
exit(9);
}
#if 0
sleep(1);
#endif
zpxlvl = 1;
#if 0
ptxwait("SETUP");
#endif
// attach to tracee
// NOTE: we do _not_ need to do this because child has done TRACEME
#if 0
dbg(zpxlvl,"ptxctl: PREATTACH\n");
ptxcmd(PTRACE_ATTACH,NULL,NULL);
dbg(zpxlvl,"ptxctl: POSTATTACH\n");
#endif
// wait for initial stop
#if 1
ptxwait("INIT");
#endif
if (opt_e)
ptxwhere(1);
dbg(zpxlvl,"ptxctl: START\n");
tvelap = tvgetf();
zpxlvl = 2;
while (1) {
dbg(zpxlvl,"ptxctl: SINGLESTEP\n");
ptxcmd(PTRACE_SINGLESTEP,NULL,NULL);
ptxwait("WAIT");
sicount += 1;
// show where we are
if (opt_e)
ptxwhere(0);
dbg(zpxlvl,"ptxctl: STEPCOUNT sicount=%lld\n",sicount);
// stop when target terminates
if (stopflg)
break;
}
zpxlvl = 0;
ptxstatus(status);
printf("ptxctl: STATUS (%s) pidfin=%d\n",statbuf,pidfin);
// get elapsed time
tvelap = tvgetf() - tvelap;
}
// main -- main program
int
main(int argc,char **argv)
{
char *cp;
--argc;
++argv;
for (; argc > 0; --argc, ++argv) {
cp = *argv;
if (*cp != '-')
break;
switch (cp[1]) {
case 'd':
cp += 2;
opt_d = (*cp != 0) ? atoi(cp) : 1;
break;
case 'e':
cp += 2;
opt_e = (*cp != 0) ? atoi(cp) : 1;
break;
case 'f':
cp += 2;
opt_f = (*cp != 0) ? atoi(cp) : 1;
break;
}
}
if (argc == 0) {
printf("specify target program\n");
exit(1);
}
// set output line buffering
switch (opt_f) {
case 0:
break;
case 1:
setlinebuf(stdout);
break;
default:
setbuf(stdout,NULL);
break;
}
ptxctl(argc,argv);
// print statistics
printf("%llu instructions -- ELAPSED: %.9f -- %.3f insts / sec\n",
sicount,tvelap,(double) sicount / tvelap);
return 0;
}