Disable AVX-optimized functions in glibc (LD_HWCAP_MASK, /etc/ld.so.nohwcap) for valgrind & gdb record

Modern x86_64 linux with glibc will detect that CPU has support of AVX extension and will switch many string functions from generic implementation to AVX-optimized version (with help of ifunc dispatchers: 1, 2).

This feature can be good for performance, but it prevents several tool like valgrind (older libVEXs, before valgrind-3.8) and gdb's "target record" (Reverse Execution) from working correctly (Ubuntu "Z" 17.04 beta, gdb 7.12.50.20170207-0ubuntu2, gcc 6.3.0-8ubuntu1 20170221, Ubuntu GLIBC 2.24-7ubuntu2):

$ cat a.c
#include <string.h>
#define N 1000
int main(){
        char src[N], dst[N];
        memcpy(dst, src, N);
        return 0;
}
$ gcc a.c -o a -fno-builtin
$ gdb -q ./a
Reading symbols from ./a...(no debugging symbols found)...done.
(gdb) start
Temporary breakpoint 1 at 0x724
Starting program: /home/user/src/a

Temporary breakpoint 1, 0x0000555555554724 in main ()
(gdb) record
(gdb) c
Continuing.
Process record does not support instruction 0xc5 at address 0x7ffff7b60d31.
Process record: failed to record execution log.

Program stopped.
__memmove_avx_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:416
416             VMOVU   (%rsi), %VEC(4)
(gdb) x/i $pc
=> 0x7ffff7b60d31 <__memmove_avx_unaligned_erms+529>:   vmovdqu (%rsi),%ymm4

There is error message "Process record does not support instruction 0xc5" from gdb's implementation of "target record", because AVX instructions are not supported by the record/replay engine (sometimes the problem is detected on _dl_runtime_resolve_avx function): https://sourceware.org/ml/gdb/2016-08/msg00028.html "some AVX instructions are not supported by process record", https://bugs.launchpad.net/ubuntu/+source/gdb/+bug/1573786, https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=836802, https://bugzilla.redhat.com/show_bug.cgi?id=1136403

Solution proposed in https://sourceware.org/ml/gdb/2016-08/msg00028.html "You can recompile libc (thus ld.so), or hack __init_cpu_features and thus __cpu_features at runtime (see e.g. strcmp)." or set LD_BIND_NOW=1, but recompiled glibc still has AVX, and ld bind-now doesn't help.

I heard that there are /etc/ld.so.nohwcap and LD_HWCAP_MASK configurations in glibc. Can they be used to disable ifunc dispatching to AVX-optimized string functions in glibc?

How does glibc (rtld?) detects AVX, using cpuid, with /proc/cpuinfo (probably not), or HWCAP aux (LD_SHOW_AUXV=1 /bin/echo |grep HWCAP command gives AT_HWCAP: bfebfbff)?

It looks like there is a nice workaround for this implemented in recent versions of glibc: a "tunables" feature that guides selection of optimized string functions. You can find a general overview of this feature here and the relevant code inside glibc in ifunc-impl-list.c.

Here's how I figured it out. First, I took the address being complained about by gdb:

Process record does not support instruction 0xc5 at address 0x7ffff75c65d4.

I then looked it up in the table of shared libraries:

(gdb) info shared
From                To                  Syms Read   Shared Object Library
0x00007ffff7fd3090  0x00007ffff7ff3130  Yes         /lib64/ld-linux-x86-64.so.2
0x00007ffff76366b0  0x00007ffff766b52e  Yes         /usr/lib/x86_64-linux-gnu/libubsan.so.1
0x00007ffff746a320  0x00007ffff75d9cab  Yes         /lib/x86_64-linux-gnu/libc.so.6
...

You can see that this address is within glibc. But what function, specifically?

(gdb) disassemble 0x7ffff75c65d4
Dump of assembler code for function __strcmp_avx2:
   0x00007ffff75c65d0 <+0>:     mov    %edi,%eax
   0x00007ffff75c65d2 <+2>:     xor    %edx,%edx
=> 0x00007ffff75c65d4 <+4>:     vpxor  %ymm7,%ymm7,%ymm7

I can look in ifunc-impl-list.c to find the code that controls selecting the avx2 version:

  IFUNC_IMPL (i, name, strcmp,
          IFUNC_IMPL_ADD (array, i, strcmp,
                  HAS_ARCH_FEATURE (AVX2_Usable),
                  __strcmp_avx2)
          IFUNC_IMPL_ADD (array, i, strcmp, HAS_CPU_FEATURE (SSE4_2),
                  __strcmp_sse42)
          IFUNC_IMPL_ADD (array, i, strcmp, HAS_CPU_FEATURE (SSSE3),
                  __strcmp_ssse3)
          IFUNC_IMPL_ADD (array, i, strcmp, 1, __strcmp_sse2_unaligned)
          IFUNC_IMPL_ADD (array, i, strcmp, 1, __strcmp_sse2))

It looks like AVX2_Usable is the feature to disable. Let's rerun gdb accordingly:

GLIBC_TUNABLES=glibc.cpu.hwcaps=-AVX2_Usable gdb...

On this iteration it complained about __memmove_avx_unaligned_erms, which appeared to be enabled by AVX_Usable - but I found another path in ifunc-memmove.h enabled by AVX_Fast_Unaligned_Load. Back to the drawing board:

GLIBC_TUNABLES=glibc.cpu.hwcaps=-AVX2_Usable,-AVX_Fast_Unaligned_Load gdb ...

On this final round I discovered a rdtscp instruction in the ASAN shared library, so I recompiled without the address sanitizer and at last, it worked.

In summary: with some work it's possible to disable these instructions from the command line and use gdb's record feature without severe hacks.

There does not seem a straightforward runtime method to patch feature detection. This detection happens rather early in the dynamic linker (ld.so).

Binary patching the linker seems the easiest method at the moment. @osgx described one method where a jump is overwritten. Another approach is just to fake the cpuid result. Normally cpuid(eax=0) returns the highest supported function in eax while the manufacturer IDs are returned in registers ebx, ecx and edx. We have this snippet in glibc 2.25 sysdeps/x86/cpu-features.c:

__cpuid (0, cpu_features->max_cpuid, ebx, ecx, edx);

/* This spells out "GenuineIntel".  */
if (ebx == 0x756e6547 && ecx == 0x6c65746e && edx == 0x49656e69)
  {
      /* feature detection for various Intel CPUs */
  }
/* another case for AMD */
else
  {
    kind = arch_kind_other;
    get_common_indeces (cpu_features, NULL, NULL, NULL, NULL);
  }

The __cpuid line translates to these instructions in /lib/ld-linux-x86-64.so.2 (/lib/ld-2.25.so):

172a8:       31 c0                   xor    eax,eax
172aa:       c7 44 24 38 00 00 00    mov    DWORD PTR [rsp+0x38],0x0
172b1:       00 
172b2:       c7 44 24 3c 00 00 00    mov    DWORD PTR [rsp+0x3c],0x0
172b9:       00 
172ba:       0f a2                   cpuid

So rather than patching branches, we could as well change the cpuid into a nop instruction which would result in invocation of the last else branch (as the registers will not contain "GenuineIntel"). Since initially eax=0, cpu_features->max_cpuid will also be 0 and the if (cpu_features->max_cpuid >= 7) will also be bypassed.

Binary patching cpuid(eax=0) by nop this can be done with this utility (works for both x86 and x86-64):

#!/usr/bin/env python
import re
import sys

infile, outfile = sys.argv[1:]
d = open(infile, 'rb').read()
# Match CPUID(eax=0), "xor eax,eax" followed closely by "cpuid"
o = re.sub(b'(\x31\xc0.{0,32}?)\x0f\xa2', b'\\1\x66\x90', d)
assert d != o
open(outfile, 'wb').write(o)

An equivalent Perl variant, -0777 ensures that the file is read at once instead of separating records at line feeds:

perl -0777 -pe 's/\x31\xc0.{0,32}?\K\x0f\xa2/\x66\x90/' < /lib/ld-linux-x86-64.so.2 > ld-linux-x86-64-patched.so.2
# Verify result, should display "Success"
cmp -s /lib/ld-linux-x86-64.so.2 ld-linux-x86-64-patched.so.2 && echo 'Not patched' || echo Success

That was the easy part. Now, I did not want to replace the system-wide dynamic linker, but execute only one particular program with this linker. Sure, that can be done with ./ld-linux-x86-64-patched.so.2 ./a, but the naive gdb invocations failed to set breakpoints:

$ gdb -q -ex "set exec-wrapper ./ld-linux-x86-64-patched.so.2" -ex start ./a
Reading symbols from ./a...done.
Temporary breakpoint 1 at 0x400502: file a.c, line 5.
Starting program: /tmp/a 
During startup program exited normally.
(gdb) quit
$ gdb -q -ex start --args ./ld-linux-x86-64-patched.so.2 ./a
Reading symbols from ./ld-linux-x86-64-patched.so.2...(no debugging symbols found)...done.
Function "main" not defined.
Temporary breakpoint 1 (main) pending.
Starting program: /tmp/ld-linux-x86-64-patched.so.2 ./a
[Inferior 1 (process 27418) exited normally]
(gdb) quit

A manual workaround is described in How to debug program with custom elf interpreter? It works, but it is unfortunately a manual action using add-symbol-file. It should be possible to automate it a bit using GDB Catchpoints though.

An alternative approach that does not binary linking is LD_PRELOADing a library that defines custom routines for memcpy, memove, etc. This will then take precedence over the glibc routines. The full list of functions is available in sysdeps/x86_64/multiarch/ifunc-impl-list.c. Current HEAD has more symbols compared to the glibc 2.25 release, in total (grep -Po 'IFUNC_IMPL \(i, name, \K[^,]+' sysdeps/x86_64/multiarch/ifunc-impl-list.c):

memchr, memcmp, __memmove_chk, memmove, memrchr, __memset_chk, memset, rawmemchr, strlen, strnlen, stpncpy, stpcpy, strcasecmp, strcasecmp_l, strcat, strchr, strchrnul, strrchr, strcmp, strcpy, strcspn, strncasecmp, strncasecmp_l, strncat, strncpy, strpbrk, strspn, strstr, wcschr, wcsrchr, wcscpy, wcslen, wcsnlen, wmemchr, wmemcmp, wmemset, __memcpy_chk, memcpy, __mempcpy_chk, mempcpy, strncmp, __wmemset_chk,

I encountered this problem recently as well, and ended up solving it using dynamic CPUID faulting to interrupt execution of the CPUID instruction and override its result, which avoids touching glibc or the dynamic linker. This requires processor support for CPUID faulting (Ivy Bridge+) as well as Linux kernel support (4.12+) for exposing it to userspace through the ARCH_GET_CPUID and ARCH_SET_CPUID subfunctions of arch_prctl(). When this feature is enabled, a SIGSEGV signal will be delivered on each execution of CPUID, allowing a signal handler can emulate execution of the instruction and override the result.

The full solution is a bit involved since I also need to interpose the dynamic linker, because hardware capability detection was moved there starting with glibc 2.26+. I've uploaded the full solution online at https://github.com/ddcc/libcpuidoverride .

Not the best or complete solution, just a smallest bit-editing kludge to allow valgrind and gdb record for the my task.

Lekensteyn asks:

how to mask out AVX/SSE without recompiling glibc

I did full rebuild of unmodified glibc, which is rather easy in debian and ubuntu: just sudo apt-get source glibc, sudo apt-get build-dep glibc and cd glibc-*/; dpkg-buildpackage -us -uc (manual to get the ld.so without stripped debugging information.

Then I did binary (bit) patching of the output ld.so file, in the function used by __get_cpu_features. Target function was compiled from get_common_indeces of source file sysdeps/x86/cpu-features.c under the name of get_common_indeces.constprop.1 (it is just next after the __get_cpu_features in the binary code). It has several cpuids, first one is cpuid eax=1 "Processor Info and Feature Bits"; and later there is check "jle 0x6" and jump down around the code "cpuid eax=7 ecx=0 Extended Features" just to get AVX2 status. There is the code which was compiled into this logic:

get_common_indeces (struct cpu_features *cpu_features,
            unsigned int *family, unsigned int *model,
            unsigned int *extended_model, unsigned int *stepping)
{ ...
  if (cpu_features->max_cpuid >= 7)
    __cpuid_count (7, 0,
           cpu_features->cpuid[COMMON_CPUID_INDEX_7].eax,
           cpu_features->cpuid[COMMON_CPUID_INDEX_7].ebx,
           cpu_features->cpuid[COMMON_CPUID_INDEX_7].ecx,
           cpu_features->cpuid[COMMON_CPUID_INDEX_7].edx);

The cpu_features->max_cpuid was filled in init_cpu_features of the same file in __cpuid (0, cpu_features->max_cpuid, ebx, ecx, edx); line. It was easier to disable the if statement by replacing jle after cmp 0x6 with jg (byte 0x7e to 0x7f). (Actually this binary patch was reapplied manually to the __get_cpu_features function of real system ld-linux.so.2 - first jle before mov 7 eax; xor ecx,ecx; cpuid changed into jg.)

Recompiled package and modified ld.so were not installed into the system; I used commandline syntax of ld.so ./my_program (or mv ld.so /some/short/path.so and patchelf --set-interpreter ./my_program).

Disable AVX-optimized functions in glibc (LD_HWCAP_MASK, /etc/ld.so.nohwcap) for valgrind & gdb record

Related

Recent Posts