How to access a char array and change lower case letters to upper case, and vice versa
Solution 1:
Variations of this question get asked all the time. This version of the problem (requiring conditional behaviour beyond just if(isalpha(c)) c|=0x20;
)) made the problem complex enough that it wasn't immediately obvious how to do it efficiently.
It turns out that xor
wasn't hard to think of, and converting this code to unconditionally upcase or downcase only requires a simple change from xor 0x20
to and ~0x20
or or 0x20
. (Simplifying a bit more is possible, too.)
Here's how I'd do it with an attempt at optimally efficient asm. I even included a version with SIMD vectors, and another version of the byte loop using the branchless idea I got from vectorizing it.
Reading this answer is probably only useful once you understand the basic principles involved in solving this with not-so-optimized code. OTOH, there are very few operations actually needed, so there's not much code to grok. And I did comment it heavily. There are many helpful links in the x86 tag wiki, from tutorials to reference guides to performance tuning.
Converting between lower and upper case alphabetic ASCII characters only requires setting or clearing the 0x20
bit, because the ASCII character set is laid out with the ranges 32 from each other, and not crossing a mod32 boundary.
For each byte:
- make a copy and unconditionally OR it with 0x20
- check if it's between
'a'
and'z'
- if so, flip the ASCII alphabetic case bit using
xor
and store the result back into the array.
Doing the ASCII isalpha(3)
test this way is safe: The only source bytes that end up in the 'a'
..'z'
range from setting that bit are the upper-case alphabetic characters. It's just math that works for any two equal-sized ranges that don't cross a %32
boundary. (Or a %64
boundary if the relevant bit was 0x40
, for example).
To do the compare even more efficiently, I use the unsigned-compare trick so there's only one conditional branch inside the loop (other than the loop condition itself). See the comments in the code for an explanation.
One byte at a time branching on an efficient range-check for alphabetic char detection
/******** Untested. ************/
// ASCII characters are flipped to the opposite case (upper <-> lower)
// non-ASCII characters are left unchanged
void changeCase (char char_array[], int array_size ) {
__asm{
// BEGIN YOUR CODE HERE
mov esi, char_array; // MSVC inline asm requires these potentially-redundant copies :(
mov ecx, array_size;
test ecx,ecx; // return if(size <= 0)
jle early_out;
next_char:
movzx eax, byte ptr [esi]; // load the current character
mov edx, eax; // save a copy to maybe flip + store
// check if the character is alphabetic or not
// there are two equal-size ranges of characters: one with 0x20 set, and one without
or al, 0x20; // set 0x20 and then just check that lowercase range
// unsigned compare trick: 0 <= n < high can be done with one unsigned compare instead of two signed compares
// low < n < high can be done by shifting the range first
sub al, 'a'; // if al is less than 'a', it will become a large unsigned number
cmp al, 'z'-'a';
ja non_alpha; // conditionally skip the flip & store
xor dl, 0x20; // toggle the ASCII case bit
mov [esi], dl;
// xor [esi], 0x20 // saves the mov earlier, but is otherwise slower
non_alpha:
inc esi;
dec ecx;
jz next_char;
early_out:
// END YOUR CODE HERE
}
}
This code might be more readable if some of the "design doc" stuff was in a block outside the code. It clutters things up a lot, and makes it look like there's a lot of code, but really there are very few instructions. (They're just hard to explain with short comments. Commenting code is tricky: comments that are too obvious are just clutter and take time away from reading the code and the useful comments.)
Vectorized
Actually for x86 I'd use SSE or AVX to do 16B at a time, doing the same algorithm, but doing the comparisons with two pcmpgtb
. And of course unconditionally storing the results, so an array of all non-alphabetic characters would still be dirtied in the cache, using more memory bandwidth.
There's no unsigned SSE compare, but we can still range-shift the range we're looking for down to the bottom. There are no values less than -128
, so in a signed compare it works the way 0
does in an unsigned compare.
To do this, subtract 128
. (or add, or xor (carryless add); there's nowhere for the carry / borrow to go). This can be done in the same operation as subtracting 'a'
.
Then use the compare result as a mask to zero out bytes in a vector of 0x20
, so only the alphabetic characters get XORed with 0x20. (0 is the identity element for XOR/add/sub, which is often really handy for SIMD conditionals).
See also a strtoupper
version that has been tested, and code to call it in a loop, including handling of non-multiple-of-16 inputs, on implicit-length C strings (searching for the terminating 0 on the fly).
#include <immintrin.h>
// Call this function in a loop, with scalar cleanup. (Not implemented, since it's the same as any other vector loop.)
// Flip the case of all alphabetic ASCII bytes in src
__m128i inline flipcase(__m128i src) {
// subtract 'a'+128, so the alphabetic characters range from -128 to -128+25 (-128+'z'-'a')
// note that adding 128 and subtracting 128 are the same thing for 8bit integers.
// There's nowhere for the carry to go, so it's just xor (carryless add), flipping the high bit
__m128i lcase = _mm_or_si128(src, _mm_set1_epi8(0x20));
__m128i rangeshift= _mm_sub_epi8(lcase, _mm_set1_epi8('a'+128));
__m128i non_alpha = _mm_cmpgt_epi8(rangeshift, _mm_set1_epi8(-128 + 25)); // 0:alphabetic -1:non-alphabetic
__m128i flip = _mm_andnot_si128(non_alpha, _mm_set1_epi8(0x20)); // 0x20:alpha 0:non-alpha
return _mm_xor_si128(src, flip);
// just mask the XOR-mask so non-alphabetic elements are XORed with 0 instead of 0x20
// XOR's identity value is 0, same as for addition
}
This compiles to nice code, even without AVX, with only one extra movdqa
to save a copy of a register. See the godbolt link for two earlier versions (one using two compares to keep it simple, another using pblendvb
before I remembered to mask the vector of 0x20
s instead of the result.)
flipcase:
movdqa xmm2, XMMWORD PTR .LC0[rip] ; 0x20
movdqa xmm1, xmm0
por xmm1, xmm2
psubb xmm1, XMMWORD PTR .LC1[rip] ; -31
pcmpgtb xmm1, XMMWORD PTR .LC2[rip] ; -103
pandn xmm1, xmm2
pxor xmm0, xmm1
ret
section .rodata
.LC0: times 16 db 32
.LC1: times 16 db -31
.LC2: times 16 db -103
This same idea of using a branchless test would also work for the byte loop:
mov esi, char_array;
mov ecx, array_size;
test ecx,ecx; // return if(size <= 0)
jle .early_out;
ALIGN 16 ; really only need align 8 here, since the next 4 instructions are all 2 bytes each (because op al, imm8 insns have a special encoding)
.next_char:
movzx eax, byte ptr [esi]; // load the current character
mov edx, eax;
// check if the character is alphabetic or not
or al, 0x20;
sub al, 'a';
cmp al, 'z'-'a'; // unsigned compare trick: 'a' <= al <= 'z'
setna al; // 0:non-alpha 1:alpha (not above)
shl al, 5; // 0:non-alpha 0x20:alpha
xor dl, al; // conditionally toggle the ASCII case bit
mov [esi], dl; // unconditionally store
inc esi;
dec ecx; // for AMD CPUs, or older Intel, it would be better to compare esi against an end pointer, since cmp/jz can fuse but dec can't. This saves an add ecx, esi outside the loop
jz .next_char;
.early_out:
For 64bit code, just use rsi
instead of esi
. Everything else is the same.
Apparently MSVC inline asm doesn't allow .label
local-symbol names. I changed them for the first version (with conditional branch), but not this one.
Using movzx eax, byte [esi]
is better than mov al, [esi]
, avoiding a loop-carried false dependency on AMD, and Intel Haswell and later, and Silvermont-family. movzx
isn't quite as cheap as a load on older AMD. (It is on Intel, and AMD Ryzen at least, one uop that only uses a load port, not an ALU port). Why doesn't GCC use partial registers?
Operating on al
after that is still ok. There's no partial-register stall (or extra instructions to avoid it) because we aren't reading eax
after setcc
writes al
. (There is no setcc r/m32
, only r/m8
, unfortunately).
I have to wonder what a professor would think if anyone handed in code like this for an assignment like that. :P I doubt even a smart compiler would use that setcc
/ shift
trick unless you led the compiler towards it. (Maybe unsigned mask = (tmp>='a' && tmp<='z'); mask <<= 5; a[i] ^= mask;
or something.) Compilers do know about the unsigned-compare trick, but gcc doesn't use it in some cases for non-compile-time-constant range checks, even when it can prove that the range is small enough.
Solution 2:
For clarity's sake, I'll just use pure assembly and assume that...
-
char_array
is a 32-bit pointer at[ebp+8]
. -
array_size
is a two's complement 32-bit number at[ebp+12]
. - For your platform (it is this way for most anyway),
char
's encoding is ASCII.
You should be able to deduce this yourself into inline assembly. Now, if you look at the table everyone is supposed to remember but barely anyone does, you'll notice some important details...
- Uppercase letters
A
throughZ
map into codes0x41
through0x5A
, respectively. - Lowercase letters
a
throughz
map into codes0x61
through0x7A
, respectively. - Everything else is not a letter, and thus need no case conversion.
- If you look at the binary representation of the upper and lowercase letter ranges, you'll notice that they are exactly the same, with the sole exception that uppercase letters have bit 6 cleared, and lowercase ones have it set.
As a result, the algorithm would be...
while array_size != 0
byte = *char_array
if byte >= 0x41 and byte <= 0x5A
*char_array |= 0x20 // Turn it lowercase
else if byte >= 0x61 and byte <= 0x7A
*char_array &= 0xDF // Turn it uppercase
array_size -= 1
char_array += 1
Now, let's translate this into assembly...
mov eax, [ebp+8] # char *eax = char_array
mov ecx, [ebp+12] # int ecx = array_size
.loop:
or ecx, ecx # Compare ecx against itself
jz .end_loop # If ecx (array_size) is zero, we're done
mov dl, [eax] # Otherwise, store the byte at *eax (*char_array) into `char dl`
cmp dl, 'A' # Compare dl (*char_array) against 'A' (lower bound of uppercase letters)
jb .continue # If dl` (*char_array) is lesser than `A`, continue the loop
cmp dl, 'Z' # Compare dl (*char_array) against 'Z' (upper bound of uppercase letters)
jbe .is_uppercase # If dl (*char_array) is lesser or equal to 'Z', then jump to .is_uppercase
cmp dl, 'a' # Compare dl (*char_array) against 'a' (lower bound of lowercase letters)
jb .continue # If dl (*char_array) is lesser than 'a', continue the loop
cmp dl, 'z' # Compare dl (*char_array) against 'z' (upper bound of lowercase letters)
jbe .is_lowercase # If dl (*char_array) is lesser or equal to 'z', then jump to .is_lowercase
jmp .continue # All tests failed, so continue the loop
.is_uppercase:
or dl, 20h # Set the 6th bit
mov [eax], dl # Send the byte back to where it came from
jmp .continue # Continue the loop
.is_lowercase:
and dl, DFh # Clear the 6th bit
mov [eax], dl # Send the byte back to where it came from
jmp .continue # Continue the loop
.continue:
inc eax # Increment `eax` (`char_array`), much of like a pointer increment
dec ecx # Decrement `ecx` (`array_size`), so as to match the previous pointer increment
jmp .loop # Continue
.end_loop:
Once code reaches .end_loop
, you're done.
I hope this has led a light on you!
Solution 3:
in ASCII 'a'-'z' and 'A'-'Z' are equivalent except one bit, 0x20
your friend here is XOR.
if you have a char ( either 'A'-'Z' or 'a'-'z'), XORing it with 0x20 will toggle the case;
before XORing, doing a range check makes sense. (to see if the value is really a letter)
You can simplify this range check by ORing the value to check with 0xef, which will make 'a' to 'A' and 'z' to 'Z', and then do the range check only once
(if you only compare to <'a' and >'Z' you will miss the characters inbetween ('[', ']', etc...)
Solution 4:
Courtesy of @KemyLand for the helpful breakdown of assembly code, I have figured out how to convert Uppercase to Lowercase and vice-versa.
void changeCase (char char_array[], int array_size ) {
//this function is designed to change lowercase letters to uppercase, and vice-versa, from a char-array given the array and its size.
__asm{
// BEGIN YOUR CODE HERE
mov eax, [ebp + 8]; //move to register value parameter 1 (the array)
mov ecx, [ebp + 12]; //likewise parameter 2 (the array size)
START:
or ecx, ecx; //check if pointer is 0
cmp ecx, 0;
je endloop; //go to end loop
mov dl,byte ptr [eax]; //not sure if needed, but reassurance
cmp dl, 0x41; // is char an A?
jl cont;
cmp dl, 0x5A; // is char a Z?
jle convertUP;
cmp dl, 0x61; // is char an a?
jl cont;
cmp dl, 0x7A; // is char a z?
jle convertDOWN;
jmp cont;
convertUP:
or dl, 0x20; //Yes! Finally got it working!
mov byte ptr [eax], dl;
jmp cont;
convertDOWN:
and dl, 0xdf; //this will work for sure.
mov[eax], dl;
jmp cont
cont:
inc eax;
dec ecx;
jmp START;
endloop:
}
}
Feel free to help explain what I might have missed! Thank you all for helping me understand the x86 assembly processor better.