Most efficient way to iterate over all the chars in an NSString

Solution 1:

I think it's important that people understand how to deal with unicode, so I ended up writing a monster answer, but in the spirit of tl;dr I will start with a snippet that should work fine. If you want to know details (which you should!), please continue reading after the snippet.

NSUInteger len = [str length];
unichar buffer[len+1];

[str getCharacters:buffer range:NSMakeRange(0, len)];

NSLog(@"getCharacters:range: with unichar buffer");
for(int i = 0; i < len; i++) {
  NSLog(@"%C", buffer[i]);
}

Still with me? Good!

The current accepted answer seem to be confusing bytes with characters/letters. This is a common problem when encountering unicode, especially from a C background. Strings in Objective-C are represented as unicode characters (unichar) which are much bigger than bytes and shouldn't be used with standard C string manipulation functions.

(Edit: This is not the full story! To my great shame, I'd completely forgotten to account for composable characters, where a "letter" is made up of multiple unicode codepoints. This gives you a situation where you can have one "letter" resolving to multiple unichars, which in turn are multiple bytes each. Hoo boy. Please refer to this great answer for the details on that.)

The proper answer to the question depends on whether you want to iterate over the characters/letters (as distinct from the type char) or the bytes of the string (what the type char actually means). In the spirit of limiting confusion, I will use the terms byte and letter from now on, avoiding the possibly ambigious term character.

If you want to do the former and iterate over the letters in the string, you need to exclusively deal with unichars (sorry, but we're in the future now, you can't ignore it anymore). Finding the amount of letters is easy, it's the string's length property. An example snippet is as such (same as above):

NSUInteger len = [str length];
unichar buffer[len+1];

[str getCharacters:buffer range:NSMakeRange(0, len)];

NSLog(@"getCharacters:range: with unichar buffer");
for(int i = 0; i < len; i++) {
  NSLog(@"%C", buffer[i]);
}

If, on the other hand, you want to iterate over the bytes in a string, it starts getting complicated and the result will depend entirely upon the encoding you choose to use. The decent default choice is UTF8, so that's what I will show.

Doing this you have to figure out how many bytes the resulting UTF8 string will be, a step where it's easy to go wrong and use the string's -length. One main reason this very easy to do wrong, especially for a US developer, is that a string with letters falling into the 7-bit ASCII spectrum will have equal byte and letter lengths. This is because UTF8 encodes 7-bit ASCII letters with a single byte, so a simple test string and basic english text might work perfectly fine.

The proper way to do this is to use the method -lengthOfBytesUsingEncoding:NSUTF8StringEncoding (or other encoding), allocate a buffer with that length, then convert the string to the same encoding with -cStringUsingEncoding: and copy it into that buffer. Example code here:

NSUInteger byteLength = [str lengthOfBytesUsingEncoding:NSUTF8StringEncoding];
char proper_c_buffer[byteLength+1];
strncpy(proper_c_buffer, [str cStringUsingEncoding:NSUTF8StringEncoding], byteLength);

NSLog(@"strncpy with proper length");
for(int i = 0; i < byteLength; i++) {
  NSLog(@"%c", proper_c_buffer[i]);
}

Just to drive the point home as to why it's important to keep things straight, I will show example code that handles this iteration in four different ways, two wrong and two correct. This is the code:

#import <Foundation/Foundation.h>

int main() {
  NSString *str = @"буква";
  NSUInteger len = [str length];

  // Try to store unicode letters in a char array. This will fail horribly
  // because getCharacters:range: takes a unichar array and will probably
  // overflow or do other terrible things. (the compiler will warn you here,
  // but warnings get ignored)
  char c_buffer[len+1];
  [str getCharacters:c_buffer range:NSMakeRange(0, len)];

  NSLog(@"getCharacters:range: with char buffer");
  for(int i = 0; i < len; i++) {
    NSLog(@"Byte %d: %c", i, c_buffer[i]);
  }

  // Copy the UTF string into a char array, but use the amount of letters
  // as the buffer size, which will truncate many non-ASCII strings.
  strncpy(c_buffer, [str UTF8String], len);

  NSLog(@"strncpy with UTF8String");
  for(int i = 0; i < len; i++) {
    NSLog(@"Byte %d: %c", i, c_buffer[i]);
  }

  // Do It Right (tm) for accessing letters by making a unichar buffer with
  // the proper letter length
  unichar buffer[len+1];
  [str getCharacters:buffer range:NSMakeRange(0, len)];

  NSLog(@"getCharacters:range: with unichar buffer");
  for(int i = 0; i < len; i++) {
    NSLog(@"Letter %d: %C", i, buffer[i]);
  }

  // Do It Right (tm) for accessing bytes, by using the proper
  // encoding-handling methods
  NSUInteger byteLength = [str lengthOfBytesUsingEncoding:NSUTF8StringEncoding];
  char proper_c_buffer[byteLength+1];
  const char *utf8_buffer = [str cStringUsingEncoding:NSUTF8StringEncoding];
  // We copy here because the documentation tells us the string can disappear
  // under us and we should copy it. Just to be safe
  strncpy(proper_c_buffer, utf8_buffer, byteLength);

  NSLog(@"strncpy with proper length");
  for(int i = 0; i < byteLength; i++) {
    NSLog(@"Byte %d: %c", i, proper_c_buffer[i]);
  }
  return 0;
}

Running this code will output the following (with NSLog cruft trimmed out), showing exactly HOW different the byte and letter representations can be (the two last outputs):

getCharacters:range: with char buffer
Byte 0: 1
Byte 1: 
Byte 2: C
Byte 3: 
Byte 4: :
strncpy with UTF8String
Byte 0: Ð
Byte 1: ±
Byte 2: Ñ
Byte 3: 
Byte 4: Ð
getCharacters:range: with unichar buffer
Letter 0: б
Letter 1: у
Letter 2: к
Letter 3: в
Letter 4: а
strncpy with proper length
Byte 0: Ð
Byte 1: ±
Byte 2: Ñ
Byte 3: 
Byte 4: Ð
Byte 5: º
Byte 6: Ð
Byte 7: ²
Byte 8: Ð
Byte 9: °

Solution 2:

While Daniel's solution will probably work most of the time, I think the solution is dependent on the context. For example, I have a spelling app and need to iterate over each character as it appears onscreen which may not correspond to the way it is represented in memory. This is especially true for text provided by the user.

Using something like this category on NSString:

- (void) dumpChars
{
    NSMutableArray  *chars = [NSMutableArray array];
    NSUInteger      len = [self length];
    unichar         buffer[len+1];

    [self getCharacters: buffer range: NSMakeRange(0, len)];
    for (int i=0; i<len; i++) {
        [chars addObject: [NSString stringWithFormat: @"%C", buffer[i]]];
    }

    NSLog(@"%@ = %@", self, [chars componentsJoinedByString: @", "]);
}

And feeding it a word like mañana might produce:

mañana = m, a, ñ, a, n, a

But it could just as easily produce:

mañana = m, a, n, ̃, a, n, a

The former will be produced if the string is in precomposed unicode form and the later if it's in decomposed form.

You might think this could be avoided by using the result of NSString's precomposedStringWithCanonicalMapping or precomposedStringWithCompatibilityMapping, but this is not necessarily the case as Apple warns in Technical Q&A 1225. For example a string like e̊gâds (which I totally made up) still produces the following even after converting to a precomposed form.

 e̊gâds = e, ̊, g, â, d, s

The solution for me is to use NSString's enumerateSubstringsInRange passing NSStringEnumerationByComposedCharacterSequences as the enumeration option. Rewriting the earlier example to look like this:

- (void) dumpSequences
{
    NSMutableArray  *chars = [NSMutableArray array];

    [self enumerateSubstringsInRange: NSMakeRange(0, [self length]) options: NSStringEnumerationByComposedCharacterSequences
        usingBlock: ^(NSString *inSubstring, NSRange inSubstringRange, NSRange inEnclosingRange, BOOL *outStop) {
        [chars addObject: inSubstring];
    }];

    NSLog(@"%@ = %@", self, [chars componentsJoinedByString: @", "]);
}

If we feed this version e̊gâds then we get

e̊gâds = e̊, g, â, d, s

as expected, which is what I want.

The section of documentation on Characters and Grapheme Clusters may also be helpful in explaining some of this.

Note: Looks like some of the unicode strings I used are tripping up SO when formatted as code. The strings I used are mañana, and e̊gâds.