NSArray from NSCharacterSet
Currently I am able to make array of Alphabets like below
[[NSArray alloc]initWithObjects:@"A",@"B",@"C",@"D",@"E",@"F",@"G",@"H",@"I",@"J",@"K",@"L",@"M",@"N",@"O",@"P",@"Q",@"R",@"S",@"T",@"U",@"V",@"W",@"X",@"Y",@"Z",nil];
Knowing that is available over
[NSCharacterSet uppercaseLetterCharacterSet]
How to make an array out of it?
Solution 1:
The following code creates an array containing all characters of a given character set. It works also for characters outside of the "basic multilingual plane" (characters > U+FFFF, e.g. U+10400 DESERET CAPITAL LETTER LONG I).
NSCharacterSet *charset = [NSCharacterSet uppercaseLetterCharacterSet];
NSMutableArray *array = [NSMutableArray array];
for (int plane = 0; plane <= 16; plane++) {
if ([charset hasMemberInPlane:plane]) {
UTF32Char c;
for (c = plane << 16; c < (plane+1) << 16; c++) {
if ([charset longCharacterIsMember:c]) {
UTF32Char c1 = OSSwapHostToLittleInt32(c); // To make it byte-order safe
NSString *s = [[NSString alloc] initWithBytes:&c1 length:4 encoding:NSUTF32LittleEndianStringEncoding];
[array addObject:s];
}
}
}
}
For the uppercaseLetterCharacterSet
this gives an array of 1467 elements. But note that characters > U+FFFF are stored as UTF-16 surrogate pair in NSString
, so for example U+10400 actually is stored in NSString
as 2 characters "\uD801\uDC00".
Swift 2 code can be found in other answers to this question. Here is a Swift 3 version, written as an extension method:
extension CharacterSet {
func allCharacters() -> [Character] {
var result: [Character] = []
for plane: UInt8 in 0...16 where self.hasMember(inPlane: plane) {
for unicode in UInt32(plane) << 16 ..< UInt32(plane + 1) << 16 {
if let uniChar = UnicodeScalar(unicode), self.contains(uniChar) {
result.append(Character(uniChar))
}
}
}
return result
}
}
Example:
let charset = CharacterSet.uppercaseLetters
let chars = charset.allCharacters()
print(chars.count) // 1521
print(chars) // ["A", "B", "C", ... "]
(Note that some characters may not be present in the font used to display the result.)
Solution 2:
Inspired by Satachito answer, here is a performant way to make an Array from CharacterSet using bitmapRepresentation
:
extension CharacterSet {
func characters() -> [Character] {
// A Unicode scalar is any Unicode code point in the range U+0000 to U+D7FF inclusive or U+E000 to U+10FFFF inclusive.
return codePoints().compactMap { UnicodeScalar($0) }.map { Character($0) }
}
func codePoints() -> [Int] {
var result: [Int] = []
var plane = 0
// following documentation at https://developer.apple.com/documentation/foundation/nscharacterset/1417719-bitmaprepresentation
for (i, w) in bitmapRepresentation.enumerated() {
let k = i % 0x2001
if k == 0x2000 {
// plane index byte
plane = Int(w) << 13
continue
}
let base = (plane + k) << 3
for j in 0 ..< 8 where w & 1 << j != 0 {
result.append(base + j)
}
}
return result
}
}
Example for uppercaseLetters
let charset = CharacterSet.uppercaseLetters
let chars = charset.characters()
print(chars.count) // 1733
print(chars) // ["A", "B", "C", ... "]
Example for discontinuous planes
let charset = CharacterSet(charactersIn: "𝚨")
let codePoints = charset.codePoints()
print(codePoints) // [120488, 837521]
Performances
Very good depending on the data/usage: this solution built in release with bitmapRepresentation
seems 2 to 10 times faster than Martin R's solution with contains
or Oliver Atkinson's solution with longCharacterIsMember
.
Be sure to compare depending on your own needs: performances are best compared in a non-debug build; so avoid comparing performances in a Playground.
Solution 3:
Since characters have a limited, finite (and not too wide) range, you can just test which characters are members of a given character set (brute force):
// this doesn't seem to be available
#define UNICHAR_MAX (1ull << (CHAR_BIT * sizeof(unichar)))
NSData *data = [[NSCharacterSet uppercaseLetterCharacterSet] bitmapRepresentation];
uint8_t *ptr = [data bytes];
NSMutableArray *allCharsInSet = [NSMutableArray array];
// following from Apple's sample code
for (unichar i = 0; i < UNICHAR_MAX; i++) {
if (ptr[i >> 3] & (1u << (i & 7))) {
[allCharsInSet addObject:[NSString stringWithCharacters:&i length:1]];
}
}
Remark: Due to the size of a unichar and the structure of the additional segments in bitmapRepresentation, this solution only works for characters <= 0xFFFF and is not suitable for higher planes.