Find the first un-repeated character in a string

It has to be at least O(n) because you don't know if a character will be repeated until you've read all characters.

So you can iterate over the characters and append each character to a list the first time you see it, and separately keep a count of how many times you've seen it (in fact the only values that matter for the count is "0", "1" or "more than 1").

When you reach the end of the string you just have to find the first character in the list that has a count of exactly one.

Example code in Python:

def first_non_repeated_character(s):
    counts = defaultdict(int)
    l = []
    for c in s:
        counts[c] += 1
        if counts[c] == 1:
            l.append(c)

    for c in l:
        if counts[c] == 1:
            return c

    return None

This runs in O(n).

I see that people have posted some delightful answers below, so I'd like to offer something more in-depth.

An idiomatic solution in Ruby

We can find the first un-repeated character in a string like so:

def first_unrepeated_char string
   string.each_char.tally.find { |_, n| n == 1 }.first
end

How does Ruby accomplish this?

Reading Ruby's source

Let's break down the solution and consider what algorithms Ruby uses for each step.

First we call each_char on the string. This creates an enumerator which allows us to visit the string one character at a time. This is complicated by the fact that Ruby handles Unicode characters, so each value we get from the enumerator can be a variable number of bytes. If we know our input is ASCII or similar, we could use each_byte instead.

The each_char method is implemented like so:

rb_str_each_char(VALUE str)
{
    RETURN_SIZED_ENUMERATOR(str, 0, 0, rb_str_each_char_size);
    return rb_str_enumerate_chars(str, 0);
}

In turn, rb_string_enumerate_chars is implemented as:

rb_str_enumerate_chars(VALUE str, VALUE ary)
{
    VALUE orig = str;
    long i, len, n;
    const char *ptr;
    rb_encoding *enc;


    str = rb_str_new_frozen(str);
    ptr = RSTRING_PTR(str);
    len = RSTRING_LEN(str);
    enc = rb_enc_get(str);


    if (ENC_CODERANGE_CLEAN_P(ENC_CODERANGE(str))) {
    for (i = 0; i < len; i += n) {
        n = rb_enc_fast_mbclen(ptr + i, ptr + len, enc);
        ENUM_ELEM(ary, rb_str_subseq(str, i, n));
    }
    }
    else {
    for (i = 0; i < len; i += n) {
        n = rb_enc_mbclen(ptr + i, ptr + len, enc);
        ENUM_ELEM(ary, rb_str_subseq(str, i, n));
    }
    }
    RB_GC_GUARD(str);
    if (ary)
    return ary;
    else
    return orig;
}

From this we can see that it calls rb_enc_mbclen (or its fast version) to get the length (in bytes) of the next character in the string so that it can iterate the next step. By lazily iterating over a string, reading just one character at a time, we end up doing just one full pass over the input string as tally consumes the iterator.

Tally is then implemented like so:

static void
tally_up(VALUE hash, VALUE group)
{
    VALUE tally = rb_hash_aref(hash, group);
    if (NIL_P(tally)) {
        tally = INT2FIX(1);
    }
    else if (FIXNUM_P(tally) && tally < INT2FIX(FIXNUM_MAX)) {
        tally += INT2FIX(1) & ~FIXNUM_FLAG;
    }
    else {
        tally = rb_big_plus(tally, INT2FIX(1));
    }
    rb_hash_aset(hash, group, tally);
}


static VALUE
tally_i(RB_BLOCK_CALL_FUNC_ARGLIST(i, hash))
{
    ENUM_WANT_SVALUE();
    tally_up(hash, i);
    return Qnil;
}

Here, tally_i uses RB_BLOCK_CALL_FUNC_ARGLIST to call repeatedly to tally_up, which updates the tally hash on every iteration.

Rough time & memory analysis

The each_char method doesn't allocate an array to eagerly hold the characters of the string, so it has a small constant memory overhead. When we tally the characters, we allocate a hash and put our tally data into it which in the worst case scenario can take up as much memory as the input string times some constant factor.

Time-wise, tally does a full scan of the string, and calling find to locate the first non-repeated character will scan the hash again, each of which carry O(n) worst-case complexity.

However, tally also updates a hash on every iteration. Updating the hash on every character can be as slow as O(n) again, so the worst case complexity of this Ruby solution is perhaps O(n^2).

However, under reasonable assumptions, updating a hash has an O(1) complexity, so we can expect the average case amortized to look like O(n).

My old accepted answer in Python

You can't know that the character is un-repeated until you've processed the whole string, so my suggestion would be this:

def first_non_repeated_character(string):
  chars = []
  repeated = []
  for character in string:
    if character in chars:
      chars.remove(character)
      repeated.append(character)
    else:
      if not character in repeated:
        chars.append(character)
  if len(chars):
    return chars[0]
  else:
    return False

Edit: originally posted code was bad, but this latest snippet is Certified To Work On Ryan's Computer™.

Why not use a heap based data structure such as a minimum priority queue. As you read each character from the string, add it to the queue with a priority based on the location in the string and the number of occurrences so far. You could modify the queue to add priorities on collision so that the priority of a character is the sum of the number appearances of that character. At the end of the loop, the first element in the queue will be the least frequent character in the string and if there are multiple characters with a count == 1, the first element was the first unique character added to the queue.

Find the first un-repeated character in a string

An idiomatic solution in Ruby

Reading Ruby's source

Rough time & memory analysis

My old accepted answer in Python

Related

Recent Posts