Finding the phone numbers in 50,000 HTML pages

How do you find the phone numbers in 50,000 HTML pages?

Jeff Attwood posted 5 Questions for programmers applying for jobs:

In an effort to make life simpler for phone screeners, I've put together this list of Five Essential Questions that you need to ask during an SDE screen. They won't guarantee that your candidate will be great, but they will help eliminate a huge number of candidates who are slipping through our process today.

1) Coding The candidate has to write some simple code, with correct syntax, in C, C++, or Java.

2) OO design The candidate has to define basic OO concepts, and come up with classes to model a simple problem.

3) Scripting and regexes The candidate has to describe how to find the phone numbers in 50,000 HTML pages.

4) Data structures The candidate has to demonstrate basic knowledge of the most common data structures.

5) Bits and bytes The candidate has to answer simple questions about bits, bytes, and binary numbers.

Please understand: what I'm looking for here is a total vacuum in one of these areas. It's OK if they struggle a little and then figure it out. It's OK if they need some minor hints or prompting. I don't mind if they're rusty or slow. What you're looking for is candidates who are utterly clueless, or horribly confused, about the area in question.

>>> The Entirety of Jeff´s Original Post <<<


Note: Steve Yegge originally posed the Question.


Solution 1:

egrep "(([0-9]{1,2}.)?[0-9]{3}.[0-9]{3}.[0-9]{4})" . -R --include='*.html'

Solution 2:

Made this in Java. The regex was borrowed from this forum.

    final String regex = "[\\s](\\({0,1}\\d{3}\\){0,1}" +
            "[- \\.]\\d{3}[- \\.]\\d{4})|" +
            "(\\+\\d{2}-\\d{2,4}-\\d{3,4}-\\d{3,4})";
    final Pattern phonePattern = Pattern.compile(regex);
    
    /* The result set */
    Set<File> files = new HashSet<File>();
    
    File dir = new File("/initDirPath");
    if (!dir.isDirectory()) return;
    
    for (File file : dir.listFiles()) {
        if (file.isDirectory()) continue;
        
        BufferedReader reader = new BufferedReader(new FileReader(file));
        
        String line;
        boolean found = false;
        while ((line = reader.readLine()) != null 
                && !found) {
            
            if (found = phonePattern.matcher(line).find()) {
                files.add(file);
            }
        }
    }

    for (File file : files) {
        System.out.println(file.getAbsolutePath());
    }

Performed some tests and it went ok! :) Remeber I'm not trying to use the best design here. Just implemented the algorithm for that.

Solution 3:

Here is a improved regex pattern

\(?\d{3}\)?[-\s\.]?\d{3}[-\s\.]?\d{4}

It is able to identify several number formats

  1. xxx.xxx.xxxx
  2. xxx.xxxxxxx
  3. xxx-xxx-xxx
  4. xxxxxxxxxx
  5. (xxx) xxx xxxx
  6. (xxx) xxx-xxxx
  7. (xxx)xxx-xxxx

Solution 4:

Borrowing 2 things from the C# answer from sieben, here's a little F# snippet that will do the job. All it's missing is a way to call processDirectory, which is left out intentionally :)


open System
open System.IO
open System.Text.RegularExpressions

let rgx = Regex(@"(\({0,1}\d{3}\){0,1}[- \.]\d{3}[- \.]\d{4})|(\+\d{2}-\d{2,4}-\d{3,4}-\d{3,4})", RegexOptions.Compiled)

let processFile contents = contents |> rgx.Matches |> Seq.cast |> Seq.map(fun m -> m.Value)

let processDirectory path = Directory.GetFiles(path, "*.html", SearchOption.AllDirectories) |> Seq.map(File.ReadAllText >> processFile) |> Seq.concat