Random string that matches a regexp [duplicate]

How would you go about creating a random alpha-numeric string that matches a certain regular expression?

This is specifically for creating initial passwords that fulfill regular password requirements.


Solution 1:

Welp, just musing, but the general question of generating random inputs that match a regex sounds doable to me for a sufficiently relaxed definition of random and a sufficiently tight definition of regex. I'm thinking of the classical formal definition, which allows only ()|* and alphabet characters.

Regular expressions can be mapped to formal machines called finite automata. Such a machine is a directed graph with a particular node called the final state, a node called the initial state, and a letter from the alphabet on each edge. A word is accepted by the regex if it's possible to start at the initial state and traverse one edge labeled with each character through the graph and end at the final state.

One could build the graph, then start at the final state and traverse random edges backwards, keeping track of the path. In a standard construction, every node in the graph is reachable from the initial state, so you do not need to worry about making irrecoverable mistakes and needing to backtrack. If you reach the initial state, stop, and read off the path going forward. That's your match for the regex.

There's no particular guarantee about when or if you'll reach the initial state, though. One would have to figure out in what sense the generated strings are 'random', and in what sense you are hoping for a random element from the language in the first place.

Maybe that's a starting point for thinking about the problem, though!

Now that I've written that out, it seems to me that it might be simpler to repeatedly resolve choices to simplify the regex pattern until you're left with a simple string. Find the first non-alphabet character in the pattern. If it's a *, replicate the preceding item some number of times and remove the *. If it's a |, choose which of the OR'd items to preserve and remove the rest. For a left paren, do the same, but looking at the character following the matching right paren. This is probably easier if you parse the regex into a tree representation first that makes the paren grouping structure easier to work with.

To the person who worried that deciding if a regex actually matches anything is equivalent to the halting problem: Nope, regular languages are quite well behaved. You can tell if any two regexes describe the same set of accepted strings. You basically make the machine above, then follow an algorithm to produce a canonical minimal equivalent machine. Do that for two regexes, then check if the resulting minimal machines are equivalent, which is straightforward.

Solution 2:

String::Random in Perl will generate a random string from a subset of regular expressions:

#!/usr/bin/perl

use strict;
use warnings;

use String::Random qw/random_regex/;

print random_regex('[A-Za-z]{3}[0-9][A-Z]{2}[!@#$%^&*]'), "\n";

Solution 3:

If you have a specific problem, you probably have a specific regular expression in mind. I would take that regular expression, work out what it means in simple human terms, and work from there.

I suspect it's possible to create a general regex random match generator, but it's likely to be much more work than just handling a specific case - even if that case changes a few times a year.

(Actually, it may not be possible to generate random matches in the most general sense - I have a vague memory that the problem of "does any string match this regex" is the halting problem in disguise. With a very cut-down regex language you may have more luck though.)

Solution 4:

I have written Parsley, which consist of a Lexer and a Generator.

  • Lexer is for converting a regular expression-like string into a sequence of tokens.
  • Generator is using these tokens to produce a defined number of codes.
$generator = new \Gajus\Parsley\Generator();

/**
 * Generate a set of random codes based on Parsley pattern.
 * Codes are guaranteed to be unique within the set.
 *
 * @param string $pattern Parsley pattern.
 * @param int $amount Number of codes to generate.
 * @param int $safeguard Number of additional codes generated in case there are duplicates that need to be replaced.
 * @return array
 */
$codes = $generator->generateFromPattern('FOO[A-Z]{10}[0-9]{2}', 100);

The above example will generate an array containing 100 codes, each prefixed with "FOO", followed by 10 characters from "ABCDEFGHKMNOPRSTUVWXYZ23456789" haystack and 2 numbers from "0123456789" haystack.

Solution 5:

This PHP library looks promising: ReverseRegex

Like all of these, it only handles a subset of regular expressions but it can do fairly complex stuff like UK Postcodes:

([A-PR-UWYZ]([0-9]([0-9]|[A-HJKSTUW])?|[A-HK-Y][0-9]([0-9]|[ABEHMNPRVWXY])?) ?[0-9][ABD-HJLNP-UW-Z]{2}|GIR0AA)

Outputs

D43WF
B6 6SB
MP445FR
P9 7EX
N9 2DH
GQ28 4UL
NH1 2SL
KY2 9LS
TE4Y 0AP