What is the best format for a customer number, order number?
A large international company deploys a new web and MOTO (Mail Order and Telephone Order) handling system. Among other things you are tasked to design format for both order and customer identification numbers.
What would be the best format in your opinion? Please list any assumptions and considerations.
Accepted Answer
Michael Haren's answer selected due to the most up votes, but please do read other answers and comments as they make Michael's answer more complete.
Solution 1:
Go with all numbers or all letters. If you must mix it up, then make sure there are no ambiguous characters (Il1m, O0, etc.).
When displayed/printed, put spaces in every 3-4 characters but make sure your systems can handle inputs without the spaces.
Edit: Another thing to consider is having a built in way to distinguish orders, customers, etc. e.g. customers always start with 10, orders always start with 20, vendors always start with 30, etc.
Solution 2:
DON'T encode ANY mutable customer/order information into the numbers! And you have to assume that everything is mutable!
Some of the above suggestions include a region code. Companies can move. Your own company might reorganize and change its own definition of regions. Customer/company names can change as well.
Customer/order information belongs in the customer/order record. Not in the ID. You can modify the customer/order record later. IDs are generally written in stone.
Even just encoding the date on which the number was generated into the ID might seem safe, but that assumes that the date is never wrong on the systems generating the numbers. Again, this belongs in the record. Otherwise it can never be corrected.
Will more than one system be generating these numbers? If so, you have the potential for duplication if you use only date-based and/or sequential numbers.
Without knowing much about the company, I'd start down this path:
- A one-character code identifying the type of number. C for customers, R for orders (don't use "O" as it could be confused with zero), etc.
- An identifier of the system that generated the number. The length of this identifier depends on how many of these systems there will be.
- A sequence number, unique to the system generating it. Just a counter.
- A random number, to prevent guessable order/customer numbers. Make this as long as your paranoia requires.
- A simple checksum. Not for security, but for error checking.
Breaking this up into segments makes it more human-readable as others have pointed out.
CX5-0000758-82314-12 is a possible number generated by this approach . This consists of:
- C: it's a customer number.
- X5: the station that generated the number.
- 0000758: this is the 758th number generated by X5. We can generate 10 million before retiring this station ID or the station itself. Or don't pad with zeros and there's no limit.
- 82314: this was randomly generated and results in a 1/100,000 chance of guessing a customer ID.
- 12: checksum.
Solution 3:
A primary advantage of using only numbers is that they can be entered much more efficiently using 10-key.
The length of that number should be as short as possible while still encompassing the entire entity space you expect to catalog with room to spare. This can be tricky and should be given a bit of thought. A little set theory can give you the number of unique keys you will have access to, given a group of elements.
It is natural when speaking, to break numbers up into sets of two to four digits. By inserting dashes in some pattern, you can "force" the customer to repeat them in a more efficient and unambiguous manner.
For instance, 323-23-5344, which, of course, is social security number format, helps to inform the speaker where to pause when vocalizing the number. It also provides a visual delineation when writing the number and makes it easy to compare when copying the number.
I second the recommendation that the ordering system masks the input correctly so that no dashes need to be entered at any time. This should be carried through to printed forms to provide a clear expectation of what should be entered. For instance, a printed box for each digit separated by printed dashes.
I disagree that too much information should be embedded in this number especially if those attributes might change. For instance, say we give "323" the meaning of "is a nice customer" but then they call in four times with an attitude. Are we then going to change their customer key to "324", "is a jerk"? What if they are in region 04 and move their company to region 05?
If that happens, your options will be to update that primary key throughout the database or live with the ambiguity that the information embedded in that key is no longer reliable, thus rendering all of the information embedded in the keys of questionable utility.
It is better to store attributes that may change as separate fields in the database and have the customer number be a unique, unchanging key for that customer.
Solution 4:
To build on Daniel and Michael's questions: it's even better if the separated numbers MEAN something else. For example, I worked for a company where account numbers were like this:
xxxx-xxxx-xxxxxxxx
The first set of numbers represented the region and the second set represented the market within that region. Once you got used to knowing what numbers were from were, it made it really easy to tell what area an account was in without even having to look at the customer's account.
Solution 5:
There are several assumptions that I make when answering this question; some are based on the fact that it is a large international organization, and some are based on the fact that the format is for two separate table types.
Assumptions based on the fact that it's an international organization:
- It is probable that each region will need to operate independently -- that is, region A must be able to add customer numbers independently from region B
- Each region probably uses a different language so to make the identifiers easily type-able by users around the world, it is best to stick to numbers and spaces only.
Assumptions based on the fact that there are two tables for which this format will be used:
- This format may be used by more than the two tables listed, so it should be able to handle an arbitrarily large number of tables.
- Experienced users should be able to know what type of identifier they are looking at based on information encoded into the identifier itself.
- It would be nice if identifiers were globally unique within the entire system.
Considerations:
- For a global company, identifiers can be very long if only numerics are used. We should attempt to limit the amount of extraneous information encoded into the identifier as much as possible.
- Identifiers should be self-verifiable to a limited extent; that is a program should be able to detect a large percent of invalid identifiers without looking anything up at all. This implies a checksum.
Proposed format: SSSS0RR0TTC
The format proposed is as simple as possible, but no simpler:
- C The first (rightmost) character will be a checksum of all other characters in the identifier. A simple checksum will do. This will eliminate 90% of all typing errors. If it is decided that this is not enough, then this can be expanded to 2 digits which will eliminate 99% of all typing errors.
- TT The next N digits represent the table type number. No table type number can contain the digit zero.
- The next digit is a zero. This zero separates the table type number from the region number.
- RR The next N digits are the region number. No region numbers can contain a zero.
- The next digit is a zero. This zero separates the region from the sequence number.
- SSSS The next N digits are the sequence number. This number can contain zeros.
- Each set of four numbers are separated by spaces when printed or typed in by convention. Internally they are not separated, but this helps the user transfer them correctly.
Examples
-
Assuming:
- Customer table type=1
- Order table table type=2
- Region code for US-Alabama=1
- Region code for CA-Alberta=43
- Region code for Ethopia=924
- 10 1013 - Customer #1 in Alabama (3 is the checksum: 1 +1 + 1)
- 10 1024 - Order #1 in Alabama
- 9259 0304 3016 - customer # 925903 in Alberta, Canada
- 20 3043 4092 4023 - order number 2030434 in Ethopia
Advantages of this approach:
- 90% of mistyped numbers will be caught
- There are an unlimited number of table types
- There are an unlimited number of regions
- There are an unlimited number of sequential numbers for each table
- Identifier numbers are globally unique to the system. This is important - a customer number cannot be mistaken for an order number and visa versa.
- Each region can independently add sequence numbers without a global key
Disadvantages
- Each identifier is at least six characters
- table types numbers and region numbers cannot contain a zero because the zero is used to separate the sequence number from the region number from the table type number.