Implementing trigraphs in a C89 compiler

I am attempting to write a simplistic C89 --> x86_64 compiler, based on this C89 standard draft, in C89, for learning's sake. So far, I am implementing translation phase 1. My understanding is that this consists of

  1. Reading the code into a string.
  2. Replacing the trigraph sequences.

I have tried to implement this with a program (please forgive any style errors I have made):

char *trigraph_replacement(char *code)
{
        char *temp = calloc(1, strlen(code));
        char *temp_1 = temp;
        char *code_1 = code;
        for (; *code_1; code_1++)
        {
                if (strncmp(code_1, "??", 2) == 0)
                {
                        code_1 += 2;
                        switch (*code_1)
                        {
                        case '<':
                                *(temp_1) = '{';
                                break;
                        case '>':
                                *(temp_1) = '}';
                                break;
                        case '(':
                                *(temp_1) = '[';
                                break;
                        case ')':
                                *(temp_1) = ']';
                                break;
                        case '=':
                                *(temp_1) = '#';
                                break;
                        case '/':
                                *(temp_1) = '\\';
                                break;
                        case '\'':
                                *(temp_1) = '^';
                                break;
                        case '!':
                                *(temp_1) = '!';
                                break;
                        case '-':
                                *(temp_1) = '~';
                                break;
                        default:
                                break;
                        }
                }
                else
                {
                        *temp_1 = *code_1;
                }
                temp_1++;
        }

        free(code);
        return temp;
}

Now, intuitively, it seems that this should do what is is supposed to do, replace all the trigraphs. However, the gcc docs say that "Trigraphs are not popular and many compilers implement them incorrectly". It goes on to state that "portable code should not rely on trigraphs being either converted or ignored"

As a result, I am wondering

  • Is my implementation sufficient or did I err in some way?
  • Are trigraphs worth implementing in the first place, or are they not used, even in the most ancient legacy programs?

Solution 1:

Reading the code into a string.

Yeah well, translating from the symbol table used by the source file to the one used by the compiler. Ideally they would use the same one.

Is my implementation sufficient or did I err in some way?

  • calloc(1, strlen(code)); is almost certainly a bug because you didn't allocate room for the null terminator.
  • The implementation with heap allocation, branching switch etc is very naive for a parser, this code will be very slow. Use look-up tables when possible, prioritize execution speed over memory consumption.
  • There's no need for all the icky pointer arithmetic since you already decided to call strlen (a call which too gives overhead execution time). Store the result of that call and change the whole loop to for(size_t i=0; i<length; i++). You should be able to shave down this function considerably.

Are trigraphs worth implementing in the first place, or are they not used, even in the most ancient legacy programs?

They are barely used at all, save for in those ancient programs. Their presence in C is a heavily criticized language flaw. This is why gcc emits a warning whenever it encounters them.

However, if you wish to claim some manner of ISO C90 compliance, you need to support them. They are not (for reasons unknown) even flagged as obsolescent by the current ISO C17 standard, so it would seem they are here to stay.

Solution 2:

Leaving aside a number of technical issues and inefficiencies, which I'll get to later, the answer to your question:

did I err in some way?

is, "Yes, you did".

1. The bug in this code

Here's a highly abbreviated view of the loop in trigraph_replacement (you might hate my brace formatting; I do it this way so as to reduce vertical space wastage):

for (; *code_1; code_1++) {
    if (strncmp(code_1, "??", 2) == 0) {
        code_1 += 2;
        switch (*code_1) {
            case '<': *(temp_1) = '{'; break;
            /* ...  */
            default:                   break;
        }
    } /* end if (strncmp(...)) */
    else { *temp_1 = *code_1; }
    temp_1++;
}

If the strncmp returns a non-zero value, there's no problem; the effect is:

    *temp_1 = *code_1;
    temp_1++;
    code_1++;  /* From the for loop */

If the strncmp returns 0 and the first (or any other specified) case matches, again no problem:

    code_1 += 2; 
    *temp_1 = '{';   /* The parentheses around temp_1 were unnecessary */
    temp_1++;
    code_1++;  /* From the for loop */

But what if the strncmp returns 0 but the next character was, say, a space:

    printf("Can this be correct?? N has the value %d\n", N);

The default case involves the following sequence:

    code_1 += 2; 
    /* default case does nothing */
    temp_1++;
    code_1++;  /* From the for loop */

So the code skips over the ?? and the following space, and fails to store anything into the corresponding position in the output buffer. The result will be:

    printf("Can this be correctØN has the value %d\n", N);

(The Ø represents an actual NUL character; since code was allocated with calloc, the unset character position will be 0. That's not likely to play well with the next phase of the translation; it might prematurely terminate the source code, or produce an invalid source character error, or just get dropped on the floor, but none of those are going to be what the user intended.)

Perhaps worse, though, is the following example (§5.2.1.1para3 in the standard):

    printf("Eh???/n");

The trigraph sequence in that example starts at the second ?, which your code will never notice (and might not notice even if you fix the default case, depending on how you fix it). The intended translation is printf("Eh?\n");, after the trigraph replaces ??/ with a single backslash, which clearly must happen before the backslash escape is interpreted.

That is indeed one of the classic implementation errors which the GCC docs were alluding to.

2. Other implementation difficulties

Another common error, which your code won't suffer from if you perform the translation phases in order, is to fail to handle trigraphs in certain contexts. This typically happens in lexers which attempt to handle trigraphs (and line splicing) in parallel with the lexical decomposition, which I think was mentioned in a comment somewhere.

The argument, which I find reasonable, is that since neither trigraphs nor splices are very common at all [Note 1], even a very slow implementation in the lexer which only happens if necessary is probably a lot faster than wasting cycles on a translation pass which is practically never needed.

But some of the pathological cases, which no sensible programmer would ever actually employ, are very easy to get wrong. Consider the following:

/??/
* This is a comment
*/

In case it wasn't obvious, that's a comment because trigraph replacement is in phase 1; line splicing is in phase2; and comment recognition is in phase 3. After phase 1, the input is

/\
* This is a comment
*/

and after phase 2, it's

/* This is a comment
*/

That's easy to do if phase 1 and phase 2 are independent of the lexer. But the token regex which recognises /??/<newline>* as a comment starter and *??/<newline>/ as a comment terminator, along with all the other wacky variants, is not very pleasant to write [Note 2].

Finally, I know that it's tempting to slurp the entire input file into memory in order to parse it as a single contiguous string, but it's not really a very nice pattern. The very common antipattern of "seek to the end of the file and call tell to see how long the file is" is both an invitation to buffer overflow (the file could be longer when you actually read it, because it's still being written to) and a way of making it impossible to use a Unix pipe as an input source. But even if you read the file more carefully, you'll end up using a lot more memory than necessary, and you won't be able to start processing the source until it's all available. You might not care, and these days I'd probably accept that argument, but it's something to think about.

On the other hand, if you read the file buffer by buffer, you need to cope with the annoying case where the start of the trigraph is at the end of the buffer but the last character hasn't yet been read. Fortunately, both trigraphs and splices are short, so you can solve this issue with a small setaside buffer or even a very small state machine. But it's extra code which needs to be written and debugged.

3. Not exactly a code review

As written, the prototype

char* trigraph_replacement(char* tmp);

requires the argument to trigraph_replacement to be a mutable dynamically-allocated buffer whose ownership is transferred to the function, which ends up by freeing it and returning a new dynamically-allocated buffer.

That kind of ownership transfer is a recipe for disaster (particularly when it is undocumented). Unless you have a really good reason, functions which take string arguments should either leave them alone, or they should modify them in place. In-place modification is an efficiency hack; it's not always possible --indeed, it is often not possible-- but in this particular case, it is possible so you might want to consider it. Since the actual use of trigraphs is probably limited to the test cases you write yourself, there's really not much point burdening production code with a pointless of double allocation and copy of the entire input.

In-place modification is possible because all trigraph replacements replace three bytes with a single byte [Note 3]. Line-splicing is another process which always makes the output shorter, and with a bit of care you can do both of them at the same time. So if you document the fact that you are planning on modifying (but not freeing!) the input, it's completely reasonable to write the function as something like this:

void trigraph_replacement(char* code) {
  /* Trigraphs in `code` are modified in-place. */

  /* Use a look-up table rather than a massive case statement. This is
   * not necessarily faster, but the code is shorter.}
   */
  /* Designated initializers are C99; in C89 you'd need to write this out */
  static const char trigraphs[128] = {
    ['='] = '#',  ['('] = '[', ['/'] = '\\', [')'] = ']',
    ['\''] = '^', ['<'] = '{', ['!'] = '|',  ['>'] = '}',
    ['-'] = '~'
  };
  for (char* in = code, *out = code; ; ++in, ++out) {
    if (in[0] == '?' &&
        in[1] == '?' &&  /* Safe because code must be nul-terminated */
        in[2] > 0    &&  /* Ditto */
        in[2] < 128  &&
        trigraphs[in[2]] != 0) {
      *out = trigraphs[in[2]];
      in += 2;
    }
    else {
      *in = *out;
    }
    if (*in == 0) break;
  }
}

That loop is careful to only advance the input pointer over the ?? when it has verified that the ?? starts a trigraph. I made no attempt to special case the various possibilities in which you could copy and advance one or two characters, on the grounds that they don't happen often enough to justify the additional code complexity.

4. What if in-place modification isn't acceptable?

Not everyone like in-place modification (including me, sometimes), so it's worth thinking about an alternative. Whatever alternative you choose, it's important to think about (and document) the dynamic storage allocation strategy.

For example, it would be tempting to simply return the argument unmodified in the common case that no trigraphs were encountered. But that makes life difficult for the caller because they won't know whether the output string is an additional dynamically-allocated memory region, or the same one. So there is a good argument for unconditionally copying the input to a newly allocated output. (It's possible that the caller could take advantage of the fact that the copy is unconditional, for example by eliminating an unnecessary copy of the input buffer.)

However, the only mechanism C provides for computing the length of a copy is strlen, which requires a complete scan of the buffer (recalling that the buffer is the entire program, which could be quite large). It's actually highly likely that the caller knows how long the input it, since they must have acquired it from somewhere, and most input functions tell you how much data was read [Note 4]. So it's entirely reasonable to require that the caller tell you how long the input is, resulting in a prototype like

char* trigraph_replacement(const char* code, size_t codelen);

If for whatever reason, the caller doesn't know how long code is, they can call strlen, but there's no need to penalize calls which do know. Regardless, it's vital that enough space be allocated for the entire string including the terminating NUL character. So you will want to use malloc(codelen + 1) or calloc(1, codelen + 1). (If you're careful, you don't really need calloc.)

Notes

  1. The vast majority of line splices are surrounded by whitespace, and those can be trivially handled with a small modification to the whitespace regex pattern. The same goes for splices in string literals. Much more problematic are the splices which are in the middle of a token, particularly complicated tokens like pp-numbers, or the horrific case described below of a splice composed from a trigraph.

  2. It's not impossible, and I know some readers will take it as a challenge. That's cool. But are you sure that you got all the cases covered? I expect to see a lot of test cases :-)

  3. The characters into which the trigraphs are converted are all in the basic character set, which must be encoded into a single byte. See §5.2.1.2.

  4. fgets should not be used to read multiline input, since it involves an additional scan for newline characters. Aside from the fact that it doesn't tell you how many characters it read, leaving the necessity to do yet another scan for the null character.