Batch script to replace PHP short open tags with <?php

I have a large collection of php files written over the years and I need to properly replace all the short open tags into proper explicit open tags.

change "<?" into "<?php"

I think this regular expression will properly select them :

<\?(\s|\n|\t|[^a-zA-Z])

which takes care of cases like

<?//
<?/*

but I am not sure how to process a whole folder tree and detect the .php file extension and apply the regular expression and save the file after it has been changed.

I have the feeling this can be pretty straightforward if you master the right tools. (There is an interesting hack in the sed manual: 4.3 Example/Rename files to lower case).

Maybe I'm wrong.
Or maybe this could be a oneliner?


don't use regexps for parsing formal languages - you'll always run into haystacks you did not anticipate. like:

<?
$bla = '?> now what? <?';

it's safer to use a processor that knows about the structure of the language. for html, that would be a xml processor; for php, the built-in tokenizer extension. it has the T_OPEN_TAG parser token, which matches <?php, <? or <%, and T_OPEN_TAG_WITH_ECHO, which matches <?= or <%=. to replace all short open tags, you find all these tokens and replace T_OPEN_TAG with <?php and T_OPEN_TAG_WITH_ECHO with <?php echo .

the implementation is left as an exercise for the reader :)

EDIT 1: ringmaster was so kind to provide one.

EDIT 2: on systems with short_open_tag turned off in php.ini, <?, <%, and <?= won't be recognized by a replacement script. to make the script work on such systems, enable short_open_tag via command line option:

php -d short_open_tag=On short_open_tag_replacement_script.php

p.s. the man page for token_get_all() and googleing for creative combinations of tokenizer, token_get_all, and the parser token names might help.

p.p.s. see also Regex to parse define() contents, possible? here on SO


If you're using the tokenizer option, this might be helpful:

$content = file_get_contents($file);
$tokens = token_get_all($content);
$output = '';

foreach($tokens as $token) {
 if(is_array($token)) {
  list($index, $code, $line) = $token;
  switch($index) {
   case T_OPEN_TAG_WITH_ECHO:
    $output .= '<?php echo ';
    break;
   case T_OPEN_TAG:
    $output .= '<?php ';
    break;
   default:
    $output .= $code;
    break;
  }

 }
 else {
  $output .= $token;
 }
}
return $output;

Note that the tokenizer will not properly tokenize short tags if short tags aren't enabled. That is, you can't run this code on the system where short tags aren't working. You must run it elsewhere to convert the code.