how to iterate over non-English file names in PHP
This is not possible. It's a limitation of PHP. PHP uses the multibyte versions of Windows APIs; you're limited to the characters your codepage can represent.
See this answer.
Directory contents:
D:\Users\Cataphract\Desktop\teste2>dir Volume in drive D is GRANDEDISCO Volume Serial Number is 945F-DB89 Directory of D:\Users\Cataphract\Desktop\teste2 01-06-2010 17:16 . 01-06-2010 17:16 .. 01-06-2010 17:15 0 coptic small letter shima follows ϭ.txt 01-06-2010 17:18 86 teste.php 2 File(s) 86 bytes 2 Dir(s) 12.178.505.728 bytes free
Test file contents:
<?php
exec('pause');
foreach (new DirectoryIterator(".") as $v) {
echo $v."\n";
}
Test file results:
. .. coptic small letter shima follows ?.txt teste.php
Debugger output:
Call stack (PHP 5.3.0):
> php5ts_debug.dll!readdir_r(DIR * dp=0x02f94068, dirent * entry=0x00a7e7cc, dirent * * result=0x00a7e7c0) Line 80 C php5ts_debug.dll!php_plain_files_dirstream_read(_php_stream * stream=0x02b94280, char * buf=0x02b9437c, unsigned int count=260, void * * * tsrm_ls=0x028a15c0) Line 820 + 0x17 bytes C php5ts_debug.dll!_php_stream_read(_php_stream * stream=0x02b94280, char * buf=0x02b9437c, unsigned int size=260, void * * * tsrm_ls=0x028a15c0) Line 603 + 0x1c bytes C php5ts_debug.dll!_php_stream_readdir(_php_stream * dirstream=0x02b94280, _php_stream_dirent * ent=0x02b9437c, void * * * tsrm_ls=0x028a15c0) Line 1806 + 0x16 bytes C php5ts_debug.dll!spl_filesystem_dir_read(_spl_filesystem_object * intern=0x02b94340, void * * * tsrm_ls=0x028a15c0) Line 199 + 0x20 bytes C php5ts_debug.dll!spl_filesystem_dir_open(_spl_filesystem_object * intern=0x02b94340, char * path=0x02b957f0, void * * * tsrm_ls=0x028a15c0) Line 238 + 0xd bytes C php5ts_debug.dll!spl_filesystem_object_construct(int ht=1, _zval_struct * return_value=0x02b91f88, _zval_struct * * return_value_ptr=0x00000000, _zval_struct * this_ptr=0x02b92028, int return_value_used=0, void * * * tsrm_ls=0x028a15c0, long ctor_flags=0) Line 645 + 0x11 bytes C php5ts_debug.dll!zim_spl_DirectoryIterator___construct(int ht=1, _zval_struct * return_value=0x02b91f88, _zval_struct * * return_value_ptr=0x00000000, _zval_struct * this_ptr=0x02b92028, int return_value_used=0, void * * * tsrm_ls=0x028a15c0) Line 658 + 0x1f bytes C php5ts_debug.dll!zend_do_fcall_common_helper_SPEC(_zend_execute_data * execute_data=0x02bc0098, void * * * tsrm_ls=0x028a15c0) Line 313 + 0x78 bytes C php5ts_debug.dll!ZEND_DO_FCALL_BY_NAME_SPEC_HANDLER(_zend_execute_data * execute_data=0x02bc0098, void * * * tsrm_ls=0x028a15c0) Line 423 C php5ts_debug.dll!execute(_zend_op_array * op_array=0x02b93888, void * * * tsrm_ls=0x028a15c0) Line 104 + 0x11 bytes C php5ts_debug.dll!zend_execute_scripts(int type=8, void * * * tsrm_ls=0x028a15c0, _zval_struct * * retval=0x00000000, int file_count=3, ...) Line 1188 + 0x21 bytes C php5ts_debug.dll!php_execute_script(_zend_file_handle * primary_file=0x00a7fad4, void * * * tsrm_ls=0x028a15c0) Line 2196 + 0x1b bytes C php.exe!main(int argc=2, char * * argv=0x028a14c0) Line 1188 + 0x13 bytes C php.exe!__tmainCRTStartup() Line 555 + 0x19 bytes C php.exe!mainCRTStartup() Line 371 C
Is it really a question mark?
dp->fileinfo {dwFileAttributes=32 ftCreationTime={...} ftLastAccessTime={...} ...} dwFileAttributes: 32 ftCreationTime: {dwLowDateTime=2784934701 dwHighDateTime=30081445 } ftLastAccessTime: {dwLowDateTime=2784934701 dwHighDateTime=30081445 } ftLastWriteTime: {dwLowDateTime=2784934701 dwHighDateTime=30081445 } nFileSizeHigh: 0 nFileSizeLow: 0 dwReserved0: 3435973836 dwReserved1: 3435973836 cFileName: 0x02f9409c "coptic small letter shima follows ?.txt" cAlternateFileName: 0x02f941a0 "COPTIC~1.TXT" dp->fileinfo.cFileName[34] 63 '?'
Yes! It's character #63.
Short reply:
Under Windows, you cannot access arbitrary file names with PHP; you are limited to those file names whose name can be represented with the currently selected "code page" (see Regional and Language Options", "Format" panel and "Administrative" tab panel "Language for non-Unicode programs").
Longer reply:
Windows uses UTF-16 for file encoding since Win2000, but PHP communicate with the underlying file system as a "non-Unicode aware program". This means that there is a current "code page table" that tranlates from PHP strings to UTF-16 strings and vice-versa. From PHP the current code page can be retrieved by setlocale() in the form "language_country.codepage", for example:
setlocale(LC_CTYPE, 0) ==> "english_United States.1252"
where 1252 is the Windows code page table currently selected from the control panel; file names retrieved from the file system are encoded using that code page; file names generated from PHP must be encoded according to that code page. Things are even more complicated by the fact that UTF-16 file names are traslated to PHP strings using the "best-fit code page", that is an approxymated representation of the actual characters/words, so you cannot trust on file names and paths retrieved from the file system as they might be arbitrarily mangled.
References:
http://en.wikipedia.org/wiki/Windows_code_page What "Windows code pages" are.
https://bugs.php.net/bug.php?id=47096 More details about this issue.