Batch convert encoding in files
How can I batch-convert files in a directory for their encoding (e.g. ANSI → UTF-8) with a command or tool?
For single files, an editor helps, but how can I do the mass files job?
Solution 1:
Cygwin or GnuWin32 provide Unix tools like iconv
and dos2unix
(and unix2dos
). Under Unix/Linux/Cygwin, you'll want to use "windows-1252" as the encoding instead of ANSI (see below). (Unless you know your system is using a codepage other than 1252 as its default codepage, in which case you'll need to tell iconv the right codepage to translate from.)
Convert from one (-f
) to the other (-t
) with:
$ iconv -f windows-1252 -t utf-8 infile > outfile
Or in a find-all-and-conquer form:
## this will clobber the original files!
$ find . -name '*.txt' -exec iconv --verbose -f windows-1252 -t utf-8 {} \> {} \;
Alternatively:
## this will clobber the original files!
$ find . -name '*.txt' -exec iconv --verbose -f windows-1252 -t utf-8 -o {} {} \;
This question has been asked many times on this site, so here's some additional information about "ANSI". In an answer to a related question, CesarB mentions:
There are several encodings which are called "ANSI" in Windows. In fact, ANSI is a misnomer. iconv has no way of guessing which you want.
The ANSI encoding is the encoding used by the "A" functions in the Windows API (the "W" functions use UTF-16). Which encoding it corresponds to usually depends on your Windows system language. The most common is CP 1252 (also known as Windows-1252). So, when your editor says ANSI, it is meaning "whatever the API functions use as the default ANSI encoding", which is the default non-Unicode encoding used in your system (and thus usually the one which is used for text files).
The page he links to gives this historical tidbit (quoted from a Microsoft PDF) on the origins of CP 1252 and ISO-8859-1, another oft-used encoding:
[...] this comes from the fact that the Windows code page 1252 was originally based on an ANSI draft, which became ISO Standard 8859-1. However, in adding code points to the range reserved for control codes in the ISO standard, the Windows code page 1252 and subsequent Windows code pages originally based on the ISO 8859-x series deviated from ISO. To this day, it is not uncommon to have the development community, both within and outside of Microsoft, confuse the 8859-1 code page with Windows 1252, as well as see "ANSI" or "A" used to signify Windows code page support.
Solution 2:
With PowerShell you can do something like this:
Get-Content IN.txt | Out-File -encoding ENC -filepath OUT.txt
While ENC is something like unicode, ascii, utf8, and utf32. Check out 'help out-file'.
To convert all the *.txt files in a directory to UTF-8, do something like this:
foreach($i in ls -name DIR/*.txt) { \
Get-Content DIR/$i | \
Out-File -encoding utf8 -filepath DIR2/$i \
}
which creates a converted version of each .txt file in DIR2.
To replace the files in all subdirectories, use:
foreach($i in ls -recurse -filter "*.java") {
$temp = Get-Content $i.fullname
Out-File -filepath $i.fullname -inputobject $temp -encoding utf8 -force
}
Solution 3:
The Wikipedia page on newlines has a section on conversion utilities.
This seems your best bet for a conversion using only tools Windows ships with:
TYPE unix_file | FIND "" /V > dos_file