Repair encoding of ID3 tags
You want Ex Falso, the tag editor included in the Quod Libet project. Picard (the MusicBrainz tagger) may use the same tagging library, but QL originated it.
In particular, you want the Mutagen tagging library, which supports id3v2.4 (and by "support" I mean "enforce" ...militarily...). It is also excellent with character encodings, and includes a basic scriptable commandline tagger (mid3v2
). As far as your normalization step goes, Mutagen only saves tags in ID3v2.4. It is certainly capable of converting all text into UTF-8, but you may need to script that yourself (I believe that the mid3v2
tool's defaults are to keep the current encoding where possible, and I don't know if it can be told to save everything in a particular encoding). Mutagen is written in Python.
Ex Falso is a nice, clean GUI , and supports most of the major retag-multiple-files features you'd expect. I don't think it does much in the way of internet lookups and I don't know how it is with album artwork -- Quod Libet may support that; Ex Falso can do it with a plugin, should one exist, though one might not exist. I've never needed that functionality -- I use EF and mid3v2
in concert to handle my retagging needs.
I don't think you're going to find a standalone application that will fix up your particular selection of incorrectly-tagged encodings. Having a mixture of cp1252, UTF-16 and GB-18030 is quite unusual and I don't think existing software will be able to solve that automatically.
So I'd download Mutagen and write a custom Python script to automate your own decisions about how to fix up unknown encodings. For example:
musicroot= ur'C:\music\wonky'
tryencodings= 'gb18030', 'cp1252'
import os
import mutagen.id3
def findMP3s(path):
for child in os.listdir(path):
child= os.path.join(path, child)
if os.path.isdir(child):
for mp3 in findMP3s(child):
yield mp3
elif child.lower().endswith(u'.mp3'):
yield child
for path in findMP3s(musicroot):
id3= mutagen.id3.ID3(path)
for key, value in id3.items():
if value.encoding!=3 and isinstance(getattr(value, 'text', [None])[0], unicode):
if value.encoding==0:
bytes= '\n'.join(value.text).encode('iso-8859-1')
for encoding in tryencodings:
try:
bytes.decode(encoding)
except UnicodeError:
pass
else:
break
else:
raise ValueError('None of the tryencodings work for %r key %r' % (path, key))
for i in range(len(value.text)):
value.text[i]= value.text[i].encode('iso-8859-1').decode(encoding)
value.encoding= 3
id3.save()
The above script makes a few assumptions:
Only the tags marked as being in encoding 0 are wrong. (Ostensibly encoding 0 is ISO-8859-1, but in practice it is often a Windows default code page.)
If a tag is marked as being in UTF-8 or a UTF-16 encoding it's assumed to be correct, and simply converted to UTF-8 if it isn't already. Personally I haven't seen ID3s marked as UTF (encodings 1-3) in error before. Luckily encoding 0 is easy to recover into its original bytes since ISO-8859-1 is a 1-to-1 direct mapping of the ordinal byte values.
When an encoding 0 tag is met, the script attempts to recast it as GB18030 first, then if it's not valid falls back to code page 1252. Single-byte encodings like cp1252 will tend to match most byte sequences, so it's best to put them at the end of the list of encodings to try.
If you have other encodings like cp1251 Cyrillic, or a lot of cp1252 filenames with multiple accented characters in a row, that get mistaken for GB18030, you'll need a cleverer guessing algorithm of some sort. Maybe look at the filename to guess what sort of characters are likely to be present?