Character encoding problem with filenames - find broken filenames

Assuming you are using utf-8 encoding (the default in Ubuntu), this script should hopefully identify the filenames and rename them for you.

It works by using find with C-encoding (ASCII) to locate files with unprintable characters in them. It then tries to determine if these unprintable characters are utf-8 characters or not. If not, it shows you the filenames decoded with each of the encodings listed in the enc array, allowing you to select the one that looks right in order to rename it.

latin1 was commonly used on older Linux systems, and windows-1252 is commonly used by windows nowadays (I think). iconv -l will show you a list of possible encodings.

#!/bin/bash

# List of encodings to try. (max 10)
enc=( latin1 windows-1252 )

while IFS= read -rd '' file <&3; do
    base=${file##*/} dir=${file%/*}

    # if converting from utf8 to utf8 succeeds, we'll assume the filename is ok.
    iconv -f utf8 <<< "$base" >/dev/null 2>&1 && continue

    # display the filename converted from each enc to utf8
    printf 'In %s:\n' "$dir/"
    for i in "${!enc[@]}"; do
        name=$(iconv -f "${enc[i]}" <<< "$base")
        printf '%2d - %-12s: %s\n' "$i" "${enc[i]}" "$name"
    done
    printf ' s - Skip\n'

    while true; do
        read -p "? " -n1 ans
        printf '\n'
        if [[ $ans = [0-9] && ${enc[ans]} ]]; then
            name=$(iconv -f "${enc[ans]}" <<< "$base")
            mv -iv "$file" "$dir/$name"
            break
        elif [[ $ans = [Ss] ]]; then
            break
        fi
    done
done 3< <(LC_ALL=C find . -depth -name "*[![:print:][:space:]]*" -print0)

Try this:

find / | grep -P "[\x80-\xFF]"

This will locate all non-ASCII characters in file and folder names, and help you to find the guilty culprits :P