bytes to string conversion with invalid characters

Solution 1:

The root problem with your approach is that the result of type converting []byte to string does not have any U+FFFDs in it: this type-conversion only copies bytes from the source to the destination, verbatim.
Just as byte slices, strings in Go are not obliged to contain UTF-8-encoded text; they can contain any data, including opaque binary data which has nothing to do with text.

But some operations on strings—namely type-converting them to []rune and iterating over them using range—do interpret strings as UTF-8-encoded text. That is precisely where you got tripped: your range debugging loop attempted to interpret the string, and each time another attempt at decoding a properly encoded code point failed, range yielded a replacement character, U+FFFD.
To reiterate, the string obtained by the type-conversion does not contain the characters you wanted to get replaced by your regexp.

As to how to actually make a valid UTF-8-encoded string out of your data, you might employ a two-step process:

Type-convert your byte slice to a string—as you already do.
Use any means of interpreting a string as UTF-8—replacing U+FFFD which will dynamically appear during this process—as you're iterating.

Something like this:

var sb strings.Builder
for _, c := range string(b) {
  if c == '\uFFFD' {
    sb.WriteByte('.')
  } else {
    sb.WriteRune(c)
  }
}
return sb.String()

A note on performance: since type-converting a []byte to string copies memory—because strings are immutable while slices are not—the first step with type-conversion might be a waste of resources for code dealing with large chunks of data and/or working in tight processing loops.
In this case, it may be worth using the DecodeRune function of the encoding/utf8 package which works on byte slices. An example from its docs can be easily adapted to work with the loop above.

See also: Remove invalid UTF-8 characters from a string

Solution 2:

@kostix answer is correct and explains very clearly the issue with scanning unicode runes from a string.

Just adding the following remark : if your intention is to view characters only in the ASCII range (printable characters < 127) and you don't really care about other unicode code points, you can be more blunt :

// create a byte slice with the same byte length as s
var bs = make([]byte, len(s))

// scan s byte by byte :
for i := 0; i < len(s); i++ {
    switch {
    case 32 <= s[i] && s[i] <= 126:
        bs[i] = s[i]

    // depending on your needs, you may also keep characters in the 0..31 range,
    // like 'tab' (9), 'linefeed' (10) or 'carriage return' (13) :
    // case s[i] == 9, s[i] == 10, s[i] == 13:
    //   bs[i] = s[i]

    default:
        bs[i] = '.'
    }
}


fmt.Printf("rs: %s\n", bs)

playground

This function will give you something close to the "text" part of hexdump -C.

Solution 3:

You may want to use strings.ToValidUTF8() for this:

ToValidUTF8 returns a copy of the string s with each run of invalid UTF-8 byte sequences replaced by the replacement string, which may be empty.

It "seemingly" does exactly what you need. Testing it:

a := []byte{'a', 0xff, 0xaf, 'b', 0xbf}
s := strings.ToValidUTF8(string(a), ".")
fmt.Println(s)

Output (try it on the Go Playground):

a.b.

I wrote "seemingly" because as you can see, there's a single dot between a and b: because there may be 2 bytes, but a single invalid sequence.

Note that you may avoid the []byte => string conversion, because there's a bytes.ToValidUTF8() equivalent that operates on and returns a []byte:

a := []byte{'a', 0xff, 0xaf, 'b', 0xbf}
a = bytes.ToValidUTF8(a, []byte{'.'})
fmt.Println(string(a))

Output will be the same. Try this one on the Go Playground.

If it bothers you that multiple (invalid sequence) bytes may be shrinked into a single dot, read on.

Also note that to inspect arbitrary byte slices that may or may not contain texts, you may simply use hex.Dump() which generates an output like this:

a := []byte{'a', 0xff, 0xaf, 'b', 0xbf}
fmt.Println(hex.Dump(a))

Output:

00000000  61 ff af 62 bf                                    |a..b.|

There's your expected output a..b. with other (useful) data like the hex offset and hex representation of bytes.

To get a "better" picture of the output, try it with a little longer input:

a = []byte{'a', 0xff, 0xaf, 'b', 0xbf, 50: 0xff}
fmt.Println(hex.Dump(a))

00000000  61 ff af 62 bf 00 00 00  00 00 00 00 00 00 00 00  |a..b............|
00000010  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000020  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000030  00 00 ff                                          |...|

Try it on the Go Playground.

bytes to string conversion with invalid characters

Solution 1:

Solution 2:

Solution 3:

Related

Recent Posts