bytes to string conversion with invalid characters
Solution 1:
The root problem with your approach is that the result of type converting []byte
to string
does not have any U+FFFD
s in it: this type-conversion only copies bytes from the source to the destination, verbatim.
Just as byte slices, strings in Go are not obliged to contain UTF-8-encoded text; they can contain any data, including opaque binary data which has nothing to do with text.
But some operations on strings—namely type-converting them to []rune
and iterating over them using range
—do interpret strings as UTF-8-encoded text.
That is precisely where you got tripped: your range
debugging loop attempted to interpret the string, and each time another attempt at decoding a properly encoded code point failed, range
yielded a replacement character, U+FFFD
.
To reiterate, the string obtained by the type-conversion does not contain the characters you wanted to get replaced by your regexp.
As to how to actually make a valid UTF-8-encoded string out of your data, you might employ a two-step process:
- Type-convert your byte slice to a string—as you already do.
- Use any means of interpreting a string as UTF-8—replacing U+FFFD which will dynamically appear during this process—as you're iterating.
Something like this:
var sb strings.Builder
for _, c := range string(b) {
if c == '\uFFFD' {
sb.WriteByte('.')
} else {
sb.WriteRune(c)
}
}
return sb.String()
A note on performance: since type-converting a []byte
to string
copies memory—because strings are immutable while slices are not—the first step with type-conversion might be a waste of resources for code dealing with large chunks of data and/or working in tight processing loops.
In this case, it may be worth using the DecodeRune
function of the encoding/utf8
package which works on byte slices.
An example from its docs can be easily adapted to work with the loop above.
See also: Remove invalid UTF-8 characters from a string
Solution 2:
@kostix answer is correct and explains very clearly the issue with scanning unicode runes from a string.
Just adding the following remark : if your intention is to view characters only in the ASCII range (printable characters < 127) and you don't really care about other unicode code points, you can be more blunt :
// create a byte slice with the same byte length as s
var bs = make([]byte, len(s))
// scan s byte by byte :
for i := 0; i < len(s); i++ {
switch {
case 32 <= s[i] && s[i] <= 126:
bs[i] = s[i]
// depending on your needs, you may also keep characters in the 0..31 range,
// like 'tab' (9), 'linefeed' (10) or 'carriage return' (13) :
// case s[i] == 9, s[i] == 10, s[i] == 13:
// bs[i] = s[i]
default:
bs[i] = '.'
}
}
fmt.Printf("rs: %s\n", bs)
playground
This function will give you something close to the "text" part of hexdump -C
.
Solution 3:
You may want to use strings.ToValidUTF8()
for this:
ToValidUTF8 returns a copy of the string s with each run of invalid UTF-8 byte sequences replaced by the replacement string, which may be empty.
It "seemingly" does exactly what you need. Testing it:
a := []byte{'a', 0xff, 0xaf, 'b', 0xbf}
s := strings.ToValidUTF8(string(a), ".")
fmt.Println(s)
Output (try it on the Go Playground):
a.b.
I wrote "seemingly" because as you can see, there's a single dot between a
and b
: because there may be 2 bytes, but a single invalid sequence.
Note that you may avoid the []byte
=> string
conversion, because there's a bytes.ToValidUTF8()
equivalent that operates on and returns a []byte
:
a := []byte{'a', 0xff, 0xaf, 'b', 0xbf}
a = bytes.ToValidUTF8(a, []byte{'.'})
fmt.Println(string(a))
Output will be the same. Try this one on the Go Playground.
If it bothers you that multiple (invalid sequence) bytes may be shrinked into a single dot, read on.
Also note that to inspect arbitrary byte slices that may or may not contain texts, you may simply use hex.Dump()
which generates an output like this:
a := []byte{'a', 0xff, 0xaf, 'b', 0xbf}
fmt.Println(hex.Dump(a))
Output:
00000000 61 ff af 62 bf |a..b.|
There's your expected output a..b.
with other (useful) data like the hex offset and hex representation of bytes.
To get a "better" picture of the output, try it with a little longer input:
a = []byte{'a', 0xff, 0xaf, 'b', 0xbf, 50: 0xff}
fmt.Println(hex.Dump(a))
00000000 61 ff af 62 bf 00 00 00 00 00 00 00 00 00 00 00 |a..b............|
00000010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000020 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000030 00 00 ff |...|
Try it on the Go Playground.