Problem uploading a latin1 file in a utf-8 SAS

On a Linux system with SAS with ENCODING=UTF-8, I have this file:

more provab.txt
▒00CC00   ▒00CC00   S012016-10-04
▒00CC00   ▒00CC00   S012021-10-20

xxd provab.txt
0000000: b130 3043 4330 3020 2020 b130 3043 4330  .00CC00   .00CC0
0000010: 3020 2020 5330 3132 3031 362d 3130 2d30  0   S012016-10-0
0000020: 3420 0ab1 3030 4343 3030 2020 20b1 3030  4 ..00CC00   .00
0000030: 4343 3030 2020 2053 3031 3230 3231 2d31  CC00   S012021-1
0000040: 302d 3230 200a                           0-20 .
 
file -i provab.txt
provab.txt: text/plain; charset=iso-8859-1

If I load it in SAS without setting the encoding:

Filename inp "/mypath/provab.txt" ;
Data work.current_file;
Infile inp lrecl=33 DSD MISSOVER PAD firstObs=1;
Attrib campo_1 length=$10
format=$char10. informat=$char10. ;
Attrib campo_2 length=$10
format=$char10. informat=$char10.  ;
Attrib campo_3 length=$3
format=$char3. informat=$char3.   ;
Attrib chr_data length=$10
format=$char10. informat=$char10.      ;
  Input
        @1 campo_1 $char10.
        @11 campo_2 $char10.
        @21 campo_3 $char3.
        @24 chr_data $char10.
        ;
Run;

I get:

enter image description here

If I set ENCODING= LATIN1:

Filename inp "/mypath/provab.txt" 
ENCODING= LATIN1
;
Data work.current_file;
Infile inp lrecl=33 DSD MISSOVER PAD firstObs=1;
Attrib campo_1 length=$10
format=$char10. informat=$char10. ;
Attrib campo_2 length=$10
format=$char10. informat=$char10.  ;
Attrib campo_3 length=$3
format=$char3. informat=$char3.   ;
Attrib chr_data length=$10
format=$char10. informat=$char10.      ;
  Input
        @1 campo_1 $char10.
        @11 campo_2 $char10.
        @21 campo_3 $char3.
        @24 chr_data $char10.
        ;
Run;

I get:

enter image description here

As you can see, campo_1 and campo2 are displayed correctly but every "strange" chars implies a shift in the rest of the fields.

The chr_date field, for example, is shifted of 2 chars.

I get the same result also with encoding='iso-8859-1'.

And with iconv -f ISO-8859-1 -t UTF-8 provab.txt >provau.txt and loading provau.txt I also get the same anomaly.

How can I solve it?

After the comment from @GiacomoCatenazzi, I guess that when you give an encoding, the loader first does the charset conversion and only after split the data into fields. But as non ascii Latin1 characters are converted into 2 bytes in UTF-8, the conversion breaks the fixed size fields format.

That means that you will have to do a non trivial pre-processing:

split every line into fixed size fields
convert each field into UTF-8 charset
combine back the fields with a padding maintaining the original width.

That would be trivial with a Python script and can probably be done with awk but IMHO you will not find a product for that.

Problem uploading a latin1 file in a utf-8 SAS

Related

Recent Posts