How to ignore all whitespace characters and punctuations in snowflake

The below query working for one string. However when I run at whole table data it's not working

select
   lower( regexp_replace( nvl(column1,':'), '\\s+|[][!"#$%&\'()*+,.\\\\/:;<=>?@\^_`{|}~-]+', '')) as addres_line_1,
   column1 
from values('122 E 7th Street ');

output: 122e7thstreet

when I run similar query for the table, the white spaces are not fully removed. output: 122e7th street

table level query:

select concat(
        column1, ':',
        column2, ':',
        lower(regexp_replace(regexp_replace(nvl(ADDRESS_LINE_1,':'),'\\s|[][!"#$%&\'()*+,.\\\\/:;<=>?@\^_`{|}~-]+',''),'[ \t\r\n\v\f]+','')), ':',
        lower(regexp_replace(nvl(ADDRESS_LINE_2,':'),'\\s|[][!"#$%&\'()*+,.\\\\/:;<=>?@\^_`{|}~-]+','')), ':',
        lower(regexp_replace(nvl(ADDRESS_LINE_3,':'),'\\s|[][!"#$%&\'()*+,.\\\\/:;<=>?@\^_`{|}~-]+','')), ':',
        lower(regexp_replace(nvl(PRIMARY_TOWN,':'),'\\s|[][!"#$%&\'()*+,.\\\\/:;<=>?@\^_`{|}~-]+','')), ':',
        lower(regexp_replace(nvl(COUNTRY,':'),'\\s|[][!"#$%&\'()*+,.\\\\/:;<=>?@\^_`{|}~-]+','')), ':',
        lower(regexp_replace(nvl(TERRITORY_CODE,':'),'\\s|[][!"#$%&\'()*+,.\\\\/:;<=>?@\^_`{|}~-]+','')), ':',
        lower(regexp_replace(nvl(POSTAL_CODE,':'),'\\s|[][!"#$%&\'()*+,.\\\\/:;<=>?@\^_`{|}~-]+|\\s+','')), ':',
        lower(regexp_replace(nvl(COUNTRY_CODE,':'),'\\s|[][!"#$%&\'()*+,.\\\\/:;<=>?@\^_`{|}~-]+',''))
    ) as  ROLE_PLAYER_ADDRESS_HASH_KEY1 
from address

Solution 1:

Does using regex expressions like this one work work:

REGEXP_REPLACE(REGEXP_REPLACE(ADDRESS_LINE_1, '[^\\w]'),'_')

\w is any digit, letter, or underscore - hence the need for an outer REGEXP_REPLACE to remove underscores

Solution 2:

So you have three different regex,

you single demostraition regex is:

'\\s+|[][!"#$%&\'()*+,.\\\\/:;<=>?@\^_`{|}~-]+'

then you use:

'\\s|[][!"#$%&\'()*+,.\\\\/:;<=>?@\^_`{|}~-]+'

for the table except for POSTAL_CODE which uses:

'\\s|[][!"#$%&\'()*+,.\\\\/:;<=>?@\^_`{|}~-]+|\\s+'

Thus do they work equally well on you example input, or other inputs you see failure on?

select 
    '\\s+|[][!"#$%&\'()*+,.\\\\/:;<=>?@\^_`{|}~-]+' as reg_1,
    '\\s|[][!"#$%&\'()*+,.\\\\/:;<=>?@\^_`{|}~-]+' as reg_2,
    '\\s|[][!"#$%&\'()*+,.\\\\/:;<=>?@\^_`{|}~-]+|\\s+' as reg3,
    lower( regexp_replace( nvl(column1,':'), reg_1, '')) as addres_1,
    lower( regexp_replace( nvl(column1,':'), reg_2, '')) as addres_2,
    lower( regexp_replace( nvl(column1,':'), reg3, '')) as addres_3,
    column1 
from values
    ('122 E 7th Street '),
    ('122 E 7th Street  '),
    ('  122 E 7th Street  ')
    -- etc, etc
;   

alrighty, so the token you have is C2 A0 \u00a0 the no-break-space token.

select 
    '\\s+|[][!"#$%&\'()*+,.\\\\/:;<=>?@\^_`{|}~-]+' as reg_1,
    column1,
    regexp_replace( column1, reg_1, '') as out
from values
('122 e 7th\u00a0street')

gives:

REG_1: \s+|[][!"#$%&'()*+,.\\/:;<=>?@^_`{|}~-]+ 
COLUMN1: 122 e 7th street
OUT: 122e7th street

and if you copy that output the space has converted to a normal 0x20 space.

So now we know what the input data is we just need to match it.

so you can use the TRANSLATE function to remove the unicode character via

select 
    column1
    ,regexp_replace( column1, '\\s+', '') as r1
    ,translate(column1,'\u00a0',' ') as t1 as t1
    ,regexp_replace( t1, '\\s+', '') as r
from values ('122 e 7th\u00a0street');   

which means we can just put \u00a0 into the regex

regexp_replace( column1, '\\s+|\u00a0+', '')

works a charm!