How to ignore all whitespace characters and punctuations in snowflake
The below query working for one string. However when I run at whole table data it's not working
select
lower( regexp_replace( nvl(column1,':'), '\\s+|[][!"#$%&\'()*+,.\\\\/:;<=>?@\^_`{|}~-]+', '')) as addres_line_1,
column1
from values('122 E 7th Street ');
output: 122e7thstreet
when I run similar query for the table, the white spaces are not fully removed.
output: 122e7th street
table level query:
select concat(
column1, ':',
column2, ':',
lower(regexp_replace(regexp_replace(nvl(ADDRESS_LINE_1,':'),'\\s|[][!"#$%&\'()*+,.\\\\/:;<=>?@\^_`{|}~-]+',''),'[ \t\r\n\v\f]+','')), ':',
lower(regexp_replace(nvl(ADDRESS_LINE_2,':'),'\\s|[][!"#$%&\'()*+,.\\\\/:;<=>?@\^_`{|}~-]+','')), ':',
lower(regexp_replace(nvl(ADDRESS_LINE_3,':'),'\\s|[][!"#$%&\'()*+,.\\\\/:;<=>?@\^_`{|}~-]+','')), ':',
lower(regexp_replace(nvl(PRIMARY_TOWN,':'),'\\s|[][!"#$%&\'()*+,.\\\\/:;<=>?@\^_`{|}~-]+','')), ':',
lower(regexp_replace(nvl(COUNTRY,':'),'\\s|[][!"#$%&\'()*+,.\\\\/:;<=>?@\^_`{|}~-]+','')), ':',
lower(regexp_replace(nvl(TERRITORY_CODE,':'),'\\s|[][!"#$%&\'()*+,.\\\\/:;<=>?@\^_`{|}~-]+','')), ':',
lower(regexp_replace(nvl(POSTAL_CODE,':'),'\\s|[][!"#$%&\'()*+,.\\\\/:;<=>?@\^_`{|}~-]+|\\s+','')), ':',
lower(regexp_replace(nvl(COUNTRY_CODE,':'),'\\s|[][!"#$%&\'()*+,.\\\\/:;<=>?@\^_`{|}~-]+',''))
) as ROLE_PLAYER_ADDRESS_HASH_KEY1
from address
Solution 1:
Does using regex expressions like this one work work:
REGEXP_REPLACE(REGEXP_REPLACE(ADDRESS_LINE_1, '[^\\w]'),'_')
\w is any digit, letter, or underscore - hence the need for an outer REGEXP_REPLACE to remove underscores
Solution 2:
So you have three different regex,
you single demostraition regex is:
'\\s+|[][!"#$%&\'()*+,.\\\\/:;<=>?@\^_`{|}~-]+'
then you use:
'\\s|[][!"#$%&\'()*+,.\\\\/:;<=>?@\^_`{|}~-]+'
for the table except for POSTAL_CODE
which uses:
'\\s|[][!"#$%&\'()*+,.\\\\/:;<=>?@\^_`{|}~-]+|\\s+'
Thus do they work equally well on you example input, or other inputs you see failure on?
select
'\\s+|[][!"#$%&\'()*+,.\\\\/:;<=>?@\^_`{|}~-]+' as reg_1,
'\\s|[][!"#$%&\'()*+,.\\\\/:;<=>?@\^_`{|}~-]+' as reg_2,
'\\s|[][!"#$%&\'()*+,.\\\\/:;<=>?@\^_`{|}~-]+|\\s+' as reg3,
lower( regexp_replace( nvl(column1,':'), reg_1, '')) as addres_1,
lower( regexp_replace( nvl(column1,':'), reg_2, '')) as addres_2,
lower( regexp_replace( nvl(column1,':'), reg3, '')) as addres_3,
column1
from values
('122 E 7th Street '),
('122 E 7th Street '),
(' 122 E 7th Street ')
-- etc, etc
;
alrighty, so the token you have is C2 A0 \u00a0
the no-break-space token.
select
'\\s+|[][!"#$%&\'()*+,.\\\\/:;<=>?@\^_`{|}~-]+' as reg_1,
column1,
regexp_replace( column1, reg_1, '') as out
from values
('122 e 7th\u00a0street')
gives:
REG_1: \s+|[][!"#$%&'()*+,.\\/:;<=>?@^_`{|}~-]+
COLUMN1: 122 e 7th street
OUT: 122e7th street
and if you copy that output the space has converted to a normal 0x20 space.
So now we know what the input data is we just need to match it.
so you can use the TRANSLATE function to remove the unicode character via
select
column1
,regexp_replace( column1, '\\s+', '') as r1
,translate(column1,'\u00a0',' ') as t1 as t1
,regexp_replace( t1, '\\s+', '') as r
from values ('122 e 7th\u00a0street');
which means we can just put \u00a0 into the regex
regexp_replace( column1, '\\s+|\u00a0+', '')
works a charm!