Ruby: How to decompose only typographic ligatures from unicode text?

I am looking for a way to normalize unicode input text which includes typographic ligatures such as

# Things to replace, for instance:
U+FB00 (ff): ff
U+FB01 (fi): fi
U+FB02 (fl): fl
U+FB03 (ffi): ffi
U+FB04 (ffl): ffl
U+FB05 (ſt): st
U+FB06 (st): st

I want to keep all diacritics, punctuations and other marks which could be decomposed, but aren't typographic ligatures.

For instance, I would like to keep the trademark symbol or the ellipsis mark.

# Things to keep, for instance:
U+2122 (™): TM
U+2026 (…): ...
U+2120 (℠): SM
U+2121 (℡): TEL

I have searched for a solution and found some related answers:

  • https://superuser.com/questions/669130/double-latin-letters-in-unicode-ligatures
  • Separating Unicode ligature characters

Is there a Ruby specific way?


Solution 1:

My current hackish solution:

  def self.remove_ligatures input

    @@ligature_char_regex ||= /[#{ligature_chars.join('')}]/

    input.gsub(@@ligature_char_regex) { |c|
      c.unicode_normalize(:nfkc)
    }

  end

Which works, but relies on the a long list of characters defined manually (see below) and might not be the fastest way when performance is concerned.

  # Return the list of all characters which decompose 
  # into multiple ascii/accented characters
  #
  # Manually commented out those that are not typographic 
  # ligatures such as Trademark, Medical Doctor, CD
  #
  #  List from: https://superuser.com/questions/669130/double-latin-letters-in-unicode-ligatures
  def self.ligature_chars

    return [
      "\u0132", # (IJ): IJ
      "\u0133", # (ij): ij
      "\u01C7", # (LJ): LJ
      "\u01C8", # (Lj): Lj
      "\u01C9", # (lj): lj
      "\u01CA", # (NJ): NJ
      "\u01CB", # (Nj): Nj
      "\u01CC", # (nj): nj
      "\u01F1", # (DZ): DZ
      "\u01F2", # (Dz): Dz
      "\u01F3", # (dz): dz
      "\u20A8", # (₨): Rs
      "\u2116", # (№): No
      # "\u2120", # (℠): SM
      # "\u2121", # (℡): TEL
      # "\u2122", # (™): TM
      "\u213B", # (℻): FAX
      "\u2161", # (Ⅱ): II
      "\u2162", # (Ⅲ): III
      "\u2163", # (Ⅳ): IV
      "\u2165", # (Ⅵ): VI
      "\u2166", # (Ⅶ): VII
      "\u2167", # (Ⅷ): VIII
      "\u2168", # (Ⅸ): IX
      "\u216A", # (Ⅺ): XI
      "\u216B", # (Ⅻ): XII
      "\u2171", # (ⅱ): ii
      "\u2172", # (ⅲ): iii
      "\u2173", # (ⅳ): iv
      "\u2175", # (ⅵ): vi
      "\u2176", # (ⅶ): vii
      "\u2177", # (ⅷ): viii
      "\u2178", # (ⅸ): ix
      "\u217A", # (ⅺ): xi
      "\u217B", # (ⅻ): xii
      "\u3250", # (㉐): PTE
      "\u32CC", # (㋌): Hg
      "\u32CD", # (㋍): erg
      "\u32CE", # (㋎): eV
      "\u32CF", # (㋏): LTD
      "\u3371", # (㍱): hPa
      "\u3372", # (㍲): da
      "\u3373", # (㍳): AU
      "\u3374", # (㍴): bar
      "\u3375", # (㍵): oV
      "\u3376", # (㍶): pc
      "\u3377", # (㍷): dm
      "\u337A", # (㍺): IU
      "\u3380", # (㎀): pA
      "\u3381", # (㎁): nA
      "\u3383", # (㎃): mA
      "\u3384", # (㎄): kA
      "\u3385", # (㎅): KB
      "\u3386", # (㎆): MB
      "\u3387", # (㎇): GB
      "\u3388", # (㎈): cal
      "\u3389", # (㎉): kcal
      "\u338A", # (㎊): pF
      "\u338B", # (㎋): nF
      "\u338E", # (㎎): mg
      "\u338F", # (㎏): kg
      "\u3390", # (㎐): Hz
      "\u3391", # (㎑): kHz
      "\u3392", # (㎒): MHz
      "\u3393", # (㎓): GHz
      "\u3394", # (㎔): THz
      "\u3396", # (㎖): ml
      "\u3397", # (㎗): dl
      "\u3398", # (㎘): kl
      "\u3399", # (㎙): fm
      "\u339A", # (㎚): nm
      "\u339C", # (㎜): mm
      "\u339D", # (㎝): cm
      "\u339E", # (㎞): km
      "\u33A9", # (㎩): Pa
      "\u33AA", # (㎪): kPa
      "\u33AB", # (㎫): MPa
      "\u33AC", # (㎬): GPa
      "\u33AD", # (㎭): rad
      "\u33B0", # (㎰): ps
      "\u33B1", # (㎱): ns
      "\u33B3", # (㎳): ms
      "\u33B4", # (㎴): pV
      "\u33B5", # (㎵): nV
      "\u33B7", # (㎷): mV
      "\u33B8", # (㎸): kV
      "\u33B9", # (㎹): MV
      "\u33BA", # (㎺): pW
      "\u33BB", # (㎻): nW
      "\u33BD", # (㎽): mW
      "\u33BE", # (㎾): kW
      "\u33BF", # (㎿): MW
      "\u33C3", # (㏃): Bq
      "\u33C4", # (㏄): cc
      "\u33C5", # (㏅): cd
      "\u33C8", # (㏈): dB
      "\u33C9", # (㏉): Gy
      "\u33CA", # (㏊): ha
      "\u33CB", # (㏋): HP
      "\u33CC", # (㏌): in
      "\u33CD", # (㏍): KK
      "\u33CE", # (㏎): KM
      "\u33CF", # (㏏): kt
      "\u33D0", # (㏐): lm
      "\u33D1", # (㏑): ln
      "\u33D2", # (㏒): log
      "\u33D3", # (㏓): lx
      "\u33D4", # (㏔): mb
      "\u33D5", # (㏕): mil
      "\u33D6", # (㏖): mol
      "\u33D7", # (㏗): PH
      "\u33D9", # (㏙): PPM
      "\u33DA", # (㏚): PR
      "\u33DB", # (㏛): sr
      "\u33DC", # (㏜): Sv
      "\u33DD", # (㏝): Wb
      "\u33FF", # (㏿): gal
      "\uFB00", # (ff): ff
      "\uFB01", # (fi): fi
      "\uFB02", # (fl): fl
      "\uFB03", # (ffi): ffi
      "\uFB04", # (ffl): ffl
      "\uFB05", # (ſt): st
      "\uFB06", # (st): st
      # "\u1F12D", # (🄭): CD
      # "\u1F12E", # (🄮): WZ
      # "\u1F14A", # (🅊): HV
      # "\u1F14B", # (🅋): MV
      # "\u1F14C", # (🅌): SD
      # "\u1F14D", # (🅍): SS
      # "\u1F14E", # (🅎): PPV
      # "\u1F14F", # (🅏): WC
      # "\u1F16A", # (🅪): MC
      # "\u1F16B", # (🅫): MD
      "\u1F19", #0 (🆐): DJ
      "\u01C4", # (DŽ): DŽ
      "\u01C5", # (Dž): Dž
      "\u01C6", # (dž): dž
    ]

  end

Solution 2:

You can do that as follows.

h = { "\uFB00"=>"ff", "\uFB01"=>"fi", "\uFB02"=>"fl", "\uFB03"=>"ffi",
      "\uFB04"=>"ffl", "\uFB05"=>"st",  "\uFB06"=>"st", "\uFB06"=>"st" }
  #=> {"ff"=>"ff", "fi"=>"fi", "fl"=>"fl", "ffi"=>"ffi",
  #    "ffl"=>"ffl", "ſt"=>"st", "st"=>"st"}
s = "A ff or ffwas fi seen™ fl before ffi had ffl go ſt before st"
r = /\b(?:#{h.keys.join('|')})\b/
  #=> /\b(?:ff|fi|fl|ffi|ffl|ſt|st)\b/
s.gsub(r, h)
  #=> "A ff or ffwas fi seen™ fl before ffi had ffl go st before st"

Notice that "ff" in "ffwas" was not matched due to the word boundaries (\b) in the regular expression.