Get Unicode characters with charcode values greater hex `FFFF`
Issue
The ChrW
charcode argument is a Long
that identifies a character, but doesn't allow values greater than 65535 (hex value &HFFFF
) -
see MS Help.
For instance Miscellaneous symbols and pictographs can be found at Unicode hex block 1F300-1F5FF
. So I didn't find any way to represent the proposed hex values of ►1F512
and 1F513
for a opened or closed padlock symbol
precisely in this charcode block, as of Course ChrW(&H1F512)
would result in an invalid procedure/argument call.
A recent answer found an possibly erratic alternative referring to a lower charcode
(via ChrW(&HE1F7)
and ChrW(&HE1F6)
), but I'm searching for a way to get the higher charcode representation.
Question
Is there a systematic way to express Unicode characters found in hexadecimal code blocks greater than FFFF
by means of VBA or a work around?
Solution 1:
Something like this should work. Most code I didn't write, but I knew what to look for. Basically map the Hex to the byte array equivalent, then get the string back.
Option Explicit
'Pulled from https://www.di-mgt.com.au/howto-convert-vba-unicode-to-utf8.html
''' Maps a character string to a UTF-16 (wide character) string
Private Declare PtrSafe Function MultiByteToWideChar Lib "kernel32" ( _
ByVal CodePage As Long, _
ByVal dwFlags As Long, _
ByVal lpMultiByteStr As LongPtr, _
ByVal cchMultiByte As Long, _
ByVal lpWideCharStr As LongPtr, _
ByVal cchWideChar As Long _
) As Long
' CodePage constant for UTF-8
Private Const CP_UTF8 = 65001
''' Return length of byte array or zero if uninitialized
Private Function BytesLength(abBytes() As Byte) As Long
' Trap error if array is uninitialized
On Error Resume Next
BytesLength = UBound(abBytes) - LBound(abBytes) + 1
End Function
''' Return VBA "Unicode" string from byte array encoded in UTF-8
Public Function Utf8BytesToString(abUtf8Array() As Byte) As String
Dim nBytes As Long
Dim nChars As Long
Dim strOut As String
Utf8BytesToString = ""
' Catch uninitialized input array
nBytes = BytesLength(abUtf8Array)
If nBytes <= 0 Then Exit Function
' Get number of characters in output string
nChars = MultiByteToWideChar(CP_UTF8, 0&, VarPtr(abUtf8Array(0)), nBytes, 0&, 0&)
' Dimension output buffer to receive string
strOut = String(nChars, 0)
nChars = MultiByteToWideChar(CP_UTF8, 0&, VarPtr(abUtf8Array(0)), nBytes, StrPtr(strOut), nChars)
Utf8BytesToString = Left$(strOut, nChars)
End Function
'Grabbed from https://stackoverflow.com/questions/28798759/how-convert-hex-string-into-byte-array-in-vb6
Private Function HexToBytes(ByVal HexString As String) As Byte()
'Quick and dirty hex String to Byte array. Accepts:
'
' "HH HH HH"
' "HHHHHH"
' "H HH H"
' "HH,HH, HH" and so on.
Dim Bytes() As Byte
Dim HexPos As Integer
Dim HexDigit As Integer
Dim BytePos As Integer
Dim Digits As Integer
ReDim Bytes(Len(HexString) \ 2) 'Initial estimate.
For HexPos = 1 To Len(HexString)
HexDigit = InStr("0123456789ABCDEF", _
UCase$(Mid$(HexString, HexPos, 1))) - 1
If HexDigit >= 0 Then
If BytePos > UBound(Bytes) Then
'Add some room, we'll add room for 4 more to decrease
'how often we end up doing this expensive step:
ReDim Preserve Bytes(UBound(Bytes) + 4)
End If
Bytes(BytePos) = Bytes(BytePos) * &H10 + HexDigit
Digits = Digits + 1
End If
If Digits = 2 Or HexDigit < 0 Then
If Digits > 0 Then BytePos = BytePos + 1
Digits = 0
End If
Next
If Digits = 0 Then BytePos = BytePos - 1
If BytePos < 0 Then
Bytes = "" 'Empty.
Else
ReDim Preserve Bytes(BytePos)
End If
HexToBytes = Bytes
End Function
Example call
Public Sub ExampleLock()
Dim LockBytes() As Byte
LockBytes = HexToBytes("F0 9F 94 92") ' Lock Hex representation, found by -->http://www.ltg.ed.ac.uk/~richard/utf-8.cgi
Sheets(1).Range("A1").Value = Utf8BytesToString(LockBytes) ' Output
End Sub
Here's what is outputting to A1.
Solution 2:
The function that works for Unicode characters outside the basic multilingual plane (BMP) is WorksheetFunction.Unichar()
. This example converts cells containing hexadecimal into their Unicode equivalent:
Sub Convert()
For i = 1 To Selection.Cells.Count
n = WorksheetFunction.Hex2Dec(Selection.Cells(i).Text)
Selection.Cells(i) = WorksheetFunction.Unichar(n)
Next
End Sub
Original selection before running macro:
After running macro:
If your Excel is older and WorksheetFunction
is not available, building UTF-16 surrogates manually works, too:
Sub Convert()
For i = 1 To Selection.Cells.Count
n = CLng("&H" + Selection.Cells(i).Text) 'Convert hexadecimal text to integer
If n < &H10000 Then 'BMP characters
Selection.Cells(i) = ChrW(n)
Else
'UTF-16 hi/lo surrogate conversion
'Algorithm:
'1. Code point - 10000h (max U+10FFFF give 9FFFF...20 bits)
'2. In binary, but 10 bits in first surrogate (x) and 10 in 2nd surrogate (y)
' 110110xxxxxxxxxx 110111yyyyyyyyyy
tmp = n - &H10000
h = &HD800 + Int(tmp / (2 ^ 10)) 'bitwise right shift by 10
l = &HDC00 + (tmp And &H3FF) 'bitwise AND of last 10 bits
Selection.Cells(i) = ChrW(h) + ChrW(l)
End If
Next
End Sub
Solution 3:
as an alternative to T.M.
Don't forget to add a reference to 'Microsoft HTML Object Library'
Function GetUnicode(CharCodeString As String) As String
Dim Doc As New HTMLDocument
Doc.body.innerHTML = "&#x" & CharCodeString & ";"
GetUnicode = Doc.body.innerText
End Function
Solution 4:
Following is a VBScript code that I use in ASP Classic.
As you'll notice, there's no type declarations, everything is variant.
What I intended to do with it was having a ChrU
and AscU
functions supporting characters beyond UCS-2 (Basic Multilingual Plane).
Since this is written for VBScript first, I think it's quite host-agnostic. Don't have one, but it should work on MACs too. Hope it helps.
Private Function RightShift(ByVal pVal, shift)
Dim i, nVal
For i = 1 To shift
nVal = (pVal And &H7FFFFFFF) \ 2
If nVal And &H80000000 Then nVal = nVal Or &HC0000000
pVal = nVal
Next
RightShift = pVal
End Function
Private Function LeftShift(ByVal pVal, shift)
Dim i, nVal
For i = 1 To shift
nVal = (pVal And &H3FFFFFFF) * 2
If pVal And &H40000000 Then
nVal = nVal Or &H80000000
End If
pVal = nVal
Next
LeftShift = nVal
End Function
Public Function ChrU(ByVal code)
Dim lo, hi ' to hold 16-bit surrogate pairs
code = Int(code)
If code <= 65535 Then
' code is in the UCS-2 range (a.k.a. Basic Multilingual Plane) which ChrW (and AscW) relies on.
' falling back to ChrW
ChrU = ChrW(code)
ElseIf code <= 1114111 Then ' code is in the Unicode range beyond UCS-2
code = code - &H10000
lo = ChrW(&HD800& Or RightShift(code, 10))
hi = ChrW(&HDC00& Or (code And &H3FF))
ChrU = Join(Array(lo, hi), "")
Else
Err.Raise 9, "ChrU", "Code point was out of range."
End If
End Function
Public Function AscU(str)
Dim lo, hi ' to hold 16-bit surrogate pairs
If Len(str) = 1 Then
AscU = AscW(str) And &HFFFF&
Else
Dim txt
txt = Left(str, 2)
lo = AscW(Mid(txt, 1, 1)) And &HFFFF&
hi = AscW(Mid(txt, 2, 1)) And &HFFFF&
If &HDC00& > hi Or hi > &HDFFF& Then
' hi surrogate is not valid
' assuming "str" is a Unicode (UCS-2) string of at least 2 characters
' returning first character's codepoint
' as Asc and AscW do
AscU = lo
Exit Function
End If
AscU = &H10000 + LeftShift(lo And &H3FF, 10) + (hi And &H3FF)
End If
End Function
Solution 5:
Work around via HTML
Just in addition to the valid solutions above: I found an easy work around using IE HTML content, as HTML is not distinguishing between lower and higher code block sets; the function below simply returns the interpreted inner html:
Example call writing a padlock symbol e.g. to cell A1
[A1] = GetUnicode("1F512")
[1] Function GetUnicode() - via InternetExplorer
Function GetUnicode$(ByVal CharCodeString$)
' Purpose: get Unicode character via any valid unprefixed hex code string
' Note: late bound InternetExplorer reference
Dim Ie As Object
Set Ie = CreateObject("InternetExplorer.Application")
With Ie
.Visible = False
.Navigate "about:blank"
.document.body.innerhtml = "&#x" & CharCodeString & ";" ' prefixing HTML code
GetUnicode = .document.body.innerhtml
.Quit
End With
End Function
[2a] Alternative Function GetUnicode() - via XMLDom (Edit 5/12 2019)
This represents a host agnostic approach using XMLDom. Citing Wikipedia
"The Document Object Model (DOM) is a cross-platform and language-independent application programming interface that treats an XML document as a tree structure wherein each node is an object representing a part of the document. "
Similar to the IE approach the Unicode entity consists of the numeric (hex) prefix &#x
+ num + ;
. Generally I love XML as it allows generally a more flexible coding via its individual node and sub-node references; this example only demonstrates the simplest way to give an idea.
Function getUnicode$(ByVal CharCodeString$)
' Purpose: get Unicode character via any valid unprefixed hex code string
' Note: late bound MSXML2 reference using XMLDom
Dim XmlString$
XmlString = "<?xml version=""1.0"" encoding=""UTF-8""?><root><symbol>&#x" _
& CharCodeString & ";</symbol></root>"
With CreateObject("MSXML2.DOMDocument.6.0")
.ValidateOnParse = True
.Async = False
If .LoadXML(XmlString) Then
getUnicode = .DocumentElement.SelectSingleNode("symbol").Text
End If
End With
End Function
[2b] Further approach using FilterXML
- late edit as of 12/29 2019
The WorksheetFunction FilterXML
added in Excel 2013 allows to reformulate and shorten the above code as follows:
Function getUnicode$(ByVal CharCodeString$)
' Purpose: get Unicode character via any valid unprefixed hex code string
' Note: the FilterXML() function was introduced by Version 2013
' Help: https://docs.microsoft.com/de-de/office/vba/api/excel.worksheetfunction.filterxml
Dim XmlString As String
XmlString = "<?xml version=""1.0"" encoding=""UTF-8""?><root><symbol>&#x" _
& CharCodeString & ";</symbol></root>"
getUnicode = Application.WorksheetFunction.FilterXML(XmlString, "//symbol")
End Function
Addendum (5/2 2021)
Using the above FilterXML()
function you can also omit the XML declaration <?xml version=""1.0"" encoding=""UTF-8""?>
:
Excel and generally any XML parser assumes that the encoding is UTF-8 or UTF-16 if an XML document lacks encoding specification (unless the encoding has already been determined by a higher protocol). So you could even truncate the above function to
Function getUnicode$(ByVal hex$)
getUnicode = Application.WorksheetFunction.FilterXML("<r><s>&#x" & hex & ";</s></r>", "//s")
End Function
Note that any node names can be used instead of the above tags <r>
(for <root>
) or s
(for <symbol>
).