String concatenation containing Arabic and Western characters

I'm trying to concatenate several strings containing both arabic and western characters (mixed in the same string). The problem is that the result is a String that is, most likely, semantically correct, but different from what I want to obtain, because the order of the characters is altered by the Unicode Bidirectional Algorithm. Basically, I just want to concatenate as if they were all LTR, ignoring the fact that some are RTL, a sort of "agnostic" concatenation.

I'm not sure if I was clear in my explanation, but I don't think I can do it any better.

Hope someone can help me.

Kind regards,

Carlos Ferreira

BTW, the strings are being obtained from the database.

EDIT

enter image description here

The first 2 Strings are the strings I want to concatenate and the third is the result.

EDIT 2

Actually, the concatenated String is a little different from the one in the image, it got altered during the copy+paste, the 1 is after the first A and not immediately before the second A.


Solution 1:

You can embed bidi regions using unicode format control codepoints:

  • Left-to-right embedding (U+202A)
  • Right-to-left embedding (U+202B)
  • Pop directional formatting (U+202C)

So in java, to embed a RTL language like Arabic in an LTR language like English, you would do

myEnglishString + "\u202B" + myArabicString + "\u202C" + moreEnglish

and to do the reverse

myArabicString + "\u202A" + myEnglishString + "\u202C" + moreArabic

See Bidirectional General Formatting for more details, or the Unicode specification chapter on "Directional Formatting Codes" for the source material.

Solution 2:

It's very likely that you need to insert Unicode directional formatting codes into your string to get your string display correctly. For details see Directional Formatting Codes of the Unicode Bidirectional Algorithm specification.

Maybe the Bidi class can help you in determining the correct sequence, as it implements the Unicode Bidirectional Algorithm.

Solution 3:

It's not changing order of the codepoints. What's happening is that when it comes to display the string, it sees that the string starts with a right-to-left script, so it displays it right-to-left.