Why is the zh (ʒ) sound so infrequent in English?
Solution 1:
I would say it is a combination of two factors that show up separately in other sounds with token frequency on the low end.
/ʒ/ never developed in vocabulary inherited from Proto-Germanic
English is categorized as a Germanic language, which is a group of languages that are all thought to be descended from a common ancestor that we call Proto-Germanic. (The modern language called German is another one of these languages, but despite the name, Proto-Germanic has no special connection to German compared to English, Dutch, Swedish, or any other Germanic language.)
Many English words come from other sources, such as borrowing, but the inherited vocabulary—vocabulary that was transmitted from Proto-Germanic continuously, without passing between separate languages—constitutes a large portion of the most frequent and basic words in English.
There is no regular source of /ʒ/ in vocabulary inherited from Proto-Germanic. This is also the case for /oɪ/, the least common vowel sound.
The sound /dʒ/, another relatively infrequent consonant sound, is partly similar in that it has no regular source in word-initial position in inherited English vocabulary. (Word-initial /dʒ/ occurs in borrowed words, and also in words whose etymology is unclear or involves irregular or sporadic sound changes.)
The main source of /ʒ/ is a sequence, /zj/
Even in borrowed vocabulary, /ʒ/ almost always originates historically from a sequence /zj/. (There are exceptions to this generalization, but they occur in a small number of words of relatively low frequency such as garage and genre that can likely be ignored.) A sequence of sounds cannot be more frequent than either of its components, and so if the components were originally no more frequent than other single consonant sounds such as /n/, /l/, /v/, we would expect /zj/ (and therefore /ʒ/) to be less frequent than most single consonant sounds, and only as common as sequences like /nj/, /lj/, /vj/.
And furthermore, the /zj/ sequence that developed into /ʒ/ only occurred in the middle of words, so /ʒ/ is generally absent from monosyllables.
Other sounds in English that historically derive mostly from sequences, such as /ŋ/ (from the sequence /ng/, except for when it comes immediately before /g/ or /k/) or /ʃ/ (from *sk in native Germanic words, or from /sj/ in non-native Latinate words) are fairly low in frequency; however, these are probably not as infrequent as /ʒ/ because they do occur in native vocabulary, and they can occur in monosyllables (the *sk sequence that developed to /ʃ/ could occur word-initially, and the /ng/ sequence that developed to /ŋ/ could occur word-finally). Furthermore, ruakh pointed out in the comments that /ŋ/ occurs in the common suffix -ing, which will raise its frequency.
Solution 2:
I am by no means an expert on answering such questions but here is my best shot:
In Old English, the voiced fricatives [z v ð] only occurred as allophones of /s f θ/ respectively. (I am not sure if Old English /x/ had a voiced allophone, but that is not related here so I will skip it.) Those voiced fricatives ([z v ð]) only occurred when they were in between vowels or a vowel and a liquid consonant l and r as @Foobie explained in their answer on my question. For example, sē stæf (“the staff”) was pronounced [seː ˈstæf], but þā stafas (“the staves”) was [θaː ˈsta.vas] (from Foobie's answer).
Old English did have the phoneme /ʃ/ but it did not become voiced when occurred in the same environment (between vowels) as the other fricative consonants. Most of the English vocabulary comes from Old English and Old English did not have the phoneme /ʒ/ so there are no "native" English words with the phoneme /ʒ/, it might be one of the reasons it is somewhat rare.
Later in Middle English, due to the influence and borrowing of Norman/ French words, the voiceless fricatives [s f θ] and the voiced [z v ð] appeared in the same environment where they changed meanings. For example, (suppose) sit and zit would mean different things. This thing did not happen in Old English thus the voiced sounds were the allophones. But when they appeared in the same environment and changed meanings, the voiced sounds became separate phonemes: they changed meanings when occurred in the same environment.
/ʒ/ mostly occurs in words that have been borrowed from French. The fact that /ʒ/ is a separate phoneme in Modern English suggests that at some point, it occurred in the same environment and changed meanings of words.