relative size of most factors of semiprimes, close?

when chatting about RSA a cohort just asserted something like

"most prime factors of semiprimes are roughly the same size" measured in bits.

ie "bits" is the number of digits in the base2 representations of the two primes. am skeptical myself. in other words, eg if a program filters the semiprimes from $1..n$ and then looks at the size of factors, the assertion is most will be approximately equal. his "proof" sketch was brief/hazy and unconvincing. am skeptical. have a different picture of this. (think it might be closer to "many"...)

is this true? what is a proof sketch? is this connected to any number theory or area of study? if incorrect is there some other similar assertion that is correct?


Asymptotically, the proportion of semiprimes $\leqslant N$ where the two factors have roughly the same size is negligible, if we say that two primes $p \leqslant q$ have roughly the same size if there is a constant $\alpha > 1$ such that $q \leqslant p^\alpha$.

The number of semiprimes $\leqslant N$ is

$$\pi_{1/2}(N) = \sum_{p_k \leqslant \sqrt{N}} \left(\pi\left(\frac{N}{p_k}\right) - (k-1)\right),\tag{1}$$

and the number of semiprimes with two factors of $\alpha$-roughly the same size is

$$\pi_{1/2;\,\alpha}(N) = \sum_{p_k \leqslant N^{1/(\alpha+1)}} \left(\pi(p_k^\alpha) - (k-1)\right) + \sum_{N^{1/(\alpha+1)} \leqslant p_k \leqslant \sqrt{N}} \left(\pi\left(\frac{N}{p_k}\right) - (k-1)\right).\tag{2}$$

The sum of the subtracted terms is the same in both,

$$\frac{\pi(\sqrt{N})(\pi(\sqrt{N}-1)}{2} \sim \frac{2N}{(\log N)^2}.$$

In the first sum of $(2)$, we can estimate the contribution of the $\pi(p_k^\alpha)$ as

$$\sum_{p_k \leqslant N^{1/(\alpha+1)}} \pi(p_k^\alpha) \leqslant \pi(N^{1/(\alpha+1)})\cdot \pi(N^{\alpha/(\alpha+1)}) \sim \frac{(\alpha+1)^2 N}{\alpha(\log N)^2},$$

so it remains to consider

$$\sum_{A \leqslant p_k \leqslant B} \pi\left(\frac{N}{p_k}\right),\tag{3}$$

where here $B = \sqrt{N}$, and $A$ is either $2$ (for $(1)$), or $N^{1/(\alpha+1)}$ (for $(2)$).

To obtain the asymptotic behaviour, we don't need a too good approximation, $\pi(x) \sim x/\log x$ is sufficient. So

$$\begin{align} \sum_{A \leqslant p_k \leqslant B} \pi\left(\frac{N}{p_k}\right) &\approx \sum_{A\leqslant p_k\leqslant B} \frac{N}{p_k(\log N - \log p_k)}\\ &= \frac{N}{\log N} \sum_{A\leqslant p_k\leqslant B} \frac{1}{p_k\left(1 - \frac{\log p_k}{\log N}\right)}. \end{align}$$

We have $0 < \log p_k \leqslant \frac12\log N$, and for $0 \leqslant x \leqslant \frac12$, we have $1+x \leqslant \frac{1}{1-x}\leqslant 1+x+2x^2$. Using further $p_k \sim k\cdot \log k$, the major term in the sum is

$$\begin{align} \sum_{A \leqslant p_k \leqslant B} \frac{1}{p_k} &\sim \sum_{A\leqslant p_k\leqslant B} \frac{1}{k\log k}\\ &\sim \log \log \pi(B) - \log \log \pi(A). \end{align}$$

For $(1)$, the sum is $\log \log N + O(1)$, while for $(2)$, the sum is $\log \frac{\alpha+1}{2} + O(1)$. For the next largest term, we get

$$\frac{1}{\log N}\sum_{A\leqslant p_k\leqslant B} \frac{\log p_k}{p_k} \approx \frac{1}{\log N}\sum_{A\leqslant p_k\leqslant B} \frac{1}{k} \approx \frac{\log (\pi(B)/\pi(A))}{\log N},$$

which is bounded, and similar for the $x^2$ term. So we have

$$\pi_{1/2}(N) \sim \frac{N\log \log N}{\log N},\tag{4}$$

and

$$\pi_{1/2;\,\alpha}(N) \sim c_\alpha\frac{N}{\log N} \in \Theta\left(\frac{N}{\log N}\right).\tag{5}$$

With more careful estimates, one can determine the constant $c_\alpha$, but the above is sufficient to see that asymptotically, almost all semiprimes $\leqslant N$ have factors of very different size.

However, the factor $\log \log N$ grows very slowly, so one needs to consider the semiprimes up to a really large limit for the proportion of semiprimes with two close factors to become negligible.

Obviously, I have neither the time nor the resources to compute the proportion for really large numbers, but the counts for small limits ($\alpha = 2$) already show a trend:

   Limit         Primes          Semiprimes        Close semiprimes
      100:       25 (1.1513)       34 (1.0253)      14 (0.6447; 41.1765)
      200:       46 (1.2186)       62 (0.9851)      22 (0.5828; 35.4839)
      500:       95 (1.1808)      153 (1.0409)      47 (0.5842; 30.7190)
     1000:      168 (1.1605)      299 (1.0687)      87 (0.6010; 29.0970)
     2000:      303 (1.1515)      577 (1.0811)     167 (0.6347; 28.9428)
     5000:      669 (1.1396)     1365 (1.0855)     377 (0.6422; 27.6190)
    10000:     1229 (1.1320)     2625 (1.0889)     693 (0.6383; 26.4000)
    20000:     2262 (1.1201)     5081 (1.0973)    1317 (0.6521; 25.9201)
    50000:     5133 (1.1108)    12110 (1.1004)    3115 (0.6741; 25.7225)
   100000:     9592 (1.1043)    23378 (1.1015)    5938 (0.6836; 25.3999)
   200000:    17984 (1.0976)    45230 (1.1033)   11339 (0.6920; 25.0696)
   500000:    41538 (1.0902)   108326 (1.1044)   26616 (0.6985; 24.5703)
  1000000:    78498 (1.0845)   210035 (1.1051)   50983 (0.7044; 24.2736)
  2000000:   148933 (1.0804)   407284 (1.1046)   97052 (0.7040; 23.8291)
  5000000:   348513 (1.0752)   979274 (1.1042)  230514 (0.7111; 23.5393)
 10000000:   664579 (1.0712)  1904324 (1.1041)  442445 (0.7131; 23.2337)
 20000000:  1270607 (1.0680)  3704340 (1.1034)  851642 (0.7159; 22.9904)
 50000000:  3001134 (1.0641)  8940570 (1.1025) 2024347 (0.7177; 22.6423)
100000000:  5761455 (1.0613) 17427258 (1.1019) 3904631 (0.7193; 22.4053)
200000000: 11078937 (1.0588) 33992717 (1.1011) 7534855 (0.7201; 22.1661)

The numbers in parentheses are

  1. $\dfrac{\pi(x)\log x}{x}$,
  2. $\dfrac{\pi_{1/2}(x)\log x}{x\log\log x}$,
  3. $\dfrac{\pi_{1/2;\,2}(x)\log x}{x}$,
  4. $\dfrac{100\cdot\pi_{1/2;\,2}(x)}{\pi_{1/2}(x)}$.

We see that $(4)$ is a decent asymptotic even for smallish numbers, while $(5)$ still has considerable fluctuation in the checked range, but we see that the proportion of semiprimes with two close factors drops steadily.