Histogram and Normal distribution
I was studying histograms and normal distribution. As far as I know, they are two different tools used for calculating probability and statistics. More specifically they help to visualize and it is an effective way to summarize a large amount of data.
The main difference is in their math and the way they visualize. To calculate the probability of an event from a histogram, we calculate it in a normal arithmetic way. But, if we want to calculate probability from normal distribution we need calculus and geometry. I am adding screenshots so that everyone could understand what I meant above.
Could anyone help me to know their use cases? In which cases it will be better to use histograms and normal distribution? Is there any condition I should check before deciding which one I should use whether it is histogram or normal distribution?
Histogram of a small sample.
Suppose you have a population of high school women, you sample 100 womn at random from the population, measure their heights (to the nearest inch) and make a histogram of these 100 heights.
Using R statistical software, I can emulate this process to get
fictitious data for an example. The vector x
contains the heights in inches of 100 women.
set.seed(2021) # for reproducibility
x = round(rnorm(100, 64, 3.5)) # draw sample, round; see Note at end
From the following summary I can see that the tallest woman was 71" tall and the shortest was 56" tall. Also, I can see that that the average height is $\bar X = 63.36"$
summary(x)
Min. 1st Qu. Median Mean 3rd Qu. Max.
56.00 61.00 63.50 63.36 66.00 71.00
The histogram below has labels atop its bars, indicating how many women are represented in each bar. So, I can say that $8+10+1 = 19$ of the $100$ women are taller than 66". [In this style of histogram the intervals contain the top boundary, but not the bottom boundary.] From this I might guess that roughly $0.19 = 19\%$ of the women in the -population_ are taller than 66". But this is only a rough estimate based on a sample of 100. Perhaps it is more appropriate to give a 95% confidence interval for the probability as $(0.113, 0.267)$ or $0.19 \pm 0.077.$
hist(x, col="skyblue2", label=T)
p.est = 0.19
CI = p.est + qnorm(c(.025,.975))*sqrt(p.est*(1-p.est)/100)
CI
[1] 0.1131104 0.2668896
sum(x > 66)
[1] 19
Exact distribution of population.
By contrast, if I am told that the population distribution of such female student heights is $\mathsf{Norm}(\mu = 64, \sigma=3.5).$ then I have more knowledge about the population than I can deduce form a sample of $100$ women.
Then I can find a z-score and use printed normal CDF tables to find the exact proportion of high school women in the population weighing more than 66". For the best result, I should use $66.5$ because women taller than that will be rounded to 67" or more. (This adjustment is called the 'continuity correction'.)
Then $Z = \frac{66.5 - 64}{3.5} = 0.714.$ And from the printed table you get approximately the proportion $0.238.$ [Usually, using printed tables involves some rounding, with a small loss of accuracy.] You can
use the normal CDF function pnorm
in R, to get the slightly
more accurate value $0.2376.$
z = (66.5-64)/3.5; z
[1] 0.7142857
1 - pnorm(0.714)
[1] 0.2376136
1 - pnorm(66.5, 64, 3.5)
[1] 0.2375253
Of course, the answer $0.238$ from the exact population distribution is much better than the approximate answer $0.19\pm 0.077$ estimated from a sample of only 100 women. But you try to do your best with the information you have.
The probability $0.238$ is the area under the density curve to the right of the vertical line.
hdr = "Density of NORM(64, 3.5)"
curve(dnorm(x, 64, 3.5), 50, 75, lwd=2, ylab="Density", main=hdr)
abline(h = 0, col="green2"); abline(v = 66.5, lwd=2)
Note: The information in the line of R code
x = round(rnorm(100, 64, 3.5))
would never be known in a practical situation. This was used only to make a fictitious sample of 100. [I don't happen have a huge population of high school women in my office to use for taking the sample.]