Explain ggplot2 warning: "Removed k rows containing missing values"

Solution 1:

The behavior you're seeing is due to how ggplot2 deals with data that are outside the axis ranges of the plot. You can change this behavior depending on whether you use scale_y_continuous (or, equivalently, ylim) or coord_cartesian to set axis ranges, as explained below.

library(ggplot2)

# All points are visible in the plot
ggplot(mtcars, aes(mpg, hp)) + 
  geom_point()

In the code below, one point with hp = 335 is outside the y-range of the plot. Also, because we used scale_y_continuous to set the y-axis range, this point is not included in any other statistics or summary measures calculated by ggplot, such as the linear regression line.

ggplot(mtcars, aes(mpg, hp)) + 
  geom_point() +
  scale_y_continuous(limits=c(0,300)) +  # Change this to limits=c(0,335) and the warning disappars
  geom_smooth(method="lm")

Warning messages:
1: Removed 1 rows containing missing values (stat_smooth). 
2: Removed 1 rows containing missing values (geom_point).

In the code below, the point with hp = 335 is still outside the y-range of the plot, but this point is nevertheless included in any statistics or summary measures that ggplot calculates, such as the linear regression line. This is because we used coord_cartesian to set the y-axis range, and this function does not exclude points that are outside the plot ranges when it does other calculations on the data.

If you compare this and the previous plot, you can see that the linear regression line in the second plot has a slightly steeper slope, because the point with hp=335 is included when calculating the regression line, even though it's not visible in the plot.

ggplot(mtcars, aes(mpg, hp)) + 
  geom_point() +
  coord_cartesian(ylim=c(0,300)) +
  geom_smooth(method="lm")

Solution 2:

Just for the shake of completing the answer given by eipi10.

I was facing the same problem, without using scale_y_continuous nor coord_cartesian.

The conflict was coming from the x axis, where I defined limits = c(1, 30). It seems such limits do not provide enough space if you want to "dodge" your bars, so R still throws the error

Removed 8 rows containing missing values (geom_bar)

Adjusting the limits of the x axis to limits = c(0, 31) solved the problem.

In conclusion, even if you are not putting limits to your y axis, check out your x axis' behavior to ensure you have enough space