Outliers are observed data points that are far from the least squares line. They have unusual values and need to be examined carefully. Though an outlier may result from erroneous data, at other times, it may hold valuable information about the population under study and should be included in the data. Hence, it is crucial to examine what causes a data point to be an outlier.
The z score is used to find outliers or unusual values. It should be noted that any values beyond -2 and +2 are considered unusual values or outliers and are far away from the other data values.
We could guess at outliers by looking at a scatterplot graph and best fit-line graph. However, we would require a guideline to understand how far away a point needs to be so it can be considered an outlier. As a rough rule of thumb, we can flag any point that exceeds two standard deviations above or below the best-fit line as an outlier. The standard deviation used is the standard deviation of the residuals or errors.
We can do this visually in the scatter plot by drawing an extra pair of lines that are two standard deviations above and below the best-fit line. Any data points outside this extra pair of lines are flagged as potential outliers. Additionally, we can identify outliers numerically by calculating each residual and comparing it to twice the standard deviation.
This text is adapted from Openstax, Introductory Statistics, Section 12.5 Outliers
Copyright © 2024 MyJoVE Corporation. All rights reserved