Extreme values: Winsorize, trim, or retain?
August 1, 2018
“Trimming” data excludes the outlier values from your analysis. “Winsorizing” retains the responses in your basis but caps numeric outliers so they fall at the edge of the main distribution.
A common request is to bound the data to the [5%, 95%] percentiles. However in practice survey data is often highly asymmetric, so clipping the data at just the high end may be reasonable.
In the example below, most physicians report under 100 patients per month, but a few, 4%, report much higher numbers.
The screener termination criteria already bound the responses to be at least 5, so we might clip answers above 100 as shown by the gold line.
We can cap those answers to within a defined range by setting the
Here the data is now bounded to the range [5,100]. The outlying values are not dropped but are now counted as if they were equal to 100 and thus fall in the range “81 to 100” which has increased from 8% to 12%. The N size is still 100, but the mean is a bit lower now.
Note that the median did not change at all. In all but the most extreme cases, the median is robust to outliers and unaffected by Winsorizing because the extreme values stay on their side of the median .
Another approach is to ignore responses outside the main range. To do this we can set a filter which includes only responses that fall within the range (5, 100].
Here the basis is lower, N=96, reflecting that the outliers are ignored from the distribution. The mean is a little lower still. The median happens to stay at 30, but trimming may change the median if more values are removed from one end than the other.
Sometimes responses are entered honestly but in error. For instance, a respondent may write they purchased their Tesla in the year “2081”.
We might prefer to believe they meant to write “2018” rather than time traveled from the future to complete this survey. We could thus recode “2081” to “2018”.
Retain outliers and use a log scale
Just because numbers are atypical doesn’t mean they are unreasonable. Here it’s possible a few physicians really do treat many more patients of this condition than do most doctors.
Many pheonomena yield “long-tail” distributions where a few outliers legitimately exist. For instance in economics most people have modest wealth but a few have very high net worth, and to exclude them from analysis would be misleading.
“Long-tail” distributions often look normal, or at least more reasonable, when shown on a log scale.
Here the distribution is shown on a log scale, with small bin ranges for smaller numbers and larger bin ranges for larger numbers.
This tutorial shows three approaches to handling extreme values: trimming , winsorizing, and retaining but plotting on a logarithmic scale.
Trimming makes a lot of sense when you simply don’t believe the answers, e.g. a traveler who says he makes 999 commercial flights per year
Retaining the data makes sense when there legitimately may be high values, e.g. a few business travelers may actually take 100+ flights per year. A log scale may be useful.
Winsorizing makes senses when we want to retain the high-value responses but not take them too literally, such as when weighting physicians by self-reported patient volumes.
See how to do each of these in Protobi in this tutorial