














いずれにせよ、標準化されたアプローチに出会ったことはありません-それは常にデータ固有です。データのどのパーセンタイル(外れ値)がボラティリティ/ stの特定の割合を引き起こしているかを調べることができます。そして、そのボラティリティを減らすことと、可能な限り多くのデータを保持することの間のバランスを見つけます。

As in my comment above, "removing them from the data set" is too strong here. Trimming or Winsorizing just means what it does, ignoring or replacing as may be, for a certain calculation. You are not obliged to remove the tail values from the dataset, as if you were throwing out rotten fruit. For example, faced with possible outliers, you might do an analysis of the data as they come and an analysis based on trimming and see what difference it makes.
Nick Cox



It's a good question, but you don't answer it. You just say that truncating or Winsorizing can help visualization.
Nick Cox


One advantage of Winsorizing is that the calculation may be more efficient. In order to calculate a true truncated mean, you need to sort all of the data elements, and that is typically O(nlogn). However there are efficient ways of figuring out just the 25% and 75% percentiles using a the quick select algorithm, which is typically O(n). If you know these end points, you can quickly loop over the data again, and replace values less than 25% with the 25% value and more than 75% with 75% and average. This is identical to the Winsor mean. But looping over the data and only averaging data between the 25% value and 75% value is NOT identical to the truncated mean, because the 25% or 75% values may not be a unique value. Consider the data sequence (1,2,3,4,4). The Winsor mean is (2+2+3+4+4)/5. The correct truncated mean should be (2+3+4)/3. The "quick-select" optimized truncated mean will be (2+3+4+4)/4.

It is not the case that you need to sort all the data to compute a median (as true a median as you like), nor is it true that it's an O(nlogn) calculation to find it. There are algorithms for finding the median that are O(n) (worst case). [Further, if quick select could find the 25th and 75th percentiles in O(n) as you say, why would quick select be unable to find the 50th percentile in the same order?]
Glen_b -Reinstate Monica

You are correct. I mistyped my original post. Sometimes the typing fingers and brain are not in sync. I meant to say to correctly calculate a true truncated mean, you need to sort all of the data elements. I believe this is still true. I've updated by answer.
Mark Lakata

This seems to imply that Winsorizing means Winsorizing 25% in each tail. You can Winsorize as much or as little as seems appropriate.
Nick Cox
Licensed under cc by-sa 3.0 with attribution required.