How do we describe the distribution of time intervals when some aren’t yet complete?

The Kaplan–Meier Survival Estimator is a non-parametric curve that describes the empirical survival function given observed interval to-date.

Importantly it is designed to handle “censored” data where the intervals are observed before they are known to be complete.

Background

Survival data arises arises in many scenarios:

  • Lifespan of mutual funds, given many are still active
  • Duration of therapy when many patients are still being treated
  • Production span of cars when many are still produced

The challenge is that in many samples we have start dates but many intervals are still continuing. Observed durations are are clipped at the time of the survey.
So a simple average or even median of observed durations when those durations may seriously underestimate the total time interval.

Survival data is well established in clinical trials and other settings.

Kaplan-Meier Estimator

The Kaplan-Meier Survival Estimator is a simple robust approach to reporting empirical survival distributions with censored data.

The estimator is given by:

  • ni is number of intervals known to have survived to time ti
  • di is number of intervals known to have stopped at time ti

Kaplan-Meier Estimator in Protobi

Protobi can calculate Kaplan-Meier survival curves from start dates and end dates.

Above is the distribution of production run lengths for specific auto models. One curve shows the global overall curve and one shows just Toyota models.

Closed circles show termination events. Open circles show right-censored events where an interval is observed but still continuing.

Below the chart is a table showing the median survival time, here defined as the time of the first event where empiric survival falls below 50%.

Closed circles show termination events. Open circles show right-censored events