Skip to contents

Formulas for the Kaplan–Meier estimator

Let t1<t2<<tDt_1 < t_2 < \cdots < t_D represent the distinct event times. For each event time j=1,,Dj=1,\ldots,D, let YjY_j be the size of the risk set (the number of surviving observations) just prior to tjt_j.

Survival data is expected when outcome.type = "survival". Let djd_j be the number of failures at tjt_j. The Kaplan–Meier estimator of the survival function S(t)S(t) is as below.

Kaplan–Meier estimator

Ŝ(t)=tjt(1djYj) \hat{S}(t) = \prod_{t_j \le t} \left(1 - \frac{d_j}{Y_j} \right) Note that the estimator is defined to be right-continuous, so the events at tjt_j are included in the estimate of Ŝ(tj)\hat{S}(t_j).

The variance or standard error of the Kaplan–Meier estimator is often calculated with the Greenwood formula. This formula is derived from a binomial argument, so extension to the weighted case is ad hoc. Alternatively, Tsiatis (1981) proposes a slightly different formula based on a counting process argument which includes the weighted case.

Greenwood variance Var{Ŝ(t)}=Ŝ(t)2tjtdjYj(Yjdj) \textrm{Var}\!\left\{\hat{S}(t)\right \} = \hat{S}(t)^2\sum_{t_j \le t} \frac{d_j}{Y_j\,(Y_j - d_j)}

Tsiatis variance Var{Ŝ(t)}=Ŝ(t)2tjtdjYj2 \textrm{Var}\!\left\{\hat{S}(t)\right \} = \hat{S}(t)^2\sum_{t_j \le t} \frac{d_j}{Y_j^2}

Suppose that the survival data consists of {Ti,di,wi}\{T_i, d_i, w_i\}, independent sample of right-censored survival data with weights (i=1,...,Ni=1,...,N). Let djwd_j^w and YjwY_j^w be the weighted number of failures and the weighted number at risk, respectively, at time tjt_j. The weighted Kaplan–Meier estimator of the survival function S(t)S(t) is

Ŝ(t)=tjt(1djwYjw). \hat{S}(t) = \prod_{t_j \le t} \left( 1 - \frac{d_j^w}{Y_j^w} \right). Xie and Liu (2005) proposed the Greenwood-type variance for the weighted Kaplan–Meier estimator.

Greenwood variance for weighted Kaplan–Meier Var{Ŝ(t)}=Ŝ(t)2tjtdjwYjMjYjw(Yjwdjw), \textrm{Var}\!\left\{\hat{S}(t)\right \} = \hat{S}(t)^2\sum_{t_j \le t} \frac{d_j^w Y_j}{M_j Y_j^w(Y_j^w - d_j^w)}, where MjM_j is an adjustment factor defined as Mj=(titjwi)2titjwi2. M_j = \frac{\left(\sum_{t_i \geq t_j} w_i \right)^2}{\sum_{t_i \geq t_j} w_i^2}. The Tsiatis-type variance is calculated as follows in the same spirits.

Tsiatis variance for weighted Kaplan–Meier Var{Ŝ(t)}=Ŝ(t)2tjtdjwYjMj(Yjw)2 \textrm{Var}\!\left\{\hat{S}(t)\right \} = \hat{S}(t)^2\sum_{t_j \le t} \frac{d_j^w Y_j}{M_j (Y_j^w)^2}

Formulas for the Aalen-Johansen estimator

Competing risks (outcome.type = "competing-risk") arise in studies in which individuals are exposed to two or more mutually exclusive failure events. When a failure occurs, we observe the time to event TT and the cause of failure ϵ\epsilon. Suppose that ϵ=1\epsilon=1 and ϵ=2\epsilon=2 represent the event of interest and the competing risk, respectively. Let djkd_{jk} be the number of failures of cause kk at time tjt_j, and now the total number of failures at tjt_j is dj=dj1+dj2d_j = d_{j1} + d_{j2}.

The Aalen-Johansen estimator of CIF for cause ϵ=k\epsilon=k is as below.

Aalen-Johansen estimator

F̂k(t)=tjtdjkYjŜ(tj1). \hat{F}_k(t) = \sum_{t_j \le t} \frac{d_{jk}}{Y_j}\,\hat{S}(t_{j-1}).

where Ŝ(t)\hat{S}(t) is the overall survival function.

Two variance estimators of the Aalen-Johansen estimator are commonly used: one based on counting process theory (Aalen, 1978) and the other based on the delta method.

Aalen variance Var{F̂k(t)}=tjt[F̂k(t)F̂k(tj)]2dj(Yj1)(Yjdj)+tjtŜ2(tj1)djk(Yjdjk)Yj2(Yj1)2tjt[F̂k(t)F̂k(tj)]Ŝ(tj1)djk(Yjdjk)Yj(Yjdj)(Yj1). \begin{aligned} \textrm{Var}\!\left\{\hat{F}_k(t)\right \} &= \sum_{t_j \le t} \bigl[\hat{F}_k(t) - \hat{F}_k(t_j)\bigr]^2 \frac{d_j}{(Y_j-1)(Y_j - d_j)} \\[2pt] &\quad + \sum_{t_j \le t} \hat{S}^2(t_{j-1}) \frac{d_{jk}\,(Y_j - d_{jk})}{Y_j^2\,(Y_j-1)} \\[2pt] &\quad - 2 \sum_{t_j \le t} \bigl[\hat{F}_k(t) - \hat{F}_k(t_j)\bigr]\, \hat{S}(t_{j-1}) \frac{d_{jk}\,(Y_j - d_{jk})}{Y_j\,(Y_j - d_j)\,(Y_j-1)}. \end{aligned}

Delta method variance Var{F̂k(t)}=tjt[F̂k(t)F̂k(tj)]2djYj(Yjdj)+tjtŜ2(ti1)djk(Yjdjk)Yj32tjt[F̂k(t)F̂k(tj)]Ŝ(tj1)djkYj2. \begin{aligned} \textrm{Var}\!\left\{\hat{F}_k(t)\right \} &= \sum_{t_j \le t} \bigl[\hat{F}_k(t) - \hat{F}_k(t_j)\bigr]^2 \frac{d_j}{Y_j\,(Y_j - d_j)} \\[2pt] &\quad + \sum_{t_j \le t} \hat{S}^2(t_{i-1}) \frac{d_{jk}\,(Y_j - d_{jk})}{Y_j^3} \\[2pt] &\quad - 2 \sum_{t_j \le t} \bigl[\hat{F}_k(t) - \hat{F}_k(t_j)\bigr]\, \hat{S}(t_{j-1}) \frac{d_{jk}}{Y_j^2}. \end{aligned}

Variance based on influence functions

Under regularity conditions, Aalen-Johansen estimator can be expanded as
n1/2{F̂k(t)Fk(t)}=n1/2i=1nIFik(t)+op(1) n^{1/2}\{\hat F_k(t) - F_k(t)\} = n^{-1/2} \sum_{i=1}^n IF_{ik}(t) + o_p(1) and the process n1/2{F̂ik(t)Fik(t)}n^{1/2}\{\hat F_{ik}(t) - F_{ik}(t)\} converges weakly to a tight Gaussian process. Here IFik(t)IF_{ik}(t) is the influence function, the contribution of ii-th observation to the Aalen-Johansen estimator, and may be written as IFik(t)=0tnS(u)Y(u)dMi(u)0tnFk(u)Y(u)dMik(u), IF_{ik}(t) = \int_0^t \frac{n S(u^-)}{Y(u)}\,dM_i(u) - \int_0^t \frac{n F_k(u^-)}{Y(u)}\,dM_{ik}(u),

where Mi(t)M_i(t) and Mik(t)M_{ik}(t) is the Martingale process of the total count and the count of cause kk of ii-th observation, respectively, and Y(t)Y(t) is the at-risk process. A consistent variance estimator for n1/2{F̂k(t)Fk(t)}n^{1/2}\{\hat F_k(t) - F_k(t)\} is n1i=1n{IF̂ik(t)}2n^{-1} \sum_{i=1}^n \{\widehat{IF}_{ik}(t)\}^2.

Confidence interval options

Standard errors The default in cifcurve() with weights=NULL is the Greenwood SE when outcome.type="survival" and the delta SE when outcome.type="competing-risk". The default in cifcurve() with weights is the SE based on influence functions. By default cifcurve() rescales the Greenwood/Tsiatis quantities so that std.err is reported on the probability scale; set report.survfit.std.err = TRUE to return the conventional log-survival SEs from survival::survfit().

Confidence intervals cifcurve() constructs intervals on the probability scale using the requested transformation: "arcsine-square root"/"arcsin"/"a" (default), "plain, "log", "log-log", or "logit". Passing "none"/"n" skips interval computation entirely. The function exponentiates back to the probability scale, clips bounds to [0, 1], and replaces undefined values with NA so that interval endpoints remain well behaved in plots and summaries.