温莎日记 34

Relative survival analysis in R | Accepted 12 January 2006

Relative survival techniques are used to compare the survival experience in a study cohort with the one expected should they follow the background population mortality rates. The techniques are especially useful when the cause-specific death information is not accurate or not available since they provide a measure of excess mortality in a group of patients with a certain disease. There are several approaches to modeling relative survival, but there is no widely used statistical package that would incorporate the relevant techniques. The existing software was mostly written by the authors of different methods, in different computer languages and with different requirements for the data input, which makes it almost impossible for a user to choose between available models. We describe our R package relsurv that provides functions for easy and flexible fitting of several relative survival regression models.

Introduction

Survival analysis is concerned with studying the random variable T, representing the time between entry to a study and some event of interest (e.g. death, recurrence of disease . . . ). Instead of the cumulative distribution function F(t), we usually estimate the cumulative survival function S(t), which is defined as S(t) = P(T > t) = 1 − F(t). In the case of the final event being death, S(t) is the probability of being alive at time t. When the final event of interest is death due to a certain disease, but the causes of death are not known, it is not possible to directly estimate the proportion of dead due to the disease in question. We then resort to the methods of relative survival. The cumulative relative survival function r(t) is defined

r(t):=\frac{S_o(t)}{S_p(t)}

where S_o(t) denotes observed survival and S_p(t) stands for population or expected survival, which is estimated on the basis of  population life tables. Obviously, r(t) can be any non-negative number, although the methods are most often applied to data where r(t) is less than 1.

Relative survival regression models

The relative survival literature most frequently refers to the additive models, dominating especially in cancer research. The main assumption is that the hazard of every individual can be split into two additive components:

\lambda _o(t)=\lambda _p(t)+\lambda _E(t)

where \lambda _p(t) is the hazard every patient takes because of his age, sex and cohort year (or any other combination of covariates included in the population mortality data). \lambda _E(t) denotes the excess hazard, specific for the disease in question and \lambda _o(t) is used for the observed hazard - the one we can estimate from our data. Using the equality

S(t)=e^{-\int \lambda (t)dt} , and we write S_o(t)=S_p(t) \times S_E(t) \implies S_E(t)=\frac{S_o(t)}{S_p(t)} .

The multiplicative models assume the hazard components to be multiplicative:

\lambda _o(t)=\lambda _p(t) \cdot  \lambda _E(t).

Such a model does not assume that the observed hazard is always greater than the population hazard, but has a less obvious interpretation as the additive model. Our programs provide for fitting the transformation model, and three different approaches to fitting the additive model. Only the last four have been widely used in the past, while transformation approach has only recently been published and given its simplicity, it might become popular in the future, especially in studies where long-term observations are common.

i) Hakulinen–Tenkanen additive survival method: in order to fit the model, the patients must first be grouped into K strata, indexed by k, with one stratum for each combination of the relevant predictor variables (age, sex, cohort year, stage . . . ) and a life table must be estimated for each stratum, with intervals indexed by i. The \lambda _E(t) is assumed to be a multiplicative function of the covariates and constant within each time interval i.

ln[-ln(\frac{p_{ki}}{p_{ki}^*} )]= β‘z+\gamma _i

which is a generalized linear model with a binomial error structure and a complementary log–log link function combined with a division by p_{ki}^* .

ii) The glm model with the Poisson error structure: the method is similar to Hakulinen–Tenkanen and it requires the same grouping of the data. The number of deaths d_{ki} is assumed to be distributed as Poisson(\mu _{ki}), where \mu _{ki} =\lambda _{E,ki} \cdot  y_{ki} and y_{ki} is the person-time at risk for group k of observations on time interval i.

\frac{\mu _{ki}}{y_{ki}} =\frac{d_{ki}^* }{y_{ki}} +exp(β‘z + \gamma _i)  and  ln(\mu _{ki}-d_{ki}^* ) = β‘z + \gamma _i + ln(y_{ki}).

The model can be rewritten to yield the same likelihood function as the Est`eve model, meaning that the grouping of the variables is not crucial.

iii) The Est`eve additive survival model: the method uses individual data and estimates the coefficients using the maximum likelihood approach. The likelihood for the additive model is

L(\beta )=\prod_{j=1}^n (\lambda _o(t_j))^{d_j} \cdot exp\left\{ -\int_{0}^{t_j} \lambda _o(s)ds  \right\} =\prod_{j=1}^n (\lambda _p(t_j))+exp[β'z+\gamma ' fu(t_j)] d_j \cdot exp[ -\int_{0}^{t_j} \lambda _p(s)ds  + exp(β'z+\gamma 'fu(s))ds]

where t_j is the event time for person j and d_j the status indicator. The log-likelihood function:

l(\beta )= \sum_{j=1}^n d_jln(\lambda _p(t_j)) + exp[β'z+γ‘fu(t_j)] - \sum_{j=1}^n  \int_{0}^{t_j}  exp [β'z+γ'fu(s)]ds - \sum_{j=1}^n  \int_{0}^{t_j} \lambda _p(s)ds.

iv) The Andersen multiplicative model: the model assumes \lambda _E(t)=\lambda _0(t)exp{β'z}, which is the same as in the Cox model. The observed hazard thus becomes

\lambda _o(t)=\lambda _p(t) \lambda _0(t) e^{\beta /z} .

This model can be rewritten in the form

\lambda _o(t)=\lambda _0(t)e^{\beta /(z+1)ln(\lambda _p(t))}

from which it is obvious that this is a Cox model with an additional time-dependent variable with the coefficient fixed to 1. Though \lambda _p(t) is a continuous process, the population mortality tables are usually organised in yearly intervals. In order to fit the model using the Cox model procedures, data can be split in yearly intervals, with the \lambda _p(t) updated on each of the intervals.

v) The transformation approach: the individual survival times t are first transformed as

y=F_p(t)

where F_p is the cumulative distribution function of a person of a certain age, sex and cohort year (or any other combination included in the population tables) that would apply if the person is a typical representative of the general population. This distribution function is calculated from the general population mortality data. The values y can thus be interpreted as the achieved values on the expected cumulative distribution function for each individual. By transforming to the new scale, the population hazard is automatically taken into account, consequently all what is left is precisely the disease-specific hazard, which we can thus directly model. One way is to use the Cox model

\lambda (y)=\lambda _0(y)e^{\beta /z}.

The relsurv package

The core of the package are three functions that fit the models described in the previous section:

+ rsadd fits an additive model. The default is the Est`eve method (iii) which is specified by "max.lik", the other two options are method="glm.bin" and method="glm.poi" for models. When using one of the glm methods, the observed and expected survival proportions for each group are returned as object groups.

+ rsmul fits the multiplicative model as described in (iv). A more computationally intensive alternative is chosen by specifying method="mul1"that splits the intervals at every time point in which the population mortality changes—for example on 1 January and on the individual’s birthday.

+ rstrans fits the transformation model as described in (v). If only the transformation times are needed, this can be done directly by the survexp function (survival package) or by function rstrans, where transformed times are returned:y (fit<-rstrans(...), y<-fit$y).

The package also includes the functions for testing goodness-of-fit and plotting relevant graphs for all the above models with methods. Additionally, two data sets are included in the package. One is called rdata and contains survival data that can be used as an example. The slopop data set contains the population mortality tables for Slovenia.

All the functions are called in the same way, for example:

rstrans(formula, data, ratetable, int, na.action, init, control)

Two data sets are required for any relative survival model. One is our observed data set, which we will pass to the function as argument data. The other is the population mortality table that we want to compare our observed data to, and this is specified in the argument ratetable. The population mortality tables have to be organised as an object of class ratetable (defined in package survival), default is the survexp.us table that contains the US data (also in the survival package).

The model is specified within the argument formula. It is organised following the syntactic rules of an R language formula and thus similar to any other survival package formula. The left-hand side must be a Surv object. For example, if time and status are the survival time variable and the censoring indicator, and x is a covariate, then the command may be

rstrans(Surv(time,status)∼x)

If the variables that are used for the expected survival calculation (for example age, sex and year) are not organised and not named in the same way as in the population tables, a special term ratetable is to be included in the formula, for example:

Surv(time,status)∼x+ratetable(age=age,sex="male", year=diag)

Argument int specifies the number of yearly intervals to be used in rsadd function. int can either be a single number or a vector. The latter is used when the intervals are not all one year long, for example int=c(0,0.5,1,5,10) means a couple of short intervals at the beginning and longer intervals thereafter with the maximum follow-up time of 10 years. Each interval includes the right endpoint, but not the left one, except for the first interval which also includes the left endpoint. In this case the third interval would therefore be (1, 5 ].

This option can also be used in rsmul and rstrans functions, serving simply as the maximum follow-up time after which all the data are censored, enabling in this way a more straightforward comparison of models. Missing data are handled in the usual way through the argument na.action and the initial values for the model can be specified via the argument init. The number of iterations and other fitting specifications can be specified through the control argument, which follows the glm.control function in the rsadd function and the coxph.control function in rsmul and rstrans.

The user, however, does not have to change the data set—all the matching can be done by the help of a special term in the formula named ratetable. In order to construct this term, first check the names of the ratetable variables. For the slopop rate table the names are

> attributes(slopop)$dimid

"age" "year" "sex"

These names are now to be matched with the variable names, for example:

Surv(time,status)∼x+ratetable(age=age*365.24,

sex="male",year=diag)

The names on the left of each equality sign are the dimension names of the ratetable names, while the names on the right are the variable names in the observed data set. This example is for a data set in which no variable sex is included as all the patients are male, the calendar year variable is stored as diag (diagnosis year), and age is in years and must therefore be multiplied by 365.24 to match the data in the rate table (note that .24 is used in all R calculations as there are 24 leap years per century).

rsadd(Surv(time,status)∼x,ratetable=slopop,

data=my.data)

The rsadd function returns an object of class rsadd, while rsmul and rstrans return a coxph object. All the returned values can be viewed as such or by the help of summary function. Apart from the coefficients, their standardized values and the other usually returned values, the common object returned by all the functions is named y and containts the survival and censoring times for each individual, organised as a Surv object. In the case of rstrans function, these are already the transformed times and this function can thus also be used just for making the transformation.

The max.lik method of rsadd function is based on maximum likelihood and therefore the output also contains the log-likelihood values stored as loglik. The other two methods are based on the glm function and the returned objects preserve all its values. The original and the grouped data are both stored in the output object as data and groups, respectively. The original data set contains the demographic variables in the format that matches the rate table (stored as X1, X2,. . .) and the covariates in the format in which they were used in the model.

Example

To illustrate the usage of the program, we use a subset of data from the study of survival of patients after acute myocardial infarction that is included in the package as the file rdata. The data were collected in the study carried out at the University Clinical Center in Ljubljana and contain 1040 patients diagnosed between 1982 and 1986 and followed up until 1997. During this time 547 deaths occurred and as the causes of death are not given, this is a good example of the need of the relative survival methodology. The organisation of the data is as follows:

We fit all the models to the data in the same way:

Time and censoring are already in the right format, so we put them in the formula as Surv(time,cens);

• sex and agegr (“under 54” is the reference group) are taken as covariates; agegr is forced to be a factor variable =⇒ sex+as.factor(agegr);

• We concentrate on the first 5 years of follow-up (and assume the hazard is constant in yearly intervals for all the additive models) =⇒ int=5;

• The name of the observed data set is rdata, the study was performed in Slovenia, therefore we use Slovenian population tables =⇒ data=rdata, ratetable=slopop;

• Variables sex and year are in the same format as in the rate table slopop, age must be put into days =⇒ ratetable(age=age*365.24, sex=sex, year=year).

The estimates obtained by the three additive model fittingoptions are compared in Table 1.

The command lines for our five models therefore look very similar:

> rsadd(Surv(time,cens)∼sex+as.factor(agegr)+

+ ratetable(age=age*365.24,sex=sex,year=year),

+ data=rdata,ratetable=slopop,int=5,method="glm.bin")

> rsadd(Surv(time,cens)∼sex+as.factor(agegr)+

+ ratetable(age=age*365.24,sex=sex,year=year),

+ data=rdata,ratetable=slopop,int=5,method="glm.poi")

> rsadd(Surv(time,cens)∼sex+as.factor(agegr)+

+ ratetable(age=age*365.24,sex=sex,year=year),

+ data=rdata,ratetable=slopop,int=5)

> rsmul(Surv(time,cens)∼sex+as.factor(agegr)+

+ ratetable(age=age*365.24,sex=sex,year=year),

+ data=rdata,ratetable=slopop,int=5)

> rstrans(Surv(time,cens)∼sex+as.factor(agegr)+

+ ratetable(age=age*365.24,sex=sex,year=year),

+ data=rdata,ratetable=slopop,int=5)

The outputs of all these functions are organized in the same way, the computed values are stored in objects that closely match those of other standard regression model functions in R (e.g. coxph function). As an example of the most important part of the output (obtained with function summary), we report the estimates of the β and γ coefficients, their standard errors and the results of the Wald test for each covariate (z and p values) for the Est`eve additive model (the default of the rsadd function):

Covariate agegr has four possible values, the youngest age group is automatically taken as the reference group. The output therefore consists of nine coefficients, the last five representing the follow-up interval indicators. The positive sex coefficient (γ = 0.903) implies that the survival of men is relatively better. The age however does not seem to be a very important factor, with even the oldest group coefficient not differing significantly from the youngest (p = 0.052). This means that the differences in survival between the age groups that we would get from any classical survival method are nearly wholly attributable to the population risks. The coefficients for the follow-up years are similar as well, only the first year seems to have a larger hazard (exp(fu[0, 1]) = 0.015).

To conclude

The usage of any statistical method depends mainly on the availability of adequate software, and relative survival techniques are no exception. For now, the choice of the model, and the method to fit it, were mostly influenced by software at hand. Different methods were programmed in different languages, and their usage was hindered by considerable effort needed to learn the special requirements of such programs. We believe that by incorporating all the different methods into an R package, now being widely used by statisticians and other professionals, such obstacles are mostly overcome. All the programs use the standard R format for commands, and their usage is as easy as that of any other regression program in R. Users, experienced in relative survival analysis, and in programming in R, will find it easy to adjust the programs to their needs if necessary.

The package was written by the first author and is publicly available at CRAN. The rest of the work means specifying the attributes. The easiest way to do it is to check the attributes of an existing rate table and imitate them (as is done in the following example for an array we named my.rate). The compulsary attributes are the following:

++ If we have a four-dimensional array with dimensions age, sex, year and race, we can specify this attribute as:

> attributes(my.rate)$dimid <- c("age","sex", "year","race")

++ dim. The dimensions of the array, for example

> attributes(my.rate)$dim <- c(100,2,4,3)

++ dimnames. The names of each of the groups, i.e. a vector of names for each of the dimid variables. The race names could be specified as (race is the fourth dimension of the array)

> attributes(my.rate)$dimnames[[4]]

      <- c("white","black","other")

++ factor. This attribute specifies which of the varibles are constant in time - the values for sex and race will therefore be equal to 1, the values for age and calendar year will be 0. Other values are allowed and are used in the US population tables, but special functions must of course be programmed when using those. In our example the value would be

> attributes(my.rate)$factor <- c(0,0,1,1)

++ cutpoints. A list of vectors denoting the cut points for each of the dimid variables (the constant ones like sex must have value NULL). If our population tables are estimated every ten years for four decades, vector of cut points will equal to (calendar year is the third dimension)

> attributes(my.rate)$cutpoints[[3]]

    <- as.date(c("1Jan70",

+ "1Jan80","1Jan90","1Jan2000"))

++ class. Set the class of the object to

> attributes(my.rate)$class <- "ratetable"

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 206,602评论 6 481
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 88,442评论 2 382
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 152,878评论 0 344
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 55,306评论 1 279
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 64,330评论 5 373
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 49,071评论 1 285
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 38,382评论 3 400
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 37,006评论 0 259
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 43,512评论 1 300
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 35,965评论 2 325
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 38,094评论 1 333
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 33,732评论 4 323
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 39,283评论 3 307
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 30,286评论 0 19
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 31,512评论 1 262
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 45,536评论 2 354
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 42,828评论 2 345

推荐阅读更多精彩内容