Function to prepare dataset for multi-state modeling in long format from dataset in wide format

This function converts a dataset which is in wide format (one subject per line, multiple columns indicating time and status for different states) into a dataset in long format (one line for each transition for which a subject is at risk). Selected covariates are replicated per subjects.

Usage

msprep(time, status, data, trans, start, id, keep)

Arguments

time: Either 1) a matrix or data frame of dimension n x S (n being the number of individuals and S the number of states in the multi-state model), containing the times at which the states are visited or last follow-up time, or 2) a character vector of length S containing the column names indicating these times. In the latter cases, some elements of time may be NA, see Details
status: Either 1) a matrix or data frame of dimension n x S, containing, for each of the states, event indicators taking the value 1 if the state is visited or 0 if it is not (censored), or 2) a character vector of length S containing the column names indicating these status variables. In the latter cases, some elements of status may be NA, see Details
data: Data frame (not a tibble) in wide format in which to interpret time, status, id or keep, if appropriate
trans: Transition matrix describing the states and transitions in the multi-state model. If S is the number of states in the multi-state model, trans should be an S x S matrix, with (i,j)-element a positive integer if a transition from i to j is possible in the multi-state model, NA otherwise. In particular, all diagonal elements should be NA. The integers indicating the possible transitions in the multi-state model should be sequentially numbered, 1,...,K, with K the number of transitions
start: List with elements state and time, containing starting states and times of the subjects in the data. Default is NULL, in which case all subjects start in state 1 at time 0. If a single state and time are given this state and time is used for all subjects, otherwise the length of state and time should equal the number of subjects in data
id: Either 1) a vector of length n containing the subject identifications, or 2) a character string indicating the column name containing these subject ids. If not provided, "id" will be assigned with values 1,...,n
keep: Either 1) a data frame or matrix with n rows or a numeric or factor vector of length n containing covariate(s) that need to be retained in the output dataset, or 2) a character vector containing the column names of these covariates in data

Value

An object of class "msdata", which is a data frame in long (counting process) format containing the subject id, the covariates (replicated per subject), and

from: the starting state
to: the receiving state
trans: the transition number
Tstart: the starting time of the transition
Tstop: the stopping time of the transition
status: status variable, with 1 indicating an event (transition), 0 a censoring

The "msdata" object has the transition matrix as "trans" attribute.

Details

For msprep, the transition matrix should correspond to an irreversible acyclic Markov chain. In particular, on the diagonals only NAs are allowed.

The transition matrix, if irreversible and acyclic, will have starting states, i.e. states into which no transitions are possible. For these starting states NAs are allowed in the time and status arguments, either as columns, when specified as matrix or data frame, or as elements of the character vector when specified as character vector.

The function msprep uses a recursive algorithm through calls to the recursive function msprepEngine. First, with the current transition matrix, all starting states are detected (defined as states into which there are no transitions). For each of these starting states, all subjects starting from that state are selected and for each subject the next visited state is detected by looking at all transitions from that starting state and determining the smallest transition time with status=1. The recursive msprepEngine is called again with the starting states deleted from the transition matrix and with subjects deleted that either reached an absorbing state or that were censored. For the remaining subjects the starting states and times are updated in the next call. Datasets returned from the msprepEngine calls are appended to the current dataset in long format and finally sorted.

A warning is issued for a subject, if multiple transitions exist with the same smallest transition time (and status=0). In such cases the next transition cannot be determined unambiguously, and the state with the smallest number is chosen. In our experience, occasionally the shortest transition time has status=0, while a higher transition time has status=1. Then this larger transition time and the corresponding transition is selected. No warning is issued for these data inconsistencies.

References

Putter H, Fiocco M, Geskus RB (2007). Tutorial in biostatistics: Competing risks and multi-state models. Statistics in Medicine 26, 2389–2430.

Author

Hein Putter H.Putter@lumc.nl and Marta Fiocco

Examples


# transition matrix for illness-death model
tmat <- trans.illdeath()
# some data in wide format
tg <- data.frame(stt=rep(0,6),sts=rep(0,6),
        illt=c(1,1,6,6,8,9),ills=c(1,0,1,1,0,1),
        dt=c(5,1,9,7,8,12),ds=c(1,1,1,1,1,1),
        x1=c(1,1,1,2,2,2),x2=c(6:1))
tg$x1 <- factor(tg$x1,labels=c("male","female"))
tg$patid <- factor(2:7,levels=1:8,labels=as.character(1:8))
# define time, status and covariates also as matrices
tt <- matrix(c(rep(NA,6),tg$illt,tg$dt),6,3)
st <- matrix(c(rep(NA,6),tg$ills,tg$ds),6,3)
keepmat <- data.frame(gender=tg$x1,age=tg$x2)
# data in long format using msprep
msprep(time=tt,status=st,trans=tmat,keep=as.matrix(keepmat))
#> An object of class 'msdata'
#> 
#> Data:
#>    id from to trans Tstart Tstop time status  keep1 keep2
#> 1   1    1  2     1      0     1    1      1   male     6
#> 2   1    1  3     2      0     1    1      0   male     6
#> 3   1    2  3     3      1     5    4      1   male     6
#> 4   2    1  2     1      0     1    1      0   male     5
#> 5   2    1  3     2      0     1    1      1   male     5
#> 6   3    1  2     1      0     6    6      1   male     4
#> 7   3    1  3     2      0     6    6      0   male     4
#> 8   3    2  3     3      6     9    3      1   male     4
#> 9   4    1  2     1      0     6    6      1 female     3
#> 10  4    1  3     2      0     6    6      0 female     3
#> 11  4    2  3     3      6     7    1      1 female     3
#> 12  5    1  2     1      0     8    8      0 female     2
#> 13  5    1  3     2      0     8    8      1 female     2
#> 14  6    1  2     1      0     9    9      1 female     1
#> 15  6    1  3     2      0     9    9      0 female     1
#> 16  6    2  3     3      9    12    3      1 female     1
msprep(time=c(NA,"illt","dt"),status=c(NA,"ills","ds"),data=tg,
    id="patid",keep=c("x1","x2"),trans=tmat)
#> An object of class 'msdata'
#> 
#> Data:
#>    patid from to trans Tstart Tstop time status     x1 x2
#> 1      2    1  2     1      0     1    1      1   male  6
#> 2      2    1  3     2      0     1    1      0   male  6
#> 3      2    2  3     3      1     5    4      1   male  6
#> 4      3    1  2     1      0     1    1      0   male  5
#> 5      3    1  3     2      0     1    1      1   male  5
#> 6      4    1  2     1      0     6    6      1   male  4
#> 7      4    1  3     2      0     6    6      0   male  4
#> 8      4    2  3     3      6     9    3      1   male  4
#> 9      5    1  2     1      0     6    6      1 female  3
#> 10     5    1  3     2      0     6    6      0 female  3
#> 11     5    2  3     3      6     7    1      1 female  3
#> 12     6    1  2     1      0     8    8      0 female  2
#> 13     6    1  3     2      0     8    8      1 female  2
#> 14     7    1  2     1      0     9    9      1 female  1
#> 15     7    1  3     2      0     9    9      0 female  1
#> 16     7    2  3     3      9    12    3      1 female  1
# Patient no 5, 6 now start in state 2 at time t=4 and t=10
msprep(time=tt,status=st,trans=tmat,keep=keepmat,
        start=list(state=c(1,1,1,1,2,2),time=c(0,0,0,0,4,10)))
#> An object of class 'msdata'
#> 
#> Data:
#>    id from to trans Tstart Tstop time status gender age
#> 1   1    1  2     1      0     1    1      1   male   6
#> 2   1    1  3     2      0     1    1      0   male   6
#> 3   1    2  3     3      1     5    4      1   male   6
#> 4   2    1  2     1      0     1    1      0   male   5
#> 5   2    1  3     2      0     1    1      1   male   5
#> 6   3    1  2     1      0     6    6      1   male   4
#> 7   3    1  3     2      0     6    6      0   male   4
#> 8   3    2  3     3      6     9    3      1   male   4
#> 9   4    1  2     1      0     6    6      1 female   3
#> 10  4    1  3     2      0     6    6      0 female   3
#> 11  4    2  3     3      6     7    1      1 female   3
#> 12  5    2  3     3      4     8    4      1 female   2
#> 13  6    2  3     3     10    12    2      1 female   1