Multiple endpoints in clinical trials are a very common occurrence, one which is often linked to the complexity of the treatment effect that a study aims at estimating. In Parkinson’s Disease, for instance, whilst the endpoint favored by the regulators is often the Unified Parkinson’s Disease Rating Scale (UPDRS) Motor Score, there are other measures of drug activity that have a paramount importance both to the clinician and the patient such as, for instance, the amount of Good Quality ON Time (where ON time refers to whether the patient has received a symptomatic treatment such as Levodopa). A clinical trial might then want to investigate the treatment effect on both these endpoints in order to further support efficacy claims for the drug being studied.
Whilst testing for multiple endpoints is not a problem per se, issues arise when one’s trying to make a claim based on these results. That is, one is free to perform 2 tests at a 5% level but must be aware that the overall type-I error rate (also known as Family-Wise Error Rate) in this scenario is going to be inflated (i.e. not controlled) up to nearly 10%, i.e. nearly twice as much the level of each individual test, thus making any claim of a treatment effect less robust and thus questionable. To circumvent this issue many procedures have been proposed and used at length in the clinical trial setting, such as the simple Bonferroni correction or the more powerful Holm’s method (with its variants). The common ground for each of these methods, though, is that they do not allow one to make a ‘global’ statement on the set of variables being tested. Whilst this is very appealing in the context of a pivotal trial where we want to be able to pinpoint the specific effect of each variable and how significant this is, there are contexts (e.g. in early trials or exploratory studies) where we might be more interested in highlighting whether a subset of variables is jointly suggestive of a treatment effect (e.g. when many correlated biomarkers are being evaluated and only the key ones are to be used for further development and/or drug characterization).
The Global Statistical Test (GST) introduced by O’Brien in 1984  serves this purpose by mapping a multivariate problem on an univariate scale, so that subsets of variables can be assessed as a whole and a single probability statement (a p-value) can be estimated for each subset. The method is very flexible and in its simplest formulation is completely non-parametric since it’s based on ranks, however parametric versions based on Ordinary or Generalized Least Squares (OLS and GLS/Modified GLS) also have been proposed [1, 2].
Notation-wise, let g be the number of treatments, m the number of endpoints, μjk , σjk and λjk the mean, standard deviation and effect size for the j-th treatment and k-th endpoint, where 𝜆𝑗𝑘= 𝜇𝑗𝑘/𝜎𝑗𝑘. If all these treatment effects are the same across endpoints, then the problem can be simplified to a univariate one by creating convenient subject-specific scores that can then be analyzed by means of common ANOVA models. In Table 1 we summarize what these scores look like for each approach, following the notation in . An interesting feature of both GLS and MGLS, which makes them particularly appealing in a number of applications, is that the correlation between endpoints is factored in the score and it down-weights the contribution of a non–independent endpoint.
Table 1 Summary of GST methods
The characteristics of these methods (type I error control and power) were assessed via simulations (as in ). In terms of Type I error, we simulated data from 3 covariates with common and constant correlation ρ (=0 or 0.3), equal means and varying sample size (N = 20, 30, 40, 50), sampling from either a multivariate normal or a multivariate log-normal. Results are displayed in Table 2 below, and they show that overall type I error is controlled at the 5% level, the rank-sum method being the most conservative one across many scenarios, and the MGLS outperforming other methods for log-normal positively correlated samples.
Table 2: Type I error for GST methods, varying sample size and constant correlation
In terms of power, results in Figure 2 suggest that the non-parametric GST is more powerful when the data follow a strictly non-Normal distribution (e.g. log-normal) or in presence of outliers, whereas MGLS outpowers all other parametric options across all scenarios investigated (and the non-parametric one when data where from a pure multivariate Normal distribution). Also, the higher the correlation the lower the power.
Figure 1: Empirical power over 1000 simulations, N = 50, µA = (13, 15, 16) and µB = (13, 14, 13) and varying scenarios
To illustrate how the GST could provide added value compared to standard multivariate approaches such as Multivariate ANOVA Hotelling’s T test, we’ve considered the MPTET dataset contained in the R package multcomp considering 4 endpoints and 2 treatments. In Table 2 and 3 and Figure 2 we have reported some basic univariate information, including pairwise correlations and ANOVA F tests. Whilst there seem to be some evidence that the treatment has indeed an effect on 3 out of 4 variables, the direction of the effect is not consistent, with E4 showing a decrease in Placebo compared to the active drug, unlike the other endpoints.
If new analyze all possible subsets of endpoints to identify the most promising ones, the Hotelling’s test returns significant results for nearly all combinations, whereas according to the MGLS test only E1 and E2 jointly are suggestive of a treatment effect (see table 5). The reason of this difference in results is that the MANOVA approach only identifies ‘any’ difference, so that if two endpoints have a completely opposite direction they would still jointly suggest a treatment effect even though the result is uninterpretable. The GST, on the other hand, assumes that the ‘benefit’ of one treatment versus the other has the same direction for all endpoints and as such only variables with e.g. a positive difference in the Active – Placebo comparison would be flagged as jointly significant.
Table 5: MPTET analysis example with GST and MANOVA techniques
In conclusion, the GST is a flexible and powerful method that can be coupled with standard multiplicity correction techniques to provide further evidence of an overall treatment effect. On top of this, it can be used as a standalone method when the goal is e.g. the derivation of a composite endpoint or some kind of feature selection for further development. A SAS macro is available in  to perform the actual test, although it can be programmed rather easily using other software such as R.
Quanticate's statistical consultants are among the leaders in their respective areas enabling the client to have the ability to choose expertise from a range of consultants to match their needs. If you have a need for these types of services please Submit a RFI and member of our Business Development team will be in touch with you shortly.
- O’Brien PC. Procedures for Comparing Samples with Multiple Endpoints. Biometrics 1984; 40(4): 1079 - 1087
- Tang DI, Geller NL, Pocock SJ. On the design and analysis of randomized clinical trials with multiple endpoints. Biometrics 1993; 49: 23-30
- Dmitrienko A, Molenberghs G, Chuang-Stein C, Offen W. Analysis of Clinical Trials Using SAS – A Practical Guide. Cary, NC: SAS Press; 2007.