get_T2_two function

Hotelling's statistics (for two independent (small) samples)

Hotelling's statistics (for two independent (small) samples)

The function get_T2_two() estimates the parameters for Hotelling's two-sample T2T^2 statistic for small samples.

get_T2_two(m1, m2, signif, na_rm = FALSE)

Arguments

  • m1: A matrix with the data of the reference group, e.g. a matrix representing dissolution profiles, i.e. with rows for the different dosage units and columns for the different time points, or a matrix for the different model parameters (columns) of different dosage units (rows).

  • m2: A matrix with the same dimensions as matrix m1 with the data of the test group having the characteristics as the data of matrix m1.

  • signif: A positive numeric value between 0 and 1

    that specifies the significance level. The default value is 0.05.

  • na_rm: A logical value that indicates whether observations containing NA (or NaN) values should be removed (na_rm = TRUE) or not (na_rm = FALSE). The default is na_rm = FALSE.

Returns

A list with the following elements is returned: - Parameters: Parameters determined for the estimation of Hotelling's T2T^2.

  • S.pool: Pooled variance-covariance matrix.

  • covs: A list with the elements S.b1 and S.b2, i.e. the variance-covariance matrices of the reference and the test group, respectively.

  • means: A list with the elements mean.b1, mean.b2 and mean.diff, i.e. the average dissolution profile values (for each time point) or the average model parameters of the reference and the test group and the corresponding differences, respectively.

  • CI: A list with the elements Hotelling and Bonferroni, i.e. data frames with columns LCL and UCL for the lower and upper (1α)100%(1 - \alpha)100\% confidence limits, respectively, and rows for each time point or model parameter.

The Parameters element contains the following information: - dm: Mahalanobis distance of the samples.

  • df1: Degrees of freedom (number of variables or time points).

  • df2: Degrees of freedom (number of rows - number of variables - 1).

  • alpha: Provided significance level.

  • K: Scaling factor for FF to account for the distribution of the T2T^2 statistic.

  • k: Scaling factor for the squared Mahalanobis distance to obtain the T2T^2 statistic.

  • T2: Hotelling's T2T^2 statistic (FF-distributed).

  • F: Observed FF value.

  • F.crit: Critical FF value.

  • t.crit: Critical tt value.

  • p.F: pp value for Hotelling's T2T^2 test statistic.

Details

The two-sample Hotelling's T2T^2 test statistic is given by

T2=nTnRnT+nR(xTxR)Spooled1(xTxR), T^2 = \frac{n_T n_R}{n_T + n_R} \left( \bm{x}_T - \bm{x}_R\right)^{\top} \bm{S}_{pooled}^{-1} \left( \bm{x}_T - \bm{x}_R \right) ,%(n_T n_R) / (n_T + n_R) * (x_T - x_R)^{\top} S_{pooled}^{-1} (x_T - x_R) ,

where xTx_T and xRx_R are the vectors of the sample means of the test (TT) and reference (RR) group, e.g. vectors of the average dissolution per time point or of the average model parameters, nTn_T and nRn_R are the numbers of observations of the reference and the test group, respectively (i.e. the number of rows in matrices m1 and m2 handed over to the get_T2_two()

function), and SpooledS_{pooled} is the pooled variance-covariance matrix which is calculated by

\bm{S}_{pooled} = \frac{(n_R - 1) \bm{S}_R + (n_T - 1) \bm{S}_T}{%n_R + n_T - 2} ,S_{pooled} = ((n_R - 1) S_R + (n_T - 1) S_T) /(n_R + n_T - 2) ,

where SRS_R and STS_T are the estimated variance-covariance matrices which are calculated from the matrices of the two groups being compared, i.e. m1 and m2. The matrix Spooled1S_{pooled}^{-1} is the inverted variance-covariance matrix. As the number of columns of matrices m1

and m2 increases, and especially as the correlation between the columns increases, the risk increases that the pooled variance-covariance matrix SpooledS_{pooled} is ill-conditioned or even singular and thus cannot be inverted. The term

DM=(xTxR)Spooled1(xTxR) D_M = \sqrt{ \left( \bm{x}_T - \bm{x}_R \right)^{\top}\bm{S}_{pooled}^{-1} \left( \bm{x}_T - \bm{x}_R \right) }%D_M = sqrt((x_T - x_R)^{\top} S_{pooled}^{-1} (x_T - x_R))

is the Mahalanobis distance which is used to measure the difference between two multivariate means. For large samples, T2T^2 is approximately chi-square distributed with pp degrees of freedom, where pp is the number of variables, i.e. the number of dissolution profile time points or the number of model parameters. In terms of the Mahalanobis distance, Hotelling's T2T^2 statistic can be expressed has

nTnRnT+nR  DM2=k  DM2. \frac{n_T n_R}{n_T + n_R} \; D_M^2 = k \; D_M^2 .

To transform the Hotelling's T2T^2 statistic into an FF-statistic, a conversion factor is necessary, i.e.

K=k  nT+nRp1(nT+nR2)p. K = k \; \frac{n_T + n_R - p - 1}{\left( n_T + n_R - 2 \right) p} .%k (n_T + n_R - p - 1) / ((n_T + n_R - 2) p) .

With this transformation, the following test statistic can be applied:

K  DM2Fp,nT+nRp1,α. K \; D_M^2 \leq F_{p, n_T + n_R - p - 1, \alpha} .%K D_M^2 \leq F_{p, n_T + n_R - p - 1, \alpha} .

Under the null hypothesis, c("%", "\n\n", "H0:muT=muRH_0: \\mu_T = \\mu_R"), this FF-statistic is FF-distributed with pp and nT+nRp1n_T + n_R - p - 1 degrees of freedom. H0H_0 is rejected at significance level α\alpha if the FF-value exceeds the critical value from the FF-table evaluated at α\alpha, i.e. F>Fp,nT+nRp1,αF > F_{p, n_T + n_R - p - 1, \alpha}. The null hypothesis is satisfied if, and only if, the population means are identical for all variables. The alternative is that at least one pair of these means is different.

The following assumptions concerning the data are made:

  • The data from population ii is a sample from a population with mean vector μi\mu_i. In other words, it is assumed that there are no sub-populations.
  • The data from both populations have common variance-covariance matrix Σ\Sigma.
  • The elements from both populations are independently sampled, i.e. the data values are independent.
  • Both populations are multivariate normally distributed.

Confidence intervals :

Confidence intervals for the mean differences at each time point or confidence intervals for the mean differences between the parameter estimates of the reference and the test group are calculated by aid of the formula

(xTxR)±1K  Fp,nT+nRp1,α  spooled, \left( \bm{x}_T - \bm{x}_R \right) \pm \sqrt{\frac{1}{K} \;F_{p, n_T + n_R - p - 1, \alpha} \; \bm{s}_{pooled}} ,%(x_T - x_R) \pm sqrt(1 / K F_{p, n_T + n_R - p - 1, \alpha} s_{pooled}) ,

where spooleds_{pooled} is the vector of the diagonal elements of the pooled variance-covariance matrix SpooledS_{pooled}. With (1α)100%(1 - \alpha)100\% confidence, this interval covers the respective linear combination of the differences between the means of the two sample groups. If not the linear combination of the variables is of interest but rather the individual variables, then the Bonferroni corrected confidence intervals should be used instead which are given by the expression

(xTxR)\pmtnT+nR2,α2p  1k  spooled. \left( \bm{x}_T - \bm{x}_R \right) \pmt_{n_T + n_R - 2, \frac{\alpha}{2 p}} \;\sqrt{\frac{1}{k} \; \bm{s}_{pooled}} .%(x_T - x_R) \pm t_{n_T + n_R - 2, \alpha / (2 p)} sqrt(1 / k s_{pooled}) .

Examples

# Estimation of the parameters for Hotelling's two-sample T2 statistic # (for small samples) res1 <- get_T2_two(m1 = as.matrix(dip1[dip1$type == "R", c("t.15", "t.90")]), m2 = as.matrix(dip1[dip1$type == "T", c("t.15", "t.90")]), signif = 0.1) res1$S.pool res1$Parameters # Results in res1$S.pool # t.15 t.90 # t.15 3.395808 1.029870 # t.90 1.029870 4.434833 # Results in res1$Parameters # dm df1 df2 signif K # 1.044045e+01 2.000000e+00 9.000000e+00 1.000000e-01 1.350000e+00 # k T2 F F.crit t.crit # 3.000000e+00 3.270089e+02 1.471540e+02 3.006452e+00 2.228139e+00 # p.F # 1.335407e-07 # The results above correspond to the values that are shown in Tsong (1996) # (see reference of dip1 data set) under paragraph "DATA1 data (Comparing # the 15- and 90-minute sample time points only). # For the second assessment shown in Tsong (1996) (see reference of dip1 data # set) under paragraph "DATA2 data (Comparing all eight time points), the # following results are obtained. res2 <- get_T2_two(m1 = as.matrix(dip1[dip1$type == "R", 3:10]), m2 = as.matrix(dip1[dip1$type == "T", 3:10]), signif = 0.1) res2$Parameters # Results in res2$Parameters # dm df1 df2 signif K # 2.648562e+01 8.000000e+00 3.000000e+00 1.000000e-01 1.125000e-01 # k T2 F F.crit t.crit # 3.000000e+00 2.104464e+03 7.891739e+01 5.251671e+00 3.038243e+00 # p.F # 2.116258e-03 # In Tsong (1997) (see reference of dip7), the model-dependent approach is # illustrated with an example data set of alpha and beta parameters obtained # by fitting the Weibull curve function to a data set of dissolution profiles # of three reference batches and one new batch (12 profiles per batch). res3 <- get_T2_two(m1 = as.matrix(dip7[dip7$type == "ref", c("alpha", "beta")]), m2 = as.matrix(dip7[dip7$type == "test", c("alpha", "beta")]), signif = 0.05) res3$Parameters # Results in res3$Parameters # dm df1 df2 signif K # 3.247275e+00 2.000000e+00 4.500000e+01 5.000000e-02 4.402174e+00 # k T2 F F.crit t.crit # 9.000000e+00 9.490313e+01 4.642001e+01 3.204317e+00 2.317152e+00 # p.F # 1.151701e-11 # In Sathe (1996) (see reference of dip8), the model-dependent approach is # illustrated with an example data set of alpha and beta parameters obtained # by fitting the Weibull curve function to a data set of dissolution profiles # of one reference batch and one new batch with minor modifications and another # new batch with major modifications (12 profiles per batch). Note that the # assessment is performed on the (natural) logarithm scale. res4.minor <- get_T2_two(m1 = log(as.matrix(dip8[dip8$type == "ref", c("alpha", "beta")])), m2 = log(as.matrix(dip8[dip8$type == "minor", c("alpha", "beta")])), signif = 0.1) res4.major <- get_T2_two(m1 = log(as.matrix(dip8[dip8$type == "ref", c("alpha", "beta")])), m2 = log(as.matrix(dip8[dip8$type == "major", c("alpha", "beta")])), signif = 0.1) res4.minor$Parameters res4.minor$CI$Hotelling res4.major$Parameters res4.major$CI$Hotelling # Expected results in res4.minor$Parameters # dm df1 df2 signif K # 1.462603730 2.000000000 21.000000000 0.100000000 2.863636364 # k T2 F F.crit t.crit # 6.000000000 12.835258028 6.125918604 2.574569390 2.073873068 # p.F # 0.008021181 # Results in res4.minor$CI$Hotelling # LCL UCL # alpha -0.2553037 -0.02814098 # beta -0.1190028 0.01175691 # Expected results in res4.major$Parameters # dm df1 df2 signif K # 4.508190e+00 2.000000e+00 2.100000e+01 5.000000e-02 2.863636e+00 # k T2 F F.crit t.crit # 6.000000e+00 1.219427e+02 5.819992e+01 2.574569e+00 2.073873e+00 # p.F # 2.719240e-09 # Expected results in res4.major$CI$Hotelling # LCL UCL # alpha -0.4864736 -0.2360966 # beta 0.1954760 0.3035340

References

Hotelling, H. The generalisation of Student's ratio. Ann Math Stat. 1931; 2 (3): 360-378.

Hotelling, H. (1947) Multivariate quality control illustrated by air testing of sample bombsights. In: Eisenhart, C., Hastay, M.W., and Wallis, W.A., Eds., Techniques of Statistical Analysis, McGraw Hill, New York, 111-184.

See Also

get_T2_one, get_sim_lim, mimcr.

  • Maintainer: Pius Dahinden
  • License: GPL (>= 2)
  • Last published: 2025-03-24