← Back to HomeDocumentation

Project Whitesheet

This document describes every statistical calculation performed by the Meta-Student platform. It is intended as a transparent audit trail for researchers, supervisors, and reviewers.

Meta-Student Statistical Methods Whitesheet

Version: 1.0

Date: May 2026

Platform: Meta-Student — Cumulative Science Platform for Undergraduate Research

R Engine: lme4/lmerTest + metafor packages, running under R >= 4.4

This document describes every statistical calculation performed by the platform, the assumptions underlying each, and how to interpret every output shown on the results page. It is intended as a transparent audit trail for researchers, supervisors, and reviewers.


1. Overview of the Analysis Pipeline

When a study administrator triggers a meta-analysis, the platform:

1. Fetches all submissions with status VALIDATED or INCLUDED from the database, in chronological order of submission.

2. Downloads the raw CSV data file from each submission (individual participant data).

3. Sends the complete participant-level data to the R engine via a POST request to /run.

4. The R engine stacks participant data across studies, fits a linear mixed model appropriate to the study design, computes per-study estimates, diagnostics, and the full analytics suite.

5. Results are stored in the database and displayed on the results page.

A minimum of 2 submissions with at least 4 total participants is required before analysis can run.

1.1 Why IPD Meta-Analysis?

The platform uses Individual Participant Data (IPD) meta-analysis via linear mixed models rather than summary-statistics meta-analysis.

Advantages of the IPD approach:

  • No manual summary statistic entry — students upload raw data CSVs, eliminating transcription errors.
  • Greater statistical power — participant-level data retains more information than summary statistics.
  • Correct handling of complex designs — crossover trials, nested structures, and pre-post correlations are modelled directly rather than approximated.
  • Natural variance decomposition — the mixed model simultaneously estimates between-study variance (τ²) and within-study residual variance (σ²).
  • Flexibility — covariates and design features (period effects, sequence effects) are handled within the model rather than requiring external corrections.

2. Study Designs and Required CSV Columns

The platform supports five study designs. The design is set at study creation and determines the CSV template students receive.

2.1 Randomised Controlled Trial (RCT)

Each row is one participant.

Column Meaning
participant_id Unique participant identifier
group Group assignment: treatment / control (or 1 / 0)
outcome_value Measured outcome in original units
age (optional) Participant age
sex (optional) Participant sex
notes (optional) Free text

2.2 Pre-Post Study (PRE_POST)

Each row is one participant measured before and after an intervention.

Column Meaning
participant_id Unique participant identifier
score_1 Pre-intervention score
score_2 Post-intervention score
age, sex, notes (optional)

2.3 RCT with Pre-Post Measures (RCT_PRE_POST)

Each row is one participant in a two-group design with baseline and follow-up.

Column Meaning
participant_id Unique participant identifier
group treatment / control
pre_score Baseline measurement
post_score Follow-up measurement
age, sex, notes (optional)

2.4 Crossover Trial (CROSSOVER)

Each row is one participant who received both treatments in a defined sequence.

Column Meaning
participant_id Unique participant identifier
sequence Allocation sequence: AB or BA
period_1_value Outcome measured in period 1
period_2_value Outcome measured in period 2
age, sex, notes (optional)

2.5 Cross-Sectional Correlational (CROSS_SECTIONAL)

Each row is one participant with two measured variables.

Column Meaning
participant_id Unique participant identifier
variable_1 Predictor / independent variable
variable_2 Outcome / dependent variable
age, sex, notes (optional)

3. Data Preparation

The stack_participants() function in the R engine:

1. Stacks all participant rows from all submissions into a single data frame, adding study_id and label columns to identify which submission each row belongs to.

2. Coerces outcome columns to numeric (non-numeric values become NA).

3. Drops rows with missing outcome data (listwise deletion within each study).

4. Factors study_id, group, and sequence for use in the mixed model.

For crossover designs, participant IDs are prefixed with study_id to prevent identity collisions when different studies happen to use the same ID scheme (e.g., "1", "2", "3").


4. Model Fitting — Design-Specific IPD Mixed Models

All models are fitted using lme4::lmer() with REML = TRUE (Restricted Maximum Likelihood). p-values are obtained via the Satterthwaite approximation provided by lmerTest.

4.1 RCT — Two Independent Groups

Model:


outcome_value ~ group + (1 | study_id) + (0 + group | study_id)
  • group is relevelled so that the control group is the reference category. The platform identifies the control group by checking for labels containing "control", "0", "placebo", "con", or "ctrl" (case-insensitive). If none match, the first alphabetical level is used.
  • The fixed effect for the treatment group is the mean difference (treatment minus control) — this is the primary unstandardised effect.
  • (1 | study_id) captures between-study baseline variance (nuisance parameter — how much labs differ in their control-condition measurements).
  • (0 + group | study_id) captures between-study treatment effect variance — the meta-analytic τ². This is the quantity reported as heterogeneity.
  • The two terms are independent (zero correlation assumed). This allows τ² to be estimated stably at k ≥ 4 by eliminating the need to estimate an intercept-slope correlation, which requires k > 20 to identify reliably.
  • Falls back to (1 | study_id) intercept-only if k < 4 or if the slope term causes a convergence failure.

Primary output: Mean Difference (in original measurement units)

Secondary output: Hedges' g (see Section 5)

4.2 Pre-Post — Within-Subjects Change

Model:


change ~ 1 + (1 | study_id)

where: change = score_2 - score_1
  • The change score is computed for each participant.
  • The intercept is the mean change across all participants, accounting for study-level clustering.
  • This design tests whether the mean change is significantly different from zero.

Primary output: Mean Change (in original units)

Secondary output: Hedges' g on the d_pre scale (primary SMD); Cohen's d_z reported as a context scale (see §5.7)

4.3 RCT with Pre-Post — ANCOVA Model

Model:


post_score ~ pre_score + group + (1 | study_id) + (0 + group | study_id)
  • This is an ANCOVA (Analysis of Covariance) formulation: baseline (pre_score) is included as a covariate.
  • The treatment group coefficient represents the ANCOVA-adjusted group difference at post-test, controlling for each participant's baseline score.
  • This is statistically more efficient than analysing change scores whenever there is any baseline-outcome correlation — the confidence interval is narrower for the same data. Under the standard assumption of parallel regression slopes across groups, the two approaches estimate the same quantity.
  • The independent random slope (0 + group | study_id) gives the meta-analytic τ² (between-study treatment effect variance). Falls back to intercept-only if k < 4 or convergence fails.

Primary output: Adjusted Mean Difference — ANCOVA (in original units)

Secondary output: Hedges' g (see Section 5)

4.4 Crossover — Senn's Mixed-Model Approach

Data reshaping: Each participant contributes two rows (one per period). The data is reshaped from wide to long format:

Row treatment period outcome
Participant 1 (AB), period 1 A 1 period_1_value
Participant 1 (AB), period 2 B 2 period_2_value
Participant 2 (BA), period 1 B 1 period_1_value
Participant 2 (BA), period 2 A 2 period_2_value

Model:


outcome ~ treatment + period + (1 | participant_within_study) + (0 + treatment | study_id)
  • treatment (A vs B) captures the treatment effect, with B as the reference (control).
  • period absorbs any period/carry-over effect.
  • (1 | participant_within_study) accounts for within-person correlation (each person measured twice).
  • (0 + treatment | study_id) captures the meta-analytic τ² — between-study variability in the treatment effect.
  • Falls back to (1 | study_id) if convergence fails.

This follows Senn's (2002) recommended approach for crossover meta-analysis with individual data, correctly separating the treatment, period, and carry-over effects.

Primary output: Treatment Effect, A vs B (in original units)

Secondary output: Hedges' g (see Section 5)

4.5 Cross-Sectional — Rank Pipeline Primary with lmer Context

Primary effect: Kendall's τ via the rank pipeline (see §5.8). No lmer model is fitted for the primary effect.

Context lmer model (k ≥ 4 studies, final pooled call only):


variable_2 ~ variable_1 + (1 | study_id) + (0 + variable_1 | study_id)
  • (1 | study_id) captures between-study differences in the intercept (baseline level of variable_2).
  • (0 + variable_1 | study_id) captures between-study variability in the regression slope — the meta-analytic τ² for the slope.
  • The two terms are independent (zero correlation assumed), enabling stable estimation at k ≥ 4.

Falls back to intercept-only if k < 4 or convergence fails:


variable_2 ~ variable_1 + (1 | study_id)

The standardised β from this context model is reported as the secondary effect. The raw slope and Somers' D_xy are reported as additional context panels. None of these context-lmer outputs drive the stopping rule.

Primary output: Kendall's τ (pooled via rank pipeline)

Secondary output: Standardised β (β × sd_x / sd_y from context lmer model)

Context outputs: Raw regression slope, Somers' D_xy


5. Standardised Effect Sizes (Secondary Output)

Each study design uses a design-specific within-group / pre-test SD denominator. The denominator is constructed so an SMD of, say, 0.2 corresponds to the same underlying effect magnitude regardless of design — this is what makes SESOI calibration design-invariant. The Hedges J small-sample correction is applied with the same df that is passed to the SD estimator (the df of the SD estimator, per Hedges 1981).

5.1 Denominators by design

Design Denominator Source model edf for J
RCT sqrt(var.residual) of the main model — pooled within-group SD of the outcome Main: `outcome ~ group + (1\ study_id) + (0+group\ study_id)` df.residual(main)
PRE_POST sqrt(var.residual) of an auxiliary intercept-only model fit to pre-test scores Auxiliary: `score_1 ~ 1 + (1\ study_id)` df.residual(aux)
RCT_PRE_POST sqrt(var.residual) of an auxiliary intercept-only model fit to pre-test scores Auxiliary: `pre_score ~ 1 + (1\ study_id)` df.residual(aux)
CROSSOVER sqrt(var.intercept[uid] + var.residual) of the main model — between-participant SD within studies Main: `outcome ~ treatment + period + (1\ uid) + (0+treatment\ study_id)` Satterthwaite composite (§5.3)
CROSS_SECTIONAL Primary: Kendall's τ — no SMD denominator (rank statistic). Secondary: standardised β = β × sd_x / sd_y from context lmer model. Context: raw slope, Somers' D_xy Primary: per-study τ with jackknife SE, pooled via metafor::rma. Secondary/context: `variable_2 ~ variable_1 + (1\ study_id) + (0+variable_1\ study_id)` n/a for τ; SE scales as se_β × sd_x / sd_y

All five denominators are within-source SDs free of the random-effect-slope variance τ̂, which makes the SMD stable across study sets and aligned with the dominant within-group SD convention in published meta-analyses (Becker 1988; Morris & DeShon 2002).

5.2 SMD computation (factor-contrast designs)

For RCT, RCT_PRE_POST, and CROSSOVER:


SMD = eff_size(emmeans_contrast, sigma = sd_denom, edf = edf) × J

where:
  sd_denom    = design-specific SD from §5.1
  edf         = df of the SD estimator (also passed to eff_size as edf)
  J           = exp( lgamma(edf/2) - log(sqrt(edf/2)) - lgamma((edf-1)/2) )
                  (Calin-Jageman exact Hedges' J; numerically equivalent to
                   the asymptotic 1 - 3/(4*edf - 1) at large edf but more
                   accurate at small edf. Source: doi:10.31234/osf.io/s2597)

J is computed from edf (the df of the SD estimator). Hedges (1981) derives J as a correction for bias in the SD estimator; the df of the effect estimator and the df of the SD estimator can differ substantially at low k, so the correct quantity is the df of the SD estimator. The scalar SE × J propagation is exact when J is treated as a known constant (the standard convention; Hedges 1981; Calin-Jageman).

For PRE_POST, the SMD is computed directly from the mean change and the pre-test SD:


SMD = (mean_change / sd_pre) × J
SE  ≈ sqrt( (1/sd_pre²) × Var(mean_change) + (mean_change² / sd_pre⁴) × Var(sd_pre) ) × J
Var(sd_pre) ≈ sd_pre² / (2 × edf_pre)  (asymptotic SE of the chi-distributed SD estimator)

The PRE_POST denominator is the within-study pre-test SD, not the change-score SD. The result is classical Hedges' g, compatible with RCT-style meta-analyses.

For CROSS_SECTIONAL, the primary effect is Kendall's τ — a rank-based statistic, not an SMD. Per-study τ_i is computed via stats::cor(x, y, method = "kendall") with a leave-one-out jackknife SE, then pooled across studies via metafor::rma(method = "REML", test = "knha"). No Fisher z transformation is applied. Hedges' J is not applicable. The three lmer-based context scales (raw slope, standardised β, Somers' D_xy) are computed on the final pooled call only and reported as context panels; they do not drive the stopping rule.

5.3 Composite Satterthwaite df for CROSSOVER

The CROSSOVER denominator combines two variance components with different df:


composite_edf = (var_p + var_r)² / ( var_p² / df_p + var_r² / df_r )

where:
  var_p  = participant random-intercept variance (grouped on uid)
  var_r  = residual variance
  df_p   = max(1, n_participants - n_fixed_effects)   (rough approximation)
  df_r   = df.residual(main_model)

This Satterthwaite-style composite is passed to eff_size as edf and used in J. The df_p approximation is rough; a Kenward-Roger-style derivation would be more accurate at small n_participants and is flagged as future work. Because composite_edf depends on estimated variance components, the scalar SE × J propagation is a mild approximation for CROSSOVER specifically (additional variance from edf uncertainty is small at typical CROSSOVER df values where dJ/d(edf) is near zero).

5.4 Confidence intervals for SMD


SMD ± t(edf, 0.975) × SE_SMD

CIs use the t-distribution with the SD-estimator df (the edf above), not the z-distribution. For factor-contrast designs the CI bounds come directly from eff_size's output (scaled by J). For PRE_POST the bounds are constructed from the delta-method SE_SMD and qt(0.975, edf_pre).

5.5 Design-specific SMD labels

Design SMD Label Interpretation
RCT Hedges' g Treatment vs control in pooled within-group SD units
PRE_POST Hedges' g (d_pre) — primary. Cohen's d_z reported as context (see §5.7) Mean change in within-study pre-test SD units
RCT_PRE_POST Hedges' g ANCOVA-adjusted between-group difference in pre-test SD units
CROSSOVER Hedges' g Treatment effect in between-participant within-study SD units
CROSS_SECTIONAL Kendall's τ — primary. Standardised β — secondary. Raw slope, Somers' D_xy — context (see §5.8, §6) Pooled Kendall's τ from per-study jackknife + REML rma

5.6 PRE_POST dual-scale reporting

The PRE_POST primary SMD is Hedges' g on the d_pre scale.

Why d_pre. d_pre is RCT-comparable: it standardises the mean change by the within-study pre-test SD, placing PRE_POST on the same denominator as RCT Hedges' g. This means SESOI calibration (0.2 / 0.5 / 0.8 small / medium / large) is consistent across designs and the pooled SMD can be compared directly across design types. The d_z scheme (change-score SD denominator) inflates PRE_POST SMDs relative to RCT SMDs by 1/sqrt(2(1-r)) — at r = 0.8 this is ~1.6×, breaking cross-design SESOI comparisons.

Cohen's d_z as context. Cohen's d_z (change-score SD denominator) is computed alongside d_pre at every analysis run and reported as a "Single-study interpretation" context panel on the results page and in manuscript.qmd. This allows readers from sport, exercise, or clinical change-score literature — where d_z is the dominant convention — to translate the result. Both values come from the same fitted change model; only the SD denominator differs.

Empirical pre-post correlation diagnostic. With both SMDs available, the platform recovers the empirical pre-post correlation as r ≈ 1 - (d_pre / d_z)² / 2 (Morris & DeShon 2002). This approximates test-retest reliability in a no-intervention setting but absorbs treatment-induced covariance changes in an active intervention, so it is the empirical pre-post correlation rather than a strict reliability coefficient. It is a diagnostic only — not consumed by the stopping rule.

Relationship between scales. d_z = d_pre / sqrt(2(1 - r)). At r = 0.8, d_z ≈ 1.6 × d_pre; at r = 0.9, d_z ≈ 2.2 × d_pre. An SESOI of 0.20 in d_pre units corresponds to roughly 0.32 at r = 0.8 and 0.45 at r = 0.9 in d_z units.

5.7 CROSS_SECTIONAL primary effect and context scales

The CROSS_SECTIONAL primary effect is Kendall's τ via the rank pipeline.

Why Kendall's τ.

  • Robust to outliers and non-Gaussian outcomes — no linearity assumption.
  • Symmetric: τ = 0 when x and y are independent; +1 / −1 at perfect concordance / discordance.
  • Round-number SESOI ladder maps onto pedagogically useful benchmarks: 0.10 small · 0.20 medium · 0.30 large · 0.40 very large. Default SESOI at study creation is 0.20.
  • Avoids the τ̂-dependence of standardised β, which can make the headline effect drift as studies accumulate when between-study heterogeneity is unstable at low k.

Computation (rank pipeline).


Per-study (for each study i):
  yi = stats::cor(x_i, y_i, method = "kendall")    # Kendall's τ_i on IPD
  vi = jackknife variance                           # ((n-1)/n) Σ (τ_{-j} - τ̄_{.})²

Aggregate (across studies):
  metafor::rma(yi, vi, method = "REML", test = "knha")

The jackknife SE handles ties and small-sample effects without requiring the independence assumption of closed-form asymptotic variance formulas. Pooling is on the raw scale — no Fisher z transformation (Fisher z is designed for Pearson r and is not variance-stabilising for τ). HKSJ adjustment (test = "knha") is mandatory at small k. Per-study τ_i values are cached within the sequential trajectory, keeping the walk linear in k.

Context scales (final pooled call only).

On the final pooled analysis, three additional scales are computed and reported as context panels on the results page and in manuscript.qmd:

Scale Pipeline Purpose
Raw regression slope lmer IPD Natural-units interpretation; direct change per unit-x
Standardised β (β × sd_x / sd_y) lmer + avg_slopes Dimensionless, r-like; Cohen's r benchmarks (0.1 / 0.3 / 0.5)
Somers' D_xy Rank (concordant/discordant) Asymmetric: x predicts y; for binary y, D_xy = 2·AUC − 1

These context scales do not drive the stopping rule or SESOI evaluation. The lmer model is variable_2 ~ variable_1 + (1|study_id) + (0+variable_1|study_id) with REML.

Sequential stopping rule. The stopping rule consumes the Kendall's τ pipeline's output: pooled τ, pooled SE(τ), between-study τ̂², mean within-study sampling variance, and current k. T1 detection: SE(τ) < SESOI ÷ 2 AND pooled τ significantly different from zero (Wald z > 1.96). T2 and T3 apply the CP and precision-futility triggers identically to other designs.

5.8 Interpretation benchmarks (Cohen, 1988)

SMD Interpretation
< 0.2 Negligible
0.2 – 0.49 Small
0.5 – 0.79 Medium
0.8 – 1.19 Large
>= 1.2 Very large

Important caveat: These benchmarks are rough population-level defaults. A "small" effect on a hard-to-move physiological variable may be highly practically significant, while a "large" effect on a trivial outcome may not be. Always interpret magnitude in relation to the specific outcome and population.


6. Cross-Sectional Association Metrics

The primary CROSS_SECTIONAL effect is Kendall's τ (see §5.7). The secondary effect is standardised β. Two further scales are reported as context panels.

6.1 Standardised β (secondary)

β × sd_x / sd_y, where β is the fixed-effect slope from the context lmer model. This rescales the slope so that both variables are in standard-deviation units, yielding a dimensionless quantity comparable to a Pearson r in magnitude. SE and CI scale by the same sd_x / sd_y factor — equivalent to marginaleffects::avg_slopes() for a linear model but computed directly to avoid the slow numerical Jacobian.

Interpretation benchmarks (Cohen, 1988): 0.10 small · 0.30 medium · 0.50 large.

6.2 Raw Regression Slope (context)

The unstandardised lmer coefficient: for each one-unit increase in the predictor, the expected change in the outcome in the original measurement units. Cannot be compared across studies using different scales.

6.3 Somers' D_xy (context)

Asymmetric rank correlation: D_xy = (concordant − discordant) / (pairs untied on x). For binary y, D_xy = 2·AUC − 1. Implemented directly via the concordant/discordant decomposition in r-engine/meta_analysis.R (no Hmisc dependency). Same jackknife + rma pipeline as Kendall's τ.

Interpretation ladder: 0.05 negligible · 0.10 small · 0.20 medium · 0.30 large · 0.40 very large.


7. Variance Components

The mixed model decomposes total variance into two levels:

7.1 τ² (tau-squared) — Between-Study Variance

The variance of the true study-level effects around the pooled mean. A larger τ² means the studies' underlying effects genuinely differ. This is the random-slope variance from the (0 + treatment | study_id) term in the mixed model — the between-study variability in the treatment effect specifically, not in baseline levels. The (1 | study_id) intercept term captures between-study baseline variance and is a nuisance parameter that is not reported as τ².

7.2 σ² (sigma-squared) — Residual (Within-Study) Variance

The variance of individual participant outcomes around their study-level mean, after accounting for the fixed effects. This captures individual-level noise.

7.3 I² — Proportion of Heterogeneity

I² is adapted for the IPD mixed-model context. Since σ² is individual-level variance while τ² is study-level, a naive τ²/(τ² + σ²) would vastly underestimate I² because σ² includes noise from every participant.

The correct formula scales σ² to the study level using the harmonic mean of per-study sample sizes:


n_typical = k / Σ(1/n_i)                     (harmonic mean of study sizes)
typical_sampling_var = σ² / n_typical         (scaled to study level)
I² = τ² / (τ² + typical_sampling_var)

This yields an I² comparable to what would be obtained from a traditional summary-statistics random-effects model.

Interpretation (Higgins et al., 2003):

Interpretation
< 25% Low heterogeneity — studies are fairly consistent
25% – 49% Moderate heterogeneity — some variability
50% – 74% Substantial heterogeneity — studies differ meaningfully
>= 75% High heterogeneity — interpret the pooled estimate with caution

8. Per-Study Estimates

For the forest plot and leave-one-out analyses, the platform computes per-study effect estimates using simple within-study calculations (not BLUPs from the mixed model):

Design Per-study estimate SE
RCT mean(treatment) - mean(control) Pooled independent-samples SE
PRE_POST mean(change scores) SD(change) / sqrt(n)
RCT_PRE_POST mean(Δtreatment) - mean(Δcontrol) Pooled independent-samples SE on change scores
CROSSOVER (mean(diff_AB) - mean(diff_BA)) / 2 Pooled SE / 2
CROSS_SECTIONAL Per-study Kendall's τ with leave-one-out jackknife SE Jackknife SE

Weights are computed as w_i = 1/SE_i² and normalised to sum to 1, so the frontend can display them as percentages.

These per-study estimates are then used to:

  • Construct the forest plot (via metafor::forest())
  • Run leave-one-out sensitivity analysis (via metafor::leave1out())
  • Compute the funnel plot (via metafor::funnel())
  • Run publication bias tests

9. Diagnostic Plots

The platform generates six types of diagnostic output from the IPD mixed model:

9.1 Q-Q Plot of Residuals

Plots the quantiles of the level-1 (participant) residuals against theoretical normal quantiles. Points should lie close to the diagonal line. Systematic deviations indicate non-normality of residuals, which can affect CI coverage and p-values.

9.2 Residuals vs Fitted Values

Plots residuals against predicted values. Should show a random scatter around zero. Patterns (curves, fanning) indicate model misspecification — e.g., a non-linear relationship, heteroscedasticity, or omitted variables.

A LOWESS smoother (red line) is overlaid to highlight any trends.

9.3 Q-Q Plot of Random Effects

Plots the study-level treatment slope random effects (the (0 + treatment | study_id) term — the meta-analytic τ component) against normal quantiles. Checks the assumption that true treatment effects are normally distributed across studies. With few studies (k < 10), this plot is noisy and should be interpreted cautiously.

9.4 Scale-Location Plot

Plots sqrt(|standardised residuals|) against fitted values. Checks for homoscedasticity (equal variance). If the red LOWESS line is roughly flat, variance is approximately constant. An upward or downward slope suggests variance increases or decreases with the predicted value.

9.5 Baujat Plot

Generated from the per-study summary statistics using metafor::baujat(). Each point is a study, with:

  • x-axis: contribution to overall Q (heterogeneity)
  • y-axis: influence on the pooled result

Studies in the top-right corner are both heterogeneous and influential — they are the strongest candidates for investigation.

9.6 Leave-One-Out Analysis

Each study is removed in turn, and the pooled effect is re-estimated from the remaining studies. If the result changes substantially when a single study is dropped, that study is highly influential and should be examined for data quality issues. Both a table and a forest-style plot are provided.


10. Sequential Analysis and Stopping Rules

10.1 Cumulative Re-Analysis

The sequential analysis re-fits the full design-specific IPD mixed model after each study is added, in chronological submission order. Starting from the first 2 submissions, each iteration adds one more.

For each cumulative model at study k, the platform records:

Field Meaning
k Number of studies included
effect Pooled unstandardised effect at that point
se Standard error of the unstandardised effect
ci_lower, ci_upper 95% confidence interval
smd, smd_se Standardised effect and its SE
tau2, sigma2 Variance components at that point
stability.effectChange Absolute change in SMD ( Δ SMD ) from previous step
stability.tau2Change Relative change in τ² (log-ratio) from previous step
stability.sigma2Change Relative change in σ² (log-ratio) from previous step

10.2 Sequential Stopping Rule — Three-Trigger Framework with 2-Consecutive Confirmation

When does a study close?

Meta-Student uses a sequential design: each new study submission is analysed as it arrives. The study closes when any one of three triggers fires on two consecutive completed studies (strict 2-consecutive same-trigger rule). All three triggers are gated by an initial burn-in of k ≥ 4 studies.

Trigger Condition Action on confirmed fire
T1 — Detection SE(standardised effect) < SESOI ÷ 2 AND \ z\ > z<sub>α/2</sub> Close as positive
T2 — Conditional-power futility Projected CP at k<sub>max</sub> below the registered CP threshold Close as futile (CP)
T3 — Precision futility Projected k required to satisfy SE < SESOI/2 exceeds the registered k cap Close as futile (precision)

If none fire on consecutive looks, the study continues to the next k. Because the rule requires two consecutive matching fires, the practical minimum k for any close is 5.


Trigger 1 — Detection (T1). The standard error of the pooled standardised effect must be below SESOI ÷ 2 AND the pooled effect must be significantly different from zero at the two-sided alpha (default 0.05). The standardised effect is design-dependent:

  • Experimental designs (RCT, PRE_POST, RCT_PRE_POST, CROSSOVER): the standardised effect is the SMD (Hedges' g). Default SESOI = 0.20.
  • Correlational designs (CROSS_SECTIONAL): the standardised effect is Kendall's τ. Default SESOI = 0.20 (τ scale).

Trigger 2 — Conditional-power futility (T2). Conditional power at the current observed estimate, projected forward to k<sub>max</sub>, answers: "if we kept going, what is the probability we ever cross significance?" When this probability falls below the CP threshold (default 0.20), T2 fires. The CP projection assumes future studies contribute equal information (the standard Lan–Wittes simplifying assumption); this slightly overestimates CP under non-trivial heterogeneity.


Trigger 3 — Precision futility (T3). Projects the k required to satisfy the precision target from the current observed SE trajectory. Under the SE-scales-as-1/√k approximation:


k_required = ceiling( k_current × (SE_current / (SESOI / 2))^2 )

If k_required exceeds the registered k cap (default 50), T3 fires. T3 addresses the high-heterogeneity case: when τ̂ is large relative to SESOI/2, satisfying the precision criterion would require impractically many studies. T3 caps this explicitly without coupling the decision rule to heterogeneity directly.


Strict 2-consecutive confirmation. A trigger only closes a study when the same trigger fires at look k and look k+1. Triggers do not cross-confirm one another: a study flickering between T1 at k and T2 at k+1 keeps running until two consecutive looks agree. This mitigates ordering effects, where a single look at k = 4 might fire on noise.


Design rationale.

1. Conditional power gives a principled futility rule. CP at the current observed estimate is a standard tool in group-sequential trial design (Lan & Wittes 1988; Whitehead 1997). A threshold of 0.10 to 0.20 is well established in that literature.

2. Precision futility addresses the high-heterogeneity case. Heterogeneity influences the decision only through its effect on SE. T3 caps the projection so a study cannot drag on indefinitely chasing precision that the data-generating process will not yield.

3. Two consecutive confirmations mitigate ordering effects. A single look at k = 4 has a meaningful chance of firing on noise. Requiring agreement at consecutive looks reduces this sensitivity. Strict same-trigger matching prevents pathological close-outs where consecutive looks disagree about the exit state.


Combined verdict reported by the platform:


stopReason ∈ { "detection", "cp_futility", "precision_futility", null }
stoppedAtK = first k at which the same trigger fired at two consecutive looks

> Important: A futile close (T2 or T3) is not a positive null finding. It indicates that continued accumulation under the current trajectory is unlikely to change the verdict, given the study's registered SESOI and feasibility constraints. Always interpret results using the full confidence interval and prediction interval displayed on the results page.

Parameter configuration: Each registered study has four stopping-rule parameters set at creation and locked thereafter: sesoi (default 0.20 for all designs), cpThreshold (default 0.20), kCap (default 50), kMaxForCp (default 30). Researchers planning a study should set SESOI to the minimum meaningful effect in their field on the appropriate scale. The other three default values follow the group-sequential literature and are appropriate for most studies; adjust only with a registered rationale.

10.3 Limitations

The framework's defaults are theoretically grounded but not yet verified by a full simulation study. A power and operating-characteristics study (Type I error at δ = 0, incorrect-futility rate at δ ≥ SESOI, expected stopping-k distributions, and τ sensitivity) is planned before the framework is used in confirmatory analyses. The CP projection uses the Lan–Wittes simplifying assumption that future studies contribute equal information; a τ̂-aware CP projection accounting directly for between-study heterogeneity is on the development roadmap.


11. Publication Bias Assessment

Publication bias analyses use the per-study summary statistics (Section 8) fed into metafor::rma(), not the IPD mixed model directly. This is because metafor's bias diagnostics are designed for study-level data.

11.1 Funnel Plot

A scatter plot of each study's effect size (x-axis) against its standard error (y-axis, inverted). Under no publication bias, points should scatter symmetrically around the pooled effect in an inverted funnel shape. Asymmetry suggests small studies with certain directions may be missing.

Requires >= 3 studies. Most informative with >= 10 studies.

11.2 Egger's Regression Test

Formal test for funnel asymmetry. Regresses standardised effects on precision:


(effect_i / SE_i) = a + b × (1 / SE_i) + error_i

A significant non-zero intercept (a) indicates asymmetry.

Output Meaning
z Test statistic
pValue Two-tailed p-value for the intercept
bias Estimated intercept — positive = small studies show larger effects

Caution: Low power with few studies (k < 10). A non-significant result does not rule out publication bias.

11.3 Trim-and-Fill

Non-parametric method (Duval & Tweedie, 2000) that detects and corrects funnel asymmetry by:

1. Identifying extreme studies on one side

2. Temporarily removing them

3. Re-estimating the centre

4. Imputing mirror-image studies

5. Recomputing the pooled effect

Output Meaning
nImputed Studies imputed to restore symmetry
adjustedEffectSize Corrected pooled effect
adjustedCiLower, adjustedCiUpper Corrected CI

Uses the R0 estimator. If nImputed = 0, the funnel is already symmetric. Caution: Can over-correct when heterogeneity is high.

11.4 Fail-Safe N (Rosenthal)

The number of unpublished null-result studies that would need to exist to reduce the pooled effect to non-significance.


Conventionally robust if FSN > 5k + 10

A large FSN suggests the result is resilient to publication bias. A small FSN (< 10) means only a few null studies could invalidate the finding.


12. Outlier and Influence Detection

Using the per-study summary statistics and metafor::influence(), a study is flagged as a potential outlier if either criterion is met:

Criterion Threshold Meaning
Cook's distance > 1 Disproportionate influence on the pooled estimate
Studentised residual rstudent > 2 Effect size > 2 SDs from the model prediction

Requires >= 3 studies. Flagging is a signal for manual review, not automatic exclusion. Flagged submission IDs are stored in outlierIds and highlighted on both the admin panel and results page.


13. Confidence Interval Construction

All confidence intervals for primary and secondary effects use the t-distribution with Satterthwaite degrees of freedom from lmerTest, not the z-distribution. This provides more accurate coverage with small samples and few studies.

The Satterthwaite approximation estimates the effective degrees of freedom for each fixed-effect coefficient by accounting for the variance-component structure of the model. When the approximation is unavailable (rare), the engine falls back to the z-distribution (qnorm(0.975)).

For per-study estimates and pooled publication-bias analyses, normal-theory CIs (± 1.96 × SE) are used for aggregate tests, consistent with standard meta-analytic practice via metafor. Per-study rank estimates (CROSS_SECTIONAL) use qt(0.975, n - 1) to account for small study sample sizes.


14. Completion Report Package

When a study is marked Completed, the platform automatically generates a downloadable ZIP archive containing the full reproducible analysis pipeline and a near-complete academic manuscript template. The intent is that nothing stored in the database is required for a reader to independently reproduce the analysis or write up the paper.

14.1 Archive contents


README.md                    Human-readable summary and file index
analysis.Rmd                 Reproducible R Markdown meta-analysis script
report.qmd                   Quarto technical report — every statistic explained
manuscript.qmd               Quarto academic manuscript template
references.bib               Starter BibTeX bibliography
data/
  summary_statistics.csv     Pooled summary statistics (all included submissions)
submissions/
  <submitter>_<date>.csv     Individual participant-level CSV, one per contributor
deviations/
  <submitter>_methods.txt    Methods report and admin notes per contributor

14.2 report.qmd — comprehensive technical report

This is a self-contained technical report rendering to HTML or PDF. For every statistic it shows, it also documents (a) what the statistic estimates, (b) the assumptions it depends on, (c) how to read it, and (d) when it misleads. It includes:

  • Full summary(res) output from the fitted metafor model
  • Pooled effect on both standardised and raw (unstandardised) scales
  • Knapp-Hartung small-sample adjustment (t with k−1 df, inflated SE) alongside the standard Wald test — recommended when k < 20
  • 95% prediction interval for the true effect of a new sample from the same population (distinct from the CI for the average)
  • Model fit statistics (logLik, deviance, AIC, BIC)
  • Per-sample best linear unbiased predictions (BLUPs) with shrinkage toward the pooled mean
  • Heterogeneity: τ², τ, I², H², Cochran's Q, with profile-likelihood 95% CIs for τ² and I²
  • Panel-by-panel interpretation of every diagnostic plot (Q–Q, residuals-vs-fitted, scale-location, leave-one-out, Baujat)
  • Influence-diagnostic table with per-column interpretation guidance (Cook's d, rstudent, DFFITS, cov. ratio, hat)
  • Publication-bias diagnostics (funnel plot, Egger's regression, trim-and-fill, Rosenthal's failsafe N)
  • Sensitivity analyses: (i) exclusion of influential samples, (ii) comparison across alternative τ² estimators (REML, DL, PM, ML, HE)

14.3 manuscript.qmd — academic paper template

This is a Quarto manuscript pre-populated with project metadata (title, design, primary outcome, description) and all meta-analytic results, structured as an academic paper and designed to substantially reduce manuscript-writing overhead. It renders to HTML, Word (.docx), or PDF via quarto render manuscript.qmd --to docx.

Structure:

Section Content
Abstract Structured (Background / Methods / Results / Conclusions) with Methods and Results pre-populated
Introduction Three [AUTHOR: ...] paragraph prompts + auto-filled "The present study" paragraph
Methods Fully auto-populated — design-specific blurbs for Design, Participants, Procedure, Outcomes, Effect Size, Statistical Analysis (including assumption checks, publication-bias procedures, information accrual)
Results Sample characteristics → Assumption checks and sensitivity analyses (first)Primary analysis (second)
Discussion Fully [AUTHOR: ...] placeholders (intentionally left blank)
Data/code availability, CRediT, Funding, COI Short boilerplate paragraphs

The Methods and Results narrative adapts automatically to the study design (RCT, PRE_POST, RCT_PRE_POST, CROSSOVER, or CROSS_SECTIONAL), so the same template works across studies of the same design with different topics.

14.4 Mixed-model reporting detail

Both report.qmd and manuscript.qmd present the fitted random-effects linear mixed model in substantially more detail than the website's summary cards. Each file includes a mixed-model table reporting:

  • Fixed-effect point estimate μ̂, SE, z (Wald), p (Wald), 95% CI
  • Knapp-Hartung adjusted t, p, and 95% CI
  • 95% prediction interval
  • τ², its square root τ, profile-likelihood 95% CI for τ²
  • I² with profile-likelihood 95% CI, H²
  • Cochran's Q, df, p-value
  • Per-sample BLUPs (observed vs shrunken estimate, shrinkage magnitude)

14.5 Raw files included

  • data/summary_statistics.csv — one row per submission with its computed summary statistics (means, SDs, correlations, n) exactly as supplied to the R engine.
  • submissions/<submitter>_<date>.csv — the individual participant-level CSV originally uploaded by each contributor. These allow independent re-execution of the full pipeline.
  • deviations/<submitter>_methods.txt — the submitter's methods-report text and any admin validation notes recorded at the time the submission was reviewed.

Generation is implemented in [lib/reportGenerator.ts](../lib/reportGenerator.ts); the file is produced once on the transition to COMPLETED status and written to uploads/reports/<studyId>.zip.


15. Data Flow Summary


Student uploads CSV of individual participant data and this is approved by the Supervisor
           |
           v
Submission stored in database (status: PENDING)
           |
           v (admin reviews and validates)
Status set to VALIDATED or INCLUDED
           |
           v (admin triggers analysis)
POST /api/analysis/run - the system:
  -> downloads CSV for each submission
  -> parses to participant-level rows
  -> POST to R engine at R_API_URL/run
           |
           v (R engine)
stack_participants()  — combine all CSVs into one data frame
analyze_*()          — design-specific lmer() model
  -> primary:    unstandardised effect (MD, slope, etc.)
  -> secondary:  standardised effect (Hedges' g, Kendall's τ, etc.)
  -> variance:   τ², σ², I²
compute_per_study()  — within-study estimates for plots
forest() / funnel()  — base64 PNG plots
compute_diagnostics() — Q-Q, residuals, scale-location, Baujat, LOO
run_sequential()     — cumulative models (k=2..K) + stopping rules
run_publication_bias() — Egger's, fail-safe N, trim-fill, outliers
           |
           v
Results returned as JSON
           |
           v
Stored in MetaResult table
           |
           v
Displayed on /dashboard/results/[studyId]
and /studies/[studyId] (public view)

16. Methodological References

  • Becker, B. J. (1988). Synthesizing standardized mean-change measures. British Journal of Mathematical and Statistical Psychology, 41(2), 257–278.
  • Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences (2nd ed.). Erlbaum.
  • Duval, S., & Tweedie, R. (2000). Trim and fill: A simple funnel-plot-based method of testing and adjusting for publication bias in meta-analysis. Biometrics, 56(2), 455–463.
  • Hedges, L. V. (1981). Distribution theory for Glass's estimator of effect size and related estimators. Journal of Educational Statistics, 6(2), 107–128.
  • Higgins, J. P. T., Thompson, S. G., Deeks, J. J., & Altman, D. G. (2003). Measuring inconsistency in meta-analyses. BMJ, 327(7414), 557–560.
  • Kendall, M. G. (1970). Rank Correlation Methods (4th ed.). Griffin.
  • Lan, K. K. G., & Wittes, J. (1988). The B-value: A tool for monitoring data. Biometrics, 44(2), 579–585.
  • Morris, S. B., & DeShon, R. P. (2002). Combining effect size estimates in meta-analysis with repeated measures and independent-groups designs. Psychological Methods, 7(1), 105–125.
  • Rosenthal, R. (1979). The file drawer problem and tolerance for null results. Psychological Bulletin, 86(3), 638–641.
  • Senn, S. (2002). Cross-over Trials in Clinical Research (2nd ed.). Wiley.
  • Viechtbauer, W. (2010). Conducting meta-analyses in R with the metafor package. Journal of Statistical Software, 36(3), 1–48.
  • Whitehead, J. (1997). The Design and Analysis of Sequential Clinical Trials (2nd ed.). Wiley.

This whitesheet is version-controlled alongside the codebase. For questions or feedback, contact joe.warne@tudublin.ie.