Within preclinical research, attention has focused on experimental design and how current practices can lead to poor reproducibility. There are numerous decision points when designing experiments. Ethically, when working with animals we need to conduct a harm–benefit analysis to ensure the animal use is justified for the scientific gain. Experiments should be robust, not use more or fewer animals than necessary, and truly add to the knowledge base of science. Using case studies to explore these decision points, we consider how individual experiments can be designed in several different ways. We use the Experimental Design Assistant (EDA) graphical summary of each experiment to visualise the design differences and then consider the strengths and weaknesses of each design. Through this format, we explore key and topical experimental design issues such as pseudo-replication, blocking, covariates, sex bias, inference space, standardisation fallacy and factorial designs. There are numerous articles discussing these critical issues in the literature, but here we bring together these topics and explore them using real-world examples allowing the implications of the choice of design to be considered. Fundamentally, there is no perfect experiment; choices must be made which will have an impact on the conclusions that can be drawn. We need to understand the limitations of an experiment’s design and when we report the experiments, we need to share the caveats that inherently exist.

Concerns over reproducibility, the ability to replicate results of scientific studies, have been raised across virtually all disciplines of research.

The issues impacting reproducibility arise from all stages of the research pipeline including the design, analysis, interpretation and reporting. Within preclinical research, the environment the animals are in, the severity of the procedures they experience, and the interactions they have with animal care staff and experimenters may all affect the quality of the data obtained. The utilisation of in vivo protocol guidelines such as PREPARE

Within this article, we will draw on the current thinking of how to achieve a robust experiment and use two real-world scenarios to explore how the experiments could be run. These two scenarios, involving a laboratory and a non-laboratory in vivo experiment, allows us to explore the most commonly used designs. We have limited the exploration to in vivo research where the model and outcome measure have been selected. This will allow the implications of the choice of designs to be explored as we consider the pros and cons of various approaches. Understanding how different designs vary and how this affects the analysis and subsequent conclusions is critical for recognising the implications of the choices made, and the limitations that arise, and to provide a critical context for the conclusions. We have started the exploration with the classic complete randomised design and then explored the impact of including different design features. We add these features relative to the completely randomised design for simplicity in exploration of the impact. In reality, many experiments can encompass several of these design features simultaneously.

Experiments can become quite complex, particularly with time course studies or hierarchical designs. The statistical analysis implemented is a function both of the hypothesis of interest (the biological question) but also of the experimental design. Consequently, it is important to consider both of these, to ensure that the appropriate statistical analysis is selected for the study. For example, breeding issues often lead to multiple batches of animals within an experiment and in these situations each batch should be considered a block. In this situation, the correct analysis will increase the power. When a design is complex, for example, those that include a time course or hierarchical structure, it becomes harder to ensure the analysis is appropriate. If unsure, we recommend reaching out for statistical support from a professional statistician at the planning stage. This manuscript will help you communicate the planned design to the statistician which will then ensure that the analysis is appropriate and optimal for the biological question of interest.

Throughout the manuscript, we have explored the number of animals to use for the research question of interest. Without loss of generality, and to aid comparison, the exploration of power (sensitivity) is performed assuming the underlying within-animal and between-animal variability is constant across the scenarios. We have therefore selected the sample size (n) by exploring the power as a function of standardised effect (ie, an effect of interest relative to the variability in the data). In addition, we have included guidance on the impact on the power of the different approaches, but accurate comparison is not possible as variability estimates will differ as a function of the design. We have used the Experimental Design Assistant (EDA) as a tool to communicate and visualise the various designs, and to enable us to convey their differences.

The EDA is a freely available web-based tool that was developed by the UK National Centre for the Replacement, Refinement and Reduction of Animals in Research (NC3Rs) to guide in vivo researchers through the design and analysis planning of an experiment.

Glossary A: central feature of the Experimental Design Assistant (EDA) was the development of an ontology, a standardised language to communicate experiments

Term | Definition |

Bias | The overestimation or underestimation of the true effect of an intervention. Bias is caused by inadequacies in the design, conduct or analysis of an experiment, resulting in the introduction of error |

Biological unit* | The entity (eg, mouse, cell line) that we would like to draw a conclusion about |

Confounder* | A confounder is a nuisance variable that is distributed non-randomly with respect to the independent (treatment) or outcome measure and subsequently can mask an actual association or falsely demonstrate an apparent association |

Covariate* | A covariate is a continuous variable that is measurable and considered to have a statistical relationship with the outcome measure |

Effect size | Quantitative measure of differences between groups, or strength of relationships between variables |

Experimental unit | Biological entity subjected to an intervention independently of all other units, such that it is possible to assign any two experimental units to different treatment groups. Sometimes known as unit of randomisation |

External validity | Extent to which the results of a given study enable application or generalisation to other studies, study conditions, animal strains/species or humans |

False negative | Statistically non-significant result obtained when the alternative hypothesis is true. In statistics, it is known as the type II error |

False positive | Statistically significant result obtained when the null hypothesis is true. In statistics, it is known as the type I error |

Independent variable | Variable that the researcher either manipulates (treatment, condition, time), or is a property of the sample (sex) or a technical feature (batch, cage, sample collection) that can potentially affect the outcome measure. Independent variables can be scientifically interesting or can be nuisance variables. Also known as predictor variable |

Inference space* | Inference space is the population from which the samples in an experiment were drawn and the population to which results of an experiment can be applied |

Internal validity | Extent to which the results of a given study can be attributed to the effects of the experimental intervention, rather than some other, unknown factor(s) (eg, inadequacies in the design, conduct, or analysis of the study introducing bias) |

Nuisance variable | Variables that are not of primary interest but should be considered in the experimental design or the analysis because they may affect the outcome measure and add variability. They become confounders if, in addition, they are correlated with an independent variable of interest, as this introduces bias. Nuisance variables should be considered in the design of the experiment (to prevent them from becoming confounders) and in the analysis (to account for the variability and sometimes to reduce bias). For example, nuisance variables can be used as blocking factors or covariates |

Observation unit* | The entity on which measurements are made |

Outcome measure | Any variable recorded during a study to assess the effects of a treatment or experimental intervention. Also known as dependent variable, response variable |

Power | For a predefined, biologically meaningful effect size, the probability that the statistical test will detect the effect if it exists (ie, the null hypothesis is rejected correctly) |

Sample size (n) | Number of experimental units per group, also referred to as n |

N | Total number of animals used within an experiment |

We have therefore used the EDA terminology and definitions

Understanding the EDA diagram. Example of the Experimental Design Assistant (EDA) visualisation. The diagrams are composed of nodes (shapes) and links (arrows). The nodes represent different aspects of an experiment and the links clarify the relationship between different nodes. The diagram consists of three elements: experiment detail (grey nodes), practical steps (blue and purple nodes) and the analysis plan (green and red nodes). Each node can contain additional information which can be accessed within the EDA tool by clicking on the specific node. A two-group completely randomised design is illustrated here.

The goal of an experiment is to explore cause and effect. The resulting experimental process relies on us manipulating a variable (the independent variable of interest) to see what effect it has on a second variable (the outcome measure). We seek to design experiments that have high internal validity, meaning we have confidence that we have eliminated alternative explanations for a finding. Pivotal to achieving high internal validity is the use of randomisation, where test subjects are randomly and independently assigned to ‘treatment’ groups. The need for randomisation should also be considered in the practical steps of the experiments (application of treatment, sample processing and measurement etc) to minimise potential order effects. Randomisation is a necessary process to minimise the risk of nuisance variables impacting the ability to draw conclusions about cause and effect by ensuring that potential nuisance variables are equally distributed across the test groups. Without randomisation, confounding can occur. An experiment is confounded when unintentionally the test groups differ (accidental bias) for a variable that also alters the outcome of interest. When we are trying to isolate a treatment effect, this accidental bias can mask a treatment effect of interest or it can erroneously imply a treatment effect.

The importance of randomisation can be seen in meta-analyses which found that studies that did not randomise were more likely to detect a treatment effect (OR 3.4)

A component of the randomisation process is the experimental unit (EU), which is the entity which can be assigned independently, at random, to a treatment.

Standardisation is another tool used alongside randomisation to minimise the effect of nuisance variables on studies. Here, a researcher controls potential nuisance variables by standardising all aspects of the research experiment. Within in vivo research, it is common to standardise the animals (eg, inbred animals of one sex), the environment (eg, standard housing and husbandry conditions), the testing procedure and time (eg, conducting the experiment in one batch at the same time). The driver behind standardisation is not just to manage potential confounders but also to reduce the variation in the data, with the argument that this will result in fewer animals being needed to detect a defined effect size of interest. The reality of experimental design is that we have to simplify a complex world into a testing space where we have isolated cause and effect such that there is confidence that the effect seen is arising from the treatment. This process generates an inference or testing space, which relates to the population sampled. Following these experiments, we then draw conclusions and generalise the results to a broader population. External validity is the extent to which the results of a study can be generalised to another population. The complexity in biology means we do have to make hard choices when we design experiments; we cannot explore the impact of all sources of variation on the treatment effect. To progress our scientific understanding, we therefore need to generate ‘do-able problems’

Different designs also differ in the statistical analysis that is suitable. Statistical analysis is an essential tool to query the data and assess whether the differences seen are likely to arise from sampling effects or an underlying population difference. The majority of animal experiments use hypothesis-testing (with statistical tests such as Student’s

Experimental design strategies can be used to manage variation in the data with the goal of increasing the statistical power. The statistical power is the probability that the statistical test will detect the treatment effect if it exists for a predefined effect size of interest. Statistical power is therefore lower when the false-negative error rate is higher. Historically, we have focused on minimising false positives rather than ensuring we had sufficient statistical power. This approach fails to consider that a series of experiments with low power results in false-positive errors dominating.

The key biological question is to explore the effect of compound X on apoptosis in the liver. The compound of interest has been found to produce aggression with rodents, which will be single housed to avoid welfare issues. Practically, after treatment, the rats will be euthanised and the liver harvested. The biological question will be studied by quantifying histological effects in the liver. A cross-section of each liver will be prepared for histological assessment and the number of apoptotic cells counted. Using this case study, we explore a variety of experimental designs and consider their pros and cons.

In a CRD, the treatments are assigned completely at random so that each experimental unit has the same chance of receiving any one treatment. For this type of design, any difference among experimental units receiving the same treatment is considered as experimental error. This is the simplest design that can be implemented and could be considered the building block for other designs. Critical to this design is the randomisation process. Through the random allocation of experimental units to the treatment groups, the experimenter can assume that, on average, the nuisance variables will affect treatment conditions equally; so, any significant differences between conditions can fairly be attributed to the treatment of interest.

Consider the liver case study: with a completely randomised design (

Schematic of a completely randomised design (CRD) of a rat liver study exploring the effect of compound X on the histological score either as a classic experiment or as an experiment which includes a covariate. The purple section highlights the common practical steps: In a complete randomised design experiment, there is one factor of interest, and in this scenario it is treatment which has two possible levels (vehicle or compound X). The animals are the experimental units and form a pool which are randomly allocated to the two treatment groups prior to exposure. In this design, the male Wistar rat is the biological unit, experimental unit and observation unit. The upper black section details the experimental and analysis details for a classic CRD while the lower black section details the CRD with a covariate of pre-treatment body weight and hence includes in the analysis the nuisance variable body weight. The inclusion of the pre-treatment body weight will increase the power if body weight is correlated to the outcome measure. The inclusion subtly impacts the conclusion in that the estimated treatment effect would be the change in means after adjusting for any differences in pre-treatment body weight.

A power analysis is a recommended strategy for determining the sample size (n, the number of animals per group) and hence the N (total number of animals) needed for an experiment. Once the design and subsequent analysis plan is established, there are four factors that affect power: the magnitude of the effect (effect size), the variability in the outcome measure, the significance level (typically set at 0.05) and the number of measures per group (n). At a minimum, the statistical power should be at least 0.8. For a confirmatory study (a study performed to confirm an earlier finding), a higher power should be used as the risk of a false negative has more impact on the research than in a hypothesis-generating study. For many designs, formulae exist to complete the power calculation and have been enabled in freeware such as Gpower,

Although strictly speaking not part of the experimental design, a covariate is a continuous variable that is measurable and has a statistically significant relationship with the outcome measure. Examples include body weight, organ weight, tumour size at point of randomisation, pre-treatment activity levels, baseline blood pressure measures and so on. Including a covariate is a noise reduction strategy and consequently has the potential to increase the power of your experiment. When the variable is omitted from the analysis, any variation that arises from this variable is classed as unexplained variation and this therefore decreases the power when the effect of interest is compared with the unexplained variation. To increase the statistical power of an experiment, experimenters typically focus on the n, as for most scenarios there is limited ability to influence the effect size nor the variability in the data. Inclusion of a covariate is an example where alteration of the analysis rather than the design is an alternative path to increasing power. Research has found the inclusion of a covariate is most beneficial when the covariate is strongly correlated with the outcome measure in the population, and when the experimental design would have been only moderately powered (40%–60%) without including the covariate in the analysis.

Including a covariate might also be necessary to remove a confounder which could not be removed by standardisation or randomisation. The removal is key to achieve internal validity and have greater confidence in the biological conclusion being drawn. A common example in biomedical research arises when the treatment induces a body weight phenotype and the outcome measure correlates with body weight. For example, as body weight is a highly heritable trait, 30% of knockout lines have a body weight phenotype.

Consider the liver case study: with a completely randomised design, we could include the covariate of pre-treatment body weight (

In experimental design, a factor is another name for a categorical independent variable. It is a variable that is manipulated by the experimenter and will have two or more levels (eg, light levels could have three levels such as dark, 500 lux or 1000 lux). Examples could be related to animal characteristics (eg, sex, strain, age) or aspects of the environment (eg, environmental enrichment, group size) or aspects of the protocol (eg, timings of measurement, delivery route, dose level). A factorial design investigates the impact of changes in two or more factors within a single experiment. In contrast to one-variable-at-a-time studies, factorial experiments are more efficient (provide more information and/or use fewer experimental units) and allow us to explore the interactions of the factors.

A factorial design allows us to include more than one independent variable of interest. A very topical example would be sex. Within preclinical research, there is currently an embedded sex bias as researchers predominately study only one sex; typically males.

Often researchers will argue that they will test the second sex later and that inclusion of both sexes will double the sample size of the study.

Using a factorial design for the liver case study, with treatment and sex as factors (

Schematic of a factorial design to study the effect of compound X on the histological score with sex as a second independent variable of interest. Compared with the completely randomised design (

A randomised block design (RBD) can be used to manage nuisance source of variability by including nuisance variables in the experimental design and analysis to account for them. These sources of variation may affect the measured results but are not of direct interest to the researcher. Examples of a nuisance variable could be litter, operator, instrument, batch, laboratory, time of day the experiment was run and so on. You can consider a RBD as a series of completely randomised experiments (mini-experiments) which form the blocks of the whole experiment.

In a block design, within a block there is standardisation but between blocks there is variation. To account for this structure, the EUs are randomly assigned to treatment conditions within each block separately. This process has the impact that the variability within a block is less than the variability between blocks. Consequently, in the analysis, power is increased as the effect of interest is assessed ‘within block’ against the within-block variability rather than the variability in the whole experiment. This strategy will only be effective if the variability within the block is actually lower than the variability between blocks when the experiment is performed. There are two options for a RBD, to have either a single EU or multiple EUs for each treatment per block (

Understanding a randomised block design (RBD). In a RBD, each block can be considered a mini-experiment. Within a block (shown in this schematic as a group of mice within a blue square), the experimental unit (EU, shown as a mouse) is randomly assigned to a treatment (shown as a syringe which is coloured orange or yellow dependent on the treatment assigned) and all treatment levels are represented. (A) represents an experiment with six blocks with only one experimental unit per treatment level per block. (B) represents an experiment with two blocks with replication within a block resulting in multiple EUs per treatment level per block (B).

It has been argued that our reliance on CRD along with highly standardised environments is contributing to the reproducibility crisis.

The problem with the standardisation strategy is that living organisms are highly responsive to the environment with phenotypic changes with both long-term (eg, development) and short-term (eg, acclimation) duration.

Instead of a focus on standardisation to minimise variation, it has been argued that we should embrace variability to ensure conclusions from studies are more representative and thereby improve the generalisability and hence the reproducibility.

In the liver case study, with a RBD we can explore the effect of the drug across a blocking factor to increase the generalisability of the results. One option would be to introduce variation by including batch as a blocking factor (

Schematic of a randomised block design to study the effect of compound X on the histological score with either batch (upper black pane) or strain (lower black pane) as a blocking factor. The purple section highlights the common practical steps. In the upper black pane, the experiment is split into a number of batches and the analysis includes a blocking factor ‘batch’. The blocking factor has three levels (batch 1, batch 2 and batch 3). The diagram highlights that the experimental rats are now randomly allocated to the treatment group within each batch. In this design, the experimental unit (EU), biological unit (BU) and observation unit (OU) is the Wistar rat. The estimated effect is more reproducible than the completely randomised design (

The question is what n should be used within a block and how many blocks are needed. This topic was explored for four syngeneic tumour models

For the rat case study with batches as the blocking factor, we can determine the n to use using the Kastenbaum

An alternative approach to this experiment would be including strain as a blocking factor (

In this block design, the EU and OU is the individual rat of a certain strain while the BU is now rat and generalisability has been increased as the inference space includes multiple strains. Using the Kastenbaum

We could reduce the number of rats within a block to two (total n=16) and detect the standardised effect size of 1.6 (k treatments=2, b blocks=4, N observations per treatment group within a block=2, significance threshold=0.05 and target power=0.8), but have insufficient replicates to assess how the treatment effect interacted with strain. This approach uses the same total number of animals as the CRD but would have a significantly enhanced inference space.

Replication within an experiment is often misunderstood, especially when the design is hierarchically nested, leading to a poor focus of resources and inappropriate statistical analysis.

Hierarchical nested designs are common in biomedical research. It can be considered a form of subsampling where an EU is sampled multiple times, typically to get a more accurate measure of the EU’s response. For example, blood pressure readings are very sensitive to the environment and consequently studies on rodents will typically take multiple readings per day for each animal.

An approach frequently used to manage sub-sampling is a summary statistics approach (ie, you average the readings). For example, for the blood pressure measure, you would average the readings collected for a rat and this would allow you use to use this summary metric as the reading to represent the EU in the analysis. This approach can be applied to an alternative version of the rat liver study where instead of working with the single histological score for each rat we could treat each reading on a histological slide as an OU replication for the rat (

Schematic of a completely randomised design (CRD) to study the effect of compound X on the histological score with multiple readings per rat. Compared with the original CRD (

Using a summary metric is not optimal in terms of statistical power as it ignores intra-subject variance.

Frequently, data are collected serially across all subjects either taking multiple measures of the same variable under different conditions or over two or more time periods. These within-subject designs are used in order to assess trends over time or to increase power as you are able to partition the within-animal and between-animal variability that treatment effects are assessed against. Examples include tumour growth curves, or monitoring heart rate and respiratory parameters after exposure to an intervention/treatment for a period of time, or the glucose tolerance test which tracks blood glucose concentration after exposure to glucose treatment. These repeated readings differ from replication discussed earlier in the hierarchical nested designs (

If we revisit the rat case example, the designs have directly assessed the impact of the compound X on the liver through a terminal histology assessment. If we were interested in exploring the temporal effect of the compound, this approach would need multiple groups for vehicle and treatment to sample at each time point. To enable serial sampling, we could use a proxy for liver damage and monitor the aspartate transaminase (AST) enzyme level in the circulating blood via microsampling (

Schematic of a repeat measure design to study the effect of compound X on the circulating aspartate transaminase (AST) levels with time. In comparison with the original completely randomised design (

Data from serial measure experiments can be analysed with a repeated measures mixed models or through the calculation of summary statistics, such as the area under the curve, the slope from a regression analysis or the time to peak. With a summary metric, the analysis reverts to a simpler analysis pipeline as seen with the CRD. For this case study, if we summarise the data with the area under the curve to represent the toxicity load across the time course, then the power analysis reverts to that seen with the CRD: with a n of 8 per group, we would have 0.8 power to see a change equivalent to 1.5 SD in the summary metric.

The alternative repeated measures mixed model analysis strategy models the correlation and accounts for the lack of randomisation within an animal and thus should return an efficiency advantage and give a more nuanced analysis as the effects are explored with time. The power calculation is, however, more complex as mixed model analysis accounts for the correlated structure in the data. GLIMMPSE is a web-based freeware that has been developed with a mode (Guided Study Design) for researchers to calculate the statistical power of repeated measures studies.

To highlight different potential design features, we need to consider an alternative case study. The key biological question, in this case study, is to explore the effect of diet on milk production in dairy cows within a single farm. The arrangements at the farm allow the cows to feed individually.

The advantages and disadvantages of CRD have been discussed previously. These would apply to using a CRD for the milk production study, and as in the rat study the EU would be the animal. In practice, this study would involve serially repeated measurements as the cows would be milked daily, and the average daily production over a suitable time period after initial acclimatisation to the diet would be used in the analysis (

Schematic of a completely randomised design to study milk production in dairy cows. In this completely randomised design, there is one factor of interest (diet with two levels: diet A or diet B). The individual cows are randomly allocated to a diet and milk production subsequently measured three times in the last week of the following month (allowing several weeks for the animal to adjust to the new diet). In this design, the cow is the biological unit and the experimental unit. The observation unit is the measurement of milk production on a day for a cow. As shown in the diagram, to account for this repeat measure structure, the average milk production is calculated prior to the statistical analysis.

In a cross-over design, each participant will receive multiple treatments with a wash-out period between exposures and outcome measurement. The order in which the animals receive the treatment is randomised to account for potential temporal effects. This type of design relies on the measurement being non-terminal and the treatment effect being reversible. A wash-out period is a critical step to allow the participants to return back to baseline readings before exposure to the next treatment. The disadvantages of this design are a risk of carryover effects confounding the estimated treatment effect and the welfare implications of an individual animal experiencing multiple procedures. Lengthy wash-out periods are recommended to ensure that carryover effects, where the exposure to the first treatment impacts the response to the second treatment, are minimised. The advantage of a cross-over design is that each participant forms its own control group and this dramatically increases the power.

The cow milk production study could use a cross-over design (

Schematic of a cross-over design to study milk production in dairy cows. Compared with the completely randomised design (CRD) (

To date, most scientists have implemented (knowingly or unknowingly) a CRD. The drivers towards this design are multifactorial and include lack of, or inadequate, experimental design training, minimal exposure to alternative designs, cultural norms, the historic 3R interpretation, misconceptions over N needed and a statistical skill gap.

The scientific process relies on a simplification of a complex biological process to generate a testing space where we can isolate cause and effect with the goal of incrementally unravelling the biological story. The focus on standardisation and the CRD leads to experiments which assess for a treatment effect within a narrow testing space and thus assess causality with limited generalisability. Furthermore, the publication process and focus on manuscripts as scientific measures of success encourage scientists to overstate their findings and generalise the results to a far wider population than that tested.

The authors wish to acknowledge Simon Bates who provided useful insights during the preparation of this manuscript.

NAK: conceptualisation, lead writing—original draft and preparation, visualisation and writing—review and editing. DF: supported writing—original draft and preparation and writing—review and editing.

The authors have not declared a specific grant for this research from any funding agency in the public, commercial or not-for-profit sectors.

NAK is an employee and shareholder of AstraZeneca. AstraZeneca, however, has no commercial interest in the material presented within this manuscript. NAK was a member of the working group selected by the National Centre for the Replacement, Refinement and Reduction of Animals in Research (NC3Rs) to assist in the development of the Experimental Design Assistant.

Not commissioned; externally peer reviewed.

Data are available in a public, open access repository. All diagrams created for this manuscript can be imported into the software for future exploration by accessing the .eda files from the Zenodo open access repository (

All diagrams created for this manuscript can be imported into the software for future exploration by accessing the .eda files from the Zenodo open access repository: