Objective To determine whether a change in editorial policy, including the implementation of a checklist, has been associated with improved reporting of measures which might reduce the risk of bias.
Methods The study protocol has been published at doi: 10.1007/s11192-016-1964-8.
Design Observational cohort study.
Population Articles describing research in the life sciences published in Nature journals, submitted after 1 May 2013.
Intervention Mandatory completion of a checklist during manuscript revision.
Comparators (1) Articles describing research in the life sciences published in Nature journals, submitted before May 2013; and (2) similar articles in other journals matched for date and topic.
Primary outcome The primary outcome is change in the proportion of Nature articles describing in vivo research published before and after May 2013 reporting the ‘Landis 4’ items (randomisation, blinding, sample size calculation and exclusions). We included 448 Nature Publishing Group (NPG) articles (223 published before May 2013, and 225 after) identified by an individual hired by NPG for this specific task, working to a standard procedure; and an independent investigator used PubMed ‘Related Citations’ to identify 448 non-NPG articles with a similar topic and date of publication from other journals; and then redacted all articles for time-sensitive information and journal name. Redacted articles were assessed by two trained reviewers against a 74-item checklist, with discrepancies resolved by a third.
Results 394 NPG and 353 matching non-NPG articles described in vivo research. The number of NPG articles meeting all relevant Landis 4 criteria increased from 0/203 prior to May 2013 to 31/181 (16.4%) after (two-sample test for equality of proportions without continuity correction, Χ²=36.2, df=1, p=1.8×10−9). There was no change in the proportion of non-NPG articles meeting all relevant Landis 4 criteria (1/164 before, 1/189 after). There were more substantial improvements in the individual prevalences of reporting of randomisation, blinding, exclusions and sample size calculations for in vivo experiments, and less substantial improvements for in vitro experiments.
Conclusion There was an improvement in the reporting of risks of bias in in vivo research in NPG journals following a change in editorial policy, to a level that to our knowledge has not been previously observed. However, there remain opportunities for further improvement.
- quality improvement
This is an open access article distributed in accordance with the Creative Commons Attribution 4.0 Unported (CC BY 4.0) license, which permits others to copy, redistribute, remix, transform and build upon this work for any purpose, provided the original work is properly cited, a link to the licence is given, and indication of whether changes were made. See: https://creativecommons.org/licenses/by/4.0/.
Statistics from Altmetric.com
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.
Strengths and limitations of this study
Prospectively registered study with articulated protocol and statistical analysis plan, registered with the Open Science Framework.
Study Dataset and code available in public domain.
Involvement of a large collaborative group of outcome assessors.
Articulation, in advance of analysis, of smallest effect size of interest.
Retrospective observational study.
Limited agreement between outcome assessors.
Few articles describing in vivo research report taking specific actions designed to reduce the risk that their findings are confounded by bias,1 and those that do not report such actions give inflated estimates of biological effects.2 3 Strategies and guidelines which might improve the quality of reports of in vivo research have been proposed,4 5 and while these have been endorsed by a large number of journals there is evidence that this endorsement has not been matched by a substantial increase in the quality of published reports.6 Poor replication of in vivo and in vitro research has been reported,7–9 and this has been attributed in part to poor descriptions of the experimental and analytical details.
In May 2013, Nature journals announced a change in editorial policy which required authors of submissions in the life sciences to complete a checklist, during manuscript revision, indicating whether or not they had taken certain measures which might reduce the risk of bias and to report key experimental and analytical details, and in their submission to detail where in the manuscript these issues were addressed.10 The development of this checklist was prompted in part by a consensus statement5 setting out key aspects of study design and conduct which were necessary to allow the reader to assess the validity of the findings presented; it identified these as randomisation, blinding, sample size estimation and data handling (the ‘Landis 4’). The Nature journals’ checklist also included items relating to figures and statistical representation of data; reagents used; species, strain and sex of experimental animals; reporting of relevant ethical approvals; consent (for research involving human subjects); data deposition; and availability of any bespoke computer code. The full checklist is given in online supplementary appendix 1.
The aim of this study was to determine whether the implementation of this checklist for submissions has been associated with improved reporting of measures that might reduce the risk of bias. Because the Landis consensus statement drew attention to randomisation, blinding, sample size estimation and data handling as being the most important items to report, we chose the reporting of these as our primary measure of outcome. To establish whether any observed change in quality was simply a secular trend occurring across all journals, we matched each included publication with a publication in a similar subject area published at around the same time by a different publisher.
The methods are described in detail in the published study protocol,11 and the data analysis plan and analysis code were articulated prior to database lock and registered on the Open Science Framework (DOI: 10.17605/OSF.IO/HC7FK). The complete study data set including PubMed IDentifiers (PMIDs) and data descriptors (but not, for copyright reasons, the source pdfs) of the included articles is available on Figshare (10.6084/m9.figshare.6226718).
In this observational cohort study, we aimed to determine whether the implementation of a checklist for submissions has been associated with improved reporting of measures which might reduce the risk of bias. To do this we assessed whether—in the view of trained assessors—manuscripts reported the details required by the checklist. Importantly, we did not have access to the checklists completed by the authors.
The study populations comprised (1) published articles accepted for publication in Nature journals which described research in the life sciences and which were submitted after 1 May 2013 (when the mandatory completion of a checklist at the stage of manuscript revision was introduced) and before 1 November 2014; (2) published articles accepted for publication in Nature journals in the months preceding May 2013 which describe research in the life sciences; and (3) articles from other journals matched for subject area and time of publication. We measured the change in the reporting of items included in the checklist.
Identification of relevant articles
We included studies which described in vivo (articles that contain at least one non-human animal experiment, including rodents, flies, worms, zebrafish and so on) or in vitro research.
One individual was specifically employed by the Nature Publishing Group (NPG) to select studies which (1) described in vivo or in vitro research and (2) were published in Nature, Nature Neurology, Nature Immunology, Nature Cell Biology, Nature Chemical Biology, Nature Biotechnology, Nature Methods, Nature Medicine or Nature Structural & Molecular Biology. First, the individual identified papers accepted for publication with an initial submission date later than 1 May 2013. Beginning with the then-current issues (volume corresponding to year 2015), they worked backwards in time, ensuring the submission date was after 1 May 2013, collecting papers with the intention of identifying 40 Nature papers and 20 from each of the other 8 titles (ie, 200 papers in total) (‘Post intervention’ group). They then used a similar process to identify papers submitted for publication before 1 May 2013, matched for journal and for country of origin (based on the address of the corresponding author), starting with the May 2013 issue and working backwards, ensuring that the date of submission was after 1 May 2011 (‘pre-intervention’ group). We sought to match on country of origin because at that time there was considerable discussion of national and regional differences in the quality of research publications (on the basis of anecdote rather than experience), and in an attempt to balance ‘writing in a foreign language’ between the two groups. Where no match could be found with a submission date after 1 May 2011 (ie, in a 2-year period), then the non-matched postintervention publication was excluded from analysis and a replacement postintervention publication selected, as above. A matching preintervention publication was then identified, as described above. Articles describing research involving only human subjects were excluded. A Nature editorial administrator independent of publishing decisions reviewed articles selected against the inclusion criteria and found some (less than 10%) had been included incorrectly; they replaced these with manuscript pairs that they selected according to the inclusion algorithm described above. The published files corresponding to the publication pdfs (including the extended methods section, extended data and other supplementary materials) were used to generate pdfs for analysis. These were provided to a member of our research team (RM) at a different institution, who used Adobe Acrobat to redact information relating to author names or affiliations, dates, volumes or page numbers, and the reference list, to minimise awareness of outcome assessors to whether the manuscript was preintervention or postintervention.
The same member of our research team (RM) was responsible for identifying matching articles in other journals. Using PubMed, she entered the NPG publication title to retrieve the relevant record. She then added the ‘related citations for PubMed’ result to the search builder. In the second line search field of the search builder, she searched for ‘Date of publication’ in the same calendar month and year and performed the search. In the results returned she started with the first result returned and established whether it was published in a participating NPG journal (given in bold in table 1). If it was not, she applied the study inclusion criteria (in vivo or in vitro research or both, as defined above), ensuring that there was a match on the in vivo/in vitro status between the index NPG publication and the non-NPG publication. Where these criteria were met, she selected the publication for the study and retrieved the pdf, through open access, online institutional subscription, interlibrary loan or by request from the authors. If the first related citation did not fulfil these criteria, she moved to the next, until an appropriate publication was found. If an appropriate publication was not found, she repeated these steps but with the date of publication used in the search extended by 1 month earlier and 1 month later. If this process did not identify an eligible publication, she again extended the search by a month in each direction and continued until a matching publication was found. She then recorded the difference in calendar months between the date of publication of the index NPG article and the date of publication of the matching non-NPG article. She then used Adobe Acrobat to redact information relating to author names or affiliations, dates, volumes or page numbers, and the reference list, to minimise awareness of outcome assessors to whether the manuscript was preintervention or postintervention. Having completed these tasks she played no further part in the study.
We anticipated difficulty in identifying matching articles, and in particular in matching non-NPG articles by country; we did not seek to do so. In total 896 articles were selected for analysis.
The Nature checklist focused on transparency in reporting and availability of materials and code, reflected in 10 items. We designed a series of questions (online supplementary appendix 2) to establish whether a given publication met or did not meet the requirements of the checklist. We did this to aid outcome assessors, because many checklist items included more than one embedded criteria. For instance, the section on ‘Figures and Statistical Representation of Data’ was operationalised to 12 individual ‘present/absent/not applicable’ responses. The checklist relates to the reporting of experiments, and so compliance could be achieved by reporting whether or not an element was described. For instance, for assessment of outcome, a publication was considered compliant if it reported that assessment was conducted blinded to experimental group, or if it reported that assessment was conducted without blinding to experimental group. A manuscript was only considered not to fulfil the requirements of the checklist if it described neither that the assessment was performed blinded to experimental group nor that the assessment was not performed blinded to experimental group.
Where a manuscript described both in vivo and in vitro research, the series of questions was completed for each. Where there was more than one in vitro experiment or more than one in vivo experiment, the question was considered in aggregate; that is, all in vitro experiments had to meet the requirements of the checklist item for the article to be considered compliant in reporting of in vitro experiments, and all in vivo experiments had to meet the requirements of the checklist item for the article to be considered compliant in reporting of in vivo experiments. Where an item was considered only partially compliant, we considered this, for the purposes of analysis, to be non-compliant. Where a particular checklist item was not relevant for a given manuscript (randomisation in observational studies, or power calculations in explicitly exploratory studies12), this item was considered ‘not applicable’ and the manuscript was not included in the analysis of that item.
Five researchers experienced in systematic review and risk of bias annotation scored a set of 10 articles using our series of questions. Disagreements were resolved by group discussion, to arrive at a set of ‘Gold standard’ answers for these 10 articles. We also used this experience to write a training guide for outcome assessors. We then used social media platforms and mailing lists to recruit outcome assessors. We sought to recruit individuals with a background in medicine or biomedicine at a graduate or undergraduate level who we believed should have experience in the critical appraisal of published materials. However, we also recruited two senior school students on Nuffield Research Placements in our group. After outcome assessors had reviewed the training materials, they were invited to score articles from the ‘Gold standard’ pool, presented in random order, until their concordance with the gold standard responses was 80% overall, and was 100% for the components of the primary outcome measure, for three successive articles. At this point we considered them to be trained. The 10 training data sets, with their ‘gold standard’ adjudications, were included in the analysis (using the gold standard adjudications). Because of the range of expertise available, we ensured that each manuscript was reviewed by at least one assessor highly experienced in systematic review and critical appraisal. The training platform remains available for continuing professional development at https://ecrf1.clinicaltrials.ed.ac.uk/npqip/.
PDF files of included articles were uploaded to the study website. Trained assessors were presented with articles for scoring in random order. Each manuscript was scored by two individuals, one with experience in systematic review and risks of bias annotation and one other. Disagreements between assessors were reconciled by a third, experienced individual who was not one of the original reviewers, who could see the responses previously given but not who were the initial reviewers. Each item for each manuscript was therefore scored by two (if there was agreement) or three (if there was disagreement) reviewers, except for the 10 manuscripts which served as the gold standard, which had been scored by five experienced assessors. We had intended to monitor outcome assessment after 10% of manuscripts had been scored and reconciled, but the reconciliation process lagged behind the outcome assessment, and this was not done.
Statistical analysis plan
Given our focus on the reporting of measures to reduce the risks of bias, we took as our primary outcome measure a composite measure of the proportion of articles meeting the relevant measures identified by Landis et al in 2012 as being most important for transparency in reporting in vivo research. These are covered by items 2, 3 4 and 5 of the checklist and relate to the reporting of randomisation, of the blinded assessment of outcome, of sample size calculations, and of whether the manuscript described whether samples or animals were excluded from analysis. Importantly, checklist compliance did not require, for example, that the study was randomised, but rather that the authors stated whether or not it was randomised. The evaluation principle was to determine if someone with reasonable domain knowledge could understand the parameters of experimental design sufficiently to inform interpretation. It has been argued that these measures might not be as relevant for exploratory studies, and for these we recorded the item as ‘not relevant’. We defined exploratory studies as those where hypothesis testing inferential statistical analyses were not reported. Where an item was not relevant for a publication (for instance with studies using transgenic animals where group allocation had been achieved by Mendelian randomisation), we considered compliance as meeting all of the relevant criteria. Where a publication described both in vivo and in vitro experiments, we analysed each type of experiment separately.
Our primary outcome was the proportion of articles describing in vivo experiments published by NPG after May 2013 that meet all of the relevant Landis 4 criteria. This is described in the statistical analysis plan deposited on the Open Science Framework (osf.io/hc7fk) on 7 June 2017 prior to database lock and before we had derived any outcome information. Following discussion with the NPG editorial team, we also set out in the protocol11 some predefined ‘editorially significant changes’—either reaching compliance of 80% or an increase of 15% in compliance.
We used the two-sample proportion test (prop.test) in R without the Yates continuity correction and two-sided hypothesis testing to be sensitive to the possibility that performance might have declined rather than improved. The secondary outcomes were (1) whether the proportion of articles describing in vivo experiments published by NPG after May 2013 which met all four of the Landis 4 criteria was 80% or higher (the original primary outcome; Wald test, wald.ptheor.test, RVAideMemoire in R); (2) the change in the proportion of articles describing in vitro experiments published by NPG before and after May 2013 which met all four of the Landis 4 criteria (two-sample proportion test as above); and (3) the change in the proportion of manuscripts meeting the criteria for adequate reporting of statistical analysis details, individual Landis criteria, descriptions of animals, reagents and their availability, biological sequences or structures, computer code deposition, and items relating to the involvement of human subjects or materials in included studies. For the matching articles from non-NPG journals, the secondary outcomes were (1) the change in the proportion of articles describing in vivo experiments published before and after May 2013 which met all of the Landis 4 criteria (two-sample proportion test); (2) whether the proportion of articles describing in vivo experiments published after May 2013 which met all four of the Landis 4 criteria was 80% or higher (Wald test); (3) the change in the proportion of articles describing in vitro experiments published before and after May 2013 which met all four of the Landis 4 criteria (two-sample proportion test); and (4) the change in the proportion of manuscripts meeting the criteria for reporting of statistical analysis details, individual Landis criteria, descriptions of animals, reagents and their availability, biological sequences or structures, computer code deposition, and items relating to the involvement of human subjects or materials in included studies. For each of these outcomes, we compared the changes observed in NPG articles with that observed in non-NPG articles. For each secondary analysis we used the Holm-Bonferroni correction using the p.adjust option for prop.test in R to account for the number of comparisons drawn, as described in appendix B of the data analysis plan. We also used interrupted time series analysis for each checklist item to distinguish a discrete ‘shift’ in performance from an upward ‘drift’, as described in the data analysis plan. Several tertiary outcomes are described in the study protocol and statistical analysis plan and are reported in the supplementary material.
Power calculations were performed in STATA (Version 13.0) prior to commencement of the study. For the primary outcome measure, we approximated required sample sizes using power calculations for a one-sided two-sample χ2 test in STATA seeking a significance level of p<0.01 and with varying estimates of compliance with the Landis 4 criteria in the preintervention group. With 200 articles in each group, we had 80% power to detect an increase from 10% to 21%, or from 20% to 34%, or from 30% to 45%, or from 40% to 56%, or from 50% to 66%. We wanted to detect an absolute difference of 10% or more and thought that compliance with the Landis 4 criteria in the preintervention group would be around 10%, so we thought that having 200 studies in each group would be enough.
For the primary outcome measure proposed in the original study protocol (that compliance with the Landis 4 criteria in the postintervention group reached 80%), 200 studies in each group would be sufficient to reject the alternative hypothesis if the observed compliance was 72% or lower, and again we considered this to be sufficient.
For individual checklist items, after correcting for multiple comparisons, statistical power again depends on the level of reporting in the preintervention group. Where this was between 15% and 85%, with 200 studies per group, we would have 80% power to detect an absolute increase of 15% in the reporting of each item. We considered this to be the minimal increase that would represent an important improvement in reporting. The power calculations are described in greater detail in the study protocol.11
Eight hundred and ninety-six articles were identified and uploaded for outcome ascertainment, 448 in each cohort. Two non-NPG articles were excluded because they did not meet the inclusion criteria, and we identified four NPG and nine non-NPG articles which had been included more than once. Four hundred and forty-four NPG articles and 437 non-NPG articles underwent outcome assessment. One NPG publication and one non-NPG publication were adjudged at the time of outcome assessment to report neither in vivo nor in vitro research, and so were excluded. The analysis is therefore based on 443 NPG articles (219 before and 224 after 1 May 2013) and 436 non-NPG articles (194 before and 242 after 1 May 2013) (figure 1). The difference in numbers for NPG and non-NPG before and after 1 May 2013 is because some of the NPG ‘before’ articles matched best with articles in other journals published in the few months following May 2013. Specifically, 26 NPG preintervention articles were matched with other papers published an average of 3.2 months after May 2013 (maximum of 8 months), and 6 NPG postintervention articles were matched with other papers published 1, 2, 9, 11, 12 and 215 months before May 2013. Overall, 43% of matched pairs had dates of publication within 1 month, 54% within 2 months, 64% within 3 months and 81% within 6 months of each other (range −11 to +22 months). Two hundred and thirty-nine articles described only in vivo research, 133 described only in vitro research and 507 described both. Four hundred and ninety-four papers were completely matched for in vivo and in vitro status, 276 were partially matched (one member of matched pair reporting in vivo and in vitro research, the other reporting only in vitro or only in vivo research) and 36 were mismatched (one reporting only in vivo research, the other reporting only in vitro research). The source journals are given in table 1; in total 198 different titles contributed matching articles (median of 1 article per source journal, range 1–47). The PMIDs of included articles are listed in the data supplement.
We had intended to perform subgroup analyses in groups defined by country of origin, categorisation of research, and whether the study was predominantly in silico, in vitro, in vivo or involved human subjects. However, because for some countries the number of included manuscripts was low, the categorisation of research was not available for all manuscripts in the matching non-NPG group (except by inference from the matched NPG papers), the number of predominantly in silico and predominantly human studies was low, and we were not confident that we could operationalise a judgement as to whether a paper was predominantly in vivo or in vitro, we elected not to pursue these analyses. The country of origin for papers in each cohort is shown in table 2.
Two hundred and five individuals registered with the project, of whom 109 started at least one training manuscript, 38 completed their training and 35 assessed at least one manuscript. Of these 35, 12 individuals also reconciled conflicting outcome assessments, and the web interface was programmed to ensure that they were not offered for reconciliation articles that they had previously adjudicated. Including reconciliation, the median number of articles scored was 13 (range 1–441). The agreement between outcome assessors ranged from being no better than chance at 50% (in vivo studies; implementation of statistical methods and measures: ‘Is the variance similar (difference less than two-fold) between the groups that are being statistically compared?’) to 98% (in vivo studies; ‘Does the study report the species?’). The median agreement was 82% (IQR 68%–89%). Two articles were identified during manuscript preparation as having been incorrectly recorded at data lock as reporting both in vivo and in vitro research, where in fact they only reported in vitro research, and one article had been incorrectly recorded as reporting both in vivo and in vitro research, where in fact it only reported in vivo research.
Reporting of the Landis 4 items
The proportion of NPG in vivo studies reaching full compliance with the Landis 4 criteria increased from 0% (0/203) to 16.4% (31/189) (Χ²=36.1, df=1, p=1.8×10−9), but remained significantly lower than the target of 80% (95% CI 11.6% to 22.6%, Wald test versus 80% z=−15.4, p=2.2×10−16). In the tables the denominator number of studies (‘N’) differs according to whether that criterion is relevant to the work presented; for instance, in transgenic studies randomisation may not be appropriate (tables 3–5).
Because the number of manuscripts with a country of origin other than the USA and the number of studies which were predominantly in silico (we used studies which used computer code as a surrogate for the upper extent of this) were small, we did not analyse these further. There were no differences in compliance with the primary outcome measure dependent on whether the study included human research, or whether they included both in vivo and in vitro research or in vivo research alone.
The preferred standard is that the manuscript describes which method of randomisation was used to determine how samples or animals were allocated to experimental groups, although articles were also compliant if they included a statement about randomisation even if no randomisation was used. The proportion of NPG in vivo studies reporting the method of randomisation was 1.8% before and 11.2% after (χ²=12.4, df=1, adjusted p=0.054). Of the remainder, the proportion of studies mentioning randomisation increased from 8.3% to 64.2% (χ²=110.2, df=1, adjusted p=3.2×10−14); overall, 68% of studies discussed randomisation in some way and so were judged compliant. Figure 2A shows change in the proportion of studies meeting these criteria before and after the change in editorial policy.
The preferred standard is that the manuscript describes whether the investigator was blinded to the group allocation during the experiment and/or when assessing the outcome, although articles were also compliant if they included a statement about blinding even if no blinding was done. The proportion of NPG in vivo studies reporting blinding during group allocation or outcome assessment or both increased from 4% to 22.8% (Χ²=29.6, df=1, adjusted p=7.6×10−6). Of the remainder, the proportion of studies mentioning blinding increased from 1.6% to 55.3% (Χ²=120.1, df=1, adjusted p<3.2×10−14); overall, 63% of studies discussed blinding in some way and so were judged compliant. Figure 2B shows change in the proportion of studies meeting these criteria before and after the change in editorial policy.
The proportion of studies reporting animals excluded from analysis increased from 13.9% to 30.7% (Χ²=16.1, df=1, adjusted p=0.008). Figure 2C shows change in the proportion of studies meeting these criteria before and after the change in editorial policy.
Sample size calculations
The preferred standard is that the manuscript describes how the sample size was chosen to ensure adequate power to detect a prespecified effect size, although articles were also compliant if they included a statement about sample size estimate even if no statistical methods were used. The proportion of studies reporting an a priori sample size calculation increased from 2.0% to 14.8% (Χ²=20.5, df=1, adjusted p=0.0008). Of the remainder, the proportion of studies mentioning sample size calculations increased from 1.6% to 58.4% (Χ²=140.7, df=1, adjusted p<3.2×10−14); overall, 64% of studies discussed sample size calculations in some way and so were judged compliant. Figure 2D shows change in the proportion of studies meeting these criteria before and after the change in editorial policy.
For NPG in vitro studies, the proportion reaching full compliance with the Landis 4 criteria was 0% (0/159) before and 3.3% (6/176) after (Χ²=6.8, df=1, Holm-Bonferroni adjusted p=1.00). The proportion of studies reporting randomisation was 0% before and 2.9% after (Χ²=4.4, df=1, adjusted p=1.00). The proportion of studies mentioning randomisation even where it was not reported increased from 0% to 15.6% (Χ²=25.3, df=1, p=6.9×10−5). The proportion of studies reporting blinding during group allocation or outcome assessment or both was 3.9% before and 8.9% after (Χ²=3.467, df=1, p=1.00). The proportion of studies mentioning blinding even where it was not reported increased from 0.7% to 15.9% (Χ²=23.0, df=1, p=0.0002). The proportion of studies reporting exclusions from analysis was 8.2% before and 15.9% after (Χ²=4.73, df=1, p=1.00). The proportion of studies reporting an a priori sample size calculation was 1.3% before and 7.9% after (Χ²=8.7106, df=1, p=1.00). The proportion of studies mentioning sample size even where a sample size calculation was not reported increased from 3.3% to 28.5% (Χ²=36.9, df=1, p=1.8×10−7).
The proportion of matching (non-NPG) in vivo studies reaching full compliance with the Landis 4 criteria was 1% before and 1% after (Χ²=0.01, df=1, adjusted p=1.00), and for in vitro studies the proportion of non-NPG studies reaching full compliance with the Landis 4 criteria was 0% before and 1% after (Χ²=0.8, df=1, adjusted p=1.00). There was no significant change in reporting of any of the individual Landis 4 criteria for either in vivo or in vitro research.
For in vivo studies reported in NPG articles, there were significant improvements in the reporting of exact numbers (from 46% to 69%, Χ²=22.07, df=1, adjusted p=0.0004), of whether t-tests were defined as one-sided or two-sided (from 46% to 71%, Χ²=17.80, df=1, adjusted p=0.003), and whether the assumptions of the test had been checked (from 9% to 27%, Χ²=18.58, df=1, adjusted p=0.002). For in vitro experiments described in NPG articles, there were significant improvements in the reporting of the exact numbers (from 32% to 70%, Χ²=12.60, df=1, adjusted p=0.05), of whether data represented technical or biological replicates (from 57% to 75%, Χ²=13.29, df=1, adjusted p=0.035), and whether t-tests were defined as one-sided or two-sided (from 47% to 72%, Χ²=16.18, df=1, adjusted p=0.008). For in vivo and in vitro studies described in non-NPG articles, there was no significant change in any of the items relating to statistical reporting (table 6).
Other checklist items
For reporting of details of animals used, reporting of animal species and strain was high even before the change in editorial policy. There was no significant change in reporting any of these items in NPG and non-NPG articles, or in the reporting of details of antibodies used. For in vitro research, there was an increase in the proportion of studies in NPG articles reporting recent mycoplasma testing of the cell lines used (from 1% to 26%, Χ²=26.60, df=1, adjusted p=4×10−5) but not for non-NPG articles (1% before, 1% after). For reporting and availability of accession data (eg, DNA or protein sequence deposition) and computer code, there were no significant changes for either NPG or non-NPG articles. Finally, there were no significant changes in the reporting of items relating to human subjects or the use of human materials, but for most items the number of articles for which these were relevant was very low indeed (table 7).
In our protocol we defined the smallest effect size of editorial interest following the intervention at NPG as either achievement of compliance of 80% (transparency in figures and statistical description, data deposition, and for in vivo research description of animals used and aggregate and individual compliance Landis 4 items) or an absolute improvement of 15% in the reporting of a checklist item (all other items). Table 8 shows, for each item, the 95% CI of both the compliance achieved and the change in compliance before and after checklist implementation. Using this approach, we were able reliably to exclude an improvement of 15% or more in NPG manuscripts for full and individual item Landis compliance for in vitro research; reporting full statistics for both in vivo and in vitro research; correction for multiple testing, and reporting measures of variation, and whether the exact age or weight was given for in vivo research; and reporting whether the central estimate was mean or median, and whether cell lines used had recently been authenticated for in vitro research.
We were also interested in whether changes in reporting had occurred as a step change at the time of the change in editorial policy; whether there was an initial improvement with then a return to previous performance; or if there was an ongoing improvement in reporting. To address this question we conducted an interrupted time series analysis to estimate the rate of change before the intervention, any step change at the time of the intervention and the rate of change after the intervention. We grouped articles in 3-month periods starting November 2011, and for each quarter calculated the proportional compliance with the criteria in question. Because articles were not evenly distributed across time, the analysis is of substantially reduced power, but the fitted lines for overall compliance and for each component of the Landis checklist for in vivo research are shown in figure 3. It appears that with the exception of sample size calculation, there is a continuing improvement over time in both NPG and non-NPG articles; for sample size calculations, the improvement is only seen in NPG articles. Figure 4 shows radar charts of compliance for each checklist item in NPG and non-NPG articles before and after May 2013. Figure 5 shows the 95% CIs for the change in performance for each checklist item.
The change in editorial policy at NPG was associated with improvements in reporting of randomisation, blinding, exclusions from analysis and sample size calculations. For the highly challenging primary outcome measure, full compliance increased from 0% to 16%. This falls short of the target compliance of 80%, but should be seen in the context, first, that only 1 of 1073 articles from 2009 to 10 from leading UK institutions achieved this standard,1 and second that overall compliance of 80% would require compliance with individual items of around 95%. Since these results were first communicated at the Peer Review Congress, in bioRxiv and in the peer review process, it has been suggested that the observed change is small. It certainly falls well short of the target of 80% compliance, but this needs to be seen in the context, first, of the generally poor reporting of biomedical research (as evidenced in the non-NPG cohort) and in the challenge involved with achieving compliance for a composite outcome. If the probabilities of compliance with each of the Landis 4 items were independent of each other, then achieving 80% compliance overall would require compliance with individual items of 94%.
Prior to the study we identified achievement of 80% compliance, or an absolute improvement of 15% in the reporting of an item, as being the minimal change which would represent an important effect of an editorial intervention. In the NPG cohort, for 62 items the 95% CIs of the observed change fell below 15% for 11 items, included 15% for 40 items and were above 15% for 5 items. For three items there were insufficient data to calculate 95% CIs, and for three items baseline performance already exceeded 85%.
Power calculations in primary research are often considered unfeasible, on the basis that prior to doing the study the effect size is not known. Our approach here—of identifying a smallest effect size of interest—is increasingly widely used, and has allowed us to demonstrate if any change observed might be as large as the smallest effect size of interest, is definitely that large or is definitely not that large. We hope that those using our findings to guide their own improvements will find this helpful and recommend the approach for use in future studies.
It is notable that even with considerable investment in designing and implementing a checklist, and working with authors to encourage its completion, compliance remains so low. This stands rather in contrast to the belief that ‘all’ that is required to ensure transparency in reporting is that journals ‘insist’ that authors do the right thing. Securing transparency in research reports is a complex challenge, and experience in other fields (MM is also clinical lead for a clinical neurology service) suggests such challenges require a range of complementary approaches with commitment from all stakeholders, might best be achieved through formal improvement activity, and often take multiple attempts to achieve and sustain change.
The checklist relates to transparency in reporting, and articles were judged to be compliant if they either reported measures to address that risk of bias or reported that such measures were not taken. For each of the Landis 4 criteria, compliance was most often achieved by the authors reporting that they had not taken measures to reduce the risk of bias. While this is not ideal, we believe this represents an improvement, in terms of the usefulness of the research to those who wish to use it, from a situation where these issues are not reported at all.
For reports of in vivo research, compliance for randomisation, blinding, reporting of exclusions and sample size calculations in NPG articles reached 68%, 63%, 31% and 64%, respectively. For non-NPG articles the performance was 12%, 5%, 12% and 3%. The figures for NPG articles are similar to those recently reported for in vivo research published in the journal ‘Stroke’,13 14 which began requiring reporting of such details following the publication of good practice guidelines in 2009,15 and where performance was found to be substantially higher than for in vivo research published in other American Heart Association journals.14
While we saw improvements in the transparency of reporting, the observed improvements in experimental design were much more modest. However, peer review may not ensure the quality of published work,16 as evidenced for in vivo research by poor reporting of measures to reduce risks of bias.1 We believe that the ultimate responsibility for assessing research quality (and therefore the validity of the findings presented) rests with the reader, and transparency in reporting is fundamental to this assessment.
For reports of in vitro research, compliance was substantially lower. There have been few systematic attempts to measure the quality of reporting of measures to reduce the risks of bias in vitro research, and our findings suggest that, both in NPG and non-NPG journals, this remains low. There were improvements in reporting randomisation, blinding and sample size calculations in NPG descriptions of in vitro research, but only to 18%, 23% and 34%, respectively. For non-NPG the equivalent figures were 3%, 1% and 1%. There were no significant changes in the reporting of exclusion of in vitro data, with postintervention compliance of 16% in NPG articles and 6% in non-NPG articles.
For other checklist items, changes in performance were less dramatic, but there appeared to be incremental improvements across most of the items measured, although few of these breached our rather parsimonious adjustment for multiple testing. In spite of substantial attention given to the importance of reporting the sex of experimental animals, this was only done in 52% of postintervention NPG studies and in 36% of non-NPG studies.
Our assessment of compliance with the checklist was based on the resulting manuscript, and not on the completed checklists submitted by the study authors, which were not available to us for analysis. Therefore, we do not know whether these submitted checklists were incomplete but the requirement for compliance was not completely enforced, or if the authors and editors considered that manuscripts were compliant but our outcome assessors disagreed with those judgements. Knowing the relative contribution of these two explanations would inform refinements to checklist-based strategies to improve reporting. Of note, for the Consolidated Standards of Reporting Trials checklist, Blanco and colleagues recently showed that the checklist as submitted was concordant with the manuscript as published for only one of six studies.1
Ours is an observational study, and it is possible that other (related or unrelated) changes were responsible for much if not all the differences seen. These changes were not observed in other journals (at least not when taken in aggregate), and so it is likely that alternative causal factors would relate to NPG editorial policy and practice. While we are not aware of any other relevant changes in editorial policy occurring at a relevant time, it is likely that this change in editorial policy was accompanied by increased attention given to the importance of the quality of reporting by both inhouse editorial staff and external peer reviewers. It is not possible to determine whether these might have caused the changes seen. However, a randomised controlled study of the effect of the Animal Research: Reporting of In Vivo Experiments (ARRIVE) checklist completion on the quality of reporting of in vivo research at PLoS One will report shortly.
While our primary outcome measure was unchanged, when writing our data analysis plan (and prior to any data inspection or analysis), we did change our criterion for measuring success, from ‘whether compliance (with the Landis 4 criteria, for in vivo research) in the postintervention group of articles reached 80%’ to ‘the change in proportion of articles describing in vivo research meeting the 4 Landis criteria’. This was because our primary intention had been to observe any effect of a change in publication policy, and with the benefit of hindsight this was not captured in our original primary outcome, but we recognise this as a limitation in our findings. We note, however, that the primary outcome used reflects better the title of the study protocol than does the primary outcome measure proposed in that protocol.
For our comparator group we chose similar articles with a similar date of publication identified using the PubMed ‘related citations’ tool. The journals in which these works were published will vary in the attention which they have given to transparency in reporting, and it may be that for some journals there have been changes similar to those observed in the NPG articles. While we might have restricted our comparator group to journals more similar to NPG articles (for instance by impact factor or extent of editorial intervention), this would have meant lower fidelity of matching by subject area or date of publication or both, and we considered these factors to be more important. For this reason, our findings for NPG articles cannot be interpreted as showing improved reporting compared with similar articles in similar journals. The representation of such ‘similar’ journals in the comparator group is too small to allow meaningful conclusions to be drawn.
During the study we encountered some difficulties that we had not expected. We had thought that it would be straightforward to distinguish between an in vivo experiment and an in vitro experiment, but we had to develop an operational approach which defined that experiment on the basis of the subject at the time that the experimental intervention occurred; so a tissue slice experiment involving tissues from animals exposed to treatment or control we considered in vivo, while a similar experiment applying drugs directly to the slice we considered to be an in vitro experiment.
Our matching on whether studies reported in vitro or in vivo research or both was also reasonable in most cases. Differences will have emerged where, as described above, articles were initially categorised with one set of characteristics (in vitro, in vivo or both) and matched accordingly, but later judged to have different characteristics. Our matching for date of publication worked reasonably well, apart from the inclusion of one comparator article published in 1995, 215 months before its ‘matching’ NPG article. We had not anticipated that matching articles would be so difficult to identify, so our matching rules did not have an upper limit of difference in the date of publication. An alternative approach would have been to prioritise matching on data of publication rather than manuscript content, but each approach has its weaknesses. Because one group (non-NPG studies published before May 2013) is substantially smaller, this will have limited, to an extent, the statistical power of these contrasts; however, since power changes with the square root of the number of studies, we estimate this loss of power only to be around 10%. Since the comparator (non-NPG) group does not contribute to our primary outcome, and the matching is generally good, we do not think that these mismatches devalue our findings to any appreciable extent.
Our matching by country of origin in the NPG cohort of publications may have introduced a bias in that manuscripts from countries with fewer publications may have been excluded because of the lack of an appropriate match. However, the included manuscripts had a country of origin matching more than 85% of Nature papers published between 2010 and 2016, and so it is unlikely that this has introduced major bias.
Further, there were some checklist items where agreement between outcome assessors was very low—for instance, for the question of whether for in vivo research the difference in variance between groups being compared was less than twofold, the agreement was no better than would be expected by chance alone. We recommend that the future development of publication checklists should include an assessment of interobserver variation by potential users of the checklist for each checklist item; low agreement might indicate that the item should be rephrased or reframed, or that more explanatory text is required.
We encountered a further unexpected problem when assessing compliance with reporting of blinding, randomisation and with sample size calculations. These were assessed with pairs of questions: first did the study report doing it (yes/no/not relevant); and second did they at least mention it (yes/no/not relevant). If a study was ‘yes’ for the first question, assessors were instructed to score the second as not relevant. Therefore, the number scored as ‘not relevant’ for the second question should represent the sum of those scored as ‘yes’ and as ‘not relevant’ for the first. This was not always the case (for in vivo research occurring in 0.1%, 0.8% and 6% of assessments for sample size calculation, randomisation and blinding, respectively), but we did not become aware of this problem until after database lock. Any impact of this shortcoming is likely to be small.
Finally, our work shows the challenge of assessing even a relatively limited number of articles against a relatively straightforward checklist. We are delighted that so many collaborators (from six continents) agreed to participate and are very grateful to them. However, even with their help the outcome assessment and reconciliation took 17 months. This is too slow to be useful, for instance, for quality improvement activity, where more rapid feedback would allow more rapid adjustments in response to performance. We have tested the use of text analytics using regular expressions to automatically ascertain reporting of measures to reduce the risk of bias, and for some such risks of bias the approach achieves sensitivities and specificities above 80%.17 For more complex items it is likely that machine learning approaches using, for instance, convoluted neural networks may be more successful, and this is a current focus of our research. We hope that, by making the data set for this study available, this might be used, for instance, for distant supervised learning in such systems. However, the extent of disagreement between our trained assessors suggests that the language used to describe experiments in biomedicine is not altogether clear, and both machines and human may require greater clarity in reporting to fully understand published research.
Introduction of a checklist leads to substantial improvements in the quality of reporting in NPG articles that were not seen in matched articles from other publishers, and these improvements appear to be ongoing. However, there is still substantial room for improvement, which suggests that measures such as mandatory author checklists need to be supplemented by other approaches.
Review history and Supplementary material
Prepublication and Review History is available online at http://dx.doi.org/10.1136/bmjos-2017-000035.
A preprint version of this article is available at https://www.biorxiv.org/content/early/2017/09/12/187245
Handling editor Chris Chambers
Funding The study was funded by a grant from the Laura and John Arnold Foundation, who played no role in the design, conduct or analysis of the study or in decisions regarding publication or dissemination.
Competing interests None declared.
Patient consent Not required.
Provenance and peer review Not commissioned, externally peer reviewed.
Collaborators Study steering committee: Malcolm Macleod (University of Edinburgh, Chief Investigator and Chair), Emily Sena (University of Edinburgh), David Howells (University of Tasmania). Study management committee: Malcolm Macleod (University of Edinburgh, Chief Investigator and Chair), Emily Sena (University of Edinburgh), David Howells (University of Tasmania), Veronique Kiermer (Nature, until mid 2015), Sowmya Swaminathan (Nature, from mid 2015). Redaction and identification of articles: Hugh Ash, Rosie Moreland (Imperial College, London). Authoring and testing of training materials: Cadi Irvine, Paula Grill, Monica Dingwall, Emily Sena, Gillian Currie, Malcolm Macleod (University of Edinburgh). Programming and data management: Jing Liao, Chris Sena (University of Edinburgh). Outcome assessors: Paula Grill, Monica Dingwall, Malcolm Macleod, Cadi Irvine, Cilene Lino de Oliveira, Daniel-Cosmin Marcu, Fala Cramond, Sulail Rajani, Andrew Ying, Hanna Vesterinen, Roncon Paolo, Kaitlyn Hair, Marie Soukupova, Devon C Crawford, Kimberley Wever, Mahajabeen Khatib, Ana Antonic, Thomas Ottavi, Xenios Milidonis, Klara Zsofia Gerlei, Thomas Barrett, Ye Liu, Chris Choi, Evandro Araújo De-Souza, Alexandra Bannach-Brown, Peter-Paul Zwetsloot, Kasper Jacobsen Kyng, Sarah McCann, Emily Wheater, Aaron Lawson McLean, Marco Casscella, Alice Carter, Privjyot Jheeta, Emma Eaton. Reconciliation: Alexandra Bannach-Brown, Malcolm Macleod, Monica Dingwall, Paula Grill, Kaitlyn Hair, Cilene Lino de Oliveira, Sulail Rajani, Daniel-Cosmin Marcu, Cadi Irvine, Fala Cramond. Data analysis: Paula Grill, Jing Liao, Malcolm Macleod. Writing Committee: Malcolm Macleod, David Howells, Jing Liao, Paul Grill, Emily Sena.
Open data Raw data are available at https://figshare.com/articles/NPQIP_final_analysis_set/5375275/1.
Open materials All materials have been made publicly available: https://osf.io/HC7FK/.