The New England Journal of Medicine NEJM announced new guidelines for authors for statistical reporting yesterday*. The ASA describes the change as “in response to the ASA Statement on Pvalues and Statistical Significance and subsequent The American Statistician special issue on statistical inference” (ASA I and II,^{(note)} in my abbreviation). If so, it seems to have backfired. I don’t know all the differences in the new guidelines, but those explicitly noted appear to me to move in the reverse direction from where the ASA I and II^{(note)} guidelines were heading.
The most notable point is that the NEJM highlights the need for error control, especially for constraining the Type I error probability, and pays a lot of attention to adjusting Pvalues for multiple testing and post hoc subgroups. ASA I included an important principle (#4) that Pvalues are altered and may be invalidated by multiple testing, but they do not call for adjustments for multiplicity, nor do I find a discussion of Type I or II error probabilities in the ASA documents. NEJM gives strict requirements for controlling familywise error rate or false discovery rates (understood as the Benjamini and Hochberg frequentist adjustments).
They do not go along with the ASA II^{(note)} call for ousting thresholds, ending the use of the words “significance/significant”, or banning “p ≤ 0.05”. In the associated article, we read:
“Clinicians and regulatory agencies must make decisions about which treatment to use or to allow to be marketed, and P values interpreted by reliably calculated thresholds subjected to appropriate adjustments have a role in those decisions”.
When it comes to confidence intervals, the recommendations of ASA II^{(note)}, to the extent they were influential on the NEJM, seem to have had the opposite effect to what was intended–or is this really what they wanted?
 When no method to adjust for multiplicity of inferences or controlling false discovery rate was specified in the protocol or SAP of a clinical trial, the report of all secondary and exploratory endpoints should be limited to point estimates of treatment effects with 95% confidence intervals. In such cases, the Methods section should note that the widths of the intervals have not been adjusted for multiplicity and that the inferences drawn may not be reproducible. No P values should be reported for these analyses.
Significance levels and Pvalues, in other words, are terms to be reserved for contexts in which their error statistical meaning is legitimate. This is a key strong point of the NEJM guidelines. Confidence levels, for the NEJM, lose their error statistical or “coverage probability” meaning, unless they follow the adjustments that legitimate Pvalues call for. But they must be accompanied by a sign that warns the reader the intervals were not adjusted for multiple testing and thus “the inferences drawn may not be reproducible.” The Pvalue, but not the confidence interval, remains an inferential tool with control of error probabilities. Now CIs are inversions of tests, and strictly speaking should also have error control. Authors may be allowed to forfeit this, but then CIs can’t replace significance tests and their use may even (inadvertently, perhaps) signal lack of error control. (In my view, that is not a good thing.) Here are some excerpts:
For all studies:

Significance tests should be accompanied by confidence intervals for estimated effect sizes, measures of association, or other parameters of interest. The confidence intervals should be adjusted to match any adjustment made to significance levels in the corresponding test.
For clinical trials:

Original and final protocols and statistical analysis plans (SAPs) should be submitted along with the manuscript, as well as a table of amendments made to the protocol and SAP indicating the date of the change and its content.

The analyses of the primary outcome in manuscripts reporting results of clinical trials should match the analyses prespecified in the original protocol, except in unusual circumstances. Analyses that do not conform to the protocol should be justified in the Methods section of the manuscript. …

When comparing outcomes in two or more groups in confirmatory analyses, investigators should use the testing procedures specified in the protocol and SAP to control overall type I error — for example, Bonferroni adjustments or prespecified hierarchical procedures. P values adjusted for multiplicity should be reported when appropriate and labeled as such in the manuscript. In hierarchical testing procedures, P values should be reported only until the last comparison for which the P value was statistically significant. P values for the first nonsignificant comparison and for all comparisons thereafter should not be reported. For prespecified exploratory analyses, investigators should use methods for controlling false discovery rate described in the SAP — for example, Benjamini–Hochberg procedures.

When no method to adjust for multiplicity of inferences or controlling false discovery rate was specified in the protocol or SAP of a clinical trial, the report of all secondary and exploratory endpoints should be limited to point estimates of treatment effects with 95% confidence intervals. In such cases, the Methods section should note that the widths of the intervals have not been adjusted for multiplicity and that the inferences drawn may not be reproducible. No P values should be reported for these analyses.
As noted earlier, since Pvalues would be invalidated in such cases, it’s entirely right not to give them. CIs are permitted, yes, but are required to sport an alert warning that, even though multiple testing was done, the intervals were not adjusted for this and therefore “the inferences drawn may not be reproducible.” In short their coverage probability justification goes by the board.
I wonder if practitioners can opt out of this weakening of CIs, and declare in advance that they are members of a subset of CI users who will only report confidence levels with a valid error statistical meaning, dual to statistical hypothesis tests.
The NEJM guidelines continue:

…When the SAP prespecifies an analysis of certain subgroups, that analysis should conform to the method described in the SAP. If the study team believes a post hoc analysis of subgroups is important, the rationale for conducting that analysis should be stated. Post hoc analyses should be clearly labeled as post hoc in the manuscript.

Forest plots are often used to present results from an analysis of the consistency of a treatment effect across subgroups of factors of interest. …A list of P values for treatment by subgroup interactions is subject to the problems of multiplicity and has limited value for inference. Therefore, in most cases, no P values for interaction should be provided in the forest plots.

If significance tests of safety outcomes (when not primary outcomes) are reported along with the treatmentspecific estimates, no adjustment for multiplicity is necessary. Because information contained in the safety endpoints may signal problems within specific organ classes, the editors believe that the type I error rates larger than 0.05 are acceptable. Editors may request that P values be reported for comparisons of the frequency of adverse events among treatment groups, regardless of whether such comparisons were prespecified in the SAP.

When possible, the editors prefer that absolute event counts or rates be reported before relative risks or hazard ratios. The goal is to provide the reader with both the actual event frequency and the relative frequency. Odds ratios should be avoided, as they may overestimate the relative risks in many settings and be misinterpreted.

Authors should provide a flow diagram in CONSORT format. The editors also encourage authors to submit all the relevant information included in the CONSORT checklist. …The CONSORT statement, checklist, and flow diagram are available on the CONSORT
Detailed instructions to ensure that observational studies retain control of error rates are given.
In the associated article:
P values indicate how incompatible the observed data may be with a null hypothesis; “P<0.05” implies that a treatment effect or exposure association larger than that observed would occur less than 5% of the time under a null hypothesis of no effect or association and assuming no confounding. Concluding that the null hypothesis is false when in fact it is true (a type I error in statistical terms) has a likelihood of less than 5%. [i]…
The use of P values to summarize evidence in a study requires, on the one hand, thresholds that have a strong theoretical and empirical justification and, on the other hand, proper attention to the error that can result from uncritical interpretation of multiple inferences.^{5} This inflation due to multiple comparisons can also occur when comparisons have been conducted by investigators but are not reported in a manuscript. A large array of methods to adjust for multiple comparisons is available and can be used to control the type I error probability in an analysis when specified in the design of a study.^{6,7} Finally, the notion that a treatment is effective for a particular outcome if P<0.05 and ineffective if that threshold is not reached is a reductionist view of medicine that does not always reflect reality. [ii]
… A welldesigned randomized or observational study will have a primary hypothesis and a prespecified method of analysis, and the significance level from that analysis is a reliable indicator of the extent to which the observed data contradict a null hypothesis of no association between an intervention or an exposure and a response. Clinicians and regulatory agencies must make decisions about which treatment to use or to allow to be marketed, and P values interpreted by reliably calculated thresholds subjected to appropriate adjustments have a role in those decisions.
Finally, the current guidelines are limited to studies with a traditional frequentist design and analysis, since that matches the large majority of manuscripts submitted to the Journal. We do not mean to imply that these are the only acceptable designs and analyses. The Journal has published many studies with Bayesian designs and analyses^{810} and expects to see more such trials in the future. When appropriate, our guidelines will be expanded to include best practices for reporting trials with Bayesian and other designs.
What do you think?
I will update this with corrections and thoughts using (i), (ii), etc.
The author guidelines:
https://www.nejm.org/authorcenter/newmanuscripts
The associated article:
https://www.nejm.org/doi/full/10.1056/NEJMe1906559
*I meant to thank Nathan Schachtman for notifying me and sending links; also Stuart Hurlbert.
[i] It would be better, it seems to me, if the term “likelihood” was used only for its technical meaning in a document like this.
[ii] I don’t see it as a matter of “reductionism” but simply a matter of the properties of the test and the discrepancies of interest in the context at hand.
Nathan Schachtman has blogged on this today I see. He write:
“The editors seem to be saying that if authors fail to prespecify or even postspecify methods for controlling error probabilities, then they cannot declare statistical significance, or use pvalues, but they can use confidence intervals in the same way they have been using them, and with the same misleading interpretations supplied by their readers.”
http://schachtmanlaw.com/blog/
Mayo, I’ll opt out of this one. There are so many things wrong or poorly stated in the NEJM documents that dealing with them here would be awkward and unproductive. The core error underlying all the problems is their preoccupation with trying to control or correct for multiplicities, which cannot be done in any logical, objective way and which is, in fact, unneccessary. This is a fetish largely restricted to statisticians working for clinical trials, and not one affecting most other disciplines. And of course, they are half a century out of date in insisting on dichotomizing the P scale.
Justification for those opinions may be found in, inter alia: Hurlbert, S.H. and C.M. Lombardi. 2012. Lopsided reasoning on lopsided tests and multiple comparisons. Australian and New Zealand Journal of Statistics 54:2342.
https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1467842X.2012.00652.x
Stuart: i know what you’ve written on this, but have never understood why, as a good Fisherian, you’d opt out of error control. Have you read Excursions 1 and 3 of SIST? Fisher insisted on the pvalues having an error probability interpretation, even if, finding his fiducial intervals didn’t always allow this, he started to revise sentences later in life. Fisherian Cox, who looks down on behavioristic NP, and rejects rigid predesignated thresholds, insists that for pvalues to retain their physical meaning, they must be adjusted for selection effects. See Mayo and Cox 2006
https://www.nejm.org/doi/full/10.1056/NEJMe1906559
Mayo:
I don’t “opt out of error control.” In an experiment this is achieved routinely by:
1) various formal types of error control, e.g. replication, randomization, blocking, use of covariates, etc.;
2) judicious interpretation of P values, etc., taking into account all aspects of the experimental design, experimental execution, data handling, and data analysis;
3) reporting all these aspects clearly to the reader of my report; and
4) prepublication scrutiny of my report by colleagues, referees and editors.
Effective error control does not require fixing alphas (or critical P values) either for individual tests or for families or sets of tests.
It will be apparent to anyone who examines the fulminating literature on “controlling for multiplicities” as thoroughly as Hurlbert and Lombardi (2012) did that it is all a giant house of cards.
I looked at your Excursions 1 and 3 again, as well as Mayo & Cox (2006), but find no indication, aside from your positive mention of FDRs, as to how you would analyze the study presented in Hurlbert & Lombardi’s (2012) table 1. Or any refutation of the nine arguments listed for NOT trying to formally adjust for multiplicities.
If anyone has seen critiques of that paper I’d be interested to know of them. The titles of the 43 papers that have cited it so far give no outward indication of criticism, but who knows.
Stuart:
You wrote: “Effective error control does not require fixing alphas (or critical P values)”. I never said it did, but I seem to recall that in your paper, I think the 2009 with Lombardi, you say, in advocating doing away with fixed alphas, that you therefore give up on error control. I’ve never known why you link them. I don’t think I’ve read the 2012, only the 2009, but I may have. It will have to be after my Summer Seminar.
Why is it a house of cards?
Well, I think it’s 20 pages and the most thorough critique of setwise error rate procedures and FDRPs out there, I believe, and highly relevant to the NEJM issues
here’s a quote from p.32. Maybe Nathan could comment on this too:
“The SWERP ‘cottage industry’ (Tukey 1991) seems to have generated large amounts
of mathematics divorced from the real needs of researchers, physicians, regulators and the
public. The longstanding, strong arguments against fixing setwise αS are not rebutted by
SWERP enthusiasts – they are mostly just ignored.
“These arguments include: (i) the irrelevance of the probability of one or more type I errors
for the rare or unrealistic situation in which all nulls in a set being assessed are true; (ii) the
dependence of SWERPs on the notion that it is rational to fix alphas for both individual tests
and sets of tests in order to generate dichotomized conclusions (‘significant’/‘nonsignificant’,
or ‘positive study’/‘negative study’); (iii) the lack of any objective grounds for specifying αS,
for which reason it is usually blindly set at the familiar 0.05; (iv) the decrease in power of
individual comparisons if αS is set low; (v) the lack of objective grounds for defining the size
or scope of a set, as discussed above; (vi) the consequently inconsistent way in which sets
are defined in practice; (vii) the penalization of studies designed to answer many questions
that results from requiring stronger evidence (lower Pvalues) before their individual null
hypotheses can be rejected, as compared with smaller studies using the same αS; (viii) the
inconstancy of the evidentiary standard for ‘significance’ from one test to another within a set,
from one set to another within a study, and from one study to another, thus greatly diminishing
comparability of analyses and studies (not that ‘significant’ is a term that should be used to
describe Pvalues – see e.g. Cox 1958; Eysenck 1960; Skipper et al. 1967; Altman 1991;
Hurlbert & Lombardi 2009); and (ix) neglect of the fact that subject matter interpretations
logically are made on a testbytest basis and should be strongly influenced by estimated
individual effect sizes, especially in any sort of applied research.”
Stuart,
I can’t respond now to all of this, but I will add the .”(vii) the penalization of studies” you note is not always a serious problem for observational epidemiologic studies given the proliferation of metaanalyses that will very quickly follow an interesting finding in a single study. To be sure, there are problems with many metaanalyses, but the quality has improved, as has the standardization of their conduct and reporting, in consensus statements such as PRISMA, MOOSE, etc.
Nathan
Stuart is, of course, correct. Pvalues should not be adjusted for the presence or absence of other Pvalues. It is pretty unusual for discussions on this type of issue to lead to a change of mind, but see here for one case where it actually happened: https://stats.stackexchange.com/questions/139311/usingmultiplecomparisonscorrectionsinfisherpvalueframework/139685#139685
Actually they make it very clear in the joint paper for the {world without thresholds} issue that they recognize that in rejecting error control for each test, they will not get overall control of type I, II errors. I’m not denying a more quasiformal or informal notion of error control enters at the level of what I call “big picture” inference. That’s what SEV does. But the pieces on which such assessments are built can’t go off the rails as they can if it is declared that we aren’t to point out* thresholds are violated, at least in the land of doing formal statistics.
*correction: “pointing out” is too weak. What ASA II says is “Whether a pvalue passes any arbitrary threshold should not be considered at all when deciding” how to interpret/ use results.
Hurlbert is driven by some strange longing to kill NP tests to make the world neoFisherian. He has decided the only way to do this is to kill the word “
significance” “significant” so he spends time getting signatures. I think this is childish and wrongheaded. When people kill significance tests, they’re not going to run to Fisherian tests–but to methods Hurlbert likes even less than NP.Trying to chum me back in again, eh, Mayo?
I will bite but only to say:
1) I have never advocated the killing of the word “significance” or to ” kill signficance tests” though have argued that (neoFisherian) “significance assessments” is a more appropriate label for them
2) anyone can establsh and “point out thresholds” evem if there are no objective grounds for defining them or for interpreting or labeling differentially, e.g., P values of .049 and .051
3) even if a threshhold is established, the label “statistically significant” is unneeded and superfluous
4) Many people are lemmings. No point in trying to tell them not to “run” off the cliff into the ocean
5) My and my students’ experience over the last 34 decades is that doing things the “neoFisherian way” creates no obstacles to getting research published, tho occasional there were pauses while we reeducated editors.
6) Have said all these things before!
Eh Hurlbert? Glad to have you back, but the truth is, just proving how little I understand of blogs even after 8 years, I assumed you’d probably never come back to this particular post and was just kvetching into the void. (I didn’t hear back on my last email with you, by the way.) I know that a few years ago you were calling out “significant”, and I told you then that “significance” as in Fisherian significance would be taken out at the same time, and this turns out to be true. Nobody can remember which one they’re not supposed to say, and it’s just a distraction from the real issues. Moreover, in your paper for the special issue on a world beyond p< .05, you speak of “ forgoing “statistically significant” and related terms” and acknowledge:
"We are heartened that the three editors of this major special issue …have come out in full support of our thesis. Their introductory editorial states: “The ASA Statement on Pvalues and Statistical Significance stopped just short of recommending that declarations of ‘statistical significance’ be abandoned. We take that step here.”
So now your word ban is wrapped up in theirs and can never be separated. And when you urge editors to accept your ban, you ask them to accept the larger word ban in ASA II. But I did replace “significance” with “significant” in my comment. I'm very opposed to all such word bans as you know.
Based on 2 papers I wrote with Cox, I have a section in SIST (3.3) on how to remain within the Fisherian tribe and do all that an inferential construal of NP tests can do. Have you read it? It starts on p. 146.
A "moderate" (i.e., nonsmall) Pvalue can be used to set upper bounds, or deny upper bounds are warranted–as Cox agrees. It uses exactly the same reasoning as Fisherian significance assessments. It's the lack of the specific alternative that leads to just about all of the problems and abuses Meehl brought out long ago, especially moving from a low pvalue to a substantive theory and thinking nonrejections are uninformative.
I've corrected my remark on point (2), indicating it's a correction. It's not being barred from "pointing out" but being barred from taking account whether the Pvalue reaches a prespecified level "at all" in deciding how to interpret results.
Too late to write more now.
Mayo,
Yes, our posts crossed in the ether. I agree that the NEJM’s guidelines are largely a rejection of the recent editorial by Wasserstein et al. on statistical significance, in the 2019 special issue of American Statistician. There are, however, a couple sentences in the Harrington, et al., editorial about the new guidelines, in this week’s NEJM, which bend toward the ASA position. You quote one of them:
“[T]he notion that a treatment is effective for a particular outcome if P 0.05. In the past, the NEJM has been fairly strict in prohibiting such claims when p > 5%, but it has been rather libertine in allowing claims of statistical significance when p < 5%, in the face of multiple comparisons without any adjustments. In other words, authors were permitted to declare having found a "statistically significant association" in an observational study, even though the authors conducted multiple comparisons and analyses of exposures and outcomes. The studies I am thinking of may have had a caveat in the discussion section about multiple testing and the inflated rate of Type I error, but the "statistically significant" language was allowed to stand. As you might imagine, such practice confuses judges, lawyers, and regulators, who must interpret and act on studies.
The new guidelines should end this practice that I have described, but the concern I expressed, and which you quoted above from my post, is based upon the belief that expert witnesses or regulators will take the 95% unadjusted confidence interval and provide their own declaration of statistical significance on the basis of the interval's exclusion of a risk ratio of 1.0 (or risk difference of 0.0). As you noted, the new NEJM guidelines do not require adjustments to the confidence interval, no matter how many comparisons are conducted or reported.
Nathan Schachtman
Nathan: I’m missing the sentence with “libertine” in it; I’m going too fast obviously, and am traveling. Is it in the associated article?
Rather than fear they will take the confidence level reported (despite multiple testing) as genuine, I’d worry that all confidence levels will be tainted as failing to have error probability guarantees. This is just one journal, I realize, but CI users should be concerned. Frankly, the CI unitarians–those who not only promote CIs as replacing statistical tests altogether, but utterly reject and derogate stat hyp tests (never minding the duality between CIs and tests)–invite this because, in denouncing tests, one might rightly glean they are denouncing error probabilities as qualifying inferences. They also slide into assigning the confidence level onto the interval, and some of them don’t seem concerned about selection effects. But other users of CIs don’t feel this way, and now their method might be looked at as noninferential. I don’t know. But the “new statistics” might want to reassert its inferential creds (and opt out of ignoring multiple testing).
Uggh; I tried to past the sentence into the comment field, and it did not go well. I was trying to quote the following sentence from the Harrington, et al., editorial:
“[T]he notion that a treatment is effective for a particular outcome if P 5%, but it has been rather libertine ….
Sorry for how I uploaded that comment.
Back to your concern about the effect on all confidence intervals.
I think that concern is well justified. In my post, I suggested that the NEJM’s new approach is not entirely coherent. They set out a recommendation for prespecification of end points and adjustments for multiplicity, but then tell prospective authors who report multiple end points, without prespecified adjustments, to give the point estimates with the same 95% confidence intervals. The only “penalty” for the looser practice is that the authors cannot use the phrase “statistically significant” and also that the authors must include a caveat about their results “may not replicate” or something to that effect.
So there are now two “classes” of confidence intervals, ones that qualify for declaring associations and effects, and others that will be reported, with uncertain meaning.
There will be confusion, despite the caveats the guidelines call for.
I warned leaders of the CI “crusade” (Hurlbert’s term), that they were shooting themselves in the foot by inventing a radical distinction between CIs and tests, even though CIs were developed as inversions of tests (by Neyman ~1930). This is in a recent presentation, slide #57.
https://errorstatistics.com/2019/07/06/thestatisticswarserrorsandcasualties/
I lost the end quotation mark. The sentence I was calling your attention to was:
“[T]he notion that a treatment is effective for a particular outcome if P 0.05.”
The text following is my commentary, and not part of the quote.
OK; I give up. That time I know I pasted the sentence in from the article, but it is not showing up in my browser as I composed it. I suspect I am doing something wrong here. I will try one more time. The sentence I was quoting was:
“[T]he notion that a treatment is effective for a particular outcome if P < 0.05 and ineffective if that threshold is not reached is a reductionist view of medicine that does not always reflect reality.”
The libertine comment was mine about past NEJM practice, not Harrington's about the new guidelines.
Nathan: Oh, I thought I just kept missing the “libertine” sentence, despite rereading the article. You were saying that in practice NEJM allowed low pvalues to stand even if there was multiple testing? Anyway, I’ll come back to this tomorrow. I’m sorry you had trouble with the comments.
Yes; that was what I was trying to say. In the medical journals, there is typically not a reported pvalue, but there are point estimates and 95% C.I.s. What I have seen (under the previous guidelines, not the new ones) was observational studies conducted in administrative databases (such as the Danish, or Swedish, national healthcare datasets), with many tests, none adjusted. These studies were permitted to declare “statistical significance” when the C.I. excluded 1.0 (for risk ratios), with a statement in the Discussion section that multiple testing can inflate the Type I error rates. [Lesser journals, especially in the fields of occupational and environmental medicine, do not even carry such a caveat.] My experience has been that the FDA has not been impressed with such findings, but they have helped fuel specious claiming in court.
Pingback: Data Science newsletter – July 22, 2019  Sports.BradStenger.com
Pingback: Scholarly Publishing Roundup September 2019 – Becker Medical Library
Pingback: On Some SelfDefeating Aspects of the ASA’s (2019) Recommendations on Statistical Significance Tests  Error Statistics Philosophy
Pingback: Schachtman Law » American Statistical Association – Consensus versus Personal Opinion