Sunday, 17 January 2016

Gunpowder Plot: foiled

Just a week ago I hailed the new king, and already there was an assassination attempt. A new paper claims that the statistical significance of the 750 GeV diphoton excess is merely 2 sigma local. The  story is being widely discussed in the corridors and comment sections because we all like to watch things die...  The assassins used this plot:

The Standard Model prediction for the diphoton background at the LHC is difficult to calculate from first principles. Therefore,  the ATLAS collaboration assumes a theoretically motivated functional form for this background as a function of the diphoton invariant mass. The ansatz contains a number of free parameters, which are then fitted using the data in the entire analyzed range of invariant masses. This procedure leads to the prediction represented by the dashed line in the plot (but see later). The new paper assumes a slightly more complicated functional form with more free parameters, such that the slope of the background is allowed to change.  The authors argue that their more general  ansatz provides a better fit to the entire diphoton spectrum, and moreover predicts a larger background for the large invariant masses.  As a result, the significance of the 750 GeV excess decreases to an insignificant value of 2 sigma.
     
There are several problems with this claim.  First, I'm confused why the blue line is described as the ATLAS fit, since it is clearly different than the background curve in the money-plot provided by ATLAS (Fig. 1 in ATLAS-CONF-2015-081). The true ATLAS background is above the blue line, and much closer to the black line in the peak region (edit: it seems now that the background curve plotted by ATLAS corresponds to a1=0  and one more free parameter for an overall normalization, while the paper assumes fixed normalization). Second, I cannot reproduce the significance quoted in the paper. Taking the two ATLAS bins around 750 GeV, I find 3.2 sigma excess using the true ATLAS background, and 2.6 sigma using the black line (edit: this is because my  estimate is too simplistic, and the paper also takes into account the uncertainty on the background curve). Third, the postulated change of slope is difficult to justify theoretically. It would mean there is a new background component kicking in at ~500 GeV, but this does not seem to be the case in this analysis.

Finally, the problem with the black line is that it grossly overshoots the high mass tail,  which is visible even to a naked eye.  To be more quantitative, in the range 790-1590 GeV there are 17 diphoton events observed by ATLAS,  the true ATLAS backgrounds predicts 19 events, and the black line predicts 33 events. Therefore, the background shape proposed in the paper is inconsistent with the tail at the 3 sigma level! While the alternative background choice decreases the  significance at the 750 GeV peak, it simply moves (and amplifies) the tension to another place.

So, I think the plot is foiled and the  claim does not stand scrutiny.  The 750 GeV peak may well be just a statistical fluctuation that will go away when more data is collected, but it's unlikely to be a stupid error on the part of ATLAS. The king will live at least until summer.

42 comments:

Albertz said...

As an alternative to "the King" I propose the "Fat Bastard".

Anonymous said...

I tried to reproduce the reduced significance claimed in that paper, and I find the higher significance claimed by ATLAS.

Anonymous said...

Jester, why do you (and the paper) plot an unbinned theory line and binned data? If I've understood correctly, It would help if we saw the binned theory predictions too.

Anonymous said...

Is it possible that the change of slope in he background curve is a detector effect?!

Jester said...

Anon 11:39: I'm not an expert in this topic, but I don't think so. The reconstruction efficiency for photons at this energy is close to 100%, so there's not much room for abrupt changes. If there was another background kicking in (e.g. jets faking photons) they would see the quality of photons dropping, but this is not the case. A change of slope can also be induced by experimental cuts, but the cuts used in this analyses cannot do anything dramatic to photons at several hundred GeV.

Jester said...

Anon 11:19: If you plot a binned background in the interesting invariant mass range, the points will basically lies on the respective background curve.

Anonymous said...

But in the high-energy tail, the problem you mention

the range 790-1590 GeV there are 17 diphoton events observed by ATLAS, the true ATLAS backgrounds predicts 19 events, and the black line predicts 33 events. Therefore, the background shape proposed in the paper is inconsistent with the tail at the 3 sigma level!

would be glaringly obvious with a binned bg prediction, wouldn't it? The plot is misleading as it is.

Jester said...

I think it's glaringly obvious as it is :) But I agree: if you plot binned predictions it makes it more clear there is a dozen of bins with no events where they predict ~1 event.

mfb said...

Concerning the shape for ATLAS: figure 1 in the paper has two functions, the green dashed one matches the ATLAS result, the blue one (which matches the blue one in the plot you copied) is an alternative function with fewer parameters.

33 expected events with 17 observed shows their model is clearly wrong, and expect more data to show that even better.

Jester said...

That's right, the green line in Fig.1 of the theory paper almost overlaps with the background line drawn by ATLAS (it's a bit lower at the tail). But it corresponds to a different functional form than what's described in the ATLAS note (eq 3). That's another confusing thing: either the theorist's fit to the blue curve is not correct, or ATLAS is using a different functional than what's written in the note.

Mario said...

"The king will live at least until summer.". When is the data-taking time in 2016? Are there some dates that we should keep in mind, where potential follow-up results are presented? Like, are there conferences where ATLAS and CMS will present intermediate 2016 results? Like the ICHEP conference in 2012. Given the luminosity is as expected/planned, when will they be able to make concrete statements (not only publicly but also internally)?

mfb said...

First collisions are planned for late April. It is expected that there are a few inverse femtobarn available for the summer conferences - hopefully more than the whole 2015 dataset. ICHEP is in August. I would expect that we can see some results there - if the excess appears again with the same strength, it should get really significant. If there is absolutely nothing, we need a new king.

"not only publicly but also internally"?

Mario said...

mfb: Thanks! That was exactly my question, what are the "summer conferences" and when are they? Is it only ICHEP (which is 3.-10.august), or are there other "summer conferences"? What does "few inverse fb" mean, can you give the expected numbers? Also, what does "late April" mean exactly, do you have some CERN reference for that? Would be great. thanks again.

By "not only publicly but also internally" I wonder when the ATLAS and CMS teams know that the 750GeV excess remained there/went away. Of course they know before the public announcement.

mfb said...

ICHEP is the most important one, but there are a couple of others. As the diphoton excess got significant attention from theorists (much more significant than the experimental data: ~10-15 events over about the same background, >150 theory papers), I would guess results are shown there.

I don't know the precise current estimate, but those are never reliable. See last year, the original plan were 12/fb, the result 4/fb. There will also be some difference between "data delivered when the conference starts" and "data analyzed at that point".

'By "not only publicly but also internally" I wonder when the ATLAS and CMS teams know that the 750GeV excess remained there/went away.'
Depends on the analysis strategy, but even with a rushed publication process I guess at least 2 weeks before the results become public. Would be early/mid July if they want to have results for ICHEP.

Mario said...

mfb: Thanks, these are interesting information. you say they expected 12/fb last year but only got 4/fb (precisely, ATLAS shows 3.2/fb and CMS 2.6/fb for the 750GeV plots). Do you have some sources for the expectation of 12/fb? Thanks!

"even with a rushed publication process" - does that mean you expect they will have a publication by that time? That is different then what happened with Higgs.

mfb said...

The lumi estimate varied a lot over time, here is a talk from Moriond (March 2015) quoting 10/fb or 12/fb depending on which numbers you look at: https://indico.in2p3.fr/event/10819/session/3/contribution/109/material/slides/0.pdf slides 17 and 18.

The 4/fb are delivered luminosity (to both experiments, approximately), the recorded value is lower, and analyses use an even lower one where all necessary detector components were fully operational. The difference between ATLAS and CMS is mainly due to the magnet.


I meant "publication process" as in "something shown in public", not as journal article. Those will take more time.

Mario said...

Great, thanks for that source, looks very interesting (so i might spend some time tomorrow looking at other Moriond talks :) ).

Do you know the reason why much lower than 10/fb was recorded? Was it an accelerator-problem, or because of an unexpected shutdown, or were there some problems with the detectors? Thanks!

Jonathan Davis said...

Hi Jester, thanks for the interest!

What ATLAS plot as their background and what they actually use is a bit confusing. If you look at the note they say explicitly that they set k = 0 in their background fit. However when we fit their function to the data we got what we showed in figure 1 i.e. that the background with k=1 looks like the plot in the ATLAS note.

So it depends on if you believe their plot or the text of the note. I have a feeling they just accidentally plotted the fit with the k=1 component, but who knows.

Also remember the function we picked is arbitrary, just as the ATLAS one is. We could have picked one which didn't have such a steep tail. Though I do not think the best-fit form of our function over-shoots the tail, since it still falls within the error bars. In any case this is reflected in the quality of the fit and in the likelihoods we plot in figure 2. What are the errors bars you used for your claim of 3 sigma tension? By eye this seems wrong.

So our point is a more general one than just picking a new function. We were trying to show that a result which depends so sensitively on the choice of empirical background function should not be trusted. We were not necessarily suggesting that we had invented a better function for the background. For example this would not have happened for the Higgs search, since the background is well constrained both above and below the resonance.

Also, if you would like to email me the parameter ranges for nuisance parameters you used in your profile likelihood fit I would be happy to help understand why you could not reproduce our significances.

Thanks,
Jonathan Davis

Jester said...

Hi Jonathan, thanks for the comments. It'll be indeed great to sort out the details of the background fitting in ATLAS.
For our discussion, let's start with the tail. The number I quoted is purely statistical error. Integrating your best fit curve from 790 to 1590 GeV I'm getting the prediction of 33 events (up to digitization error). The number of observed events in that part of the tail is 17. The Poissonian probability of 33 fluctuating down to 17 is 0.16%, which corresponds to 3.2 sigma. My claim is that your best fit screws up the tail more than it improves the 750 GeV region.

dhrou said...

I guess there is a basic misunderstanding in this paper, which is that the fact that empty bins (in the plot showed by Jester) are significant is ignored.
Bottom right of page 4 says "BIC = −2lnL+klnn, where k is the number of parameters in the model and n = 27 is the number of data-points." n=27 is indeed the number of points visible on the plot, however one should also consider the empty bins with zero event. By eye, there are actually 36 bins, so 36 data points. This is the reason why the proposed function clearly overshoots the data at high mass, as jester you spot by quoting the expected and observed integral.

Jonathan Davis said...

Ah yes I see your reasoning. However the low-energy bins have uncertainties which are at least twice the size of the Poisson ones (for example the events with one event, have an upper error bar going up to 3.2, while it would reach only 2 with Poisson errors). So I think you should use the error bars on the ATLAS plot instead, in which case you get something closer to 1.8 sigma tension.

Obviously some tension remains, but I don't think this is entirely a surprise, especially since the points are discrete here. Indeed any fit through discrete points like this would have some tension.

Also just to respond to dhrou as well. We also tried the BIC with n = 36, and you get a similar conclusion regarding the BIC. Originally in fact we did use that number, but its not clear with the definition of the BIC which one to use. Rest assured that we spent several weeks just on the numerics for this paper, and that we have done a lot of tests on our results, including cross checking them using two completely different pieces of code.

In any case it is up to you to decide what conclusions you take from our paper. Personally I do not trust a result which varies so much just with a small change in the empirical background. However this is of course my own view.

Jonathan

Ervin Goldfain said...

Jonathan,

Your point makes a lot of sense, as a matter of principle. If the significance is over-sensitive to the details of the processing algorithms, the signal is likely to be inconclusive.

The hope is that the uncertainty will start to clear up by the end of the year, although this may not be a sure bet at this point. More data in uncharted territory may also boost the "noise".

Anonymous said...

ohh, someone just got owned!

RBS said...

Jester,
please forgive me what must be a stupid layman's question as I'm trying to understand the conundrum. So, because we cannot estimate the background reliably from the theory (to which we must compare the observations to see if anything could be there that is not a part of the theory), we try to simulate it somehow. So we take a curve with multiple parameters, resolve the parameters by fitting it to the regions where we know there's nothing and see what it would yield (i.e. extrapolate) to the regions where we don't know for sure. This approach, taken naively causes two immediate questions:
1. For the "fitting" regions - do we know for sure that nothing is there?
2. And for the curve - there should be something about it that could still derive from the theory, or general truths - e.g. no sudden twists, etc. Otherwise, couldn't we find a parametrized curve to fit pretty much any observed data?
Thanks!

Anonymous said...

Jonathan,

Can you elaborate on the ambiguity in the definition of the BIC? To me, it's quite clear that n=36 is correct. Why would it be correct to remove data from the chi-squared-like term in the BIC (-2\ln L) and reduce n in the term that penalizes parameters (k\ln n)?

Please can you clarify whether this is what you did, and if so, why? What are the values of -2\ln L and BIC with all 36 data points?

In the LLR test-statistic, q_\sigma, do you consider all 36 data points or only your smaller set of 27?

Jester said...

Jonathan, the ATLAS error bars are just Poissonian statistical errors. More precisely, they are not plotting the variance Sqrt[N] (thanks god!), but instead they show a sort of 1 sigma confidence interval. The upper limit of the error bar is the mean value for which the Poissonian probability of observing N or less events is ~16%, and the lower limit of the error bar is the mean value for which the probability of observing N or more events is ~16% . For large N this reduces to the usual Sqrt[N] error bars. I guess this procedure has a clever-sounding name... maybe someone who knows statistics theory better than me could comment....

I'm not sure what you mean by "use the error bars on the ATLAS plot". Do you use their measured values and error bars to define a Gaussian chi^2 ? This is OK for high-N bins, but not for low statistics ones. What do you do for the bins where there is zero events and errors are not displayed? Do you ignore them, as suggested by dhrou ? I don't think that "any fit through discrete points like this would have some tension" once the statistical procedure is correctly defined. I think the correct procedure is to define the usual Poisson likelihood for all 36 bins. With this procedure, the ATLAS curve leads to only 1 sigma tension at the tail, and your green curve to even less than that.

I agree that the background fitting is not very carefully described in the ATLAS note, and that they should provide more details in the paper. However, I think there is no hint in the data (and no theoretical motivation) for a more complicated background shape. Based on your explanations, it seems to me that your conclusions are due to ignoring zero-event bins in your fit.

Anonymous said...

I agree re the error bars - they show a two-tail 68% confidence interval for the expected number of events from Poisson statistics.

In the example,

for example the events with one event, have an upper error bar going up to 3.2, while it would reach only 2 with Poisson errors

With n=1 events observed, the two-tailed 68% upper limit for the expected is 3.2. See where 0.32/2=0.16 and x=1 occur in this look-up table:

https://onlinecourses.science.psu.edu/stat414/sites/onlinecourses.science.psu.edu.stat414/files/lesson12/SmallPoisson01.gif


it's at ~3.2. Or calculate it explicitly.

Anonymous said...

"they are not plotting the variance Sqrt[N]"

I guess you mean

"they are not plotting the square root of the variance (Sqrt[N])"?

Jester said...

sure

mfb said...

@Mario: mainly accelerator issues. More radiation damage than expected, more heating of elements during injection and operation, magnets not as trained as expected, ...

@Jester: It is an application of the Feldman-Cousins method ( http://arxiv.org/abs/physics/9711021 ).

@Jonathan: Okay thanks, your comments here clarified that you do not understand the ATLAS error bars, or statistics with low event numbers in general.

Anonymous said...

@mfb I don't think the ordering rule is FC, actually - it seems to be symmetric two-tail.

For n=1 observed, we see an interval of about 0.2 to 3.2 (see eg the right-most point on the plot).

Use the table I linked above

https://onlinecourses.science.psu.edu/stat414/sites/onlinecourses.science.psu.edu.stat414/files/lesson12/SmallPoisson01.gif

which is for cumulative Poisson, p(X<=x; \lambda), we find an upper limit for \lambda with x=1 and p=0.32/2=0.16, at \lambda=~3.2 and a lower limit at x=0 and p= 1. - 0.32/2 = 0.84 at about \lambda=0.2. This agrees well with the numbers from the plot.

See slide 20 of vuko.home.cern.ch/vuko/teaching/stat09/CL.pdf
for the relevant formulas.

Summarizing
===========

Quite a helpful thread, if I may summarise, we found 2 ostensibly big problems/questions about Davis et al (as well as issues in original post):

* Misinterpreted error bars, shich show confidence intervals from Poisson statistics (68% symmetric twotail). These were possibly interpreted somehow as 1\sigma errors in Gaussian likelihoods. This will be alright for large n by the CLT, but we aren't in that regime.

* Omitted empty bins, leaving only 27 data points rather than 36 with 9 empty. This occurred for at least the BIC, and possibly for the LLR (this is unconfirmed). I find this quite alarming.

Jonathan, are these big issues? Hoe do they impact your quoted significances? Do they impact your conclusions?

Anonymous said...

If we take only the data points that are below 1 TeV, then I wonder what happens to the significance of the excess if we fit them using:

a) The Atlas function

b) Using Jonathan paper's function.

Jester said...

Probably there would not be a significant preference for either hypothesis

Ervin Goldfain said...

Jester,

Could you elaborate on your latest thoughts regarding Jonathan's paper and the updated ATLAS curves?

Thanks!

Jester said...

There will be a separate post next week. Brace for a perfect storm :)

Ervin Goldfain said...

Enough already with perfect (snow...) storms :-)

tulpoeid said...

"Partly changed my mind about http://arxiv.org/abs/1601.03153 While v1 has serious errors, valid concerns about ATLAS diphoton background are raised"

Hey Jester, how about an update on that in the comments section?

Jester said...

As of today, I think the tweet was premature. The background fitting procedure is not described carefully in the ATLAS conf note, which caused some misinterpretation by theorists. Once this is fixed, there is no big discrepancy between ATLAS and theorists' estimates.

kkumer said...

Concerning zero-event bins, wouldn't it be appropriate to
plot one-sided error bars there (going to 1.8 or so?)?
That would help avoiding this confusion with bin counting.

Jester said...

I agree this would help to guide the eye

mfb said...

It could help in this particular case, but there are searches where it would be annoying to have all those bins. Consider this plot for example: https://atlas.web.cern.ch/Atlas/GROUPS/PHYSICS/CONFNOTES/ATLAS-CONF-2015-043/fig_01f.png
Would it help to have tens of long black lines in the plot?

mfb said...

An arxiv entry from Bradley J Kavanagh, testing the ATLAS excess with various background models: http://arxiv.org/abs/1601.07330
He gets consistent local significances close to the ATLAS value, with all parametrizations, including the one from Davis et al.