Can you help me with statistics?

Posted by Nathaniel.

I have a statistics question that I need some help with. It’s really a question of “what is the statistical importance of a deviation from a fit?”

Let me illustrate with plots:

1. Here we have some data and a line fit to the data. Everyone I’m sure agrees that this is a good fit.

2. Now, the same data and fit, but with one point that’s off by 1-sigma. 1-sigma events happen all the time (well, roughly 1/3 of the time) so we’d still assume that the fit matches the data well.

3. Now we have a point that’s 2-sigma from the fit. Assuming a normal distribution, that should only happen by chance ~5% of the time, so we start wondering if the deviation of that point is actually a significant event.

4. Now, the real question. Instead of a single deviant point, we have two points. Both of them are 1.3-sigma from the fit. If taken individually, there’s a ~20% probability that each one does match the fit. However, we “know” that they’re correlated in that the depression of both points is related to something physically going on. How would I determine how statistically significant this depression is?

Needless to say, my real data isn’t faked and is more complex than the example, but I need to figure out the same sort of answer. I’d appreciate any help that anyone can offer. And I apologize for using my astronomer’s imprecise statistical descriptions.

  

9 Responses to “Can you help me with statistics?”

  1. Brian Says:

    In the actual problem you’re working on are you using a linear, least-squares regression, or a non-linear, least-squares regression, or some other regression?

    For a linear least squares regression, the simplest statistic that provides some esimate of the goodness of the fit is the coefficient of determination, also sometimes called Pearson’s Coefficient of regression, generally denoted by R^2 ( “R-squared” sorry, too lazy to go into math mode right now).

  2. Brian Says:

    Another question (apologies if I’m too basic in my wording here): if you do the regression, then take the difference between the each data point y(x_m), and the value predicted by the regression for x_m, are those differences normally distributed? If they are (and if the relationship between y and x is a linear one (as in the example on the blog) then I think R^2 is the coefficient that you want. If the errors are not normally distributed, I think you want some other coefficient, like chi-squared, but it’s been a while since I’ve done any statistics, so my memory on that point might be flawed.

  3. Brian Says:

    One more word: if the variance in the data is not constant, e.g. for some physics reason, the variance is a function of the independent variable, then the data is “heteroscedastic,” and you might want to look up techniques for dealing with heteroscedasticity.

  4. Brian Says:

    Sorry, just had another thought. You could remove the two points from the data set and take the R^2 value of the remaining points (call this A). Then put them back in and take the R^2 value (call this B). The difference between the first number and the second (A-B) will be a measure (in some sense) of how significant the depression is. I don’t know that this procedure would be a well respected statistical technique, but I think it would provide me with a meaningful idea about the significance of the depression if I were reading a paper about it. For good measure you could divide the difference by A: foo=abs[(A-B)/A], and then foo can only vary between 0 and 1.

  5. Nathaniel Says:

    Brian, thanks. The real problem that I’m working on is a mass function for one of my globular clusters. It’s basically a function of number of stars at a given mass. For some unknown reason, there are stars missing in a certain mass range even after I’ve accounted for completeness. If you look at the color-magnitude diagram for the cluster, it’s even visible to the eye.

    The problem is that the referee on the paper is the main competitor to the group that I’m working with and he’s refusing to accept “look, you can see it by eye” (which is what his previous papers have done). The real issue is that if it were just a single point that didn’t lay along the fit, I could easily say “this point is 1.7 sigma (or whatever) from the fit and therefore has an x% chance of being due to the Poisson noise.” Instead though, I have a nice line with two points off the line.

    Haha, that however raises a few issues… first of all, the fit is actually a power law and therefore only a straight line in a log-log plot. The errors are Poisson errors from the counting statistics and, since it’s a log-log plot, are not symmetrical. The number of stars in each bin is large though, so the error regardless of what type of distribution it follows is fairly small.

  6. Brian Says:

    I see.
    Have you heard of <a href=”http://en.wikipedia.org/wiki/RANSAC”RANSAC (RANdom SAmple Consensus)? Not sure if it would be useful here, but it’s a cool name for a statistical technique in any case: “After we RANSACed the data, we determined the following parameters…”

  7. Nathaniel Says:

    RANSAC might be the way to go. If I use that sort of algorithm, I get a very good power-law fit to the bins at higher and lower masses compared to the dip. Compared to that resulting fit, the points in the dip are some 10-sigma low on an individual basis without having to worry about how the correlate to each other. Hopefully that will satisfy the referee.

    Of course, I think that Michael will take serious issue with the method since it’s really a “ignore everything that doesn’t fit what you want it to” sort of thing. This is astronomy though, we’re not as rigorous as we probably should be.

    Not to mention the fact that this paper was planned to be a 4-page letter saying “look, it’s an odd cluster” and due to the requirements put on us by the referee it’s now morphed into a 20 page monster.

  8. Michael Says:

    Dude, we’ve already established that you’re allowed to make shit up in Astronomy. :) Nothing new there and I certainly understand pain-in-the-ass referees. I only just got a paper published last week after eight months of back-and-forth with referees.

    More seriously, my first concern after glancing through the post and Brian’s comments concerned the error in a log-log plot. As long as you’re sure that “fairly small” error really is “small enough”, then ok, but log plots can do some goofy things to stats packages. A random sampling method is a nice and direct approach but only if you have a big enough N. (Ignoring “that’s what she said” jokes for now.) To me, a globular cluster sounds like it should have a crapload of stars and therefore the large N limit is happily satisfied, but I know that you often have to throw out a lot, right?

  9. Nathaniel Says:

    Michael, that’s a very good point that I hadn’t really thought of. Being as it is a log-log plot, the errors are really weird. You really need to flip out of log space to see how close you actually are rather than say “the plot looks like it’s 1-sigma”. It turns out that the deviation I’m seeing is actually more like 10-sigma once you get out of log space.

Leave a Reply