Wednesday, March 12, 2014


In analytics world arguments flare up sometimes around statistics, and there is a worrying tone in some which seem to talk down, or skip over the main tenets of the science.. We see words such as "statistics assumes blah" making it sound like the entire science is based on weak assumptions.

It is not.

Let's say we are sampling from a crazy-ass distribution which represents the entirety of whatever data we are looking at. Say.. the shapes of rocks in 3D. So data is all, and let's say the distribution that this data came from is freaky. Who created it? For the sake of the argument, let's say God did. We cant even work with the math of this distribution, it's that crazy (even if we knew it) but you know what? God can.

Anyway. Now let's say all we have is small, tiny samples from this big-ass data coming from the crazy-ass distribution, and we do something simple with it, like averaging them. We are lowly humans down here, we have small data, and simple little operations. We have to make do with what we have.

Here is the kicker though: A theorem says this simple average will approach the "true" mean of the unknown, crazy-ass distribution that would need crazy-ass math to compute. So we humans with our simple little operations and bitchy little data can have a peek at the creation of God. Like, WOW.

This is not some made-up concoction, or based on weak assumption. It is Iron Law. In geek, this is a theoretical relation of the small and the big. Another theorem states the averages will approach the Normal distribution, but that's another topic. All in all, there is only one assumption here which is entire data comes from "a" distribution and that isn't entirely unrealistic. The big / small relation is the trick, the add-on, the lever here, and based on these theorems (that are proven), many tools can be developed such as confidence intervals, p-values etc. These tools are essentially what is called Statistics.

[geek] Intuitively it should make sense even simple average of the sample approach the true mean because the sample would have been "generated" closer to that mean, of the true distribution, but it is good to see this proven mathematically [/geek].

There is no hand-waving necessary while explaining these concepts, they are as mathematical as any other, such as geometry, calculus which also create certain axioms and make certain (realistic) assumptions.