This topic may seem a bit strange, but I'm running out of ideas here! Still, it's interesting how often this topic comes about in mentoring sessions, especially when dealing with A/B testing. So, if you can't answer the question "when are two numbers equal enough?" in a simple sentence, perhaps you'll have something to learn from this article.
First of all, the rationale of all this. Sometimes, we need to make an executive decision about whether we should apply this or the other function on the data at hand. In A/B testing, this is usually something like “should we go for the equal variances or the unequal variances variant of the T-test?” Of course, when you have two samples, the chances of their variances being exactly equal is minuscule, so why did those old sages of Stats whom we revere so much decide to have two variants of the T-test, based on the equality of the variances involved? Well, there is a different formula used since if the variances are the same, things are much simpler with the underlying math. But then the question becomes "when are these two variances equal?" and keep in mind that we are talking Stats here, so the rigidity of Math as we know it doesn't apply. We are comfortable with approximations, otherwise, we'd have to abandon the whole idea of Statistics altogether!
In engineering, two numbers are equal when their difference is within a tolerance margin. We usually depict this tolerance by a threshold th expressed as a negative power of ten. So, often we have something like th = 10^(-3), which is a fancy way of saying th = 0.001. This kind of approximation, although very handy, may not apply to the problem at hand. Besides, few disciplines have the scientific reasoning and discipline that Engineering exhibits, and Stats is not one of them. Also, let's not forget that traditional Computer Science is akin to Engineering, so the approx() function found in many languages follows a similar motif, making it inapplicable to the problem mentioned previously.
In Physics, things are a bit different, which is why often we talk about orders of magnitude. So, it's often the case that if two quantities A and B are different by at least an order of magnitude, they are much different. This is another way of saying that one is at least ten times bigger than the other. This is something we can apply to our problem since it gives us a relative rule of thumb to work with. Of course, an order of magnitude is quite a bit when we talk about variances, but we can adapt this to something that makes more sense in Analytics work.
So, what about a fixed percentage, maybe one order of magnitude less than 1? This would translate into 10% (since 1 = 100%), something that's not too much but not negligible either. So, if v1 and v2 are the two variances at hand, we can say that if v1 <= (1+10%)v2 and v2 <= (1+10%)v1, we can presume v1 and v2 to be more or less equal. Additionally, this wouldn't work if one of them is 0, in which case the two variances would always be considered different from each other. Then again, this makes intuitive sense since we'd be dealing with a static variable and one that varies at least a bit. Also, as things are made simpler if we use as a reference point the smaller variance, we can just do a single comparison and be done with it. After all, if v2 is the smallest and v1 <= 1.1*v2, we can be sure that the reverse would also hold true.
In other words, we can use a script like the one attached to this article and not have to worry about this matter much (note that this script allows us to use a different threshold too, other than 0.1). Cheers!
Your comment will be posted after it is approved.
Leave a Reply.
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy approach to technology, particularly related to A.I.