Everybody knows A/B test, as it became essential tool for exploring new user preferences or patterns and also a way to systemically innovate through chain of (managed) experiments. But A/A test, seriously? Yes, this is not typo (after all on QWERTY keyboard A and B are quite far-away from each other 😉 ) and there are good reasons for this test to exist (and being used). But let’s get to it step-by-step …
Ups, it happened again …
Imagine you had a glitch in system that led to sudden (maybe even undesired) A/B test of the customer experience. Some people received the service as expected, while others have been cut off from this element of experience. You do a post-mortem and you see that this positively influenced their shopping behavior. However, as this was a complicated glitch, you can’t easily replicate the test to find out if actually switching off the experience element completely would be a better long-term strategy. I mean, it was still valid A/B test, but as the assignment to groups was not controlled you are not sure if the up-tick in B-group shopping behavior can be attributed to the change in experience or just skewed (not random) assignment of user sample by glitch itself. How would you tell?
Your existing A/B testing platform had fallen behind the curve and thus you decided to shop for alternative solution. After implementing the new tool, all of the sudden some of your experiments start to show significantly larger gaps between test and control group. Some of the gains are almost too good to be true. But you already cancelled your previous software subscription, so you cannot replicate the same experiment on previous platform any more. How would you find out if all of the sudden your campaigns started to work better or just new tool is “wired differently” to run the tests?
There is a way
As you can read above, both of the depicted scenarios are stemming from real life of online marketers. I bet you might have experienced some variation of those firsthand. But luckily, for both there is solution, you don’t have to throw away the data and start all over the again. And, yes, the solution is indeed our mysterious A/A test. So how does it really work?
A and A (again) , seriously?
The original idea of A/B test is quite simple: If mutually comparable groups of users are subjected to different treatment (and this treatment is the only substantial difference they have), if any of the groups behaves significantly differently as a result of the test, then there is high probability (yes, don’t forget it is still probabilistic finding) that these change in experience and change in behavior are linked to each other. The major vulnerability of this experiment, unfortunately, is the assumption that user samples A and B are really comparable. So what happens if we already have result of A/B testing but we have no proof/info how correctly the selection into groups happened?
Well, luckily this logic of the experimenting works also the other way around. If we have two groups undergoing identical treatment and resulting in also same behavior, we have high probability (yes, probabilistic here again), that the groups have had similar user distribution as well.
Therefore, if we have two groups and want to find out if they are some-what similar in user mix, we can subject them to same treatment and watch if they receive significantly similar result on high confidence level. And that exactly is the essence of our A/A test. Here the A+A signifies not that same group was used twice, but that different groups (in original experience A- and B-group) are subjected to same treatment in second experiment. This way we can try to “learn something about similarity’ of the groups ex post, already after completed A/B test.
Using A/A test is thus easy way to double check (or compensate for lack of) initial user assignment. Please note that while A/B test uses sample similarity to point out behavior difference, it’s A/A cousin uses (absence of) behavior difference to point out samples similarity. That also means that A/A does not:
- say anything (additional) about the strength of behavioral difference from original A/B test, nor it serves as any prove of it. (if the difference in final behavior between A/B groups was statistically insignificant, it remains so even after successful A/A test).
- prove general similarity of original A and B groups in general. It signals just similarity for behavior(s) relevant for original A/B experiment.
- generate any new insights about the users (which is often contested by opponents of A/A as waste of testing capacities).
The main value added of the A/A testing is that it is possible to run (almost) no matter what the original experiment was and is easy to set-up. After all, you just need to wire the groups to same branch of the process. Therefore, A/A test is a quick remedy to “unusual” set-ups/hick-ups in proper A/B testing.
Its simplicity, of course, comes with some controversy. Some practitioners argue against A/A tests as not being the most robust way to prove A/B groups similarity (heavy multinomial distribution comparisons are), being not fully possible to rerun (e.g., if original B-group condition altered long-term perception of service) as well as being opportunity (or real) costs of not running other experiments instead.
I am far from promoting A/A tests as silver bullet, evade it if any of the (above mentioned) counter-arguments are very true for your own situation. However, the mere A/A tests existence and its proper set-up should be part of your toolbox; May the situation turn it to be the cheapest (and quickest) way to heal the experiment (improper) set-up. Especially so, if you are to assess the results of A/B tests conducted by somebody else before.
Publikované dňa 18. 7. 2022.