Get out your phone and take a look at your Facebook app’s tab bar.

Image result for facebook tab bar

If you compare it with a friend  – chances are it looks different. This is because they, and most of the worlds best companies, are running 1000’s of tests on the product all the time. (Check out this spreadsheet from UX wizard Luke Wroblewski see more about the Facebook tab bar!!!)

AB testing is a method of experimenting that is a useful way to get a quantitative understanding of how certain changes impact product metrics. It works by showing many equally sized ‘groups’ of users different versions of the same product and comparing their behaviour.

You can AB test anything from something as simple as changing the colour of a button, to adding a brand new feature to a product. Knowing how and what to test is crucial & I’ll run through some of the key points to help you be a red-hot product manager.

1) Build a strong hypothesis

Before you run the test, make sure you truly understand the problem you’re trying to solve and the hypothesis you’ve built to solve it. Simple example;

Goal: The company wants to improve top line revenues

Hypothesis: By introducing external advertising, I will drive $X ARPU from non-payers whilst not impacting their retention and engagement over the mid-long term.

When building a goal and hypothesis, make sure you:

  • Try and do qualitative research if its practical – i.e. surveys/interviews/focus groups
  • Be specific with your hypothesis – testing something that is provable as true or false
  • Identify the key metrics and the ‘anti metrics’ – anti metrics = things that you want to avoid going down
  • Start by testing the most basic assumptions – in this case, will ads hurt retention? If they don’t then you can explore different types of ads
  • Baseline your hypothesis – you should try to predict the amount of impact your change will have – which will highlight where metrics are not moving normally

2) Understand when to add users to your test

As part of set up, you will need to decide at what point a user gets added to the test. A good rule of thumb is to choose somewhere that is as close as possible to the thing being tested. For example, if you’re testing the product listing page on your e-commerce site, don’t trigger a hit when someone first lands on your website but only when the actual page is rendered. This will make your results MUCH more accurate.

Make sure not to trigger it too late too for example, if you’re doing a button design test – triggering a hit on the button click would be too late & won’t give you any results.

3) Make sure you’re using two control groups. 

The control group is the ‘A’ section of an AB test i.e. in order to measure how good the change is, you need the same number of users not to see any change. However, unless something ends up being 100% statistically certain (which almost never happens) there is a risk of getting a false positive. You can mitigate this by running TWO control groups.

This means that there are two groups in the test of the same size as the variant(s) with no changes. You can then compare the two controls and you should no differences. If you do see significant differences, it is likely your test is providing false positive (or negative) results.

4) Be careful to assess the long-term impact 

Even though in isolation, AB tests can show great results – if you are running many, the gains are very rarely cumulative. There is a FANTASTIC article from Airbnb on this exact point ‘winners curse’ that I encourage you to read. You need a way to test the long-term impact of 10’s or 100’s of experiments.

The way we do this at Badoo is having a ‘super control’ group. This is a small % (5%) of all our users globally across all countries and demographics. These super control users are completely protected from all ab testing. They have no changes at all from all AB tests, even when a variant has won and has been rolled out to all users. This means that at the end of the quarter (or year etc) you can compare the super control with the rest of your user base to see the long-term difference in your metrics – reset and pick a different 5% for the next quarter

5) Understand significance – but don’t always be held to it 

A lot of differences of opinion on this. This is my 10 cents. Significance level tells you how likely your changes are to be actual vs random statistical interference.

I always look for significance in order to make decisions and usually wait for it but for me, it’s not the be all and end all. Particularly for startups with low user numbers – it might be impossible or impractical to wait for significance. In these cases, I often make decisions based on P values that are converging towards significance – particularly if the changes are similar vs the two control groups.

6) Avoid test intersections

If you have multiple tests running, you must ensure that they are exclusive of each other. For example, if you run two different tests aiming to improve cart conversion and the groups are combined, some users might be in the variant group of both tests. You can’t know which test influenced their behaviour.

At Badoo, we usually have maybe 100 AB tests running at any time, but we also have the luxury of millions of active users spread globally. We can carefully avoid overlaps by testing in certain countries or demographics. However, if you don’t have this, I suggest you run fewer tests at once. Sometimes for ridiculously important experiments, we can lock down entire countries – to see ecosystem changes.

7) For community products – consider other testing methodologies

Image result for time series forecasting

If you’re AB testing, it means that a high % (at least 50%) are not seeing the feature. If you’re testing a big new engagement feature – this can make the new feature/product a lot less effective. For example, if Uber wanted to test the impact of launching Uber Pool, it’s expensive to AB test the impact on their product, because it means there are half as many riders who have UberPool available so the experience and revenue metrics would be much worse.

 

In these cases, it’s better to roll out a change to an entire country and run a time series analysis on the difference vs a similar country. We often use Norway vs Sweden for these types of massive tests because they’re spookily similar markets!!!

Hopefully, this gave some good insight into building a solid AB testing strategy!