Here at Flixbus we strive to connect people across Europe. We love to support you in discovering new places, or simply return you home after visiting your friends and family. To support you with your travels, we work to improve the usability and accessibility of our service. To do so, we continuously deliver new features to our website and apps and we measure their impact on those goals. For our online products, we use several testing methodologies like A/B Testing, Multivariate Testing and Usability Testing. In this blog post we want to show you how we tested our “Nearby City Suggestions” feature.
Imagine you are crazy about cave diving. You visited the “Blautopf” and stayed with friends in Bad Urach (https://en.wikipedia.org/wiki/Blautopf). Now you want to travel home to Leipzig. You search for a trip on our flixbus.comwebsite, but there is no specific connection on that date. Before this new feature was rolled out, this is what you would have seen:
Here, our “Nearby City Suggestions” comes into the game. This feature offers you a list of alternative connections. Technically, we look for possible connections in a 50-km radius around the selected locations. In your case, it looks like:
Pretty cool, right?
We hoped that with the introduction of this new feature our customers would be happier and able to use our service more often. Our developers had been working hard on it. Our business stakeholders were eager to see the results. The pressure was on. But how could we know whether our customers would actually be happy about it? The answer was quite easy: we had to measure it with an A/B test.
If you are unfamiliar with the A/B test methodology, here is how it works. We split the traffic on our website in two pots:
We run the test for a specific time, usually at least one business cycle, which in our case is a full week (we want to make sure we capture the differences between weekdays and weekends behavior). Once the correct amount of time has passed, we compare the test results between the two groups. This way we find out if our new feature has helped us reach our goal (e.g.: higher conversion rate, better usability…) or not. Based on the final result, we decide if we change the website for all our customers or keep using the previous version.
So, here we were. We started our A/B test:
A couple of days later, we took a sneak peek at the first results to make sure everything was tracked correctly and we didn’t break the website. This was not done to actually analyze the results (doing so would have been bad practice, because It would have been to ealry to infer a statistical significance), only to double check that both our workaround for the splitting mechanism and the tracking were working correctly.
At our first look, the conversion rate was fantastic for version B. Also we had a statistically significant difference between both versions with a lot more views than “needed”. A typical win-win situation, since our customers could choose the rides they were interested in more easily and we improved our service.
But then we took a closer look at the traffic split: it wasn’t a 50-50 split. It was an 80-20 split. EIGHTY-TWENTY. Why was this difference so big? Why was our control group significantly larger than the variation? We couldn’t explain this difference and consequently we couldn’t validate this result without knowing if our data was corrupt. Instead of waiting for the end of this test, we were alarmed (i.e. slightly panicked) and we started working hard to find an answer to all our questions.
In a regular, “ideal” A/B test, the traffic split would occur exactly on the click of the search results when Zero Results should be displayed: here we would randomly spilt our customers to Version A or B. With cookies, we also make sure that this split would be valid for the customers over the whole lifetime of the test. This is how the traffic should be split:
In our case, here was our first struggle. At the time, we couldn’t build this perfect “traffic splitter”. We also regularly use a popular A/B Testing tool which splits the traffic in the requested way, but the setup for this specific test was too complicated.
To be able to test this feature, we had to implement a workaround. We had to split the traffic as soon as our customers visited our webpage. As you can imagine, it might be several clicks before our customer clicks on the search button to perform a search that would lead to a Zero Result page. This is not perfect, because the customer can also decide not to look for a ride.
There was also another challenge: what about those customers that landed directly on a search page? How should we assign them a cookie and make sure that they always see the same version? We decided to exclude those customers from the test.
In the end, this is what our workaround looked like:
We knew this mechanism wasn’t perfect. As you can see, the 50-50 split happens BEFORE users can even make a search that would lead to a Zero Result page. This means that the actual split of our customers which actually see a Zero Result page might not have been exactly 50-50. However, we assumed that the likelihood of coming across a Zero Result page would be equal across the two pots, so the actual split might not have been too far from 50-50. We would have expected something like 45-55, which is ok. We never expected this split to be 80-20. Was this split mechanism the cause of the massive imbalance between the two groups?
We launched ourselves on a quest to find out what was wrong. The following actions took place during the subsequent two weeks, in which we went through the test setup and results multiple times (did we mention that the pressure was on?).
First, we performed several checks on our web tracking tools (i.e. Webtrekk and Google Analytics) and we assured that each time the users were presented with a Zero Result page, this event would be counted correctly as a “Zero Results with Suggestions” or “Zero Results without Suggestions”.
We then double checked whether we were pulling the right numbers from our analytics tools. Were we double counting our customers? We checked the selected time frame, the segments, the filters, the metrics…all looked good.
We checked the script we used to assign the cookie. Maybe the random function that assigned the cookie wasn’t random? Maybe we weren’t selecting the right target? It wasn’t the case. Multiple people checked the code but no one could find an error.
We then investigated what was different between the two pots. Device, operating system and browser splits were equal between the two pots. However, channel distribution wasn’t. Version B pot had much more Paid Search Brand traffic than version A, and much less Paid Search Generic traffic. Were we onto something? Were people coming from a generic search more likely to be assigned to Version A? And if so, why?
Then, the light at the end of the tunnel. Our data science team found themselves in the situation in which they were assigned to Version B, hence they should see some suggestions on a Zero Result page. However they didn’t see any suggestions, and obviously Webtrekk and Google Analytics’ event would count a “Zero Results with no suggestions”. We then added some additional tags: we would now fire a hit-based custom dimension which would track the value of the cookie (Version A or Version B) on each page view, just to double check whether users were assigned two different cookies in the same sessions.
After collecting this new custom dimension for one day, we were able to confirm our hypothesis: not all users that were assigned to Version B could see some suggestions on a Zero Result page. The reason? We ran out of alternative options to offer, so we had nothing to show anyway.
We found out that this happened to roughly 50% of the users that were assigned version B. This was why our numbers showed us that only 20% of the total traffic would see some suggestions and 80% no suggestions. Basically, this happened:
We understood the cause of the traffic imbalance, and it wasn’t due to the test setup, or to some errors in collecting the results. It was due to the fact that in some cases we didn’t have any alternative city to suggest (maybe the cities we could suggest where further than 50 km away from the original stations, for example). Our careful and ongoingl analysis showed that our data wasn't corrupt.
Finally, after three weeks of testing and troubleshooting, our data was validated.
We checked the channel distribution between the two pots once again, and we realized that even if there were some differences in the proportion of Paid Search Generic and Brand, they weren’t big enough to skew the numbers. The chart below shows the channel distribution comparison for the different segments we analyzed:
In the end, the final results were really good. We let the test run for three weeks, and we found out that:
Our customers were happier and purchased more. Our business stakeholders were satisfied, and the feature was rolled out to 100% of the users. We were exhausted, but really pleased with the work. Win-win-win.
In a nutshell, this is what happened: