Marketing Experimentation: Answers to marketers’ and entrepreneurs’ questions about marketing experiments
Here are answers to some chat questions from last week’s ChatGPT, CRO and AI: 40 Days to build a MECLABS SuperFunnel. I hope they help with your own marketing efforts. And feel free to join us on any Wednesday to get your marketing questions answered as well.
Am I understanding the message correctly that the main value at first isn’t in more conversions or $$$ but in deeper understanding of the customer mindset?
This questioner is asking about the value of marketing experimentation. And it reminds me of this great movie quote…
Alfred: Why do we fall, Master Wayne?
The (future) Batman: So we can learn to pick ourselves back up again.
Similarly, we might say…
Marketer: Why do we run experiments, Flint?
Flint (the real Batman) McGlaughlin: “The goal of a test is not to get a lift, but rather to get a learning.”
So when you see marketing experiments in articles or videos (and we are guilty of this as well), they usually focus on the conversion rate increase. It is a great way to get marketers’ attention. And of course we do want to get performance improvements from our tests.
But if you’re always getting performance improvements, you’re doing it wrong. Here’s why…
Marketing is essentially communicating value to the customer in the most efficient and effective way possible so they will want to take an action (from Customer Value: The 4 essential levels of value propositions).
So if you’re always getting performance improvements, you’re probably not pushing the envelope hard enough. You’re probably not finding the most efficient and effective way, you’re only fixing the major problems in your funnel. Which, of course, is helpful as well.
In other words, don’t feel bad about getting a loss in your marketing experiment. Little Bruce Wayne didn’t become Batman by always doing the obvious, always playing it safe. He had to try new things, fall down from time to time, so he could learn how to pick himself back up.
While that immediate lift feels good, and you should get many if you keep at it, the long-term, sustainable business improvement comes from actually learning from those lifts and losses to do one of the hardest thing any marketer, nay, any person can do – get into another human being’s head. We just happen to call those other human beings customers.
Which leads us to some questions about how to conduct marketing experiments…
Do we need a control number?
In the MEC300 LiveClass, we practiced using a calculator to determine if results from advertisement tests are statistically significant
The specific tool we practiced with for test analysis was the AB+ Test Calculator by CXL.
I’m guessing this questioner may think ‘control number’ comes from previous performance. And when we conducted pre-test analysis, we did use previous performance to help us plan (for more on pre-test planning and why you should calculate statistical significant with your advertising experiments, you can read Factors Affecting Marketing Experimentation: Statistical significance in marketing, calculating sample size for marketing tests, and more).
But once you’ve run an advertising experiment, your ‘control number’ – really two numbers, users or sessions and conversions – will be data from your ads’ or landing pages’ performance.
You may be testing your new ideas against an ad you have run previously by splitting your budget between the old and new ads. In this case, you would usually label the incumbent ad the control, and the new ad idea would be the treatment or variation.
If both ads are new, technically they would both be treatments or variations because you do not have any standard control that you are comparing against. For practical purposes in using the test analysis tool, it is usually easier to put the lower-performing ad’s numbers in the control, so you are dealing with a positive lift.
Remember, what you are doing with the test calculator is ensuring you have enough samples to provide a high likelihood that the difference in results you are seeing is not due to random chance. So for the sake of the calculator, it does not matter which results you put in the control section.
Labeling a version a ‘control’ is most helpful when actually analyzing the results, and realizing which ad you had originally been running, and what your hypothesis was for making a change.
Which brings us to what numbers you should put in the boxes in the test calculator…
Users or sessions, would that be landing page views? I’m running on Facebook.
In this specific test calculator, it asks for two numbers for the control and variation – ‘users or sessions’ and ‘conversions.’
What the calculator is basically asking for is – how many people saw it, and how many people acted on it – to get the conversion rate.
What you fill into these boxes will depend on the primary KPI for successes in the experiment (for more on primary KPI selection, you can read Marketing Experimentation: How to get real-world answers to questions about a company’s marketing efforts).
If your primary KPI is a conversion on a landing page, then yes, you could use landing page views or, even better, unique pageviews – conversion rate would be calculated by dividing the conversion actions (like a form fill or button click) by unique pageviews.
However, if your primary KPI is clickthrough on a Facebook ad, then the conversion rate would be calculated by dividing the ad’s clicks by its impressions.
Which brings us to the next question, since this tool allows you to add in a control and up to five variations of the control (so six total treatments)…
Can you confirm the definition of a variation really fast? Is it a change in copy/imagery or just different size of ad?
Remember what we are doing when we’re running a marketing experiment – we are trying to determine that there is a change in performance because of our change in messaging. For example, a headline about our convenient location works better than a headline about our user-friendly app, so we’ll focus our messaging about our convenient location.
When there are no validity threats and we just make that one change, we can be highly confident that the one change is the reason for the difference in results.
But when there are two changes – well, which change caused the difference in results?
For this reason, every change is a variation.
That said, testing in a business context with a necessarily limited budget and the need for reasonably quick action, it could make sense to group variations.
So in the question asked, each ad size should be a variation, but you can group those into Headline A ads, and Headline B ads.
Then you can see the difference in performance between the two headlines. But you also have the flexibility to get more granular and see if there are any differences among the sizes themselves. There shouldn’t be right? But by having the flexibility to zoom in and see what’s going on, you might discover that small space ads for Headline B are performing worst. Why? Maybe Headline B works better overall, but it is longer than Headline A, and that makes the small space ads too cluttered.
Ad size is a change unrelated to the hypothesis. But for other changes, this is where a hypothesis helps guide your testing. Changing two unrelated things would result in multiple variations (two headlines, and two images, would create four variations). However, if your experimentation is guided by a hypothesis, all of the changes you make should tie into that hypothesis.
So if you were testing what color car is most likely to attract customers, and you tested a headline of “See our really nice red car” versus “See our really nice blue car,” it would make no sense to have a picture of a red car in both ads. In this case, if you didn’t change the image, you wouldn’t really be testing the hypothesis.
For a real-world example see Experiment #1 (a classic test from MarketingExperiments) in No Unsupervised Thinking: How to increase conversions by guiding your audience. The team was testing a hypothesis that the original landing page had many objectives competing in an unorganized way that may have been creating friction. Testing this hypothesis necessitated making multiple changes, so they didn’t create a variation for each. However, when making a new variation would be informative (namely, how far should they go in reducing distractions) they created a new variation.
So there were three variations total. The control (original), treatment #1 which tested simplifying by making multiple changes, and treatment #2 which tested simplifying even further.
When we discuss testing, we usually talk about splitting traffic in half (or thirds, in the case above) and sending an equal amount of traffic to each variation to see how they perform. But what if your platform is fighting you on that…
One thing I’m noticing is that Google isn’t showing my ads evenly – very heavily skewed. If it continues, should I pause the high impression group to let the others have a go?
It really comes down to a business decision for you to make – how much transparency, risk, and reward are you after? Here are the factors to consider.
On the one hand, this could seem like a less risky approach. Google is probably using a statistical model (probably Bayesian) paired with artificial intelligence to skew towards your better performing ad to make you happy – i.e., to keep you buying Google ads because you see they are working. This is similar to multi-armed bandit testing, a methodology that emphasizes the higher performers while a test is running. You can see an example in case study #1 in Understanding Customer Experience: 3 quick marketing case studies (including test results).
So you could view trusting Google to do this well as less risky. After all, you are testing in a business context (not a perfect academic environment for a peer-reviewed paper). If one ad is performing better, why pay money to get a worse-performing ad in front of people?
And you can still reach statistical significance on uneven sample sizes. The downside I can see is that Google is doing this in a black box, and you essentially just have to trust Google. It’s up to you how comfortable you are doing that.
When you go to validate your test, you could get a Sample Ratio Mismatch warning in a test calculator, warning you that you don’t have a 50/50 split. But read the warning carefully (my emphasis added), “SRM-alert: if you intended to have a 50% / 50% split, we measured a possible Sample Ratio Mismatch (SRM). Please check your traffic distribution.”
This warning is likely meant to warn you of a difference that isn’t obvious to the naked eye if you intended to run a 50/50 split. This could be due to validity threats like instrumentation effect and selection effect. Let’s say your splitter wasn’t working properly, and some notable social media accounts shared a landing page link but it only went to one of the treatments. That could threaten the validity of the experiment. You are no longer randomly splitting the traffic.
On the flip side, if you want more control over things, you could evenly split the impressions or traffic and use a Frequentist (Z Test) statistical methodology and choose the winner after things have run. You aren’t trusting that Google is picking the right winner, and not giving up on an initially under-performing treatment too soon.
I can’t personally say that one approach is definitely right or wrong, it comes down to what you consider more risky, and how much control and transparency you would like to have.
And if you would like to get deeper into these different statistical models, you can read A/B Testing: Why do different sample size calculators and testing platforms produce different estimates of statistical significance?
Flint, do I remember right? 4 words to action in the headline?
This question is a nice reminder that we’ve been answering questions about testing methodology – the infrastructure that helps you get reliable data – but you still need to craft powerful treatments for your tests.
The teaching from Flint McGlaughlin that you are referring to is actually about four words to value in the headline, not action. Four words to value. To give you ideas for headline testing, you can watch Flint teach this concept in the free FastClass – Effective Headlines: How to write the first 4 words for maximum conversion.
Hi, how would one gain access to this? Looks fascinating.
How can I join a cohort?
How do you join a cohort on 4/26?
Are the classes in the cohort live or self-paced? I’m based in Australia so there’s a big-time difference.
Oh, I would be very interested in joining that cohort. Do I email and see if I can join it?
At the end of the LiveClass, we stayed on the Zoom and talked directly to the attendees who had questions about joining the MECLABS SuperFunnel Cohort. If you are thinking of joining, or just looking for a few good takeaways for your own marketing, RSVP now for a Wednesday LiveClass.
Here are some quick excerpt videos to give you an idea of what you can experience in a LiveClass:
Single variable testing vs variable cluster testing
What is an acceptable level of statistical confidence?
Paul Good talks about the need for the MECLABS cohort
Categories: Marketing and Advertising Facebook