ICE Scoring Framework: How to Prioritize CRO Test Ideas Effectively
You have 50 test ideas and can only run 3 per month. How do you pick the right ones? The ICE scoring framework is the most popular method — but most teams use it wrong. This guide shows you how to use it correctly, when to use alternatives, and how to build a prioritization system that actually works.
What Is ICE Scoring?
ICE stands for:
- Impact — How much will this improve conversion/revenue if it works?
- Confidence — How sure are we that it will work?
- Ease — How easy is it to implement and test?
Each factor is scored 1—10, and the ICE score = I x C x E.
How to Score Each Factor
Impact (1—10)
Estimate the revenue effect if the test wins.
| Score | Impact Level | Example |
|---|---|---|
| 1—2 | Marginal | Button color change, minor copy tweak |
| 3—4 | Moderate | New social proof section, improved product descriptions |
| 5—6 | Significant | Restructured checkout flow, new pricing presentation |
| 7—8 | High | Express checkout integration, personalized homepage |
| 9—10 | Transformative | Complete funnel redesign, new business model element |
Tips for scoring Impact:
- Consider the page’s traffic volume (a small improvement on a high-traffic page has more impact than a large improvement on a low-traffic page)
- Calculate potential revenue: (Estimated CVR lift x Monthly Visitors x AOV)
- Score based on revenue impact, not just CVR impact
Confidence (1—10)
How confident are you that this change will produce the predicted improvement?
| Score | Evidence Level | Example |
|---|---|---|
| 1—2 | Gut feeling / opinion | ”I think users would like a video here” |
| 3—4 | Industry best practice | ”Articles say exit-intent popups increase signups” |
| 5—6 | Competitor or case study evidence | ”Competitor X added this and reported 20% lift” |
| 7—8 | Your own qualitative data | ”Session recordings show users struggling with this exact element” |
| 9—10 | Your own quantitative data | ”Analytics show 60% drop-off at this step; user surveys confirm the reason” |
Tips for scoring Confidence:
- Require evidence, not opinions. Ask: “What data supports this?”
- Multiple data points increase confidence
- Past test results on similar changes increase confidence
Ease (1—10)
How easy is it to build, launch, and measure this test?
| Score | Effort Level | Example |
|---|---|---|
| 1—2 | Weeks of dev + design work | Complete checkout rebuild, new payment integration |
| 3—4 | Days of dev + design work | New page layout, complex A/B test setup |
| 5—6 | 1—2 days of work | New section design, multi-element test |
| 7—8 | A few hours | Copy change, CTA button test, image swap |
| 9—10 | Minutes (visual editor change) | Headline test, button color, badge addition |
ICE Scoring Example
| Test Idea | Impact | Confidence | Ease | ICE Score |
|---|---|---|---|---|
| Add express checkout (Shop Pay, Apple Pay) | 9 | 9 | 6 | 486 |
| Redesign product page with benefit-first copy | 7 | 7 | 5 | 245 |
| Add free shipping progress bar to cart | 7 | 8 | 8 | 448 |
| Change CTA button from blue to green | 1 | 2 | 10 | 20 |
| Add exit-intent popup with 10% discount | 6 | 7 | 8 | 336 |
| Complete checkout flow redesign | 9 | 6 | 2 | 108 |
Priority order: Express checkout (486) —> Shipping bar (448) —> Exit popup (336) —> Product page (245) —> Checkout redesign (108) —> Button color (20)
Common ICE Scoring Mistakes
1. Everyone scores differently
Problem: Your 7 is someone else’s 4. Without calibration, scores are meaningless. Fix: Use the scoring rubrics above. Have the team score independently, then discuss and calibrate.
2. High-Ease bias
Problem: Easy tests always bubble to the top, even if impact is low. Fix: Set a minimum Impact threshold (5 or higher) before a test enters the backlog. Easy + low-impact = waste of a test slot.
3. Confidence without evidence
Problem: Teams rate confidence based on how much they personally like the idea. Fix: Require at least one data source for any Confidence score above 5.
4. Not updating scores
Problem: Ideas scored 6 months ago based on old data. Fix: Re-score quarterly as new data becomes available.
ICE vs Alternative Frameworks
| Framework | Factors | Best For | Weakness |
|---|---|---|---|
| ICE | Impact, Confidence, Ease | Quick scoring, small teams | Subjective, no behavioral science grounding |
| PIE | Potential, Importance, Ease | Page-level prioritization | ”Importance” is vague |
| PXL | Binary criteria checklist | Reducing subjectivity | Complex, requires training |
| AXR | Assumption, eXpected impact, Resource cost | Behavioral science-driven CRO | Requires heuristic analysis expertise |
| RICE | Reach, Impact, Confidence, Effort | Product teams | ”Reach” adds complexity |
The AXR Framework (acceleroi’s Approach)
AXR improves on ICE by grounding confidence in behavioral science:
- Assumption Strength — Is the hypothesis backed by a recognized behavioral principle (cognitive bias, heuristic, or UX pattern)? Scored based on evidence strength.
- EXpected Impact — Revenue impact estimated from page traffic x predicted CVR lift x AOV
- Resource Cost — Implementation effort (inverted: easy = high score)
The key difference: AXR’s “Assumption Strength” requires citing specific behavioral science principles, not just confidence feelings. This produces more reliable prioritization.
Building Your Test Backlog
Step 1: Gather all test ideas from research
Pull ideas from: analytics findings, heatmap insights, session recording observations, user survey feedback, competitor analysis, and team brainstorming.
Step 2: Score each idea with ICE (or AXR)
Have 2—3 team members score independently, then average.
Step 3: Set thresholds
- Test immediately: Score greater than 300
- Test next quarter: Score 150—300
- Backlog / revisit: Score 50—150
- Don’t test: Score less than 50
Step 4: Plan your testing calendar
Based on your testing velocity (tests per month), slot the top-scoring ideas into your calendar.
Step 5: Review and re-score monthly
As you learn from test results, update scores on remaining ideas.
Skip the manual scoring. Our AI audit engine automatically scores every recommendation using the AXR framework — grounded in 40+ behavioral science heuristics and calibrated against 1,000+ historical A/B test results.