What AI Actually Recommends: 21 Questions, 4 Engines

Most conversations about AI search treat ChatGPT, Gemini, Claude and Perplexity as a single channel. They aren't. To make that concrete, we asked all four the same 21 buyer questions across SaaS, agency, marketing and e-commerce categories. We recorded the verbatim brand short list each model returned, the citations they exposed, and the language they used to justify their picks. The full dataset is published at What AI actually recommends. This post is what we found.

How the measurement worked

Each category was one buyer-style question, phrased the way a real prospect would ask it. Examples: 'What's the best CRM for a SaaS startup with a small sales team?', 'What's the best client portal software for a marketing agency?', 'What's the best Shopify app for SEO?'. Every question was sent to ChatGPT, Gemini, Claude and Perplexity on the same day, with no follow-up prompts and no steering. We captured each model's brand list, position order, citation URLs and explanatory notes. That gave us four parallel recommendation sets per category and 21 categories in total.

The point wasn't to crown a winner. The point was to see how often the four models agree, where they diverge, and what kind of source each one leans on.

Headline: clean cross-model consensus is rare

Out of 21 categories, only five produced a clean #1 brand named by all four models. Two of those were single-vendor markets (Shopify for direct-to-consumer e-commerce, Buzzsprout for podcast hosting). Two were categories with a small set of obvious incumbents (Zendesk and Intercom for customer support, Mixpanel and Amplitude for product analytics). One was Yotpo for product reviews.

The remaining 16 categories showed real disagreement. Client portals for agencies returned six different brands across the four models, with no single name appearing in all four lists. Rank trackers for small businesses pulled seven distinct brands, with Semrush and Ahrefs the only ones surfacing in three lists. White-label SEO and project management for agencies showed similar fragmentation.

We pulled the rank-tracker results into their own breakdown of the best rank tracker for small business, showing exactly which name each of the four engines reached for.

The same fragmentation shows up in our own field too, as you can see in what the four engines recommend as the best AI visibility tool.

If you've been hoping AI search would converge on stable winners the way Google's first page eventually did, the data says no. Not yet, and not at the rate that planning a one-engine strategy would suggest.

The citation gap between engines is enormous

Across the 21 measurements, Perplexity returned 186 citation URLs. Gemini returned 84. Claude returned 87. ChatGPT returned one. That isn't a fluke of a single bad run. ChatGPT exposes citations inconsistently in its public API surface, so for most categories we got brand names and reasoning notes but no source URLs at all.

Practically, this means Perplexity is the easiest engine to influence through editorial placement, because its citations are the recommendation. If your brand isn't on the list articles Perplexity is reading, it isn't on Perplexity. ChatGPT is the hardest engine to influence through any single tactic, because its judgment about your category is woven into training data plus opaque retrieval. The strategy that gets you cited on Perplexity is not the same strategy that gets you named by ChatGPT.

G2 dominates the third-party signal

When we counted the most-cited domains across the entire dataset, G2 was first with 21 citations, appearing in nearly every category. Forbes was second with 12. Zapier's blog third with 8. PCMag and Capterra tied at 7. Buffer's blog appeared 5 times.

What that ordering tells you: review platforms with structured, queryable category pages are the single most reliable third-party signal in AI recommendations. Editorial 'best of' roundups from established publications come next. Your own blog can earn citations, but only if it's structured like a roundup rather than a brand essay.

If you're trying to get cited and you don't have a G2 profile that's current, with reviews dated within the last twelve months and category pages populated, you're leaving the highest-yield surface area in the dataset on the table.

Per-engine personality, in plain English

The models have distinct habits. Once you've read 80 recommendation lists side by side, the patterns are hard to miss.

Want to see this in action?

See how every major AI model talks about your brand. Free to start.

Free AI Check

ChatGPT leans on incumbents. HubSpot showed up as its #1 pick in four different categories (CRM, content marketing, email, brand monitoring). It rarely surfaces niche or emerging tools, and its lists are the tightest, usually four to five brands.
Gemini reaches further. It surfaces less-obvious choices like Mavrck and CreatorIQ for influencer marketing, Contently for content, and SE Ranking for SEO. Its citation set skews toward editorial roundups.
Claude blends established and modern picks, with the longest brand lists on average (around five brands per query). It cites publisher sources with the same fluency as Perplexity but tends to add more interpretive notes about trade-offs.
Perplexity goes widest. It surfaces emerging tools that the other three miss entirely: Pylon for customer support, Localo for rank tracking, Sitechecker for white-label SEO, TinyIMG for Shopify SEO. Its citation stack is twice as deep as the others combined.

Content marketing was one of the categories where that split showed up most clearly, and we broke down the full short list in our look at the content marketing platforms AI models recommend.

Brand monitoring was one of those categories, and if you want to see which names the four engines surface there today, we keep a live snapshot of the best brand monitoring tool recommendations.

The takeaway: a brand can be perfectly visible on one engine and effectively invisible on another. The 'who's winning' question only makes sense per engine.

Brand monitoring is one of the categories where that per-engine split shows up clearly, and our snapshot of the best PR monitoring tool across the four engines breaks down who each model names.

The first-position bias

Across the dataset, the brand a model named first looked qualitatively different from the brands it named second through fifth. First-place picks were almost always either the category-defining incumbent or the one the model judged most universally applicable. Lower positions read more like alternatives, edge cases or hedges.

This matters because the first-place recommendation is the one most likely to be remembered by the reader and most likely to be quoted by downstream AI agents stacking these answers together. Being named fifth out of five is not nothing, but the gap between #1 and #2 in influence is bigger than the gap between #2 and #5.

Where buyers should and shouldn't trust this kind of snapshot

A single-question, single-day measurement is a snapshot, not a verdict. AI Overview content changes around 70 percent of the time for the same query, and SparkToro found less than a 1-in-100 chance that two identical queries return the same brand list. The pattern across many runs is more informative than any single result.

What the snapshot does well: it shows you which short list a buyer is likely to see if they ask the question today, and how different that list is from the next engine over. What it doesn't do: it doesn't tell you the long-run probability your brand is named, the sentiment behind the mention, or whether a specific buyer persona phrased their question slightly differently and got a totally different list. That requires repeated, structured measurement across a full prompt set. We covered why one-off checks fall short in why spot-checking AI visibility doesn't work.

What this means for marketing teams

Three things move from the data.

Run the measurement per engine. If you're optimising for AI visibility and treating the four engines as one, you're probably over-investing in whichever one you can see and under-investing in the other three. The cost of running parallel checks has collapsed.
Prioritise the surfaces AI actually reads. G2, category-leading roundups, structured comparison content. The brands earning citations across multiple engines are the ones present on those surfaces.
Treat #1 differently from #2 through #5. Being on the list at all is the first job. Climbing to the first-named position is a separate, harder job, and worth measuring as its own metric.

See your own category

The full dataset, with model-by-model brand lists, consensus tables and citation breakdowns for all 21 questions, is at What AI actually recommends. If your category isn't in the public set or you want the same measurement run for your specific brand, the free AI visibility check runs the same kind of multi-engine measurement against your brand in around 30 seconds.

For a single category broken out the same way, see which AEO tools each engine recommends when asked the buyer question directly.

What AI Actually Recommends: 21 Buyer Questions, Four Engines, One Snapshot

How the measurement worked

Headline: clean cross-model consensus is rare

The citation gap between engines is enormous

G2 dominates the third-party signal

Per-engine personality, in plain English

The first-position bias

Where buyers should and shouldn't trust this kind of snapshot

What this means for marketing teams

See your own category

See your brand through every major AI model.

More from the blog

ChatGPT for Content Creation: How to Make Content AI Search Cites

Ahrefs Brand Radar Review 2026: Features, Pricing and Is It Worth It?

The Best Free AI Brand Monitoring Tools and Trials for 2026