All Articles
    Data & Research
    Updated May 19, 20269 min read

    What AI Actually Recommends: 21 Buyer Questions, Four Engines, One Snapshot

    We asked ChatGPT, Gemini, Claude and Perplexity the same 21 buyer questions across SaaS, agency, marketing and e-commerce categories. Five categories had a clean #1 across all four models. The other 16 didn't. Here's what the data shows.

    Matiss Katanenko

    Matiss Katanenko

    Co-founder, Honeyb

    Most conversations about AI search treat ChatGPT, Gemini, Claude and Perplexity as a single channel. They aren't. To make that concrete, we asked all four the same 21 buyer questions across SaaS, agency, marketing and e-commerce categories. We recorded the verbatim brand short list each model returned, the citations they exposed, and the language they used to justify their picks. The full dataset is published at What AI actually recommends. This post is what we found.

    How the measurement worked

    Each category was one buyer-style question, phrased the way a real prospect would ask it. Examples: 'What's the best CRM for a SaaS startup with a small sales team?', 'What's the best client portal software for a marketing agency?', 'What's the best Shopify app for SEO?'. Every question was sent to ChatGPT, Gemini, Claude and Perplexity on the same day, with no follow-up prompts and no steering. We captured each model's brand list, position order, citation URLs and explanatory notes. That gave us four parallel recommendation sets per category and 21 categories in total.

    The point wasn't to crown a winner. The point was to see how often the four models agree, where they diverge, and what kind of source each one leans on.

    Headline: clean cross-model consensus is rare

    Out of 21 categories, only five produced a clean #1 brand named by all four models. Two of those were single-vendor markets (Shopify for direct-to-consumer e-commerce, Buzzsprout for podcast hosting). Two were categories with a small set of obvious incumbents (Zendesk and Intercom for customer support, Mixpanel and Amplitude for product analytics). One was Yotpo for product reviews.

    The remaining 16 categories showed real disagreement. Client portals for agencies returned six different brands across the four models, with no single name appearing in all four lists. Rank trackers for small businesses pulled seven distinct brands, with Semrush and Ahrefs the only ones surfacing in three lists. White-label SEO and project management for agencies showed similar fragmentation.

    If you've been hoping AI search would converge on stable winners the way Google's first page eventually did, the data says no. Not yet, and not at the rate that planning a one-engine strategy would suggest.

    The citation gap between engines is enormous

    Across the 21 measurements, Perplexity returned 186 citation URLs. Gemini returned 84. Claude returned 87. ChatGPT returned one. That isn't a fluke of a single bad run. ChatGPT exposes citations inconsistently in its public API surface, so for most categories we got brand names and reasoning notes but no source URLs at all.

    Practically, this means Perplexity is the easiest engine to influence through editorial placement, because its citations are the recommendation. If your brand isn't on the list articles Perplexity is reading, it isn't on Perplexity. ChatGPT is the hardest engine to influence through any single tactic, because its judgment about your category is woven into training data plus opaque retrieval. The strategy that gets you cited on Perplexity is not the same strategy that gets you named by ChatGPT.

    G2 dominates the third-party signal

    When we counted the most-cited domains across the entire dataset, G2 was first with 21 citations, appearing in nearly every category. Forbes was second with 12. Zapier's blog third with 8. PCMag and Capterra tied at 7. Buffer's blog appeared 5 times.

    What that ordering tells you: review platforms with structured, queryable category pages are the single most reliable third-party signal in AI recommendations. Editorial 'best of' roundups from established publications come next. Your own blog can earn citations, but only if it's structured like a roundup rather than a brand essay.

    If you're trying to get cited and you don't have a G2 profile that's current, with reviews dated within the last twelve months and category pages populated, you're leaving the highest-yield surface area in the dataset on the table.

    Per-engine personality, in plain English

    Want to see this in action?

    Check how AI models talk about your brand — free, instant, no signup required.

    Free AI Check

    The models have distinct habits. Once you've read 80 recommendation lists side by side, the patterns are hard to miss.

    • ChatGPT leans on incumbents. HubSpot showed up as its #1 pick in four different categories (CRM, content marketing, email, brand monitoring). It rarely surfaces niche or emerging tools, and its lists are the tightest, usually four to five brands.
    • Gemini reaches further. It surfaces less-obvious choices like Mavrck and CreatorIQ for influencer marketing, Contently for content, and SE Ranking for SEO. Its citation set skews toward editorial roundups.
    • Claude blends established and modern picks, with the longest brand lists on average (around five brands per query). It cites publisher sources with the same fluency as Perplexity but tends to add more interpretive notes about trade-offs.
    • Perplexity goes widest. It surfaces emerging tools that the other three miss entirely: Pylon for customer support, Localo for rank tracking, Sitechecker for white-label SEO, TinyIMG for Shopify SEO. Its citation stack is twice as deep as the others combined.

    The takeaway: a brand can be perfectly visible on one engine and effectively invisible on another. The 'who's winning' question only makes sense per engine.

    The first-position bias

    Across the dataset, the brand a model named first looked qualitatively different from the brands it named second through fifth. First-place picks were almost always either the category-defining incumbent or the one the model judged most universally applicable. Lower positions read more like alternatives, edge cases or hedges.

    This matters because the first-place recommendation is the one most likely to be remembered by the reader and most likely to be quoted by downstream AI agents stacking these answers together. Being named fifth out of five is not nothing, but the gap between #1 and #2 in influence is bigger than the gap between #2 and #5.

    Where buyers should and shouldn't trust this kind of snapshot

    A single-question, single-day measurement is a snapshot, not a verdict. AI Overview content changes around 70 percent of the time for the same query, and SparkToro found less than a 1-in-100 chance that two identical queries return the same brand list. The pattern across many runs is more informative than any single result.

    What the snapshot does well: it shows you which short list a buyer is likely to see if they ask the question today, and how different that list is from the next engine over. What it doesn't do: it doesn't tell you the long-run probability your brand is named, the sentiment behind the mention, or whether a specific buyer persona phrased their question slightly differently and got a totally different list. That requires repeated, structured measurement across a full prompt set. We covered why one-off checks fall short in why spot-checking AI visibility doesn't work.

    What this means for marketing teams

    Three things move from the data.

    • Run the measurement per engine. If you're optimising for AI visibility and treating the four engines as one, you're probably over-investing in whichever one you can see and under-investing in the other three. The cost of running parallel checks has collapsed.
    • Prioritise the surfaces AI actually reads. G2, category-leading roundups, structured comparison content. The brands earning citations across multiple engines are the ones present on those surfaces.
    • Treat #1 differently from #2 through #5. Being on the list at all is the first job. Climbing to the first-named position is a separate, harder job, and worth measuring as its own metric.

    See your own category

    The full dataset, with model-by-model brand lists, consensus tables and citation breakdowns for all 21 questions, is at What AI actually recommends. If your category isn't in the public set or you want the same measurement run for your specific brand, the free AI visibility check runs the same kind of multi-engine measurement against your brand in around 30 seconds.

    Matiss Katanenko

    About the author

    Matiss Katanenko

    Co-founder, Honeyb

    My name is Matiss Katanenko and I co-founded Honeyb, the AI visibility platform that tracks how ChatGPT, Gemini, Claude, Perplexity and the other major AI engines talk about brands. I'm based in Riga, Latvia. Before Honeyb I spent years on the agency side running SEO and content programs for fast-growing brands across the US and Europe. That work is where I watched AI search start to compress the entire discovery channel into a four-brand short list, and decided to build the tool I wished agencies had. In my free time I'm in the sauna, on a padel court, or behind a drum kit.

    Connect on LinkedIn
    Honeyb

    Free, instant, no signup

    See your brand through every major AI model.

    Run a free check in 30 seconds. The picture is usually different than you'd expect.

    ChatGPTChatGPT
    ClaudeClaude
    GeminiGemini
    PerplexityPerplexity