Introduction
Large Language Models (LLMs) are rapidly becoming integral to how users discover information, acting as sophisticated recommenders for everything from “the best running shoes” to “top B2B CRMs.” For businesses developing AI products or brands keen on maintaining digital visibility, understanding which brands and sites LLMs surface, and how consistently they do so, is paramount. This article introduces a novel, open-source methodology designed to rigorously evaluate LLM brand surfacing behavior.
Why This Research Matters
In an era where LLMs increasingly influence consumer choices, several critical questions emerge for brands and AI product developers:
*   Which brands and sites receive the most prominence in LLM-generated recommendations?
*   How stable are these recommendations across different samples, locales, or even different LLM models?
*   Can a “top-k” list derived from an LLM be truly trusted as a reliable ranking?
Our research aims to provide measurable, reproducible answers to these questions, acknowledging the inherent complexities and limitations of LLM outputs.
Introducing Entity-Conditioned Probing (ECP)
At the heart of our evaluation framework is Entity-Conditioned Probing (ECP). This method involves crafting specific prompts tailored to various categories, such as “best XXX tools in DE.” For each unique combination of category and locale, we collect multiple independent samples from an LLM. Each response generated by the LLM is then carefully parsed to extract a list of relevant entities, which could be brands or websites. This multi-sampling approach is crucial for capturing the variability in LLM responses.
Ensuring Reliability: The Half-Split Consensus Approach
To address the critical issue of reliability and stability, we employ a clever resampling technique: multi-sampling coupled with a half-split consensus method. After collecting a set of entity lists for a given setup, we divide these lists into two equal halves. For each half, we compute a “consensus top-k” list, identifying the most frequently mentioned entities. The next step is to measure the overlap@k between these two consensus lists.
*   If the overlap@k is high, it indicates that the “top-k” ranking generated by the LLM for that specific query and context is stable and less prone to random fluctuations.
*   Conversely, if the overlap@k is low, it suggests that any single top-k list derived from the LLM should be treated with caution, as its ranking may be noisy and inconsistent.
This process is visually represented in a diagram showing ECP sampling and half-split consensus flow, which ensures a robust evaluation of LLM output stability.
Large-Scale Evaluation and Key Resources
To thoroughly test our methodology and uncover interesting divergences in LLM behavior, we executed a large-scale study comprising 15,600 samples across 52 distinct categories and locales. The insights gained from this extensive evaluation offer a deeper understanding of how LLMs operate as brand recommenders.
All aspects of our research, including the methodology, code, and data, have been open-sourced to promote transparency and reproducibility:
*   Paper (preprint): https://zenodo.org/records/17489350
*   Code: https://github.com/jim-seovendor/entity-probe
*   Data (Hugging Face): https://huggingface.co/datasets/seovendorco/entity-probe
The repository includes:
*   /pl_top/*.csv files containing per-prompt list outputs and parsed entities.
*   results.*.jsonl files offering structured results and metadata for in-depth analysis.
*   Scripts designed to aggregate list outputs, compute consensus tops, evaluate overlap@k reliability, and export tables/figures.
Getting Started (Python Example)
For those eager to dive into the data, a simple Python example can get you started:
# pip install pandas numpy
import pandas as pd
import json
# Example: load top lists and compute simple frequency
# Replace "pl_top/example_category_en-US.csv" with the desired data file
pl = pd.read_csv("pl_top/example_category_en-US.csv")
pl["entity"] = pl["entity"].str.strip().str.lower()
freq = pl["entity"].value_counts().head(20)
print(freq)
This snippet demonstrates how to load an entity list and quickly determine the most frequently mentioned entities.
Conclusion
Our open-sourced method provides a robust framework for quantitatively assessing the brand and site surfacing behavior of LLMs. By focusing on reliability and reproducibility, we offer a valuable tool for anyone looking to understand and optimize brand visibility in the evolving landscape of AI-driven recommendations.