Transparency
Research Methodology
How we measure AI citations, which models we test, and the statistical frameworks that ensure our findings are reliable and reproducible.
Last updated March 2026 · Version 3.1
Data Collection
How queries are generated and submitted to AI models
Our query generation process begins with seed queries derived from three sources: real search query datasets (anonymized), industry keyword databases, and manually curated queries designed to test specific citation behaviors. Each seed query is then expanded into multiple variants to account for phrasing differences, specificity levels, and intent nuances.
Queries are classified into four intent categories following a modified version of the Broder taxonomy:
Informational
Seeking knowledge or explanation (e.g., "What is the best CRM for startups?")
Commercial
Comparing products or evaluating options (e.g., "Notion vs Coda for team wikis")
Navigational
Looking for a specific brand or resource (e.g., "Stripe pricing page")
Transactional
Ready to take action or purchase (e.g., "Sign up for project management tool")
Each query is submitted individually via API with a clean conversation context — no prior messages, no system prompts, and no user profile data. This ensures we measure the model's baseline citation behavior rather than personalized responses. Temperature is set to 0 for deterministic output, and each query is run three times to verify consistency.
Model Coverage
Which AI models are tested and how we account for differences
As of Q1 2026, we test four major AI models that represent the majority of consumer-facing AI search interactions:
| Model | Version Tested | Access Method | Grounding |
|---|---|---|---|
| ChatGPT | GPT-4o (2026-01) | OpenAI API | Web browsing enabled |
| Perplexity | Pro (latest) | Perplexity API | Native web search |
| Google Gemini | Gemini 1.5 Pro | Google AI API | Google Search grounding |
| Claude | Opus 4 | Anthropic API | Web search tool |
Each model has different grounding capabilities and knowledge cutoffs, which directly affect citation behavior. Perplexity and ChatGPT with browsing tend to cite more recent sources, while models without web access rely more heavily on training data. We report both aggregate and model-specific metrics to account for these differences.
Citation Scoring
How we classify, score, and weight different types of AI citations
Not all citations are equal. A direct recommendation (“I recommend using Notion for this”) carries different weight than a passing mention (“tools like Notion, Coda, and others”). Our scoring framework classifies each brand mention into one of four tiers:
Tier 1 — Primary Recommendation
Brand is the main or sole recommendation. Appears as the direct answer to the user's query.
Tier 2 — Named Alternative
Brand is listed among a small set of recommended options (typically 2-4 brands) with specific context for each.
Tier 3 — Comparative Mention
Brand is mentioned in a comparison or as one option in a longer list, without strong endorsement language.
Tier 4 — Indirect Reference
Brand is referenced tangentially, such as in an example, analogy, or background context without recommendation intent.
Citation tier classification is performed using a combination of rule-based NLP patterns and a fine-tuned classifier trained on 8,000+ manually labeled examples. Inter-annotator agreement for the labeling task was 91.3% (Cohen's kappa = 0.87), indicating strong reliability.
Statistical Framework
Confidence intervals, sample sizes, and significance testing
All reported citation rates include 95% confidence intervals calculated using the Wilson score interval method, which provides better coverage properties than the standard Wald interval for proportions near 0 or 1.
Sample size requirements
Total queries per quarter
Queries per industry minimum
Replications per query
Quarter-over-quarter comparisons use a two-proportion z-test with Bonferroni correction to account for multiple comparisons across industries. We only report a change as statistically significant if p < 0.005 after correction. Effect sizes are reported using Cohen's h for proportion comparisons.
For brand-level rankings, we use a Bayesian hierarchical model that accounts for query difficulty and industry-level variance. This prevents brands in low-volume categories from being disproportionately affected by sampling noise.
Limitations & Ethics
What our methodology can and cannot measure
We believe transparency about limitations is essential to maintaining research credibility. The following are known limitations of our current methodology:
API vs Consumer Product Differences
We access models via API, which may differ from the consumer-facing product. ChatGPT's web interface, for example, may include features like memory and plugins that affect citation behavior in ways our API testing does not capture.
Temporal Variability
AI model responses can change over time as models are updated. Our quarterly snapshots capture a specific window, and citation rates measured in March may differ from those in April, even for the same model version.
Geographic and Language Bias
Our queries are generated in English and primarily reflect North American and European market contexts. Citation patterns for brands in other regions and languages may differ substantially.
Personalization Effects
By using clean API sessions, we measure baseline citation behavior. Real users with conversation history, preferences, and location data may receive personalized responses with different citation patterns.
Ethical Commitment
We do not attempt to manipulate AI model behavior or game citation rates. Our research is observational and intended to inform, not to provide a playbook for artificial citation inflation. We publish our methodology in full to enable scrutiny and replication.
Questions About Our Methodology?
We welcome scrutiny and collaboration. If you have questions about our methods, want to discuss replication, or are interested in academic partnerships, reach out to our research team.
Contact Research TeamStay ahead of AI search changes
Get research updates, citation insights, and tool announcements.