Transparency

Research Methodology

How we measure AI citations, which models we test, and the statistical frameworks that ensure our findings are reliable and reproducible.

Last updated March 2026 · Version 3.1

Data Collection

How queries are generated and submitted to AI models

Our query generation process begins with seed queries derived from three sources: real search query datasets (anonymized), industry keyword databases, and manually curated queries designed to test specific citation behaviors. Each seed query is then expanded into multiple variants to account for phrasing differences, specificity levels, and intent nuances.

Queries are classified into four intent categories following a modified version of the Broder taxonomy:

Informational

Seeking knowledge or explanation (e.g., "What is the best CRM for startups?")

Commercial

Comparing products or evaluating options (e.g., "Notion vs Coda for team wikis")

Navigational

Looking for a specific brand or resource (e.g., "Stripe pricing page")

Transactional

Ready to take action or purchase (e.g., "Sign up for project management tool")

Each query is submitted individually via API with a clean conversation context — no prior messages, no system prompts, and no user profile data. This ensures we measure the model's baseline citation behavior rather than personalized responses. Temperature is set to 0 for deterministic output, and each query is run three times to verify consistency.

Model Coverage

Which AI models are tested and how we account for differences

As of Q1 2026, we test four major AI models that represent the majority of consumer-facing AI search interactions:

Model	Version Tested	Access Method	Grounding
ChatGPT	GPT-4o (2026-01)	OpenAI API	Web browsing enabled
Perplexity	Pro (latest)	Perplexity API	Native web search
Google Gemini	Gemini 1.5 Pro	Google AI API	Google Search grounding
Claude	Opus 4	Anthropic API	Web search tool

Each model has different grounding capabilities and knowledge cutoffs, which directly affect citation behavior. Perplexity and ChatGPT with browsing tend to cite more recent sources, while models without web access rely more heavily on training data. We report both aggregate and model-specific metrics to account for these differences.

Citation Scoring

How we classify, score, and weight different types of AI citations

Not all citations are equal. A direct recommendation (“I recommend using Notion for this”) carries different weight than a passing mention (“tools like Notion, Coda, and others”). Our scoring framework classifies each brand mention into one of four tiers:

1.0

Tier 1 — Primary Recommendation

Brand is the main or sole recommendation. Appears as the direct answer to the user's query.

0.7

Tier 2 — Named Alternative

Brand is listed among a small set of recommended options (typically 2-4 brands) with specific context for each.

0.4

Tier 3 — Comparative Mention

Brand is mentioned in a comparison or as one option in a longer list, without strong endorsement language.

0.1

Tier 4 — Indirect Reference

Brand is referenced tangentially, such as in an example, analogy, or background context without recommendation intent.

Citation tier classification is performed using a combination of rule-based NLP patterns and a fine-tuned classifier trained on 8,000+ manually labeled examples. Inter-annotator agreement for the labeling task was 91.3% (Cohen's kappa = 0.87), indicating strong reliability.

Statistical Framework

Confidence intervals, sample sizes, and significance testing

All reported citation rates include 95% confidence intervals calculated using the Wilson score interval method, which provides better coverage properties than the standard Wald interval for proportions near 0 or 1.

Sample size requirements

48,000+

Total queries per quarter

3,000+

Queries per industry minimum

Replications per query

Quarter-over-quarter comparisons use a two-proportion z-test with Bonferroni correction to account for multiple comparisons across industries. We only report a change as statistically significant if p < 0.005 after correction. Effect sizes are reported using Cohen's h for proportion comparisons.

For brand-level rankings, we use a Bayesian hierarchical model that accounts for query difficulty and industry-level variance. This prevents brands in low-volume categories from being disproportionately affected by sampling noise.

Limitations & Ethics

What our methodology can and cannot measure

We believe transparency about limitations is essential to maintaining research credibility. The following are known limitations of our current methodology:

API vs Consumer Product Differences

We access models via API, which may differ from the consumer-facing product. ChatGPT's web interface, for example, may include features like memory and plugins that affect citation behavior in ways our API testing does not capture.

Temporal Variability

AI model responses can change over time as models are updated. Our quarterly snapshots capture a specific window, and citation rates measured in March may differ from those in April, even for the same model version.

Geographic and Language Bias

Our queries are generated in English and primarily reflect North American and European market contexts. Citation patterns for brands in other regions and languages may differ substantially.

Personalization Effects

By using clean API sessions, we measure baseline citation behavior. Real users with conversation history, preferences, and location data may receive personalized responses with different citation patterns.

Ethical Commitment

We do not attempt to manipulate AI model behavior or game citation rates. Our research is observational and intended to inform, not to provide a playbook for artificial citation inflation. We publish our methodology in full to enable scrutiny and replication.

Questions About Our Methodology?

We welcome scrutiny and collaboration. If you have questions about our methods, want to discuss replication, or are interested in academic partnerships, reach out to our research team.

Contact Research Team

Stay ahead of AI search changes

Get research updates, citation insights, and tool announcements.