Complete Literature Review
When a consumer asks ChatGPT "what's the best CRM for small businesses," the response draws from two distinct knowledge layers, and understanding how each one works explains most of what brands get wrong about AI visibility. The first is parametric memory: knowledge encoded during pretraining on massive text corpora. This is where the data licensing deals matter. Google pays Reddit $60 million per year for access to its content API (CBS News, February 2024); OpenAI pays an estimated $70 million per year for similar access (TechCrunch, May 2024). Reddit's 22+ billion posts and comments, YouTube's transcripts (which correlate at 0.737 with AI visibility per Ahrefs), and the broader web crawl all feed the models' base knowledge during training.
The second layer is retrieval-augmented generation (RAG), where the model queries live web sources at inference time to supplement or update its parametric knowledge. Here, Reddit's dominance is even more direct. Profound's analysis of 680 million citations found Reddit is the #1 cited domain on both Perplexity (6.6% of all citations) and Google AI Overviews (2.2%), and the #2 cited domain on ChatGPT behind Wikipedia. Perplexity cited Reddit in over 20% of responses during early 2026 (Evertune). The University of Toronto's study found 69-82% of citations come from earned media (Britopian, October 2025), and their methodology categorized community discussion platforms like Reddit as earned media rather than social media. Traditional social platforms like Twitter and Instagram contribute minimally, but Reddit and YouTube function as primary content infrastructure for both the pretraining and retrieval layers.
During training, the model learns to predict the next token in a sequence, which means it encodes statistical associations between words, concepts, and entities. If "brand X" frequently co-occurs with "enterprise," "expensive," and "reliable" across millions of documents, those associations become the model's understanding of brand X. Brands exist inside the model as distributed patterns across billions of parameters, shaped entirely by what the training corpus contained, with no explicit brand database or structured representation.
Every pretrained model has a knowledge cutoff: the date after which no information from the training corpus was collected. Any content published, brand changes made, or products launched after this date do not exist in the model's parametric memory.
Not all web content makes it into training. Labs filter aggressively for quality, deduplicate against older versions, and strip toxic or low-value material. When OpenAI built GPT-3's training set from Common Crawl, they reduced 45TB of compressed text to 570GB, discarding approximately 98.7% of the crawled web (Brown et al., NeurIPS 2020). Modern datasets are even stricter: FineWeb-Edu and DCLM use model-based quality scoring that removes roughly 90% of candidate data, and DeepSeek's preprocessing eliminated nearly 90% of repeated content across 91 Common Crawl dumps (Mozilla Foundation). The content your brand publishes may never reach the training corpus because it falls below a quality threshold, gets deduplicated against a more authoritative version of the same information, or sits on a domain the crawler never reached.
The pipeline's most opaque stage is inference: the moment when the model generates a response and decides which brands to name. Seer Interactive ran six behavioral tests across 362,188 LLM responses and concluded that citations are post-hoc (Seer Interactive, February 2026). The model generates its brand recommendation from parametric memory first, built from the pretraining and fine-tuning stages described above, and then searches for citations to support the choice after the fact. If this hypothesis is correct, and Seer is careful to note they cannot observe token generation logs directly, then the common assumption that "earning a citation earns a recommendation" is backwards. A brand can produce the most authoritative content in its category, earn consistent retrieval, and still never be recommended because the model's parametric memory does not associate the brand strongly enough with the query topic.
Even if you could observe the model's inference process directly, the output varies by user. ChatGPT's memory feature, upgraded in April 2025, rewrites search queries using stored user preferences: a vegan in San Francisco asking for restaurant recommendations gets different search queries and different results than a carnivore in Dallas, before the model even begins selecting brands (TechCrunch, April 2025). Google launched Personal Intelligence in AI Mode in January 2026, connecting Gmail, Photos, YouTube history, and Search history to personalize responses. Perplexity stores structured preferences (favorite brands, dietary needs, keywords) that persist across conversations and work across all models.
Read the full blog post here: https://trysill.com/blog/how-ai-models-form-brand-opinions
0
5
0