Transformers

The T in GPT

(From ...)

Understanding the T in GPT (transformers, but not the one in the movies) is harder and also puts parameters and tokens in our path.

Every internal variable in a neural network that can be tuned or adjusted to change the output is a parameter. A simple parameter is the number of possible words that might go next in a sentence. A hyperparameter is a configuration of variables like the randomness, creativity, or temperature of which response is chosen. When the temperature is set to 0, the model is at its most deterministic: it will return the next word with the highest probability. At higher temperatures, less probable words can be chosen, which leads to more creativity and diversity, but also more potential nonsense and hallucination. The Bing/Copilot conversational style buttons—creative, balanced, or precise—seem to regulate temperature. A model is trained by feeding it examples and tuning its parameters to adjust the output.

GPT-2 had 1.5 billion parameters, GPT-3 had 175 billion parameters, and GPT-4 was estimated to have 1.76 trillion parameters and significantly more memory (Griffith, 2023; Geyer, 2023). More parameters allow for more choices and nuance, so GPT-4 is smarter, more factual, more multilingual, and multimodal (able to accept visual or audio prompts). GPT-4 can create graphs or use an image of what is in your refrigerator to generate recipe ideas. But more parameters can also be slower and more costly. For some tasks, GPT-3.5 is good enough or even better, but while GPT-3.5 failed the bar exam, GPT-4 immediately passed, doing better than most humans (Katz et al., 2023). Professor Anna Mills was a beta-tester and quickly noticed that GPT-4 wrote more sophisticated, precise, articulate, and connected prose, with more varied sentence structure, word choice, information, and examples than GPT-3.5 (Mills, 2023b). All of those extra parameters allow users to imitate your professor or write like Yoda.

A neural network deals only with numbers, so words (or images or brain waves) need to be turned into tokens, which are a kind of digital stand-in. Tokens are a series of 0s and 1s that represent words, parts of a word, or other data. More tokens imply more vocabulary, context, and nuance, but both the process of turning words into numbers (tokenization) and size matter. Claude (100,000 tokens or 75,000 words) and GPT-4 Turbo (128,000 tokens or 96,000 words) can both read your novel all at once. GPT-4 can also tokenize and analyze images, which Claude cannot.

An AI determines the meaning of a token by observing it in context—what other tokens (words or images) appear near it and how often. Initially, each word in a sequence was processed sequentially. The 2017 breakthrough of transformers (the T you've been waiting for) provided every token with a weight that internalized the context of the token compared with any other token (Uszkoreit, 2017; Vaswani et al., 2017). This self-attention allowed each word to be processed simultaneously, which greatly increased speed and created a more natural way to embed context. Transformers allow an AI to look not just at the probability of the next token, but at multiple combinations of tokens and larger patterns at once. Transformers opened the door for human-sounding LLMs but also unified the many disparate disciplines of AI into one of a field where everything—from images, code, music, or DNA—can be treated like language. This synchronicity (which echoes the predictions of a coming convergence or singularity) is one of the reasons why LLMs have become the central foundation models and why change has appeared to occur so quickly. Public access to a functioning GPT on November 30, 2022, also helped shape that perception.