This LLM Cheat Sheet distils the essential terminology of large language model research and engineering into a single, accessible guide. From architectures and core components to training strategies and evaluation benchmarks, it provides precise definitions to help you navigate technical papers, model documentation, and benchmark results.
SwissCognitive Guest Blogger: Bin Liang, Algorithm Lead – “LLM Terminology Cheat Sheet for AI Practitioners in 2025”

As you may have perceived, the field of LLMs evolves at an unprecedented pace. Architectures, training methods, and best practices shift within months.
While some concepts in this cheatsheet may seem dated or superseded by newer approaches, they remain valuable for building systematic understanding of how we arrived at today’s state-of-the-art.
Familiarity with earlier paradigms like BERT’s bidirectional encoding or early attention variants helps practitioners critically evaluate new techniques and understand architectural trade-offs.
Consider this a map of the LLM landscape: not every path is still traveled, but knowing the terrain enables better navigation. We strongly recommend skimming through to build foundational knowledge, then focusing deeper on concepts most relevant to your current needs.
If you work with large language models (play with the latest open-source ones here), you know the jargon can get dense fast. This guide puts the most important terms – from architectures and core components to training strategies and evaluation benchmarks – in one place, with precise definitions you can rely on when reading research papers, model documentation, or benchmark results.
Model Architectures & Types
Modern LLM architectures have evolved rapidly from early-stage bidirectional models to today’s dominant decoder-only designs.
While encoder-based models like BERT (2018) pioneered transfer learning and language representation in NLP, the field has largely shifted to decoder-only generative Transformers (GPT, LLaMA, Qwen) for their superior generative capabilities and efficiency since the launch of GPT-3.5.
When selecting an architecture & type, consider:
- Computational Budget: MoE models like GLaM offer better scaling.
- Deployment Constraints: Some open source models like Qwen series provide a huge variety of model sizes, making it more suitable for deployment at different computing budgets.
- Task Requirements: Different model families specialize in different domains. For example, vision-language models (VLM) for multimodal tasks, or code-focused models for programming and reasoning over code.
- Capability Needs Instruction-tuned models (with names ending in -instruct) are better at following direct instructions, whereas reasoning models (with names ending in -reasoning or -reasoner) excel at performing complex reasoning tasks.
Understanding this evolution helps practitioners distinguish between legacy concepts and current best practices when evaluating model options.
The following covers both foundational and emerging architectures to help you navigate technical documentation and make informed decisions.
- Transformer: Neural architecture based entirely on attention mechanisms, discarding recurrent (RNN) and convolutional (CNN) networks. Delivers strong performance in tasks such as machine translation, with higher parallelism and shorter training times.
- Encoder–Decoder Architecture: Standard sequence transduction structure. The encoder processes the input sequence and the decoder generates the output; in the highest-performing models, the two are connected via an attention mechanism.
- Decoder-Only Architecture: Transformer design using only the decoder stack, with causal self-attention restricting tokens to attend only to previous positions. Optimised for autoregressive generation (e.g., text completion, code, dialogue). Simpler and more efficient than encoder–decoder for generative tasks; used in most modern LLMs like GPT, LLaMA, Qwen, etc.
- BERT (Bidirectional Encoder Representations from Transformers): Bidirectional Transformer encoder pre-trained with Masked Language Modelling (MLM) and Next Sentence Prediction (NSP). Jointly attends to left and right context. Common sizes: BERT-BASE (110M parameters) and BERT-LARGE (340M).
- Masked Language Model (MLM): Pre-training objective masking ~15% of WordPiece tokens: 80% replaced with [MASK], 10% with random tokens, 10% unchanged. The model predicts the originals from both left and right context.
- Next Sentence Prediction (NSP): Pre-training objective predicting whether the second sentence follows the first (50% true, 50% random).
- OpenAI GPT: Left-to-right Transformer using causal self-attention, where tokens attend only to preceding context.
- ELMo (Embeddings from Language Models): Concatenates features from independently trained left-to-right and right-to-left LSTMs.
- DistilBERT: Compressed BERT retaining ~97% of its performance, with reduced size, cost, and latency.
- ERNIE (Enhanced Language Representation with Informative Entities): Incorporates entity information from knowledge graphs into language representations.
- GLaM (Efficient Scaling of Language Models with Mixture-of-Experts): MoE-based architecture outperforming GPT-3 in NLU/NLG, with lower FLOPs per token and reduced training energy.
- Mixture-of-Experts (MoE): Uses a gating module to dynamically select a small subset of experts (e.g., 2 of 64) per token; outputs are combined before the next layer.
- PaLM (Pathways Language Model): Model series with sizes from 8B to 540B parameters, trained on 780B high-quality tokens.
- phi-1 / phi-1-base / phi-1-small: Code generation models differing in API usage accuracy and logical consistency.
- Toolformer: Learns to use external tools such as WikiSearch or machine translation APIs.
- WebGPT: Browser-augmented question answering model using human feedback.
- InstructGPT: Fine-tuned with human feedback to better follow instructions.
Model Components & Mechanisms
Understanding core LLM components enables informed decisions on model customization and optimization. The attention mechanism remains fundamental, while early variants like additive attention exist in literature, and scaled dot-product attention dominates production systems for efficiency.
Business leaders should focus on mechanisms that impact real-world constraints. Memory footprint (quantization, LoRA ranks), context window length (positional encoding choices), and inference latency (attention complexity). These technical decisions translate directly to infrastructure costs and application capabilities.
Below are the key technical components and their definitions.
- Attention Mechanism: Maps a query (Q) and key–value pairs (K, V) to an output as the weighted sum of values, with weights determined by a compatibility function.
- Query (Q), Key (K), Value (V): Vector components in the attention function.
- Scaled Dot-Product Attention: Computed as softmax(QKᵀ / √dₖ)V; adds scaling for stability.
- Additive Attention: Uses a feed-forward network to compute compatibility scores; less efficient than dot-product attention in practice.
- Self-Attention: Each token attends to all others in the sequence, capturing long-range dependencies.
- Attention Heads: Independent attention layers in multi-head attention, each focusing on different dependencies.
- Transformer Blocks: The repeated layer units in Transformer-based models such as BERT.
- Layers (L), Hidden Size (H), Self-Attention Heads (A): Core parameters defining a Transformer architecture.
- WordPiece Embeddings: Subword tokenisation used in BERT, with a 30K-token vocabulary.
- [CLS] Token: Special classification token placed at sequence start; its final hidden state represents the sequence for classification.
- [SEP] Token: Separator token distinguishing sentences or segments.
- Token, Segment, and Position Embeddings: Elements combined to form token input representations.
- Rotary Position Embedding (RoPE): Encodes positional information via rotation, preserving vector norms.
- Rotary Matrix: Predefined matrix in RoPE for applying rotations.
- Linear Attention: Self-attention variant with linear complexity, compatible with RoPE.
- Content-Based Key Vectors: Weight matrices for computing content-based keys.
- Location-Based Key Vectors: Weight matrices for computing location-based keys.
- Bitwise Determinism: Ensures exact reproducibility at bit level from any checkpoint.
- Outliers: Extreme values that reduce quantisation precision; mitigated with block constants.
- Vector-Wise Quantisation: Improves quantisation by applying scaling per vector.
- NF4 (NormalFloat 4-bit): Symmetric 4-bit datatype optimised for normally distributed weights.
- FP8: 8-bit floating-point datatype for scaling constants.
- Blocksize: Block size in quantisation, affecting precision and memory.
- Trainable Matrices: Low-rank matrices updated in LoRA fine-tuning.
- Frozen Weights: Pre-trained weights kept fixed during tuning.
- dmodel: Transformer layer input/output dimension.
- LoRA (Low-Rank Adaptation): A fine-tuning method that injects trainable low-rank matrices into pre-trained models while keeping original weights frozen.
- Rank (r): Rank of LoRA matrices.
- dffn: Dimension of Transformer MLP, usually 4 × dmodel.
- Subspace Similarity: Measures overlap between ∆W and W subspaces.
- Segment-Level Recurrence Mechanism: Transformer-XL technique reusing hidden states to extend context.
- Relative Positional Encodings: Encodes relative distances into attention scores.
- Bi-Encoder Architecture: Uses separate encoders for queries and documents.
- Maximum Inner Product Search (MIPS): Retrieves top-k documents with highest similarity scores.
- Non-Parametric Memory: Document index in retrieval-augmented models.
Training Methods & Strategies
Training strategy determines both model capability and data efficiency.
The modern LLM training pipeline typically follows: large-scale pre-training → supervised fine-tuning (SFT) → alignment via RL (RLHF/RLAIF/DPO).
When choosing strategies, weigh:
- Available Compute Budget
- Data Quality & Quantity
- Desired Model Behavior (creative vs. factual)
- Time-to-deployment
Understanding these trade-offs helps practitioners avoid over-engineering while achieving business objectives efficiently.
This section details training approaches from pre-training objectives to alignment techniques, helping you choose the right methodology for your use case.
- Pre-Training: Initial training on unlabelled data using objectives such as MLM and NSP.
- Fine-Tuning: Adapting a pre-trained model to a specific task with labelled data.
- Semi-Supervised Approach: Combines unsupervised pre-training and supervised fine-tuning for transferable representations.
- Two-Stage Training Procedure: Unsupervised pre-training followed by supervised adaptation.
- Unsupervised Objective: Pre-training target that does not require labels.
- Denoising Objective: Reconstructs corrupted inputs.
- Span-Corruption Objective: Masks contiguous token spans, predicting the originals.
- Mass-Style Objective: Masks 15% of tokens, replacing with mask tokens, then reconstructs.
- BERT-Style Denoising Objective: Similar to MLM but used in encoder–decoder models to reconstruct full sequences.
- Deshuffling Objective: Predicts the original order from shuffled tokens.
- Multi-Task Training: Trains on multiple tasks concurrently.
- Multi-Task Pre-Training: Pre-trains across multiple tasks.
- Instruction Tuning: Fine-tunes on datasets reformatted as natural language instructions.
- Prompt Engineering: Optimising prompts for desired outputs.
- LoRA (Low-Rank Adaptation): Parameter-efficient fine-tuning with low-rank matrices merged into frozen weights.
- Parameter-Efficient Approach: Fine-tuning with fewer trainable parameters.
- Random Gaussian Initialisation: LoRA matrix A initialisation method.
- Zero Initialisation: LoRA matrix B initialisation producing zero ∆W.
- Bias-Only / BitFit: Tunes only bias terms.
- Prefix-Embedding Tuning: Inserts special tokens with trainable embeddings.
- Prefix-Layer Tuning: Extends prefix tuning to layer activations.
- Adapter Tuning: Inserts adapter layers between attention/MLP modules and residuals.
- Prefix-Tuning: Optimises a continuous task-specific prefix vector without changing model weights.
- Continuous Task-Specific Vectors: Learnable prefix parameters not tied to real tokens.
- Virtual Tokens: Prefix vectors treated as pseudo-tokens.
- Random Initialisation: Randomly initialising prefixes; less effective than real-token initialisation.
- QLoRA (Quantised LoRA): LoRA using NF4 quantisation with FP8 constants and double dequantisation.
- Double Dequantisation: Converts 4-bit weights to higher-precision formats during inference.
- RLHF (Reinforcement Learning from Human Feedback): Aligns models using human preference data.
- RLAIF (Reinforcement Learning from AI Feedback): Uses AI-generated preference labels for alignment.
- AI-Generated Preference Labels: AI-produced quality judgements for candidate outputs.
- Reward Model (RM): Predicts reward signals for RLHF/RLAIF.
- Self-Consistency: Samples multiple reasoning chains (CoT) and averages preferences.
- PPO (Proximal Policy Optimisation): RL algorithm used in RLHF.
- SFT (Supervised Fine-Tuning): Fine-tunes on curated supervised data.
- Process Supervision: Labels intermediate reasoning steps.
- Outcome Supervision: Labels only final results.
- MathMix: Math-focused token dataset for pre-training.
- Decontamination Checks: Ensures no benchmark leakage in training data.
- Weak-to-Strong Generalisation: Trains a strong model under weak supervision.
- Weak Supervisor: Model producing weak labels.
- Weak Labels: Soft labels from weak supervision.
- Strong Student: Model trained under weak supervision that surpasses the supervisor.
- Imitation: Failure mode where the student copies supervisor errors.
- Human Simulator Failure Mode: Risk of models imitating human phrasing instead of optimal answers.
- Linear Probing: Adds a linear classifier to frozen models.
- Covariate Shift Problem: Training/test distribution mismatch.
- Concept Shift: Change in label meaning or distribution.
- Noisy Labels: Incorrect labels in data.
- FLOPs per Token: Inference compute cost measure.
- Greedy Decoder: Decoding strategy selecting highest-probability token each step.
- AdamW Optimiser: Optimiser for models such as LLaMA and LoRA.
- Cosine Learning Rate Schedule: Cosine-shaped learning rate decay.
- Weight Decay: Regularisation reducing overfitting.
- Gradient Clipping: Caps gradient magnitude.
- Warmup Steps: Gradually increase learning rate at training start.
- Batch Size: Number of samples per training step.
- Epoch: One full pass over the training dataset.
- Learning Rate: Step size for weight updates.
- Reward Function (πrf): Source of training rewards in alignment.
- Gradient Coefficient (GC): Scales penalty/reward magnitude.
- RFT (Reward-Free Tuning): Alignment method without explicit reward models.
- DPO (Direct Preference Optimisation): Preference-based alignment without reinforcement learning.
- Online RFT: Real-time reward-free tuning.
- GRPO (Generalised Regularised Policy Optimisation): Alignment method using model-based reward functions.
- Continual Training: Continues training for domain adaptation.
Evaluation Metrics & Datasets
Robust evaluation frameworks are critical for validating LLM performance before being deployed in production. Modern evaluation spans multiple dimensions:
- Task-specific Metrics (BLEU for translation, Pass@k for code, MMLU for reasoning),
- Safety Benchmarks (TruthfulQA for factuality, RealToxicityPrompts for harmful content)
- Domain Capabilities (GSM8K for mathematics, HumanEval for programming).
Business leaders are suggested to prioritize metrics aligned with use cases. Customer service applications require low toxicity and high factual accuracy, while creative tools may require better variability. Comprehensive evaluation across multiple benchmarks provides realistic performance expectations and identifies model limitations.
The metrics and datasets below provide a reference for benchmarking model performance across diverse tasks and domains.
- BLEU: Translation quality metric, also used in code generation.
- Pass@k: Fraction of generated code passing tests within k attempts.
- GLUE Benchmark: NLU benchmark with tasks such as CoLA, MNLI, MRPC, QNLI, QQP, RTE, SST-2, STS-B, WNLI.
- SQuAD: Reading comprehension dataset for QA.
- SICK Dataset: NLI dataset with entailment, contradiction, and neutral labels.
- Open Entity: Entity classification benchmark.
- FewRel, TACRED: Relation classification datasets.
- HumanEval: Code generation benchmark.
- SuperGLUE: More challenging NLU benchmark.
- WMT Language Pairs: Translation datasets for BLEU scoring.
- MMLU: Multi-domain knowledge understanding benchmark.
- GSM8K: Grade-school maths QA benchmark.
- MATH Dataset: Advanced mathematics problem set.
- PRM800K: Large dataset with step-level labels for math problem solving.
- RealToxicityPrompts: Toxicity evaluation dataset.
- CrowS Pairs: Measures bias in models.
- TruthfulQA: Tests factual accuracy and informativeness.
- CAIL2019-SCM: Chinese long-text semantic matching dataset.
- HotpotQA, Fever: QA datasets for multi-hop reasoning.
- MS-MARCO, Jeopardy Question Generation: Used for retrieval-augmented generation.
- CMATH, AGIEval: Chinese mathematics benchmarks.
- Rouge-1 / Rouge-2 / Rouge-L: Summarisation evaluation metrics.
- EM (Exact Match): QA accuracy measure.
- F1 Score: Common classification/NER metric.
- Accuracy: Overall correctness measure.
- Precision, Recall, Micro-F1: Entity and relation extraction metrics.
Other Key Concepts
Beyond core architectures and training, several cross-cutting concepts directly impact LLM productionization.
- In-context learning enables task adaptation without fine-tuning, reducing deployment friction.
- Temperature and sampling strategies (top-k, greedy decoding) control output randomness—critical for consistency in business applications.
Prompt engineering remains a practical skill for maximizing model utility without retraining.
- Length generalization and low-data regime performance affect real-world robustness. For practitioners, understanding toxicity metrics (TPP, TPC).
- Calibration Curves (confidence vs. accuracy alignment).
- Entity/relationship Extraction capabilities is essential for building reliable AI products. These operational considerations often determine success or failure in production environments, regardless of benchmark performance.
Here are essential operational concepts spanning inference control, evaluation diagnostics, and knowledge extraction techniques.
- Universal Representation: Features transferable to multiple tasks.
- Low-Data Regime: Training with very limited data.
- Length Generalisation: Performance on sequences longer than training examples.
- Co-Occurrence Prompts: Analyses token co-occurrence patterns in generated text.
- Temperature: Sampling parameter controlling randomness.
- Top-k Sampling: Chooses from top-k tokens at each step.
- POS Tagger: Identifies part of speech for tokens.
- Toxicity Probability of the Prompt (TPP): Measures input prompt toxicity.
- Toxicity Probability of the Continuation (TPC): Measures toxicity in model outputs.
- Perspective API: Tool for assigning toxicity probabilities to text.
- Toxicity Degeneration: Unwanted toxic text generation.
- In-Context Learning: Learning from examples in the prompt without weight updates.
- Data Contamination: Evaluation data appearing in training sets.
- Calibration Curve: Plots predicted confidence vs. actual accuracy.
- Substring Match: Detects overlap between evaluation and training data.
- LLM Prompt: Instruction or example text for LLMs.
- Self-Reflection Iterations: Iteration count for reflection-based entity detection.
- Chunk Size: Affects detected entity count.
- Entity Extraction: Identifies named entities and attributes.
- Relationship Extraction: Identifies relationships between entities.
- Leiden Algorithm: Detects communities in graph data.
- Entity Nodes: Graph nodes representing entities.
- Graph Communities: Groups of related entities in a graph.
- Hierarchical Clustering: Reveals internal community structure.
- RLAIF vs RLHF: Compares AI-feedback and human-feedback reinforcement learning.
- Position Bias: Preference for specific positions in pairwise comparisons.
- Pairwise Accuracy: Reward model accuracy on held-out human preferences.
- ULMFiT: Universal fine-tuning method for text classification.
- Model Architectures: Fundamental model design types (e.g., Transformer, MoE).
- Reasoning Capability: Ability to perform logical reasoning and problem solving.
- External Knowledge: Use of information outside training data.
- High-Quality Data: Critical for improving performance and alignment.
- Human/AI Feedback: Mechanism for improving performance and alignment.
About the Author:
Bin Liang holds an MSc in Scientific and Data Intensive Computing from University College London (UCL).
This LLM Cheat Sheet distils the essential terminology of large language model research and engineering into a single, accessible guide. From architectures and core components to training strategies and evaluation benchmarks, it provides precise definitions to help you navigate technical papers, model documentation, and benchmark results.
SwissCognitive Guest Blogger: Bin Liang, Algorithm Lead – “LLM Terminology Cheat Sheet for AI Practitioners in 2025”
As you may have perceived, the field of LLMs evolves at an unprecedented pace. Architectures, training methods, and best practices shift within months.
While some concepts in this cheatsheet may seem dated or superseded by newer approaches, they remain valuable for building systematic understanding of how we arrived at today’s state-of-the-art.
Familiarity with earlier paradigms like BERT’s bidirectional encoding or early attention variants helps practitioners critically evaluate new techniques and understand architectural trade-offs.
Consider this a map of the LLM landscape: not every path is still traveled, but knowing the terrain enables better navigation. We strongly recommend skimming through to build foundational knowledge, then focusing deeper on concepts most relevant to your current needs.
If you work with large language models (play with the latest open-source ones here), you know the jargon can get dense fast. This guide puts the most important terms – from architectures and core components to training strategies and evaluation benchmarks – in one place, with precise definitions you can rely on when reading research papers, model documentation, or benchmark results.
Model Architectures & Types
Modern LLM architectures have evolved rapidly from early-stage bidirectional models to today’s dominant decoder-only designs.
While encoder-based models like BERT (2018) pioneered transfer learning and language representation in NLP, the field has largely shifted to decoder-only generative Transformers (GPT, LLaMA, Qwen) for their superior generative capabilities and efficiency since the launch of GPT-3.5.
When selecting an architecture & type, consider:
Understanding this evolution helps practitioners distinguish between legacy concepts and current best practices when evaluating model options.
The following covers both foundational and emerging architectures to help you navigate technical documentation and make informed decisions.
Model Components & Mechanisms
Understanding core LLM components enables informed decisions on model customization and optimization. The attention mechanism remains fundamental, while early variants like additive attention exist in literature, and scaled dot-product attention dominates production systems for efficiency.
Business leaders should focus on mechanisms that impact real-world constraints. Memory footprint (quantization, LoRA ranks), context window length (positional encoding choices), and inference latency (attention complexity). These technical decisions translate directly to infrastructure costs and application capabilities.
Below are the key technical components and their definitions.
Training Methods & Strategies
Training strategy determines both model capability and data efficiency.
The modern LLM training pipeline typically follows: large-scale pre-training → supervised fine-tuning (SFT) → alignment via RL (RLHF/RLAIF/DPO).
When choosing strategies, weigh:
Understanding these trade-offs helps practitioners avoid over-engineering while achieving business objectives efficiently.
This section details training approaches from pre-training objectives to alignment techniques, helping you choose the right methodology for your use case.
Evaluation Metrics & Datasets
Robust evaluation frameworks are critical for validating LLM performance before being deployed in production. Modern evaluation spans multiple dimensions:
Business leaders are suggested to prioritize metrics aligned with use cases. Customer service applications require low toxicity and high factual accuracy, while creative tools may require better variability. Comprehensive evaluation across multiple benchmarks provides realistic performance expectations and identifies model limitations.
The metrics and datasets below provide a reference for benchmarking model performance across diverse tasks and domains.
Other Key Concepts
Beyond core architectures and training, several cross-cutting concepts directly impact LLM productionization.
Prompt engineering remains a practical skill for maximizing model utility without retraining.
Here are essential operational concepts spanning inference control, evaluation diagnostics, and knowledge extraction techniques.
About the Author:
Share this: