# Benchmarking Basics We define a ***benchmark*** as _the execution of an algorithm using a specific hardware platform and under a precise configuration of hyperparameters_ (and fed with specific input data), _while measuring some interesting metrics_. Under this definition, running a benchmark means selecting a particular AI asset and executing it on a specific HW device and hyperparameter configurations; multiple benchmarking runs are needed to explore multiple HW resources and/or configurations. The identification of the hyperparameters and targets to be metrics are crucial for effective execution of a benchmarking suite. (bench_pre)= ## Preliminaries The definition of the execution of an AI asset varies by domain. For instance, in the case of DNN it might refer to the training phase. In this case, the set of hyperparameters range from the DNN topology (e.g., overall structure of the neural network number of layers, number of nodes per layer) to the parameters governing the DNN’s training (from the learning rate to the batch size). In any way, the first step is to clearly define the benchmarking goals, explicitly specifying the task (e.g., in case of LLMs, summarization, classification, QA, etc.). It must also be placed care in ensuring consistency in task formulation (e.g., input-output format). To ensure reproducible benchmarks through a systematic approach, a series of preliminary questions to guide the benchmarking process need to be addressed. - Which [HW resources](bench_hw) are to be used? - What is the pool of devices and HW architectures available? High-performance and cloud computing, consumer-grade machines, embedded systems, etc. - This can be answered by taking into account the cost of the HW and the budget available to run the benchmarks. - If the AI asset’s behaviour is governed by [hyperparameters](bench_hp), which are the ones to be explored? How to choose the values of such hyperparameters? - A vast range of potentially interesting [metrics](bench_metrics) could be measured – determining the useful ones depending on the scope of the benchmark. - What is the cost associated with measuring the targets? - Does the input [data](bench_data) feed to the AI asset impact the metrics we are interested in? Some metrics might be impacted while others might be agnostic to the input. (bench_workflow)= ## Workflow The benchmark workflow is composed by the following steps: (bench_asset)= ### AI Asset Selection As a starting point, the AI asset to be benchmarked will have to be selected. - Selecting an AI asset involves choosing a specific AI algorithm, such as an optimization model (e.g., constraint programming), an agent-based simulation, or an ML model. For ML models, distinguish between training and inference, as they are different AI assets. While this might seem trivial, it's crucial, especially for non-experts, to carefully consider and clarify the algorithm of interest to avoid confusion. After having chose the target AI asset, it's important to provide input and output data specifications: - Input Data Format - clearly specify the size and structure of the input data the AI software will receive. - Output Data Format - define the expected format and structure of the output data the AI software will produce. (bench_hw)= ### Hardware Resources The pool of HW resources to be benchmarked has to be chosen, according to the available budget and/or HW already at its own disposal. - Choose the most suitable hardware platform for the task, such as HPC or cloud resources, GPU servers, compute servers, embedded devices, or IoT devices. - The monetary budget determines the number of runs on different hardware platforms, considering the costs and pricing schemes (e.g., purchasing devices for embedded systems, or calculating HPC costs based on CPU usage and duration). (bench_hp)= ### Hyperparameters **Parameters** define the model’s capacity, while **hyperparameters** govern how that capacity is utilized. Effective benchmarking requires careful control of hyperparameters and transparency about parameters. Use task-specific insights to guide model selection, balancing performance, and resource efficiency. According to the AI asset, a selection of the hyperparameters of interest must be made. - This can be done only by someone with sufficient knowledge about the AI asset to be benchmarked. - Some examples: the parameters governing the agents’ behaviours in Agent-Based models; the number of training epochs, learning rate, number of layers, etc., in the case of DNN training; the number of scenarios to be considered in stochastic optimization models; the number of threads in parallel applications, and so on. - Additionally, a search strategy over the hyperparameters space must be defined, especially if multiple hyperparameters with many possible values have to be explored - The issue is that the hyperparameters space tends to get very large pretty quickly and thus exploring it becomes a computationally challenging task, hence the need to have an efficient search strategy able to balance exploration and exploitation. - In general, common approaches are random search, grid search, Latin Hypercube Sampling ([LHS](https://www.jstor.org/stable/1268522?origin=crossref)), Bayesian Optimization ([BO](https://www.sciencedirect.com/science/article/pii/S1674862X19300047)), genetic algorithms ([GA](https://www.sciencedirect.com/science/article/pii/S1674862X19300047)), etc. #### Parameters vs. Hyperparameters in Large Language Models (LLMs) In the context of Large Language Models (LLMs), the distinction between parameters and hyperparameters becomes even more critical due to the complexity and scale of these models. Understanding these concepts is essential when benchmarking LLMs for specific tasks to ensure reproducibility, fairness, and insightful comparisons. 1. Parameters in LLMs Parameters are the internal variables learned by the model during training. They encode the knowledge extracted from the training data and directly affect the model's ability to make predictions. - Weights of Attention Layers: - Attention heads use learned weights to compute key, query, and value vectors - Weights in Feed-Forward Layers: - Dense layers that transform intermediate representations - Embeddings: - Learned representations of input tokens and positional encodings Key Characteristics are: - Scale: LLMs often have billions (or even trillions) of parameters, e.g., GPT-3 has 175 billion parameters - Fixed After Training: Parameters are determined during pretraining and remain constant during inference - Task Adaptation: Fine-tuning or instruction-tuning may adjust parameters for specific tasks - Impact on Performance: Larger models (with more parameters) generally have greater capacity but also higher computational costs Benchmarking Implications: - Larger parameter counts do not always guarantee better performance on a specific task - When benchmarking, evaluate whether the additional capacity is necessary for the task at hand (e.g., small models may suffice for simpler tasks) 2. Hyperparameters in LLMs Hyperparameters are external settings that define how the model is trained or fine-tuned and influence performance, efficiency, and resource consumption. Types of Hyperparameters in LLM Context: - Training Hyperparameters: - Learning Rate: Determines how quickly model weights are updated during training - Batch Size: Number of training examples processed in a single forward/backward pass - Optimizer Settings: Choices like AdamW vs. SGD, weight decay, and momentum - Architecture Hyperparameters: - Number of Layers: Depth of the model - Hidden Size: Dimensionality of intermediate representations - Number of Attention Heads: Affects how attention is distributed in multi-head attention - Inference Hyperparameters: - Temperature: Controls randomness in output text (higher = more diverse) - Top-k / Top-p Sampling: Restricts token sampling to the top-k probable or top-p cumulative probability tokens - Max Tokens: Sets the maximum length of the generated output Key Characteristics: - Hyperparameters are not learned during training and must be specified manually or through optimization methods - Different hyperparameters are relevant for training, fine-tuning, and inference Benchmarking Implications: - Selecting optimal hyperparameters for fine-tuning or inference is critical for fair comparisons - Default hyperparameters may not be suitable for all tasks—evaluate their impact empirically (bench_data)= ### Datasets If the AI asset requires data to be executed, it must be provided (for instance, training samples for training Machine Learning models or input instances to solve optimization problems). Select or create appropriate datasets that reflect the real-world scenarios and challenges relevant to your AI system or task It's crucial to ensure the datasets are diverse, representative, and cover a wide range of possible inputs to obtain reliable benchmarking results. The data must be detailed described using meta-data. To ensure full reproducibility it is better to follow a precise [checklist](./reproducibility.md). It is as important to check whether the available data is annotated (and whether the annotation process is well-documented, e.g., to better identify possible sources of bias). To obtain trustworthy benchmarking results it is crucial to : 1) consider the amount of data required and 2) use the same dataset and pre-processing steps across models. (bench_metrics)= ### Metrics As a final choice, the metrics to be measured must be decided, again according to what can be relevant to the AI asset in question The choice of the metrics depends as well on the HW platform selected, as a metric that can be measured on an embedded device could be much harder. The metrics can be grouped in four (very broad classes): 1. Accuracy -- Task-specific performance metrics 2. Efficiency -- Latency, memory usage, and computational cost 3. Robustness -- Performance on adversarial or noisy inputs 4. Generalization -- Performance across different datasets for the same task See [Metrics](./metrics.md) for more details (bench_out)= ### Benchmark Output Define the expected output format for the benchmarking process to ensure clarity and ease of integration into downstream tasks (e.g., automated HW device selection for AI algorithms). The output could be a simple CSV file with accompanying metadata. The output of the benchmark workflow can assume different shapes. The simplest is the form of a data set (or data frame) organised in a structured format. The first choice to be made is how to organise the results: - One data set per AI asset - PROS: Simplifies the management and storage of the results, facilitating the retrieval for reuse - CONS: It is relatively more complicated to add new benchmarking results in case the same AI asset is run on a different HW platform at a later date - Multiple data sets per asset (e.g., a different data set for each different HW resource where the asset has been benchmarked) - PROS: Adding a new resource simply means adding a new data set -- without interfering with already existing ones - CONS: Slightly more complicated management; e.g., different data sets must share the same structure (format) to guarantee compatibility See [Output](./output.md) for more details. In any case, report findings transparently clearly documenting: 1) model parameters and architecture; 2) training and inference hyperparameters; 3) dataset details (size, splits, preprocessing); 4) 4omputational resources used (e.g., GPU/TPU type, runtime)