OV - - Global

exec - 2025-10-16 14:59:11 - MAQAO 2025.1.2

Help is available by moving the cursor above any symbol or by checking MAQAO website.

▶Filter Information

143 threads covering less than 1% of profiled time ( = Max (Thread Active Time)) were discarded, cumulating 8.08 seconds CPU time. You can adjust the threshold below which a thread will be discarded with the thread-filter-threshold option.

Global Metrics

Total Time (s)		150.87
Max (Thread Active Time) (s)		7.75
Average Active Time (s)		6.00
Activity Ratio (%)		4.51
Average number of active threads		9.548
Affinity Stability (%)		79.9
GFLOPS		254.191
Time in analyzed loops (%)		4.48
Time in analyzed innermost loops (%)		4.38
Time in user code (%)		39.7
Compilation Options Score (%)		100
Array Access Efficiency (%)		89.0

Potential Speedups
Perfect Flow Complexity		1.00
Perfect OpenMP/MPI/Pthread/TBB		1.66
Perfect OpenMP/MPI/Pthread/TBB + Perfect Load Distribution		3.16
No Scalar Integer	Potential Speedup	1.00
No Scalar Integer	Nb Loops to get 80%	5
FP Vectorised	Potential Speedup	1.00
FP Vectorised	Nb Loops to get 80%	4
Fully Vectorised	Potential Speedup	1.02
Fully Vectorised	Nb Loops to get 80%	9
FP Arithmetic Only	Potential Speedup	1.02
FP Arithmetic Only	Nb Loops to get 80%	7

CQA Potential Speedups Summary

Average Active Threads Count⏎

FLOPS Breakdown⏎

Loop Based Profile⏎

Innermost Loop Based Profile⏎

Application Categorization⏎

Compilation Options⏎

Source Object	Issue
▼libllama.so–
○llama-vocab.cpp
○hashtable.h
○hashtable_policy.h
▼libggml-cpu.so–
○binary-ops.cpp
○traits.cpp
○vec.cpp
○common.h
○sgemm.cpp
○amx.cpp
○repack.cpp
○ggml-cpu.cpp
○ops.cpp
○quants.c
○ggml-cpu.c
○mmq.cpp
▼libggml-blas.so–
○ggml-blas.cpp
▼libggml-base.so–
○ggml.c
▼exec–
○basic_string.h

Loop Path Count Profile⏎

Cumulated Speedup If No Scalar Integer⏎

Cumulated Speedup If FP Vectorized⏎

Cumulated Speedup If Fully Vectorized⏎

Cumulated Speedup If FP Arithmetic Only⏎

Experiment Summary

Experiment Name
Application	/beegfs/hackathon/users/eoseret/qaas_runs_test/176-060-7658/intel/llama.cpp/run/binaries/gcc_3/exec
Timestamp	2025-10-16 14:59:11	Universal Timestamp	1760619551
Number of processes observed	1	Number of threads observed	240
Experiment Type	MPI; OpenMP;
Machine	isix06.benchmarkcenter.megware.com
Model Name	Intel(R) Xeon(R) 6972P
Architecture	x86_64	Micro Architecture	GRANITE_RAPIDS
Cache Size	491520 KB	Number of Cores	96
OS Version	Linux 5.14.0-570.39.1.el9_6.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Sep 4 05:08:52 EDT 2025
Architecture used during static analysis	x86_64	Micro Architecture used during static analysis	GRANITE_RAPIDS
Frequency Driver	intel_pstate	Frequency Governor	performance
Huge Pages	always	Hyperthreading	on
Number of sockets	2	Number of cores per socket	96
Compilation Options	exec: GNU C++17 14.2.0 -march=graniterapids -mprefer-vector-width=256 -g -O3 -O3 -O3 -funroll-loops -ffast-math -fno-omit-frame-pointer -fcf-protection=none -fno-finite-math-only libggml-base.so: GNU C11 14.2.0 -march=graniterapids -mprefer-vector-width=256 -g -O3 -O3 -O3 -std=gnu11 -funroll-loops -ffast-math -fno-omit-frame-pointer -fcf-protection=none -fno-finite-math-only -fPIC libggml-blas.so: GNU C++17 14.2.0 -march=graniterapids -mprefer-vector-width=256 -g -O3 -O3 -O3 -std=gnu++17 -funroll-loops -ffast-math -fno-omit-frame-pointer -fcf-protection=none -fno-finite-math-only -fPIC libggml-cpu.so: GNU C++17 14.2.0 -march=graniterapids -mprefer-vector-width=256 -g -O3 -O3 -O3 -std=gnu++17 -funroll-loops -ffast-math -fno-omit-frame-pointer -fcf-protection=none -fno-finite-math-only -fPIC -fopenmp libllama.so: GNU C++17 14.2.0 -march=graniterapids -mprefer-vector-width=256 -g -O3 -O3 -O3 -funroll-loops -ffast-math -fno-omit-frame-pointer -fcf-protection=none -fno-finite-math-only -fPIC
Comments

Configuration Summary

Dataset
Run Command	<executable> -m meta-llama-3.1-8b-instruct-Q8_0.gguf -t 192 -n 0 -p 512 -r 3
MPI Command	mpirun -n <number_processes>
Number Processes	1
Number Nodes	1
Number Processes per Node	1
Filter	Not Used
Profile Start	Not Used
Profile Stop	Not Used
Maximal Path Number	4

Report Configuration

exec - 2025-10-16 14:59:11 - MAQAO 2025.1.2

▶Filter Information

Global Metrics

CQA Potential Speedups Summary

Average Active Threads Count⏎

FLOPS Breakdown⏎

Loop Based Profile⏎

Innermost Loop Based Profile⏎

Application Categorization⏎

Compilation Options⏎

Loop Path Count Profile⏎

Cumulated Speedup If No Scalar Integer⏎

Cumulated Speedup If FP Vectorized⏎

Cumulated Speedup If Fully Vectorized⏎

Cumulated Speedup If FP Arithmetic Only⏎

Experiment Summary

Configuration Summary