OV - - Global

exec - 2025-10-16 12:06:24 - MAQAO 2025.1.2

Help is available by moving the cursor above any symbol or by checking MAQAO website.

▶Filter Information

145 threads covering less than 1% of profiled time ( = Max (Thread Active Time)) were discarded, cumulating 8.39 seconds CPU time. You can adjust the threshold below which a thread will be discarded with the thread-filter-threshold option.

Global Metrics

Total Time (s)		149.42
Max (Thread Active Time) (s)		7.50
Average Active Time (s)		5.82
Activity Ratio (%)		4.42
Average number of active threads		9.272
Affinity Stability (%)		80.5
GFLOPS		262.682
Time in analyzed loops (%)		5.30
Time in analyzed innermost loops (%)		5.18
Time in user code (%)		39.5
Compilation Options Score (%)		75.0
Array Access Efficiency (%)		89.1

Potential Speedups
Perfect Flow Complexity		1.00
Perfect OpenMP/MPI/Pthread/TBB		1.64
Perfect OpenMP/MPI/Pthread/TBB + Perfect Load Distribution		3.17
No Scalar Integer	Potential Speedup	1.00
No Scalar Integer	Nb Loops to get 80%	3
FP Vectorised	Potential Speedup	1.00
FP Vectorised	Nb Loops to get 80%	4
Fully Vectorised	Potential Speedup	1.03
Fully Vectorised	Nb Loops to get 80%	7
FP Arithmetic Only	Potential Speedup	1.02
FP Arithmetic Only	Nb Loops to get 80%	5

CQA Potential Speedups Summary

Average Active Threads Count⏎

FLOPS Breakdown⏎

Loop Based Profile⏎

Innermost Loop Based Profile⏎

Application Categorization⏎

Compilation Options⏎

Source Object	Issue
▼libllama.so–
▼hashtable.h–
○	-march=x86-64 is used but it should be replaced by a more architecture specific option or -march=native.
○	-funroll-loops is missing.
▼unicode.cpp–
○	-march=x86-64 is used but it should be replaced by a more architecture specific option or -march=native.
○	-funroll-loops is missing.
▼llama-vocab.cpp–
○	-march=x86-64 is used but it should be replaced by a more architecture specific option or -march=native.
○	-funroll-loops is missing.
▼hashtable_policy.h–
○	-march=x86-64 is used but it should be replaced by a more architecture specific option or -march=native.
○	-funroll-loops is missing.
▼libggml-cpu.so–
○binary-ops.cpp	-funroll-loops is missing.
○traits.cpp	-funroll-loops is missing.
○common.h	-funroll-loops is missing.
○sgemm.cpp	-funroll-loops is missing.
○vec.cpp	-funroll-loops is missing.
○amx.cpp	-funroll-loops is missing.
○mmq.cpp	-funroll-loops is missing.
○repack.cpp	-funroll-loops is missing.
○quants.c	-funroll-loops is missing.
○ggml-cpu.c	-funroll-loops is missing.
○ops.cpp	-funroll-loops is missing.
▼libggml-blas.so–
▼ggml-blas.cpp–
○	-march=x86-64 is used but it should be replaced by a more architecture specific option or -march=native.
○	-funroll-loops is missing.
▼libggml-base.so–
▼ggml-quants.c–
○	-march=x86-64 is used but it should be replaced by a more architecture specific option or -march=native.
○	-funroll-loops is missing.
▼ggml.c–
○	-march=x86-64 is used but it should be replaced by a more architecture specific option or -march=native.
○	-funroll-loops is missing.
▼exec–
▼–
○	-g is missing for some functions (possibly ones added by the compiler), it is needed to have more accurate reports. Other recommended flags are: -O2/-O3, -march=(target)
○	-O2, -O3 or -Ofast is missing.
○	-march=(target) is missing.

Loop Path Count Profile⏎

Cumulated Speedup If No Scalar Integer⏎

Cumulated Speedup If FP Vectorized⏎

Cumulated Speedup If Fully Vectorized⏎

Cumulated Speedup If FP Arithmetic Only⏎

Experiment Summary

Experiment Name
Application	/beegfs/hackathon/users/eoseret/qaas_runs_test/176-060-7658/intel/llama.cpp/run/base_runs/defaults/gcc/exec
Timestamp	2025-10-16 12:06:24	Universal Timestamp	1760609184
Number of processes observed	1	Number of threads observed	238
Experiment Type	MPI; OpenMP;
Machine	isix06.benchmarkcenter.megware.com
Model Name	Intel(R) Xeon(R) 6972P
Architecture	x86_64	Micro Architecture	GRANITE_RAPIDS
Cache Size	491520 KB	Number of Cores	96
OS Version	Linux 5.14.0-570.39.1.el9_6.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Sep 4 05:08:52 EDT 2025
Architecture used during static analysis	x86_64	Micro Architecture used during static analysis	GRANITE_RAPIDS
Frequency Driver	intel_pstate	Frequency Governor	performance
Huge Pages	always	Hyperthreading	on
Number of sockets	2	Number of cores per socket	96
Compilation Options	exec: N/A libggml-base.so: GNU C11 14.2.0 -mtune=generic -march=x86-64 -g -O3 -O3 -std=gnu11 -fno-omit-frame-pointer -fcf-protection=none -fPIC libggml-blas.so: GNU C++17 14.2.0 -mtune=generic -march=x86-64 -g -O3 -O3 -std=gnu++17 -fno-omit-frame-pointer -fcf-protection=none -fPIC libggml-cpu.so: GNU C++17 14.2.0 -march=graniterapids -mmmx -mpopcnt -msse -msse2 -msse3 -mssse3 -msse4.1 -msse4.2 -mavx -mavx2 -mno-sse4a -mno-fma4 -mno-xop -mfma -mavx512f -mbmi -mbmi2 -maes -mpclmul -mavx512vl -mavx512bw -mavx512dq -mavx512cd -mavx512vbmi -mavx512ifma -mavx512vpopcntdq -mavx512vbmi2 -mgfni -mvpclmulqdq -mavx512vnni -mavx512bitalg -mavx512bf16 -mno-avx512vp2intersect -mno-3dnow -madx -mabm -mcldemote -mclflushopt -mclwb -mno-clzero -mcx16 -menqcmd -mf16c -mfsgsbase -mfxsr -mno-hle -msahf -mno-lwp -mlzcnt -mmovbe -mmovdir64b -mmovdiri -mno-mwaitx -mpconfig -mpku -mprfchw -mptwrite -mrdpid -mrdrnd -mrdseed -mno-rtm -mserialize -msgx -msha -mshstk -mno-tbm -mtsxldtrk -mvaes -mwaitpkg -mwbnoinvd -mxsave -mxsavec -mxsaveopt -mxsaves -mamx-tile -mamx-int8 -mamx-bf16 -muintr -mhreset -mno-kl -mno-widekl -mavxvnni -mavx512fp16 -mno-avxifma -mno-avxvnniint8 -mno-avxneconvert -mno-cmpccxadd -mamx-fp16 -mprefetchi -mno-raoint -mno-amx-complex -mno-avxvnniint16 -mno-sm3 -mno-sha512 -mno-sm4 -mno-apxf -mno-usermsr -mavx10.1-256 -mavx10.1-512 --param=l1-cache-size=48 --param=l1-cache-line-size=64 --param=l2-cache-size=491520 -mtune=graniterapids -g -O3 -O3 -std=gnu++17 -fno-omit-frame-pointer -fcf-protection=none -fPIC -fopenmp libllama.so: GNU C++17 14.2.0 -mtune=generic -march=x86-64 -g -O3 -O3 -fno-omit-frame-pointer -fcf-protection=none -fPIC
Comments

Configuration Summary

Dataset
Run Command	<executable> -m meta-llama-3.1-8b-instruct-Q8_0.gguf -t 192 -n 0 -p 512 -r 3
MPI Command	mpirun -n <number_processes>
Number Processes	1
Number Nodes	1
Number Processes per Node	1
Filter	Not Used
Profile Start	Not Used
Profile Stop	Not Used
Maximal Path Number	4

Report Configuration

exec - 2025-10-16 12:06:24 - MAQAO 2025.1.2

▶Filter Information

Global Metrics

CQA Potential Speedups Summary

Average Active Threads Count⏎

FLOPS Breakdown⏎

Loop Based Profile⏎

Innermost Loop Based Profile⏎

Application Categorization⏎

Compilation Options⏎

Loop Path Count Profile⏎

Cumulated Speedup If No Scalar Integer⏎

Cumulated Speedup If FP Vectorized⏎

Cumulated Speedup If Fully Vectorized⏎

Cumulated Speedup If FP Arithmetic Only⏎

Experiment Summary

Configuration Summary