OV - exec - Summary

exec - 2025-09-12 15:43:22 - MAQAO 2025.1.2

Help is available by moving the cursor above any symbol or by checking MAQAO website.

run_0
run_1
run_2
run_3
run_4
run_5
run_6
run_7
run_8
run_9
run_10
run_11
run_12
run_13
run_14

▼Stylizer

[ 4 / 4 ] Application profile is long enough (11.79 s)

To have good quality measurements, it is advised that the application profiling time is greater than 10 seconds.

[ 3 / 3 ] Most of time spent in analyzed modules comes from functions compiled with -g and -fno-omit-frame-pointer

-g option gives access to debugging informations, such are source locations. -fno-omit-frame-pointer improve the accuracy of callchains found during the application profiling.

[ 3 / 3 ] Optimization level option is correctly used

[ 2 / 3 ] Security settings from the host restrict profiling. Some metrics will be missing or incomplete.

Current value for kernel.perf_event_paranoid is 2. If possible, set it to 1 or check with your system administrator which flag can be used to achieve this.

[ 3 / 3 ] Architecture specific option -mcpu is used

[ 2 / 2 ] Application is correctly profiled ("Others" category represents 0.00 % of the execution time)

To have a representative profiling, it is advised that the category "Others" represents less than 20% of the execution time in order to analyze as much as possible of the user code

[ 1 / 1 ] Lstopo present. The Topology lstopo report will be generated.

▼Strategizer

[ 4 / 4 ] Enough time of the experiment time spent in analyzed loops (71.43%)

If the time spent in analyzed loops is less than 30%, standard loop optimizations will have a limited impact on application performances.

[ 4 / 4 ] Threads activity is good

On average, more than 91.97% of observed threads are actually active

[ 4 / 4 ] CPU activity is good

CPU cores are active 96.71% of time

[ 4 / 4 ] Loop profile is not flat

At least one loop coverage is greater than 4% (69.30%), representing an hotspot for the application

[ 4 / 4 ] Enough time of the experiment time spent in analyzed innermost loops (70.67%)

If the time spent in analyzed innermost loops is less than 15%, standard innermost loop optimizations such as vectorisation will have a limited impact on application performances.

[ 4 / 4 ] Affinity is good (99.45%)

Threads are not migrating to CPU cores: probably successfully pinned

[ 3 / 3 ] Less than 10% (0.00%) is spend in BLAS1 operations

It could be more efficient to inline by hand BLAS1 operations

[ 0 / 3 ] Too many functions do not use all threads

Functions running on a reduced number of threads (typically sequential code) cover at least 10% of application walltime (18.22%). Check both "Max Inclusive Time Over Threads" and "Nb Threads" in Functions or Loops tabs and consider parallelizing sequential regions or improving parallelization of regions running on a reduced number of threads

[ 3 / 3 ] Cumulative Outermost/In between loops coverage (0.76%) lower than cumulative innermost loop coverage (70.67%)

Having cumulative Outermost/In between loops coverage greater than cumulative innermost loop coverage will make loop optimization more complex

[ 2 / 2 ] Less than 10% (0.00%) is spend in BLAS2 operations

BLAS2 calls usually could make a poor cache usage and could benefit from inlining.

[ 2 / 2 ] Less than 10% (0.00%) is spend in Libm/SVML (special functions)

▼Optimizer

Loop ID	Analysis	Penalty Score
○Loop 379 - libggml-cpu.so	Execution Time: 69 % - Vectorization Ratio: 38.46 % - Vector Length Use: 53.85 %
►Loop 72 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 50.00 %
►Control Flow Issues+		1
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
►Vectorization Roadblocks+		1001
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000
○Loop 908 - libggml-cpu.so	Execution Time: 0 % - Vectorization Ratio: 100.00 % - Vector Length Use: 100.00 %
►Loop 73 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 41.07 %
►Control Flow Issues+		3
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Non innermost loop (Outermost) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Vectorization Roadblocks+		1003
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000
○	[SA] Non innermost loop (Outermost) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Loop 56 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 50.29 %
►Control Flow Issues+		3
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Vectorization Roadblocks+		1003
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Loop 1896 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 8.42 % - Vector Length Use: 38.16 %
►Loop Computation Issues+		2
○	[SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points.	2
►Control Flow Issues+		3
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Vectorization Roadblocks+		1003
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
○Loop 1908 - libggml-cpu.so	Execution Time: 0 % - Vectorization Ratio: 100.00 % - Vector Length Use: 100.00 %
►Loop 53 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 52.17 %
►Control Flow Issues+		1
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
►Vectorization Roadblocks+		1
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
►Loop 419 - libggml-base.so+	Execution Time: 0 % - Vectorization Ratio: 23.17 % - Vector Length Use: 42.00 %
►Loop Computation Issues+		10
○	[SA] Presence of expensive FP instructions - Perform hoisting, change algorithm, use SVML or proper numerical library or perform value profiling (count the number of distinct input values). There are 1 issues (= instructions) costing 4 points each.	4
○	[SA] Less than 10% of the FP ADD/SUB/MUL arithmetic operations are performed using FMA - Reorganize arithmetic expressions to exhibit potential for FMA. This issue costs 4 points.	4
○	[SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points.	2
○Control Flow Issues		0
►Vectorization Roadblocks+		1000
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000
►Loop 1596 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 27.27 % - Vector Length Use: 36.36 %
►Loop Computation Issues+		8
○	[SA] Presence of expensive FP instructions - Perform hoisting, change algorithm, use SVML or proper numerical library or perform value profiling (count the number of distinct input values). There are 1 issues (= instructions) costing 4 points each.	4
○	[SA] Less than 10% of the FP ADD/SUB/MUL arithmetic operations are performed using FMA - Reorganize arithmetic expressions to exhibit potential for FMA. This issue costs 4 points.	4
►Control Flow Issues+		1
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
►Data Access Issues+		2
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 1 issues ( = data accesses) costing 2 point each.	2
►Vectorization Roadblocks+		3
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 1 issues ( = data accesses) costing 2 point each.	2

▼Stylizer

[ 4 / 4 ] Application profile is long enough (11.68 s)

To have good quality measurements, it is advised that the application profiling time is greater than 10 seconds.

[ 3 / 3 ] Most of time spent in analyzed modules comes from functions compiled with -g and -fno-omit-frame-pointer

-g option gives access to debugging informations, such are source locations. -fno-omit-frame-pointer improve the accuracy of callchains found during the application profiling.

[ 3 / 3 ] Optimization level option is correctly used

[ 2 / 3 ] Security settings from the host restrict profiling. Some metrics will be missing or incomplete.

Current value for kernel.perf_event_paranoid is 2. If possible, set it to 1 or check with your system administrator which flag can be used to achieve this.

[ 3 / 3 ] Architecture specific option -mcpu is used

[ 2 / 2 ] Application is correctly profiled ("Others" category represents 0.00 % of the execution time)

To have a representative profiling, it is advised that the category "Others" represents less than 20% of the execution time in order to analyze as much as possible of the user code

[ 1 / 1 ] Lstopo present. The Topology lstopo report will be generated.

▼Strategizer

[ 4 / 4 ] Enough time of the experiment time spent in analyzed loops (72.01%)

If the time spent in analyzed loops is less than 30%, standard loop optimizations will have a limited impact on application performances.

[ 4 / 4 ] Threads activity is good

On average, more than 92.07% of observed threads are actually active

[ 4 / 4 ] CPU activity is good

CPU cores are active 96.85% of time

[ 4 / 4 ] Loop profile is not flat

At least one loop coverage is greater than 4% (69.83%), representing an hotspot for the application

[ 4 / 4 ] Enough time of the experiment time spent in analyzed innermost loops (71.23%)

If the time spent in analyzed innermost loops is less than 15%, standard innermost loop optimizations such as vectorisation will have a limited impact on application performances.

[ 4 / 4 ] Affinity is good (99.46%)

Threads are not migrating to CPU cores: probably successfully pinned

[ 3 / 3 ] Less than 10% (0.00%) is spend in BLAS1 operations

It could be more efficient to inline by hand BLAS1 operations

[ 0 / 3 ] Too many functions do not use all threads

Functions running on a reduced number of threads (typically sequential code) cover at least 10% of application walltime (18.80%). Check both "Max Inclusive Time Over Threads" and "Nb Threads" in Functions or Loops tabs and consider parallelizing sequential regions or improving parallelization of regions running on a reduced number of threads

[ 3 / 3 ] Cumulative Outermost/In between loops coverage (0.79%) lower than cumulative innermost loop coverage (71.23%)

Having cumulative Outermost/In between loops coverage greater than cumulative innermost loop coverage will make loop optimization more complex

[ 2 / 2 ] Less than 10% (0.00%) is spend in BLAS2 operations

BLAS2 calls usually could make a poor cache usage and could benefit from inlining.

[ 2 / 2 ] Less than 10% (0.00%) is spend in Libm/SVML (special functions)

▼Optimizer

Loop ID	Analysis	Penalty Score
○Loop 379 - libggml-cpu.so	Execution Time: 69 % - Vectorization Ratio: 38.46 % - Vector Length Use: 53.85 %
►Loop 72 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 50.00 %
►Control Flow Issues+		1
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
►Vectorization Roadblocks+		1001
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000
►Loop 73 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 41.07 %
►Control Flow Issues+		3
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Non innermost loop (Outermost) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Vectorization Roadblocks+		1003
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000
○	[SA] Non innermost loop (Outermost) - Collapse loop with innermost ones. This issue costs 2 points.	2
○Loop 908 - libggml-cpu.so	Execution Time: 0 % - Vectorization Ratio: 100.00 % - Vector Length Use: 100.00 %
►Loop 56 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 50.29 %
►Control Flow Issues+		3
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Vectorization Roadblocks+		1003
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Loop 1896 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 8.42 % - Vector Length Use: 38.16 %
►Loop Computation Issues+		2
○	[SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points.	2
►Control Flow Issues+		3
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Vectorization Roadblocks+		1003
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
○Loop 1908 - libggml-cpu.so	Execution Time: 0 % - Vectorization Ratio: 100.00 % - Vector Length Use: 100.00 %
►Loop 53 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 52.17 %
►Control Flow Issues+		1
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
►Vectorization Roadblocks+		1
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
►Loop 419 - libggml-base.so+	Execution Time: 0 % - Vectorization Ratio: 23.17 % - Vector Length Use: 42.00 %
►Loop Computation Issues+		10
○	[SA] Presence of expensive FP instructions - Perform hoisting, change algorithm, use SVML or proper numerical library or perform value profiling (count the number of distinct input values). There are 1 issues (= instructions) costing 4 points each.	4
○	[SA] Less than 10% of the FP ADD/SUB/MUL arithmetic operations are performed using FMA - Reorganize arithmetic expressions to exhibit potential for FMA. This issue costs 4 points.	4
○	[SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points.	2
○Control Flow Issues		0
►Vectorization Roadblocks+		1000
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000
►Loop 1596 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 27.27 % - Vector Length Use: 36.36 %
►Loop Computation Issues+		8
○	[SA] Presence of expensive FP instructions - Perform hoisting, change algorithm, use SVML or proper numerical library or perform value profiling (count the number of distinct input values). There are 1 issues (= instructions) costing 4 points each.	4
○	[SA] Less than 10% of the FP ADD/SUB/MUL arithmetic operations are performed using FMA - Reorganize arithmetic expressions to exhibit potential for FMA. This issue costs 4 points.	4
►Control Flow Issues+		1
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
►Data Access Issues+		2
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 1 issues ( = data accesses) costing 2 point each.	2
►Vectorization Roadblocks+		3
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 1 issues ( = data accesses) costing 2 point each.	2

▼Stylizer

[ 4 / 4 ] Application profile is long enough (11.60 s)

To have good quality measurements, it is advised that the application profiling time is greater than 10 seconds.

[ 3 / 3 ] Most of time spent in analyzed modules comes from functions compiled with -g and -fno-omit-frame-pointer

-g option gives access to debugging informations, such are source locations. -fno-omit-frame-pointer improve the accuracy of callchains found during the application profiling.

[ 3 / 3 ] Optimization level option is correctly used

[ 2 / 3 ] Security settings from the host restrict profiling. Some metrics will be missing or incomplete.

Current value for kernel.perf_event_paranoid is 2. If possible, set it to 1 or check with your system administrator which flag can be used to achieve this.

[ 3 / 3 ] Architecture specific option -mcpu is used

[ 2 / 2 ] Application is correctly profiled ("Others" category represents 0.00 % of the execution time)

To have a representative profiling, it is advised that the category "Others" represents less than 20% of the execution time in order to analyze as much as possible of the user code

[ 1 / 1 ] Lstopo present. The Topology lstopo report will be generated.

▼Strategizer

[ 4 / 4 ] Enough time of the experiment time spent in analyzed loops (71.72%)

If the time spent in analyzed loops is less than 30%, standard loop optimizations will have a limited impact on application performances.

[ 4 / 4 ] Threads activity is good

On average, more than 92.03% of observed threads are actually active

[ 4 / 4 ] CPU activity is good

CPU cores are active 96.83% of time

[ 4 / 4 ] Loop profile is not flat

At least one loop coverage is greater than 4% (69.57%), representing an hotspot for the application

[ 4 / 4 ] Enough time of the experiment time spent in analyzed innermost loops (70.94%)

If the time spent in analyzed innermost loops is less than 15%, standard innermost loop optimizations such as vectorisation will have a limited impact on application performances.

[ 4 / 4 ] Affinity is good (99.45%)

Threads are not migrating to CPU cores: probably successfully pinned

[ 3 / 3 ] Less than 10% (0.00%) is spend in BLAS1 operations

It could be more efficient to inline by hand BLAS1 operations

[ 0 / 3 ] Too many functions do not use all threads

Functions running on a reduced number of threads (typically sequential code) cover at least 10% of application walltime (19.12%). Check both "Max Inclusive Time Over Threads" and "Nb Threads" in Functions or Loops tabs and consider parallelizing sequential regions or improving parallelization of regions running on a reduced number of threads

[ 3 / 3 ] Cumulative Outermost/In between loops coverage (0.78%) lower than cumulative innermost loop coverage (70.94%)

Having cumulative Outermost/In between loops coverage greater than cumulative innermost loop coverage will make loop optimization more complex

[ 2 / 2 ] Less than 10% (0.00%) is spend in BLAS2 operations

BLAS2 calls usually could make a poor cache usage and could benefit from inlining.

[ 2 / 2 ] Less than 10% (0.00%) is spend in Libm/SVML (special functions)

▼Optimizer

Loop ID	Analysis	Penalty Score
○Loop 379 - libggml-cpu.so	Execution Time: 69 % - Vectorization Ratio: 38.46 % - Vector Length Use: 53.85 %
►Loop 72 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 50.00 %
►Control Flow Issues+		1
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
►Vectorization Roadblocks+		1001
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000
►Loop 73 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 41.07 %
►Control Flow Issues+		3
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Non innermost loop (Outermost) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Vectorization Roadblocks+		1003
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000
○	[SA] Non innermost loop (Outermost) - Collapse loop with innermost ones. This issue costs 2 points.	2
○Loop 908 - libggml-cpu.so	Execution Time: 0 % - Vectorization Ratio: 100.00 % - Vector Length Use: 100.00 %
►Loop 56 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 50.29 %
►Control Flow Issues+		3
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Vectorization Roadblocks+		1003
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Loop 1896 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 8.42 % - Vector Length Use: 38.16 %
►Loop Computation Issues+		2
○	[SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points.	2
►Control Flow Issues+		3
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Vectorization Roadblocks+		1003
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
○Loop 1908 - libggml-cpu.so	Execution Time: 0 % - Vectorization Ratio: 100.00 % - Vector Length Use: 100.00 %
►Loop 53 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 52.17 %
►Control Flow Issues+		1
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
►Vectorization Roadblocks+		1
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
►Loop 419 - libggml-base.so+	Execution Time: 0 % - Vectorization Ratio: 23.17 % - Vector Length Use: 42.00 %
►Loop Computation Issues+		10
○	[SA] Presence of expensive FP instructions - Perform hoisting, change algorithm, use SVML or proper numerical library or perform value profiling (count the number of distinct input values). There are 1 issues (= instructions) costing 4 points each.	4
○	[SA] Less than 10% of the FP ADD/SUB/MUL arithmetic operations are performed using FMA - Reorganize arithmetic expressions to exhibit potential for FMA. This issue costs 4 points.	4
○	[SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points.	2
○Control Flow Issues		0
►Vectorization Roadblocks+		1000
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000
►Loop 1596 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 27.27 % - Vector Length Use: 36.36 %
►Loop Computation Issues+		8
○	[SA] Presence of expensive FP instructions - Perform hoisting, change algorithm, use SVML or proper numerical library or perform value profiling (count the number of distinct input values). There are 1 issues (= instructions) costing 4 points each.	4
○	[SA] Less than 10% of the FP ADD/SUB/MUL arithmetic operations are performed using FMA - Reorganize arithmetic expressions to exhibit potential for FMA. This issue costs 4 points.	4
►Control Flow Issues+		1
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
►Data Access Issues+		2
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 1 issues ( = data accesses) costing 2 point each.	2
►Vectorization Roadblocks+		3
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 1 issues ( = data accesses) costing 2 point each.	2

▼Stylizer

[ 4 / 4 ] Application profile is long enough (11.60 s)

To have good quality measurements, it is advised that the application profiling time is greater than 10 seconds.

[ 3 / 3 ] Most of time spent in analyzed modules comes from functions compiled with -g and -fno-omit-frame-pointer

-g option gives access to debugging informations, such are source locations. -fno-omit-frame-pointer improve the accuracy of callchains found during the application profiling.

[ 3 / 3 ] Optimization level option is correctly used

[ 2 / 3 ] Security settings from the host restrict profiling. Some metrics will be missing or incomplete.

Current value for kernel.perf_event_paranoid is 2. If possible, set it to 1 or check with your system administrator which flag can be used to achieve this.

[ 3 / 3 ] Architecture specific option -mcpu is used

[ 2 / 2 ] Application is correctly profiled ("Others" category represents 0.00 % of the execution time)

To have a representative profiling, it is advised that the category "Others" represents less than 20% of the execution time in order to analyze as much as possible of the user code

[ 1 / 1 ] Lstopo present. The Topology lstopo report will be generated.

▼Strategizer

[ 4 / 4 ] Enough time of the experiment time spent in analyzed loops (72.28%)

If the time spent in analyzed loops is less than 30%, standard loop optimizations will have a limited impact on application performances.

[ 4 / 4 ] Threads activity is good

On average, more than 92.09% of observed threads are actually active

[ 4 / 4 ] CPU activity is good

CPU cores are active 96.87% of time

[ 4 / 4 ] Loop profile is not flat

At least one loop coverage is greater than 4% (70.19%), representing an hotspot for the application

[ 4 / 4 ] Enough time of the experiment time spent in analyzed innermost loops (71.52%)

If the time spent in analyzed innermost loops is less than 15%, standard innermost loop optimizations such as vectorisation will have a limited impact on application performances.

[ 4 / 4 ] Affinity is good (99.46%)

Threads are not migrating to CPU cores: probably successfully pinned

[ 3 / 3 ] Less than 10% (0.00%) is spend in BLAS1 operations

It could be more efficient to inline by hand BLAS1 operations

[ 0 / 3 ] Too many functions do not use all threads

Functions running on a reduced number of threads (typically sequential code) cover at least 10% of application walltime (18.43%). Check both "Max Inclusive Time Over Threads" and "Nb Threads" in Functions or Loops tabs and consider parallelizing sequential regions or improving parallelization of regions running on a reduced number of threads

[ 3 / 3 ] Cumulative Outermost/In between loops coverage (0.76%) lower than cumulative innermost loop coverage (71.52%)

Having cumulative Outermost/In between loops coverage greater than cumulative innermost loop coverage will make loop optimization more complex

[ 2 / 2 ] Less than 10% (0.00%) is spend in BLAS2 operations

BLAS2 calls usually could make a poor cache usage and could benefit from inlining.

[ 2 / 2 ] Less than 10% (0.00%) is spend in Libm/SVML (special functions)

▼Optimizer

Loop ID	Analysis	Penalty Score
○Loop 379 - libggml-cpu.so	Execution Time: 70 % - Vectorization Ratio: 38.46 % - Vector Length Use: 53.85 %
►Loop 72 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 50.00 %
►Control Flow Issues+		1
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
►Vectorization Roadblocks+		1001
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000
○Loop 908 - libggml-cpu.so	Execution Time: 0 % - Vectorization Ratio: 100.00 % - Vector Length Use: 100.00 %
►Loop 56 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 50.29 %
►Control Flow Issues+		3
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Vectorization Roadblocks+		1003
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Loop 73 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 41.07 %
►Control Flow Issues+		3
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Non innermost loop (Outermost) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Vectorization Roadblocks+		1003
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000
○	[SA] Non innermost loop (Outermost) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Loop 1896 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 8.42 % - Vector Length Use: 38.16 %
►Loop Computation Issues+		2
○	[SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points.	2
►Control Flow Issues+		3
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Vectorization Roadblocks+		1003
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
○Loop 1908 - libggml-cpu.so	Execution Time: 0 % - Vectorization Ratio: 100.00 % - Vector Length Use: 100.00 %
►Loop 53 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 52.17 %
►Control Flow Issues+		1
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
►Vectorization Roadblocks+		1
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
►Loop 419 - libggml-base.so+	Execution Time: 0 % - Vectorization Ratio: 23.17 % - Vector Length Use: 42.00 %
►Loop Computation Issues+		10
○	[SA] Presence of expensive FP instructions - Perform hoisting, change algorithm, use SVML or proper numerical library or perform value profiling (count the number of distinct input values). There are 1 issues (= instructions) costing 4 points each.	4
○	[SA] Less than 10% of the FP ADD/SUB/MUL arithmetic operations are performed using FMA - Reorganize arithmetic expressions to exhibit potential for FMA. This issue costs 4 points.	4
○	[SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points.	2
○Control Flow Issues		0
►Vectorization Roadblocks+		1000
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000
►Loop 1596 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 27.27 % - Vector Length Use: 36.36 %
►Loop Computation Issues+		8
○	[SA] Presence of expensive FP instructions - Perform hoisting, change algorithm, use SVML or proper numerical library or perform value profiling (count the number of distinct input values). There are 1 issues (= instructions) costing 4 points each.	4
○	[SA] Less than 10% of the FP ADD/SUB/MUL arithmetic operations are performed using FMA - Reorganize arithmetic expressions to exhibit potential for FMA. This issue costs 4 points.	4
►Control Flow Issues+		1
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
►Data Access Issues+		2
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 1 issues ( = data accesses) costing 2 point each.	2
►Vectorization Roadblocks+		3
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 1 issues ( = data accesses) costing 2 point each.	2

▼Stylizer

[ 4 / 4 ] Application profile is long enough (11.53 s)

To have good quality measurements, it is advised that the application profiling time is greater than 10 seconds.

[ 3 / 3 ] Most of time spent in analyzed modules comes from functions compiled with -g and -fno-omit-frame-pointer

-g option gives access to debugging informations, such are source locations. -fno-omit-frame-pointer improve the accuracy of callchains found during the application profiling.

[ 3 / 3 ] Optimization level option is correctly used

[ 2 / 3 ] Security settings from the host restrict profiling. Some metrics will be missing or incomplete.

Current value for kernel.perf_event_paranoid is 2. If possible, set it to 1 or check with your system administrator which flag can be used to achieve this.

[ 3 / 3 ] Architecture specific option -mcpu is used

[ 2 / 2 ] Application is correctly profiled ("Others" category represents 0.00 % of the execution time)

To have a representative profiling, it is advised that the category "Others" represents less than 20% of the execution time in order to analyze as much as possible of the user code

[ 1 / 1 ] Lstopo present. The Topology lstopo report will be generated.

▼Strategizer

[ 4 / 4 ] Enough time of the experiment time spent in analyzed loops (71.42%)

If the time spent in analyzed loops is less than 30%, standard loop optimizations will have a limited impact on application performances.

[ 4 / 4 ] Threads activity is good

On average, more than 91.85% of observed threads are actually active

[ 4 / 4 ] CPU activity is good

CPU cores are active 96.66% of time

[ 4 / 4 ] Loop profile is not flat

At least one loop coverage is greater than 4% (69.36%), representing an hotspot for the application

[ 4 / 4 ] Enough time of the experiment time spent in analyzed innermost loops (70.65%)

If the time spent in analyzed innermost loops is less than 15%, standard innermost loop optimizations such as vectorisation will have a limited impact on application performances.

[ 4 / 4 ] Affinity is good (99.43%)

Threads are not migrating to CPU cores: probably successfully pinned

[ 3 / 3 ] Less than 10% (0.00%) is spend in BLAS1 operations

It could be more efficient to inline by hand BLAS1 operations

[ 0 / 3 ] Too many functions do not use all threads

Functions running on a reduced number of threads (typically sequential code) cover at least 10% of application walltime (18.38%). Check both "Max Inclusive Time Over Threads" and "Nb Threads" in Functions or Loops tabs and consider parallelizing sequential regions or improving parallelization of regions running on a reduced number of threads

[ 3 / 3 ] Cumulative Outermost/In between loops coverage (0.77%) lower than cumulative innermost loop coverage (70.65%)

Having cumulative Outermost/In between loops coverage greater than cumulative innermost loop coverage will make loop optimization more complex

[ 2 / 2 ] Less than 10% (0.00%) is spend in BLAS2 operations

BLAS2 calls usually could make a poor cache usage and could benefit from inlining.

[ 2 / 2 ] Less than 10% (0.00%) is spend in Libm/SVML (special functions)

▼Optimizer

Loop ID	Analysis	Penalty Score
○Loop 379 - libggml-cpu.so	Execution Time: 69 % - Vectorization Ratio: 38.46 % - Vector Length Use: 53.85 %
►Loop 72 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 50.00 %
►Control Flow Issues+		1
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
►Vectorization Roadblocks+		1001
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000
►Loop 73 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 41.07 %
►Control Flow Issues+		3
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Non innermost loop (Outermost) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Vectorization Roadblocks+		1003
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000
○	[SA] Non innermost loop (Outermost) - Collapse loop with innermost ones. This issue costs 2 points.	2
○Loop 908 - libggml-cpu.so	Execution Time: 0 % - Vectorization Ratio: 100.00 % - Vector Length Use: 100.00 %
►Loop 56 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 50.29 %
►Control Flow Issues+		3
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Vectorization Roadblocks+		1003
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Loop 1896 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 8.42 % - Vector Length Use: 38.16 %
►Loop Computation Issues+		2
○	[SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points.	2
►Control Flow Issues+		3
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Vectorization Roadblocks+		1003
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
○Loop 1908 - libggml-cpu.so	Execution Time: 0 % - Vectorization Ratio: 100.00 % - Vector Length Use: 100.00 %
►Loop 53 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 52.17 %
►Control Flow Issues+		1
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
►Vectorization Roadblocks+		1
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
►Loop 419 - libggml-base.so+	Execution Time: 0 % - Vectorization Ratio: 23.17 % - Vector Length Use: 42.00 %
►Loop Computation Issues+		10
○	[SA] Presence of expensive FP instructions - Perform hoisting, change algorithm, use SVML or proper numerical library or perform value profiling (count the number of distinct input values). There are 1 issues (= instructions) costing 4 points each.	4
○	[SA] Less than 10% of the FP ADD/SUB/MUL arithmetic operations are performed using FMA - Reorganize arithmetic expressions to exhibit potential for FMA. This issue costs 4 points.	4
○	[SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points.	2
○Control Flow Issues		0
►Vectorization Roadblocks+		1000
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000
►Loop 1596 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 27.27 % - Vector Length Use: 36.36 %
►Loop Computation Issues+		8
○	[SA] Presence of expensive FP instructions - Perform hoisting, change algorithm, use SVML or proper numerical library or perform value profiling (count the number of distinct input values). There are 1 issues (= instructions) costing 4 points each.	4
○	[SA] Less than 10% of the FP ADD/SUB/MUL arithmetic operations are performed using FMA - Reorganize arithmetic expressions to exhibit potential for FMA. This issue costs 4 points.	4
►Control Flow Issues+		1
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
►Data Access Issues+		2
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 1 issues ( = data accesses) costing 2 point each.	2
►Vectorization Roadblocks+		3
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 1 issues ( = data accesses) costing 2 point each.	2

▼Stylizer

[ 4 / 4 ] Application profile is long enough (11.74 s)

To have good quality measurements, it is advised that the application profiling time is greater than 10 seconds.

[ 3 / 3 ] Most of time spent in analyzed modules comes from functions compiled with -g and -fno-omit-frame-pointer

-g option gives access to debugging informations, such are source locations. -fno-omit-frame-pointer improve the accuracy of callchains found during the application profiling.

[ 3 / 3 ] Optimization level option is correctly used

[ 2 / 3 ] Security settings from the host restrict profiling. Some metrics will be missing or incomplete.

Current value for kernel.perf_event_paranoid is 2. If possible, set it to 1 or check with your system administrator which flag can be used to achieve this.

[ 3 / 3 ] Architecture specific option -mcpu is used

[ 2 / 2 ] Application is correctly profiled ("Others" category represents 0.00 % of the execution time)

To have a representative profiling, it is advised that the category "Others" represents less than 20% of the execution time in order to analyze as much as possible of the user code

[ 1 / 1 ] Lstopo present. The Topology lstopo report will be generated.

▼Strategizer

[ 4 / 4 ] Enough time of the experiment time spent in analyzed loops (69.95%)

If the time spent in analyzed loops is less than 30%, standard loop optimizations will have a limited impact on application performances.

[ 4 / 4 ] Threads activity is good

On average, more than 91.77% of observed threads are actually active

[ 4 / 4 ] CPU activity is good

CPU cores are active 96.50% of time

[ 4 / 4 ] Loop profile is not flat

At least one loop coverage is greater than 4% (67.79%), representing an hotspot for the application

[ 4 / 4 ] Enough time of the experiment time spent in analyzed innermost loops (69.04%)

If the time spent in analyzed innermost loops is less than 15%, standard innermost loop optimizations such as vectorisation will have a limited impact on application performances.

[ 4 / 4 ] Affinity is good (99.53%)

Threads are not migrating to CPU cores: probably successfully pinned

[ 3 / 3 ] Less than 10% (0.00%) is spend in BLAS1 operations

It could be more efficient to inline by hand BLAS1 operations

[ 0 / 3 ] Too many functions do not use all threads

Functions running on a reduced number of threads (typically sequential code) cover at least 10% of application walltime (18.89%). Check both "Max Inclusive Time Over Threads" and "Nb Threads" in Functions or Loops tabs and consider parallelizing sequential regions or improving parallelization of regions running on a reduced number of threads

[ 3 / 3 ] Cumulative Outermost/In between loops coverage (0.91%) lower than cumulative innermost loop coverage (69.04%)

Having cumulative Outermost/In between loops coverage greater than cumulative innermost loop coverage will make loop optimization more complex

[ 2 / 2 ] Less than 10% (0.00%) is spend in BLAS2 operations

BLAS2 calls usually could make a poor cache usage and could benefit from inlining.

[ 2 / 2 ] Less than 10% (0.00%) is spend in Libm/SVML (special functions)

▼Optimizer

Loop ID	Analysis	Penalty Score
○Loop 379 - libggml-cpu.so	Execution Time: 67 % - Vectorization Ratio: 38.46 % - Vector Length Use: 53.85 %
►Loop 72 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 50.00 %
►Control Flow Issues+		1
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
►Vectorization Roadblocks+		1001
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000
►Loop 73 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 41.07 %
►Control Flow Issues+		3
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Non innermost loop (Outermost) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Vectorization Roadblocks+		1003
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000
○	[SA] Non innermost loop (Outermost) - Collapse loop with innermost ones. This issue costs 2 points.	2
○Loop 908 - libggml-cpu.so	Execution Time: 0 % - Vectorization Ratio: 100.00 % - Vector Length Use: 100.00 %
►Loop 56 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 50.29 %
►Control Flow Issues+		3
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Vectorization Roadblocks+		1003
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
○Loop 1908 - libggml-cpu.so	Execution Time: 0 % - Vectorization Ratio: 100.00 % - Vector Length Use: 100.00 %
►Loop 1896 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 8.42 % - Vector Length Use: 38.16 %
►Loop Computation Issues+		2
○	[SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points.	2
►Control Flow Issues+		3
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Vectorization Roadblocks+		1003
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Loop 53 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 52.17 %
►Control Flow Issues+		1
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
►Vectorization Roadblocks+		1
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
►Loop 419 - libggml-base.so+	Execution Time: 0 % - Vectorization Ratio: 23.17 % - Vector Length Use: 42.00 %
►Loop Computation Issues+		10
○	[SA] Presence of expensive FP instructions - Perform hoisting, change algorithm, use SVML or proper numerical library or perform value profiling (count the number of distinct input values). There are 1 issues (= instructions) costing 4 points each.	4
○	[SA] Less than 10% of the FP ADD/SUB/MUL arithmetic operations are performed using FMA - Reorganize arithmetic expressions to exhibit potential for FMA. This issue costs 4 points.	4
○	[SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points.	2
○Control Flow Issues		0
►Vectorization Roadblocks+		1000
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000
►Loop 1596 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 27.27 % - Vector Length Use: 36.36 %
►Loop Computation Issues+		8
○	[SA] Presence of expensive FP instructions - Perform hoisting, change algorithm, use SVML or proper numerical library or perform value profiling (count the number of distinct input values). There are 1 issues (= instructions) costing 4 points each.	4
○	[SA] Less than 10% of the FP ADD/SUB/MUL arithmetic operations are performed using FMA - Reorganize arithmetic expressions to exhibit potential for FMA. This issue costs 4 points.	4
►Control Flow Issues+		1
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
►Data Access Issues+		2
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 1 issues ( = data accesses) costing 2 point each.	2
►Vectorization Roadblocks+		3
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 1 issues ( = data accesses) costing 2 point each.	2

▼Stylizer

[ 4 / 4 ] Application profile is long enough (11.54 s)

To have good quality measurements, it is advised that the application profiling time is greater than 10 seconds.

[ 3 / 3 ] Most of time spent in analyzed modules comes from functions compiled with -g and -fno-omit-frame-pointer

-g option gives access to debugging informations, such are source locations. -fno-omit-frame-pointer improve the accuracy of callchains found during the application profiling.

[ 3 / 3 ] Optimization level option is correctly used

[ 2 / 3 ] Security settings from the host restrict profiling. Some metrics will be missing or incomplete.

Current value for kernel.perf_event_paranoid is 2. If possible, set it to 1 or check with your system administrator which flag can be used to achieve this.

[ 3 / 3 ] Architecture specific option -mcpu is used

[ 2 / 2 ] Application is correctly profiled ("Others" category represents 0.00 % of the execution time)

To have a representative profiling, it is advised that the category "Others" represents less than 20% of the execution time in order to analyze as much as possible of the user code

[ 1 / 1 ] Lstopo present. The Topology lstopo report will be generated.

▼Strategizer

[ 4 / 4 ] Enough time of the experiment time spent in analyzed loops (72.12%)

If the time spent in analyzed loops is less than 30%, standard loop optimizations will have a limited impact on application performances.

[ 4 / 4 ] Threads activity is good

On average, more than 92.05% of observed threads are actually active

[ 4 / 4 ] CPU activity is good

CPU cores are active 96.88% of time

[ 4 / 4 ] Loop profile is not flat

At least one loop coverage is greater than 4% (69.95%), representing an hotspot for the application

[ 4 / 4 ] Enough time of the experiment time spent in analyzed innermost loops (71.32%)

If the time spent in analyzed innermost loops is less than 15%, standard innermost loop optimizations such as vectorisation will have a limited impact on application performances.

[ 4 / 4 ] Affinity is good (99.47%)

Threads are not migrating to CPU cores: probably successfully pinned

[ 3 / 3 ] Less than 10% (0.00%) is spend in BLAS1 operations

It could be more efficient to inline by hand BLAS1 operations

[ 0 / 3 ] Too many functions do not use all threads

Functions running on a reduced number of threads (typically sequential code) cover at least 10% of application walltime (18.91%). Check both "Max Inclusive Time Over Threads" and "Nb Threads" in Functions or Loops tabs and consider parallelizing sequential regions or improving parallelization of regions running on a reduced number of threads

[ 3 / 3 ] Cumulative Outermost/In between loops coverage (0.80%) lower than cumulative innermost loop coverage (71.32%)

Having cumulative Outermost/In between loops coverage greater than cumulative innermost loop coverage will make loop optimization more complex

[ 2 / 2 ] Less than 10% (0.00%) is spend in BLAS2 operations

BLAS2 calls usually could make a poor cache usage and could benefit from inlining.

[ 2 / 2 ] Less than 10% (0.00%) is spend in Libm/SVML (special functions)

▼Optimizer

Loop ID	Analysis	Penalty Score
○Loop 379 - libggml-cpu.so	Execution Time: 69 % - Vectorization Ratio: 38.46 % - Vector Length Use: 53.85 %
►Loop 72 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 50.00 %
►Control Flow Issues+		1
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
►Vectorization Roadblocks+		1001
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000
►Loop 73 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 41.07 %
►Control Flow Issues+		3
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Non innermost loop (Outermost) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Vectorization Roadblocks+		1003
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000
○	[SA] Non innermost loop (Outermost) - Collapse loop with innermost ones. This issue costs 2 points.	2
○Loop 908 - libggml-cpu.so	Execution Time: 0 % - Vectorization Ratio: 100.00 % - Vector Length Use: 100.00 %
►Loop 56 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 50.29 %
►Control Flow Issues+		3
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Vectorization Roadblocks+		1003
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Loop 1896 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 8.42 % - Vector Length Use: 38.16 %
►Loop Computation Issues+		2
○	[SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points.	2
►Control Flow Issues+		3
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Vectorization Roadblocks+		1003
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
○Loop 1908 - libggml-cpu.so	Execution Time: 0 % - Vectorization Ratio: 100.00 % - Vector Length Use: 100.00 %
►Loop 53 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 52.17 %
►Control Flow Issues+		1
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
►Vectorization Roadblocks+		1
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
►Loop 419 - libggml-base.so+	Execution Time: 0 % - Vectorization Ratio: 23.17 % - Vector Length Use: 42.00 %
►Loop Computation Issues+		10
○	[SA] Presence of expensive FP instructions - Perform hoisting, change algorithm, use SVML or proper numerical library or perform value profiling (count the number of distinct input values). There are 1 issues (= instructions) costing 4 points each.	4
○	[SA] Less than 10% of the FP ADD/SUB/MUL arithmetic operations are performed using FMA - Reorganize arithmetic expressions to exhibit potential for FMA. This issue costs 4 points.	4
○	[SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points.	2
○Control Flow Issues		0
►Vectorization Roadblocks+		1000
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000
►Loop 1596 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 27.27 % - Vector Length Use: 36.36 %
►Loop Computation Issues+		8
○	[SA] Presence of expensive FP instructions - Perform hoisting, change algorithm, use SVML or proper numerical library or perform value profiling (count the number of distinct input values). There are 1 issues (= instructions) costing 4 points each.	4
○	[SA] Less than 10% of the FP ADD/SUB/MUL arithmetic operations are performed using FMA - Reorganize arithmetic expressions to exhibit potential for FMA. This issue costs 4 points.	4
►Control Flow Issues+		1
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
►Data Access Issues+		2
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 1 issues ( = data accesses) costing 2 point each.	2
►Vectorization Roadblocks+		3
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 1 issues ( = data accesses) costing 2 point each.	2

▼Stylizer

[ 4 / 4 ] Application profile is long enough (11.58 s)

To have good quality measurements, it is advised that the application profiling time is greater than 10 seconds.

[ 3 / 3 ] Most of time spent in analyzed modules comes from functions compiled with -g and -fno-omit-frame-pointer

-g option gives access to debugging informations, such are source locations. -fno-omit-frame-pointer improve the accuracy of callchains found during the application profiling.

[ 3 / 3 ] Optimization level option is correctly used

[ 2 / 3 ] Security settings from the host restrict profiling. Some metrics will be missing or incomplete.

Current value for kernel.perf_event_paranoid is 2. If possible, set it to 1 or check with your system administrator which flag can be used to achieve this.

[ 3 / 3 ] Architecture specific option -mcpu is used

[ 2 / 2 ] Application is correctly profiled ("Others" category represents 0.00 % of the execution time)

To have a representative profiling, it is advised that the category "Others" represents less than 20% of the execution time in order to analyze as much as possible of the user code

[ 1 / 1 ] Lstopo present. The Topology lstopo report will be generated.

▼Strategizer

[ 4 / 4 ] Enough time of the experiment time spent in analyzed loops (73.13%)

If the time spent in analyzed loops is less than 30%, standard loop optimizations will have a limited impact on application performances.

[ 4 / 4 ] Threads activity is good

On average, more than 92.29% of observed threads are actually active

[ 4 / 4 ] CPU activity is good

CPU cores are active 97.11% of time

[ 4 / 4 ] Loop profile is not flat

At least one loop coverage is greater than 4% (70.90%), representing an hotspot for the application

[ 4 / 4 ] Enough time of the experiment time spent in analyzed innermost loops (72.33%)

If the time spent in analyzed innermost loops is less than 15%, standard innermost loop optimizations such as vectorisation will have a limited impact on application performances.

[ 4 / 4 ] Affinity is good (99.49%)

Threads are not migrating to CPU cores: probably successfully pinned

[ 3 / 3 ] Less than 10% (0.00%) is spend in BLAS1 operations

It could be more efficient to inline by hand BLAS1 operations

[ 0 / 3 ] Too many functions do not use all threads

Functions running on a reduced number of threads (typically sequential code) cover at least 10% of application walltime (17.21%). Check both "Max Inclusive Time Over Threads" and "Nb Threads" in Functions or Loops tabs and consider parallelizing sequential regions or improving parallelization of regions running on a reduced number of threads

[ 3 / 3 ] Cumulative Outermost/In between loops coverage (0.81%) lower than cumulative innermost loop coverage (72.33%)

Having cumulative Outermost/In between loops coverage greater than cumulative innermost loop coverage will make loop optimization more complex

[ 2 / 2 ] Less than 10% (0.00%) is spend in BLAS2 operations

BLAS2 calls usually could make a poor cache usage and could benefit from inlining.

[ 2 / 2 ] Less than 10% (0.00%) is spend in Libm/SVML (special functions)

▼Optimizer

Loop ID	Analysis	Penalty Score
○Loop 379 - libggml-cpu.so	Execution Time: 70 % - Vectorization Ratio: 38.46 % - Vector Length Use: 53.85 %
►Loop 72 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 50.00 %
►Control Flow Issues+		1
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
►Vectorization Roadblocks+		1001
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000
►Loop 73 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 41.07 %
►Control Flow Issues+		3
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Non innermost loop (Outermost) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Vectorization Roadblocks+		1003
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000
○	[SA] Non innermost loop (Outermost) - Collapse loop with innermost ones. This issue costs 2 points.	2
○Loop 908 - libggml-cpu.so	Execution Time: 0 % - Vectorization Ratio: 100.00 % - Vector Length Use: 100.00 %
►Loop 56 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 50.29 %
►Control Flow Issues+		3
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Vectorization Roadblocks+		1003
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Loop 1896 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 8.42 % - Vector Length Use: 38.16 %
►Loop Computation Issues+		2
○	[SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points.	2
►Control Flow Issues+		3
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Vectorization Roadblocks+		1003
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
○Loop 1908 - libggml-cpu.so	Execution Time: 0 % - Vectorization Ratio: 100.00 % - Vector Length Use: 100.00 %
►Loop 53 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 52.17 %
►Control Flow Issues+		1
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
►Vectorization Roadblocks+		1
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
►Loop 419 - libggml-base.so+	Execution Time: 0 % - Vectorization Ratio: 23.17 % - Vector Length Use: 42.00 %
►Loop Computation Issues+		10
○	[SA] Presence of expensive FP instructions - Perform hoisting, change algorithm, use SVML or proper numerical library or perform value profiling (count the number of distinct input values). There are 1 issues (= instructions) costing 4 points each.	4
○	[SA] Less than 10% of the FP ADD/SUB/MUL arithmetic operations are performed using FMA - Reorganize arithmetic expressions to exhibit potential for FMA. This issue costs 4 points.	4
○	[SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points.	2
○Control Flow Issues		0
►Vectorization Roadblocks+		1000
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000
►Loop 1596 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 27.27 % - Vector Length Use: 36.36 %
►Loop Computation Issues+		8
○	[SA] Presence of expensive FP instructions - Perform hoisting, change algorithm, use SVML or proper numerical library or perform value profiling (count the number of distinct input values). There are 1 issues (= instructions) costing 4 points each.	4
○	[SA] Less than 10% of the FP ADD/SUB/MUL arithmetic operations are performed using FMA - Reorganize arithmetic expressions to exhibit potential for FMA. This issue costs 4 points.	4
►Control Flow Issues+		1
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
►Data Access Issues+		2
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 1 issues ( = data accesses) costing 2 point each.	2
►Vectorization Roadblocks+		3
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 1 issues ( = data accesses) costing 2 point each.	2

▼Stylizer

[ 4 / 4 ] Application profile is long enough (11.56 s)

To have good quality measurements, it is advised that the application profiling time is greater than 10 seconds.

[ 3 / 3 ] Most of time spent in analyzed modules comes from functions compiled with -g and -fno-omit-frame-pointer

-g option gives access to debugging informations, such are source locations. -fno-omit-frame-pointer improve the accuracy of callchains found during the application profiling.

[ 3 / 3 ] Optimization level option is correctly used

[ 2 / 3 ] Security settings from the host restrict profiling. Some metrics will be missing or incomplete.

Current value for kernel.perf_event_paranoid is 2. If possible, set it to 1 or check with your system administrator which flag can be used to achieve this.

[ 3 / 3 ] Architecture specific option -mcpu is used

[ 2 / 2 ] Application is correctly profiled ("Others" category represents 0.00 % of the execution time)

To have a representative profiling, it is advised that the category "Others" represents less than 20% of the execution time in order to analyze as much as possible of the user code

[ 1 / 1 ] Lstopo present. The Topology lstopo report will be generated.

▼Strategizer

[ 4 / 4 ] Enough time of the experiment time spent in analyzed loops (72.46%)

If the time spent in analyzed loops is less than 30%, standard loop optimizations will have a limited impact on application performances.

[ 4 / 4 ] Threads activity is good

On average, more than 92.00% of observed threads are actually active

[ 4 / 4 ] CPU activity is good

CPU cores are active 96.86% of time

[ 4 / 4 ] Loop profile is not flat

At least one loop coverage is greater than 4% (70.34%), representing an hotspot for the application

[ 4 / 4 ] Enough time of the experiment time spent in analyzed innermost loops (71.68%)

If the time spent in analyzed innermost loops is less than 15%, standard innermost loop optimizations such as vectorisation will have a limited impact on application performances.

[ 4 / 4 ] Affinity is good (99.41%)

Threads are not migrating to CPU cores: probably successfully pinned

[ 3 / 3 ] Less than 10% (0.00%) is spend in BLAS1 operations

It could be more efficient to inline by hand BLAS1 operations

[ 0 / 3 ] Too many functions do not use all threads

Functions running on a reduced number of threads (typically sequential code) cover at least 10% of application walltime (18.44%). Check both "Max Inclusive Time Over Threads" and "Nb Threads" in Functions or Loops tabs and consider parallelizing sequential regions or improving parallelization of regions running on a reduced number of threads

[ 3 / 3 ] Cumulative Outermost/In between loops coverage (0.78%) lower than cumulative innermost loop coverage (71.68%)

Having cumulative Outermost/In between loops coverage greater than cumulative innermost loop coverage will make loop optimization more complex

[ 2 / 2 ] Less than 10% (0.00%) is spend in BLAS2 operations

BLAS2 calls usually could make a poor cache usage and could benefit from inlining.

[ 2 / 2 ] Less than 10% (0.00%) is spend in Libm/SVML (special functions)

▼Optimizer

Loop ID	Analysis	Penalty Score
○Loop 379 - libggml-cpu.so	Execution Time: 70 % - Vectorization Ratio: 38.46 % - Vector Length Use: 53.85 %
►Loop 72 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 50.00 %
►Control Flow Issues+		1
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
►Vectorization Roadblocks+		1001
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000
►Loop 73 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 41.07 %
►Control Flow Issues+		3
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Non innermost loop (Outermost) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Vectorization Roadblocks+		1003
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000
○	[SA] Non innermost loop (Outermost) - Collapse loop with innermost ones. This issue costs 2 points.	2
○Loop 908 - libggml-cpu.so	Execution Time: 0 % - Vectorization Ratio: 100.00 % - Vector Length Use: 100.00 %
►Loop 56 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 50.29 %
►Control Flow Issues+		3
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Vectorization Roadblocks+		1003
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Loop 1896 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 8.42 % - Vector Length Use: 38.16 %
►Loop Computation Issues+		2
○	[SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points.	2
►Control Flow Issues+		3
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Vectorization Roadblocks+		1003
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
○Loop 1908 - libggml-cpu.so	Execution Time: 0 % - Vectorization Ratio: 100.00 % - Vector Length Use: 100.00 %
►Loop 419 - libggml-base.so+	Execution Time: 0 % - Vectorization Ratio: 23.17 % - Vector Length Use: 42.00 %
►Loop Computation Issues+		10
○	[SA] Presence of expensive FP instructions - Perform hoisting, change algorithm, use SVML or proper numerical library or perform value profiling (count the number of distinct input values). There are 1 issues (= instructions) costing 4 points each.	4
○	[SA] Less than 10% of the FP ADD/SUB/MUL arithmetic operations are performed using FMA - Reorganize arithmetic expressions to exhibit potential for FMA. This issue costs 4 points.	4
○	[SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points.	2
○Control Flow Issues		0
►Vectorization Roadblocks+		1000
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000
►Loop 53 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 52.17 %
►Control Flow Issues+		1
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
►Vectorization Roadblocks+		1
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
►Loop 1596 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 27.27 % - Vector Length Use: 36.36 %
►Loop Computation Issues+		8
○	[SA] Presence of expensive FP instructions - Perform hoisting, change algorithm, use SVML or proper numerical library or perform value profiling (count the number of distinct input values). There are 1 issues (= instructions) costing 4 points each.	4
○	[SA] Less than 10% of the FP ADD/SUB/MUL arithmetic operations are performed using FMA - Reorganize arithmetic expressions to exhibit potential for FMA. This issue costs 4 points.	4
►Control Flow Issues+		1
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
►Data Access Issues+		2
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 1 issues ( = data accesses) costing 2 point each.	2
►Vectorization Roadblocks+		3
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 1 issues ( = data accesses) costing 2 point each.	2

▼Stylizer

[ 4 / 4 ] Application profile is long enough (11.61 s)

To have good quality measurements, it is advised that the application profiling time is greater than 10 seconds.

[ 3 / 3 ] Most of time spent in analyzed modules comes from functions compiled with -g and -fno-omit-frame-pointer

-g option gives access to debugging informations, such are source locations. -fno-omit-frame-pointer improve the accuracy of callchains found during the application profiling.

[ 3 / 3 ] Optimization level option is correctly used

[ 2 / 3 ] Security settings from the host restrict profiling. Some metrics will be missing or incomplete.

Current value for kernel.perf_event_paranoid is 2. If possible, set it to 1 or check with your system administrator which flag can be used to achieve this.

[ 3 / 3 ] Architecture specific option -mcpu is used

[ 2 / 2 ] Application is correctly profiled ("Others" category represents 0.00 % of the execution time)

To have a representative profiling, it is advised that the category "Others" represents less than 20% of the execution time in order to analyze as much as possible of the user code

[ 1 / 1 ] Lstopo present. The Topology lstopo report will be generated.

▼Strategizer

[ 4 / 4 ] Enough time of the experiment time spent in analyzed loops (71.88%)

If the time spent in analyzed loops is less than 30%, standard loop optimizations will have a limited impact on application performances.

[ 4 / 4 ] Threads activity is good

On average, more than 91.89% of observed threads are actually active

[ 4 / 4 ] CPU activity is good

CPU cores are active 96.67% of time

[ 4 / 4 ] Loop profile is not flat

At least one loop coverage is greater than 4% (69.60%), representing an hotspot for the application

[ 4 / 4 ] Enough time of the experiment time spent in analyzed innermost loops (71.07%)

If the time spent in analyzed innermost loops is less than 15%, standard innermost loop optimizations such as vectorisation will have a limited impact on application performances.

[ 4 / 4 ] Affinity is good (99.42%)

Threads are not migrating to CPU cores: probably successfully pinned

[ 3 / 3 ] Less than 10% (0.00%) is spend in BLAS1 operations

It could be more efficient to inline by hand BLAS1 operations

[ 0 / 3 ] Too many functions do not use all threads

Functions running on a reduced number of threads (typically sequential code) cover at least 10% of application walltime (21.61%). Check both "Max Inclusive Time Over Threads" and "Nb Threads" in Functions or Loops tabs and consider parallelizing sequential regions or improving parallelization of regions running on a reduced number of threads

[ 3 / 3 ] Cumulative Outermost/In between loops coverage (0.81%) lower than cumulative innermost loop coverage (71.07%)

Having cumulative Outermost/In between loops coverage greater than cumulative innermost loop coverage will make loop optimization more complex

[ 2 / 2 ] Less than 10% (0.00%) is spend in BLAS2 operations

BLAS2 calls usually could make a poor cache usage and could benefit from inlining.

[ 2 / 2 ] Less than 10% (0.00%) is spend in Libm/SVML (special functions)

▼Optimizer

Loop ID	Analysis	Penalty Score
○Loop 379 - libggml-cpu.so	Execution Time: 69 % - Vectorization Ratio: 38.46 % - Vector Length Use: 53.85 %
►Loop 72 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 50.00 %
►Control Flow Issues+		1
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
►Vectorization Roadblocks+		1001
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000
►Loop 73 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 41.07 %
►Control Flow Issues+		3
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Non innermost loop (Outermost) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Vectorization Roadblocks+		1003
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000
○	[SA] Non innermost loop (Outermost) - Collapse loop with innermost ones. This issue costs 2 points.	2
○Loop 908 - libggml-cpu.so	Execution Time: 0 % - Vectorization Ratio: 100.00 % - Vector Length Use: 100.00 %
►Loop 56 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 50.29 %
►Control Flow Issues+		3
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Vectorization Roadblocks+		1003
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Loop 1896 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 8.42 % - Vector Length Use: 38.16 %
►Loop Computation Issues+		2
○	[SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points.	2
►Control Flow Issues+		3
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Vectorization Roadblocks+		1003
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
○Loop 1908 - libggml-cpu.so	Execution Time: 0 % - Vectorization Ratio: 100.00 % - Vector Length Use: 100.00 %
►Loop 53 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 52.17 %
►Control Flow Issues+		1
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
►Vectorization Roadblocks+		1
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
►Loop 419 - libggml-base.so+	Execution Time: 0 % - Vectorization Ratio: 23.17 % - Vector Length Use: 42.00 %
►Loop Computation Issues+		10
○	[SA] Presence of expensive FP instructions - Perform hoisting, change algorithm, use SVML or proper numerical library or perform value profiling (count the number of distinct input values). There are 1 issues (= instructions) costing 4 points each.	4
○	[SA] Less than 10% of the FP ADD/SUB/MUL arithmetic operations are performed using FMA - Reorganize arithmetic expressions to exhibit potential for FMA. This issue costs 4 points.	4
○	[SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points.	2
○Control Flow Issues		0
►Vectorization Roadblocks+		1000
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000
►Loop 1596 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 27.27 % - Vector Length Use: 36.36 %
►Loop Computation Issues+		8
○	[SA] Presence of expensive FP instructions - Perform hoisting, change algorithm, use SVML or proper numerical library or perform value profiling (count the number of distinct input values). There are 1 issues (= instructions) costing 4 points each.	4
○	[SA] Less than 10% of the FP ADD/SUB/MUL arithmetic operations are performed using FMA - Reorganize arithmetic expressions to exhibit potential for FMA. This issue costs 4 points.	4
►Control Flow Issues+		1
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
►Data Access Issues+		2
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 1 issues ( = data accesses) costing 2 point each.	2
►Vectorization Roadblocks+		3
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 1 issues ( = data accesses) costing 2 point each.	2

▼Stylizer

[ 4 / 4 ] Application profile is long enough (11.70 s)

To have good quality measurements, it is advised that the application profiling time is greater than 10 seconds.

[ 3 / 3 ] Most of time spent in analyzed modules comes from functions compiled with -g and -fno-omit-frame-pointer

-g option gives access to debugging informations, such are source locations. -fno-omit-frame-pointer improve the accuracy of callchains found during the application profiling.

[ 3 / 3 ] Optimization level option is correctly used

[ 2 / 3 ] Security settings from the host restrict profiling. Some metrics will be missing or incomplete.

Current value for kernel.perf_event_paranoid is 2. If possible, set it to 1 or check with your system administrator which flag can be used to achieve this.

[ 3 / 3 ] Architecture specific option -mcpu is used

[ 2 / 2 ] Application is correctly profiled ("Others" category represents 0.00 % of the execution time)

To have a representative profiling, it is advised that the category "Others" represents less than 20% of the execution time in order to analyze as much as possible of the user code

[ 1 / 1 ] Lstopo present. The Topology lstopo report will be generated.

▼Strategizer

[ 4 / 4 ] Enough time of the experiment time spent in analyzed loops (70.32%)

If the time spent in analyzed loops is less than 30%, standard loop optimizations will have a limited impact on application performances.

[ 4 / 4 ] Threads activity is good

On average, more than 91.69% of observed threads are actually active

[ 4 / 4 ] CPU activity is good

CPU cores are active 96.46% of time

[ 4 / 4 ] Loop profile is not flat

At least one loop coverage is greater than 4% (68.06%), representing an hotspot for the application

[ 4 / 4 ] Enough time of the experiment time spent in analyzed innermost loops (69.42%)

If the time spent in analyzed innermost loops is less than 15%, standard innermost loop optimizations such as vectorisation will have a limited impact on application performances.

[ 4 / 4 ] Affinity is good (99.47%)

Threads are not migrating to CPU cores: probably successfully pinned

[ 3 / 3 ] Less than 10% (0.00%) is spend in BLAS1 operations

It could be more efficient to inline by hand BLAS1 operations

[ 0 / 3 ] Too many functions do not use all threads

Functions running on a reduced number of threads (typically sequential code) cover at least 10% of application walltime (18.19%). Check both "Max Inclusive Time Over Threads" and "Nb Threads" in Functions or Loops tabs and consider parallelizing sequential regions or improving parallelization of regions running on a reduced number of threads

[ 3 / 3 ] Cumulative Outermost/In between loops coverage (0.90%) lower than cumulative innermost loop coverage (69.42%)

Having cumulative Outermost/In between loops coverage greater than cumulative innermost loop coverage will make loop optimization more complex

[ 2 / 2 ] Less than 10% (0.00%) is spend in BLAS2 operations

BLAS2 calls usually could make a poor cache usage and could benefit from inlining.

[ 2 / 2 ] Less than 10% (0.00%) is spend in Libm/SVML (special functions)

▼Optimizer

Loop ID	Analysis	Penalty Score
○Loop 379 - libggml-cpu.so	Execution Time: 68 % - Vectorization Ratio: 38.46 % - Vector Length Use: 53.85 %
►Loop 72 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 50.00 %
►Control Flow Issues+		1
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
►Vectorization Roadblocks+		1001
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000
►Loop 73 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 41.07 %
►Control Flow Issues+		3
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Non innermost loop (Outermost) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Vectorization Roadblocks+		1003
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000
○	[SA] Non innermost loop (Outermost) - Collapse loop with innermost ones. This issue costs 2 points.	2
○Loop 908 - libggml-cpu.so	Execution Time: 0 % - Vectorization Ratio: 100.00 % - Vector Length Use: 100.00 %
►Loop 56 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 50.29 %
►Control Flow Issues+		3
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Vectorization Roadblocks+		1003
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Loop 1896 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 8.42 % - Vector Length Use: 38.16 %
►Loop Computation Issues+		2
○	[SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points.	2
►Control Flow Issues+		3
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Vectorization Roadblocks+		1003
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
○Loop 1908 - libggml-cpu.so	Execution Time: 0 % - Vectorization Ratio: 100.00 % - Vector Length Use: 100.00 %
►Loop 53 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 52.17 %
►Control Flow Issues+		1
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
►Vectorization Roadblocks+		1
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
►Loop 419 - libggml-base.so+	Execution Time: 0 % - Vectorization Ratio: 23.17 % - Vector Length Use: 42.00 %
►Loop Computation Issues+		10
○	[SA] Presence of expensive FP instructions - Perform hoisting, change algorithm, use SVML or proper numerical library or perform value profiling (count the number of distinct input values). There are 1 issues (= instructions) costing 4 points each.	4
○	[SA] Less than 10% of the FP ADD/SUB/MUL arithmetic operations are performed using FMA - Reorganize arithmetic expressions to exhibit potential for FMA. This issue costs 4 points.	4
○	[SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points.	2
○Control Flow Issues		0
►Vectorization Roadblocks+		1000
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000
►Loop 1596 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 27.27 % - Vector Length Use: 36.36 %
►Loop Computation Issues+		8
○	[SA] Presence of expensive FP instructions - Perform hoisting, change algorithm, use SVML or proper numerical library or perform value profiling (count the number of distinct input values). There are 1 issues (= instructions) costing 4 points each.	4
○	[SA] Less than 10% of the FP ADD/SUB/MUL arithmetic operations are performed using FMA - Reorganize arithmetic expressions to exhibit potential for FMA. This issue costs 4 points.	4
►Control Flow Issues+		1
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
►Data Access Issues+		2
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 1 issues ( = data accesses) costing 2 point each.	2
►Vectorization Roadblocks+		3
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 1 issues ( = data accesses) costing 2 point each.	2

▼Stylizer

[ 4 / 4 ] Application profile is long enough (11.75 s)

To have good quality measurements, it is advised that the application profiling time is greater than 10 seconds.

[ 3 / 3 ] Most of time spent in analyzed modules comes from functions compiled with -g and -fno-omit-frame-pointer

-g option gives access to debugging informations, such are source locations. -fno-omit-frame-pointer improve the accuracy of callchains found during the application profiling.

[ 3 / 3 ] Optimization level option is correctly used

[ 2 / 3 ] Security settings from the host restrict profiling. Some metrics will be missing or incomplete.

Current value for kernel.perf_event_paranoid is 2. If possible, set it to 1 or check with your system administrator which flag can be used to achieve this.

[ 3 / 3 ] Architecture specific option -mcpu is used

[ 2 / 2 ] Application is correctly profiled ("Others" category represents 0.00 % of the execution time)

To have a representative profiling, it is advised that the category "Others" represents less than 20% of the execution time in order to analyze as much as possible of the user code

[ 1 / 1 ] Lstopo present. The Topology lstopo report will be generated.

▼Strategizer

[ 4 / 4 ] Enough time of the experiment time spent in analyzed loops (70.50%)

If the time spent in analyzed loops is less than 30%, standard loop optimizations will have a limited impact on application performances.

[ 4 / 4 ] Threads activity is good

On average, more than 91.77% of observed threads are actually active

[ 4 / 4 ] CPU activity is good

CPU cores are active 96.55% of time

[ 4 / 4 ] Loop profile is not flat

At least one loop coverage is greater than 4% (68.23%), representing an hotspot for the application

[ 4 / 4 ] Enough time of the experiment time spent in analyzed innermost loops (69.67%)

If the time spent in analyzed innermost loops is less than 15%, standard innermost loop optimizations such as vectorisation will have a limited impact on application performances.

[ 4 / 4 ] Affinity is good (99.49%)

Threads are not migrating to CPU cores: probably successfully pinned

[ 3 / 3 ] Less than 10% (0.00%) is spend in BLAS1 operations

It could be more efficient to inline by hand BLAS1 operations

[ 0 / 3 ] Too many functions do not use all threads

Functions running on a reduced number of threads (typically sequential code) cover at least 10% of application walltime (18.97%). Check both "Max Inclusive Time Over Threads" and "Nb Threads" in Functions or Loops tabs and consider parallelizing sequential regions or improving parallelization of regions running on a reduced number of threads

[ 3 / 3 ] Cumulative Outermost/In between loops coverage (0.82%) lower than cumulative innermost loop coverage (69.67%)

Having cumulative Outermost/In between loops coverage greater than cumulative innermost loop coverage will make loop optimization more complex

[ 2 / 2 ] Less than 10% (0.00%) is spend in BLAS2 operations

BLAS2 calls usually could make a poor cache usage and could benefit from inlining.

[ 2 / 2 ] Less than 10% (0.00%) is spend in Libm/SVML (special functions)

▼Optimizer

Loop ID	Analysis	Penalty Score
○Loop 379 - libggml-cpu.so	Execution Time: 68 % - Vectorization Ratio: 38.46 % - Vector Length Use: 53.85 %
►Loop 72 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 50.00 %
►Control Flow Issues+		1
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
►Vectorization Roadblocks+		1001
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000
►Loop 73 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 41.07 %
►Control Flow Issues+		3
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Non innermost loop (Outermost) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Vectorization Roadblocks+		1003
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000
○	[SA] Non innermost loop (Outermost) - Collapse loop with innermost ones. This issue costs 2 points.	2
○Loop 908 - libggml-cpu.so	Execution Time: 0 % - Vectorization Ratio: 100.00 % - Vector Length Use: 100.00 %
►Loop 56 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 50.29 %
►Control Flow Issues+		3
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Vectorization Roadblocks+		1003
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Loop 1896 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 8.42 % - Vector Length Use: 38.16 %
►Loop Computation Issues+		2
○	[SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points.	2
►Control Flow Issues+		3
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Vectorization Roadblocks+		1003
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
○Loop 1908 - libggml-cpu.so	Execution Time: 0 % - Vectorization Ratio: 100.00 % - Vector Length Use: 100.00 %
►Loop 53 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 52.17 %
►Control Flow Issues+		1
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
►Vectorization Roadblocks+		1
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
►Loop 419 - libggml-base.so+	Execution Time: 0 % - Vectorization Ratio: 23.17 % - Vector Length Use: 42.00 %
►Loop Computation Issues+		10
○	[SA] Presence of expensive FP instructions - Perform hoisting, change algorithm, use SVML or proper numerical library or perform value profiling (count the number of distinct input values). There are 1 issues (= instructions) costing 4 points each.	4
○	[SA] Less than 10% of the FP ADD/SUB/MUL arithmetic operations are performed using FMA - Reorganize arithmetic expressions to exhibit potential for FMA. This issue costs 4 points.	4
○	[SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points.	2
○Control Flow Issues		0
►Vectorization Roadblocks+		1000
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000
►Loop 1596 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 27.27 % - Vector Length Use: 36.36 %
►Loop Computation Issues+		8
○	[SA] Presence of expensive FP instructions - Perform hoisting, change algorithm, use SVML or proper numerical library or perform value profiling (count the number of distinct input values). There are 1 issues (= instructions) costing 4 points each.	4
○	[SA] Less than 10% of the FP ADD/SUB/MUL arithmetic operations are performed using FMA - Reorganize arithmetic expressions to exhibit potential for FMA. This issue costs 4 points.	4
►Control Flow Issues+		1
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
►Data Access Issues+		2
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 1 issues ( = data accesses) costing 2 point each.	2
►Vectorization Roadblocks+		3
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 1 issues ( = data accesses) costing 2 point each.	2

▼Stylizer

[ 4 / 4 ] Application profile is long enough (11.68 s)

To have good quality measurements, it is advised that the application profiling time is greater than 10 seconds.

[ 3 / 3 ] Most of time spent in analyzed modules comes from functions compiled with -g and -fno-omit-frame-pointer

-g option gives access to debugging informations, such are source locations. -fno-omit-frame-pointer improve the accuracy of callchains found during the application profiling.

[ 3 / 3 ] Optimization level option is correctly used

[ 2 / 3 ] Security settings from the host restrict profiling. Some metrics will be missing or incomplete.

Current value for kernel.perf_event_paranoid is 2. If possible, set it to 1 or check with your system administrator which flag can be used to achieve this.

[ 3 / 3 ] Architecture specific option -mcpu is used

[ 2 / 2 ] Application is correctly profiled ("Others" category represents 0.00 % of the execution time)

To have a representative profiling, it is advised that the category "Others" represents less than 20% of the execution time in order to analyze as much as possible of the user code

[ 1 / 1 ] Lstopo present. The Topology lstopo report will be generated.

▼Strategizer

[ 4 / 4 ] Enough time of the experiment time spent in analyzed loops (70.76%)

If the time spent in analyzed loops is less than 30%, standard loop optimizations will have a limited impact on application performances.

[ 4 / 4 ] Threads activity is good

On average, more than 91.81% of observed threads are actually active

[ 4 / 4 ] CPU activity is good

CPU cores are active 96.61% of time

[ 4 / 4 ] Loop profile is not flat

At least one loop coverage is greater than 4% (68.56%), representing an hotspot for the application

[ 4 / 4 ] Enough time of the experiment time spent in analyzed innermost loops (69.94%)

If the time spent in analyzed innermost loops is less than 15%, standard innermost loop optimizations such as vectorisation will have a limited impact on application performances.

[ 4 / 4 ] Affinity is good (99.48%)

Threads are not migrating to CPU cores: probably successfully pinned

[ 3 / 3 ] Less than 10% (0.00%) is spend in BLAS1 operations

It could be more efficient to inline by hand BLAS1 operations

[ 0 / 3 ] Too many functions do not use all threads

Functions running on a reduced number of threads (typically sequential code) cover at least 10% of application walltime (21.23%). Check both "Max Inclusive Time Over Threads" and "Nb Threads" in Functions or Loops tabs and consider parallelizing sequential regions or improving parallelization of regions running on a reduced number of threads

[ 3 / 3 ] Cumulative Outermost/In between loops coverage (0.81%) lower than cumulative innermost loop coverage (69.94%)

Having cumulative Outermost/In between loops coverage greater than cumulative innermost loop coverage will make loop optimization more complex

[ 2 / 2 ] Less than 10% (0.00%) is spend in BLAS2 operations

BLAS2 calls usually could make a poor cache usage and could benefit from inlining.

[ 2 / 2 ] Less than 10% (0.00%) is spend in Libm/SVML (special functions)

▼Optimizer

Loop ID	Analysis	Penalty Score
○Loop 379 - libggml-cpu.so	Execution Time: 68 % - Vectorization Ratio: 38.46 % - Vector Length Use: 53.85 %
►Loop 72 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 50.00 %
►Control Flow Issues+		1
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
►Vectorization Roadblocks+		1001
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000
►Loop 73 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 41.07 %
►Control Flow Issues+		3
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Non innermost loop (Outermost) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Vectorization Roadblocks+		1003
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000
○	[SA] Non innermost loop (Outermost) - Collapse loop with innermost ones. This issue costs 2 points.	2
○Loop 908 - libggml-cpu.so	Execution Time: 0 % - Vectorization Ratio: 100.00 % - Vector Length Use: 100.00 %
►Loop 56 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 50.29 %
►Control Flow Issues+		3
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Vectorization Roadblocks+		1003
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Loop 1896 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 8.42 % - Vector Length Use: 38.16 %
►Loop Computation Issues+		2
○	[SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points.	2
►Control Flow Issues+		3
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Vectorization Roadblocks+		1003
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
○Loop 1908 - libggml-cpu.so	Execution Time: 0 % - Vectorization Ratio: 100.00 % - Vector Length Use: 100.00 %
►Loop 53 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 52.17 %
►Control Flow Issues+		1
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
►Vectorization Roadblocks+		1
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
►Loop 419 - libggml-base.so+	Execution Time: 0 % - Vectorization Ratio: 23.17 % - Vector Length Use: 42.00 %
►Loop Computation Issues+		10
○	[SA] Presence of expensive FP instructions - Perform hoisting, change algorithm, use SVML or proper numerical library or perform value profiling (count the number of distinct input values). There are 1 issues (= instructions) costing 4 points each.	4
○	[SA] Less than 10% of the FP ADD/SUB/MUL arithmetic operations are performed using FMA - Reorganize arithmetic expressions to exhibit potential for FMA. This issue costs 4 points.	4
○	[SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points.	2
○Control Flow Issues		0
►Vectorization Roadblocks+		1000
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000
►Loop 1596 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 27.27 % - Vector Length Use: 36.36 %
►Loop Computation Issues+		8
○	[SA] Presence of expensive FP instructions - Perform hoisting, change algorithm, use SVML or proper numerical library or perform value profiling (count the number of distinct input values). There are 1 issues (= instructions) costing 4 points each.	4
○	[SA] Less than 10% of the FP ADD/SUB/MUL arithmetic operations are performed using FMA - Reorganize arithmetic expressions to exhibit potential for FMA. This issue costs 4 points.	4
►Control Flow Issues+		1
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
►Data Access Issues+		2
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 1 issues ( = data accesses) costing 2 point each.	2
►Vectorization Roadblocks+		3
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 1 issues ( = data accesses) costing 2 point each.	2

▼Stylizer

[ 4 / 4 ] Application profile is long enough (11.66 s)

To have good quality measurements, it is advised that the application profiling time is greater than 10 seconds.

[ 3 / 3 ] Most of time spent in analyzed modules comes from functions compiled with -g and -fno-omit-frame-pointer

-g option gives access to debugging informations, such are source locations. -fno-omit-frame-pointer improve the accuracy of callchains found during the application profiling.

[ 3 / 3 ] Optimization level option is correctly used

[ 2 / 3 ] Security settings from the host restrict profiling. Some metrics will be missing or incomplete.

Current value for kernel.perf_event_paranoid is 2. If possible, set it to 1 or check with your system administrator which flag can be used to achieve this.

[ 3 / 3 ] Architecture specific option -mcpu is used

[ 2 / 2 ] Application is correctly profiled ("Others" category represents 0.00 % of the execution time)

To have a representative profiling, it is advised that the category "Others" represents less than 20% of the execution time in order to analyze as much as possible of the user code

[ 1 / 1 ] Lstopo present. The Topology lstopo report will be generated.

▼Strategizer

[ 4 / 4 ] Enough time of the experiment time spent in analyzed loops (72.06%)

If the time spent in analyzed loops is less than 30%, standard loop optimizations will have a limited impact on application performances.

[ 4 / 4 ] Threads activity is good

On average, more than 92.02% of observed threads are actually active

[ 4 / 4 ] CPU activity is good

CPU cores are active 96.85% of time

[ 4 / 4 ] Loop profile is not flat

At least one loop coverage is greater than 4% (69.87%), representing an hotspot for the application

[ 4 / 4 ] Enough time of the experiment time spent in analyzed innermost loops (71.23%)

If the time spent in analyzed innermost loops is less than 15%, standard innermost loop optimizations such as vectorisation will have a limited impact on application performances.

[ 4 / 4 ] Affinity is good (99.47%)

Threads are not migrating to CPU cores: probably successfully pinned

[ 3 / 3 ] Less than 10% (0.00%) is spend in BLAS1 operations

It could be more efficient to inline by hand BLAS1 operations

[ 0 / 3 ] Too many functions do not use all threads

Functions running on a reduced number of threads (typically sequential code) cover at least 10% of application walltime (16.08%). Check both "Max Inclusive Time Over Threads" and "Nb Threads" in Functions or Loops tabs and consider parallelizing sequential regions or improving parallelization of regions running on a reduced number of threads

[ 3 / 3 ] Cumulative Outermost/In between loops coverage (0.83%) lower than cumulative innermost loop coverage (71.23%)

Having cumulative Outermost/In between loops coverage greater than cumulative innermost loop coverage will make loop optimization more complex

[ 2 / 2 ] Less than 10% (0.00%) is spend in BLAS2 operations

BLAS2 calls usually could make a poor cache usage and could benefit from inlining.

[ 2 / 2 ] Less than 10% (0.00%) is spend in Libm/SVML (special functions)

▼Optimizer

Loop ID	Analysis	Penalty Score
○Loop 379 - libggml-cpu.so	Execution Time: 69 % - Vectorization Ratio: 38.46 % - Vector Length Use: 53.85 %
►Loop 72 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 50.00 %
►Control Flow Issues+		1
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
►Vectorization Roadblocks+		1001
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000
►Loop 73 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 41.07 %
►Control Flow Issues+		3
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Non innermost loop (Outermost) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Vectorization Roadblocks+		1003
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000
○	[SA] Non innermost loop (Outermost) - Collapse loop with innermost ones. This issue costs 2 points.	2
○Loop 908 - libggml-cpu.so	Execution Time: 0 % - Vectorization Ratio: 100.00 % - Vector Length Use: 100.00 %
►Loop 56 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 50.29 %
►Control Flow Issues+		3
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Vectorization Roadblocks+		1003
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Loop 1896 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 8.42 % - Vector Length Use: 38.16 %
►Loop Computation Issues+		2
○	[SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points.	2
►Control Flow Issues+		3
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Vectorization Roadblocks+		1003
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
○Loop 1908 - libggml-cpu.so	Execution Time: 0 % - Vectorization Ratio: 100.00 % - Vector Length Use: 100.00 %
►Loop 53 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 52.17 %
►Control Flow Issues+		1
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
►Vectorization Roadblocks+		1
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
►Loop 419 - libggml-base.so+	Execution Time: 0 % - Vectorization Ratio: 23.17 % - Vector Length Use: 42.00 %
►Loop Computation Issues+		10
○	[SA] Presence of expensive FP instructions - Perform hoisting, change algorithm, use SVML or proper numerical library or perform value profiling (count the number of distinct input values). There are 1 issues (= instructions) costing 4 points each.	4
○	[SA] Less than 10% of the FP ADD/SUB/MUL arithmetic operations are performed using FMA - Reorganize arithmetic expressions to exhibit potential for FMA. This issue costs 4 points.	4
○	[SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points.	2
○Control Flow Issues		0
►Vectorization Roadblocks+		1000
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000
►Loop 1596 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 27.27 % - Vector Length Use: 36.36 %
►Loop Computation Issues+		8
○	[SA] Presence of expensive FP instructions - Perform hoisting, change algorithm, use SVML or proper numerical library or perform value profiling (count the number of distinct input values). There are 1 issues (= instructions) costing 4 points each.	4
○	[SA] Less than 10% of the FP ADD/SUB/MUL arithmetic operations are performed using FMA - Reorganize arithmetic expressions to exhibit potential for FMA. This issue costs 4 points.	4
►Control Flow Issues+		1
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
►Data Access Issues+		2
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 1 issues ( = data accesses) costing 2 point each.	2
►Vectorization Roadblocks+		3
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 1 issues ( = data accesses) costing 2 point each.	2

▼Stylizer

[ 4 / 4 ] Application profile is long enough (11.66 s)

To have good quality measurements, it is advised that the application profiling time is greater than 10 seconds.

[ 3 / 3 ] Most of time spent in analyzed modules comes from functions compiled with -g and -fno-omit-frame-pointer

-g option gives access to debugging informations, such are source locations. -fno-omit-frame-pointer improves the accuracy of callchains found during the application profiling.

[ 3 / 3 ] Optimization level option is correctly used

[ 2 / 3 ] Security settings from the host restrict profiling. Some metrics will be missing or incomplete.

Current value for kernel.perf_event_paranoid is 2. If possible, set it to 1 or check with your system administrator which flag can be used to achieve this.

[ 3 / 3 ] Architecture specific option -mcpu is used

[ 2 / 2 ] Application is correctly profiled ("Others" category represents 0.00 % of the execution time)

To have a representative profiling, it is advised that the category "Others" represents less than 20% of the execution time in order to analyze as much as possible of the user code

[ 1 / 1 ] Lstopo present. The Topology lstopo report will be generated.

▼Strategizer

[ 4 / 4 ] Enough time of the experiment time spent in analyzed loops (71.32%)

If the time spent in analyzed loops is less than 30%, standard loop optimizations will have a limited impact on application performances.

[ 4 / 4 ] Threads activity is good

On average, more than 91.78% of observed threads are actually active

[ 4 / 4 ] CPU activity is good

CPU cores are active 96.63% of time

[ 4 / 4 ] Loop profile is not flat

At least one loop coverage is greater than 4% (69.27%), representing an hotspot for the application

[ 4 / 4 ] Enough time of the experiment time spent in analyzed innermost loops (70.56%)

If the time spent in analyzed innermost loops is less than 15%, standard innermost loop optimizations such as vectorisation will have a limited impact on application performances.

[ 4 / 4 ] Affinity is good (99.48%)

Threads are not migrating to CPU cores: probably successfully pinned

[ 3 / 3 ] Less than 10% (0.00%) is spend in BLAS1 operations

It could be more efficient to inline by hand BLAS1 operations

[ 0 / 3 ] Too many functions do not use all threads

Functions running on a reduced number of threads (typically sequential code) cover at least 10% of application walltime (17.94%). Check both "Max Inclusive Time Over Threads" and "Nb Threads" in Functions or Loops tabs and consider parallelizing sequential regions or improving parallelization of regions running on a reduced number of threads

[ 3 / 3 ] Cumulative Outermost/In between loops coverage (0.77%) lower than cumulative innermost loop coverage (70.56%)

Having cumulative Outermost/In between loops coverage greater than cumulative innermost loop coverage will make loop optimization more complex

[ 2 / 2 ] Less than 10% (0.00%) is spend in BLAS2 operations

BLAS2 calls usually could make a poor cache usage and could benefit from inlining.

[ 2 / 2 ] Less than 10% (0.00%) is spend in Libm/SVML (special functions)

▼Optimizer

Loop ID	Analysis	Penalty Score
○Loop 379 - libggml-cpu.so	Execution Time: 69 % - Vectorization Ratio: 38.46 % - Vector Length Use: 53.85 %
►Loop 72 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 50.00 %
►Control Flow Issues+		1
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
►Vectorization Roadblocks+		1001
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000
○Loop 908 - libggml-cpu.so	Execution Time: 0 % - Vectorization Ratio: 100.00 % - Vector Length Use: 100.00 %
►Loop 73 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 41.07 %
►Control Flow Issues+		3
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Non innermost loop (Outermost) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Vectorization Roadblocks+		1003
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000
○	[SA] Non innermost loop (Outermost) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Loop 56 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 50.29 %
►Control Flow Issues+		3
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Vectorization Roadblocks+		1003
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Loop 1896 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 8.42 % - Vector Length Use: 38.16 %
►Loop Computation Issues+		2
○	[SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points.	2
►Control Flow Issues+		3
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Vectorization Roadblocks+		1003
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
○Loop 1908 - libggml-cpu.so	Execution Time: 0 % - Vectorization Ratio: 100.00 % - Vector Length Use: 100.00 %
►Loop 53 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 52.17 %
►Control Flow Issues+		1
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
►Vectorization Roadblocks+		1
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
►Loop 419 - libggml-base.so+	Execution Time: 0 % - Vectorization Ratio: 23.17 % - Vector Length Use: 42.00 %
►Loop Computation Issues+		10
○	[SA] Presence of expensive FP instructions - Perform hoisting, change algorithm, use SVML or proper numerical library or perform value profiling (count the number of distinct input values). There are 1 issues (= instructions) costing 4 points each.	4
○	[SA] Less than 10% of the FP ADD/SUB/MUL arithmetic operations are performed using FMA - Reorganize arithmetic expressions to exhibit potential for FMA. This issue costs 4 points.	4
○	[SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points.	2
○Control Flow Issues		0
►Vectorization Roadblocks+		1000
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000
►Loop 1596 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 27.27 % - Vector Length Use: 36.36 %
►Loop Computation Issues+		8
○	[SA] Presence of expensive FP instructions - Perform hoisting, change algorithm, use SVML or proper numerical library or perform value profiling (count the number of distinct input values). There are 1 issues (= instructions) costing 4 points each.	4
○	[SA] Less than 10% of the FP ADD/SUB/MUL arithmetic operations are performed using FMA - Reorganize arithmetic expressions to exhibit potential for FMA. This issue costs 4 points.	4
►Control Flow Issues+		1
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
►Data Access Issues+		2
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 1 issues ( = data accesses) costing 2 point each.	2
►Vectorization Roadblocks+		3
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 1 issues ( = data accesses) costing 2 point each.	2

Report Configuration

exec - 2025-09-12 15:43:22 - MAQAO 2025.1.2

▼Stylizer

▼Strategizer

▼Optimizer

▼Stylizer

▼Strategizer

▼Optimizer

▼Stylizer

▼Strategizer

▼Optimizer

▼Stylizer

▼Strategizer

▼Optimizer

▼Stylizer

▼Strategizer

▼Optimizer

▼Stylizer

▼Strategizer

▼Optimizer

▼Stylizer

▼Strategizer

▼Optimizer

▼Stylizer

▼Strategizer

▼Optimizer

▼Stylizer

▼Strategizer

▼Optimizer

▼Stylizer

▼Strategizer

▼Optimizer

▼Stylizer

▼Strategizer

▼Optimizer

▼Stylizer

▼Strategizer

▼Optimizer

▼Stylizer

▼Strategizer

▼Optimizer

▼Stylizer

▼Strategizer

▼Optimizer

▼Stylizer

▼Strategizer

▼Optimizer