OV - exec - Summary

exec - 2025-11-26 16:50:35 - MAQAO 2025.1.3

Help is available by moving the cursor above any symbol or by checking MAQAO website.

run_0
run_1
run_2
run_3
run_4
run_5
run_6
run_7
run_8
run_9
run_10

▼Stylizer

[ 0 / 9 ] Compilation options are not available

Compilation options are an important optimization leverage but ONE-View is not able to analyze them.

[ 4 / 4 ] Application profile is long enough (557.43 s)

To have good quality measurements, it is advised that the application profiling time is greater than 10 seconds.

[ 2 / 3 ] Security settings from the host restrict profiling. Some metrics will be missing or incomplete.

Current value for kernel.perf_event_paranoid is 2. If possible, set it to 1 or check with your system administrator which flag can be used to achieve this.

[ 2 / 2 ] Application is correctly profiled ("Others" category represents 0.02 % of the execution time)

To have a representative profiling, it is advised that the category "Others" represents less than 20% of the execution time in order to analyze as much as possible of the user code

[ 1 / 1 ] Lstopo present. The Topology lstopo report will be generated.

▶Strategizer

[ 4 / 4 ] Enough time of the experiment time spent in analyzed loops (99.45%)

If the time spent in analyzed loops is less than 30%, standard loop optimizations will have a limited impact on application performances.

[ 4 / 4 ] Threads activity is good

On average, more than 99.80% of observed threads are actually active

[ 4 / 4 ] CPU activity is good

CPU cores are active 99.80% of time

[ 4 / 4 ] Loop profile is not flat

At least one loop coverage is greater than 4% (97.03%), representing an hotspot for the application

[ 4 / 4 ] Enough time of the experiment time spent in analyzed innermost loops (98.29%)

If the time spent in analyzed innermost loops is less than 15%, standard innermost loop optimizations such as vectorisation will have a limited impact on application performances.

[ 4 / 4 ] Affinity is good (99.99%)

Threads are not migrating to CPU cores: probably successfully pinned

[ 3 / 3 ] Less than 10% (0.00%) is spend in BLAS1 operations

It could be more efficient to inline by hand BLAS1 operations

[ 3 / 3 ] Functions mostly use all threads

Functions running on a reduced number of threads (typically sequential code) cover less than 10% of application walltime (0.00%)

[ 3 / 3 ] Cumulative Outermost/In between loops coverage (1.17%) lower than cumulative innermost loop coverage (98.29%)

Having cumulative Outermost/In between loops coverage greater than cumulative innermost loop coverage will make loop optimization more complex

[ 2 / 2 ] Less than 10% (0.00%) is spend in BLAS2 operations

BLAS2 calls usually could make a poor cache usage and could benefit from inlining.

[ 2 / 2 ] Less than 10% (0.13%) is spend in Libm/SVML (special functions)

▼Optimizer

Loop ID	Analysis	Penalty Score
►Loop 1828 - libggml-cpu.so+	Execution Time: 97 % - Vectorization Ratio: 8.82 % - Vector Length Use: 21.32 %
►Loop Computation Issues+		2
○	[SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points.	2
►Data Access Issues+		16
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 8 issues ( = data accesses) costing 2 point each.	16
►Vectorization Roadblocks+		16
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 8 issues ( = data accesses) costing 2 point each.	16
►Loop 1403 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 15.48 % - Vector Length Use: 31.37 %
►Loop Computation Issues+		2
○	[SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points.	2
►Control Flow Issues+		488
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (481 paths) - Simplify control structure. There are 481 issues ( = paths) costing 1 point each with a malus of 4 points.	485
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Vectorization Roadblocks+		488
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (481 paths) - Simplify control structure. There are 481 issues ( = paths) costing 1 point each with a malus of 4 points.	485
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Loop 51 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 24.01 %
►Control Flow Issues+		52
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (45 paths) - Simplify control structure. There are 45 issues ( = paths) costing 1 point each with a malus of 4 points.	49
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Vectorization Roadblocks+		52
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (45 paths) - Simplify control structure. There are 45 issues ( = paths) costing 1 point each with a malus of 4 points.	49
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Loop 1803 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 15.24 % - Vector Length Use: 14.64 %
►Loop Computation Issues+		6
○	[SA] Presence of expensive FP instructions - Perform hoisting, change algorithm, use SVML or proper numerical library or perform value profiling (count the number of distinct input values). There are 1 issues (= instructions) costing 4 points each.	4
○	[SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points.	2
○Control Flow Issues		0
►Vectorization Roadblocks+		1000
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000
►Loop 1 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 9.23 %
►Loop Computation Issues+		2
○	[SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points.	2
►Loop 752 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 88.89 % - Vector Length Use: 97.32 %
►Loop Computation Issues+		4
○	[SA] Presence of expensive FP instructions - Perform hoisting, change algorithm, use SVML or proper numerical library or perform value profiling (count the number of distinct input values). There are 1 issues (= instructions) costing 4 points each.	4
►Data Access Issues+		6
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 3 issues ( = data accesses) costing 2 point each.	6
►Vectorization Roadblocks+		6
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 3 issues ( = data accesses) costing 2 point each.	6
►Loop 1824 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 23.16 % - Vector Length Use: 36.18 %
►Loop Computation Issues+		2
○	[SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points.	2
►Control Flow Issues+		2
○	[SA] Non innermost loop (Outermost) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Vectorization Roadblocks+		1002
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000
○	[SA] Non innermost loop (Outermost) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Loop 53 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 22.40 %
►Control Flow Issues+		1
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
►Vectorization Roadblocks+		1
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
►Loop 749 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 100.00 % - Vector Length Use: 100.00 %
►Data Access Issues+		32
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 16 issues ( = data accesses) costing 2 point each.	32
►Vectorization Roadblocks+		32
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 16 issues ( = data accesses) costing 2 point each.	32
►Loop 1405 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 100.00 % - Vector Length Use: 100.00 %
►Data Access Issues+		32
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 16 issues ( = data accesses) costing 2 point each.	32
►Vectorization Roadblocks+		32
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 16 issues ( = data accesses) costing 2 point each.	32

▼Stylizer

[ 0 / 9 ] Compilation options are not available

Compilation options are an important optimization leverage but ONE-View is not able to analyze them.

[ 4 / 4 ] Application profile is long enough (277.65 s)

To have good quality measurements, it is advised that the application profiling time is greater than 10 seconds.

[ 2 / 3 ] Security settings from the host restrict profiling. Some metrics will be missing or incomplete.

Current value for kernel.perf_event_paranoid is 2. If possible, set it to 1 or check with your system administrator which flag can be used to achieve this.

[ 2 / 2 ] Application is correctly profiled ("Others" category represents 0.02 % of the execution time)

To have a representative profiling, it is advised that the category "Others" represents less than 20% of the execution time in order to analyze as much as possible of the user code

[ 1 / 1 ] Lstopo present. The Topology lstopo report will be generated.

▶Strategizer

[ 4 / 4 ] Enough time of the experiment time spent in analyzed loops (99.29%)

If the time spent in analyzed loops is less than 30%, standard loop optimizations will have a limited impact on application performances.

[ 4 / 4 ] Threads activity is good

On average, more than 199.07% of observed threads are actually active

[ 4 / 4 ] CPU activity is good

CPU cores are active 99.73% of time

[ 4 / 4 ] Loop profile is not flat

At least one loop coverage is greater than 4% (96.75%), representing an hotspot for the application

[ 4 / 4 ] Enough time of the experiment time spent in analyzed innermost loops (98.12%)

If the time spent in analyzed innermost loops is less than 15%, standard innermost loop optimizations such as vectorisation will have a limited impact on application performances.

[ 4 / 4 ] Affinity is good (99.93%)

Threads are not migrating to CPU cores: probably successfully pinned

[ 3 / 3 ] Less than 10% (0.00%) is spend in BLAS1 operations

It could be more efficient to inline by hand BLAS1 operations

[ 3 / 3 ] Functions mostly use all threads

Functions running on a reduced number of threads (typically sequential code) cover less than 10% of application walltime (0.17%)

[ 3 / 3 ] Cumulative Outermost/In between loops coverage (1.17%) lower than cumulative innermost loop coverage (98.12%)

Having cumulative Outermost/In between loops coverage greater than cumulative innermost loop coverage will make loop optimization more complex

[ 2 / 2 ] Less than 10% (0.00%) is spend in BLAS2 operations

BLAS2 calls usually could make a poor cache usage and could benefit from inlining.

[ 2 / 2 ] Less than 10% (0.12%) is spend in Libm/SVML (special functions)

▼Optimizer

Loop ID	Analysis	Penalty Score
►Loop 1828 - libggml-cpu.so+	Execution Time: 96 % - Vectorization Ratio: 8.82 % - Vector Length Use: 21.32 %
►Loop Computation Issues+		2
○	[SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points.	2
►Data Access Issues+		16
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 8 issues ( = data accesses) costing 2 point each.	16
►Vectorization Roadblocks+		16
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 8 issues ( = data accesses) costing 2 point each.	16
►Loop 1403 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 15.48 % - Vector Length Use: 31.37 %
►Loop Computation Issues+		2
○	[SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points.	2
►Control Flow Issues+		488
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (481 paths) - Simplify control structure. There are 481 issues ( = paths) costing 1 point each with a malus of 4 points.	485
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Vectorization Roadblocks+		488
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (481 paths) - Simplify control structure. There are 481 issues ( = paths) costing 1 point each with a malus of 4 points.	485
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Loop 51 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 24.01 %
►Control Flow Issues+		52
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (45 paths) - Simplify control structure. There are 45 issues ( = paths) costing 1 point each with a malus of 4 points.	49
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Vectorization Roadblocks+		52
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (45 paths) - Simplify control structure. There are 45 issues ( = paths) costing 1 point each with a malus of 4 points.	49
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Loop 1803 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 15.24 % - Vector Length Use: 14.64 %
►Loop Computation Issues+		6
○	[SA] Presence of expensive FP instructions - Perform hoisting, change algorithm, use SVML or proper numerical library or perform value profiling (count the number of distinct input values). There are 1 issues (= instructions) costing 4 points each.	4
○	[SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points.	2
○Control Flow Issues		0
►Vectorization Roadblocks+		1000
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000
►Loop 1 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 9.23 %
►Loop Computation Issues+		2
○	[SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points.	2
►Loop 752 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 88.89 % - Vector Length Use: 97.32 %
►Loop Computation Issues+		4
○	[SA] Presence of expensive FP instructions - Perform hoisting, change algorithm, use SVML or proper numerical library or perform value profiling (count the number of distinct input values). There are 1 issues (= instructions) costing 4 points each.	4
►Data Access Issues+		6
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 3 issues ( = data accesses) costing 2 point each.	6
►Vectorization Roadblocks+		6
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 3 issues ( = data accesses) costing 2 point each.	6
►Loop 45 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 23.99 %
►Control Flow Issues+		1
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
►Vectorization Roadblocks+		1001
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000
►Loop 1824 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 23.16 % - Vector Length Use: 36.18 %
►Loop Computation Issues+		2
○	[SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points.	2
►Control Flow Issues+		2
○	[SA] Non innermost loop (Outermost) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Vectorization Roadblocks+		1002
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000
○	[SA] Non innermost loop (Outermost) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Loop 53 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 22.40 %
►Control Flow Issues+		1
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
►Vectorization Roadblocks+		1
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
►Loop 749 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 100.00 % - Vector Length Use: 100.00 %
►Data Access Issues+		32
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 16 issues ( = data accesses) costing 2 point each.	32
►Vectorization Roadblocks+		32
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 16 issues ( = data accesses) costing 2 point each.	32

▼Stylizer

[ 0 / 9 ] Compilation options are not available

Compilation options are an important optimization leverage but ONE-View is not able to analyze them.

[ 4 / 4 ] Application profile is long enough (139.54 s)

To have good quality measurements, it is advised that the application profiling time is greater than 10 seconds.

[ 2 / 3 ] Security settings from the host restrict profiling. Some metrics will be missing or incomplete.

Current value for kernel.perf_event_paranoid is 2. If possible, set it to 1 or check with your system administrator which flag can be used to achieve this.

[ 2 / 2 ] Application is correctly profiled ("Others" category represents 0.01 % of the execution time)

To have a representative profiling, it is advised that the category "Others" represents less than 20% of the execution time in order to analyze as much as possible of the user code

[ 1 / 1 ] Lstopo present. The Topology lstopo report will be generated.

▶Strategizer

[ 4 / 4 ] Enough time of the experiment time spent in analyzed loops (99.04%)

If the time spent in analyzed loops is less than 30%, standard loop optimizations will have a limited impact on application performances.

[ 4 / 4 ] Threads activity is good

On average, more than 396.09% of observed threads are actually active

[ 4 / 4 ] CPU activity is good

CPU cores are active 99.59% of time

[ 4 / 4 ] Loop profile is not flat

At least one loop coverage is greater than 4% (96.51%), representing an hotspot for the application

[ 4 / 4 ] Enough time of the experiment time spent in analyzed innermost loops (97.87%)

If the time spent in analyzed innermost loops is less than 15%, standard innermost loop optimizations such as vectorisation will have a limited impact on application performances.

[ 4 / 4 ] Affinity is good (99.80%)

Threads are not migrating to CPU cores: probably successfully pinned

[ 3 / 3 ] Less than 10% (0.00%) is spend in BLAS1 operations

It could be more efficient to inline by hand BLAS1 operations

[ 3 / 3 ] Functions mostly use all threads

Functions running on a reduced number of threads (typically sequential code) cover less than 10% of application walltime (0.38%)

[ 3 / 3 ] Cumulative Outermost/In between loops coverage (1.17%) lower than cumulative innermost loop coverage (97.87%)

Having cumulative Outermost/In between loops coverage greater than cumulative innermost loop coverage will make loop optimization more complex

[ 2 / 2 ] Less than 10% (0.00%) is spend in BLAS2 operations

BLAS2 calls usually could make a poor cache usage and could benefit from inlining.

[ 2 / 2 ] Less than 10% (0.16%) is spend in Libm/SVML (special functions)

▼Optimizer

Loop ID	Analysis	Penalty Score
►Loop 1828 - libggml-cpu.so+	Execution Time: 96 % - Vectorization Ratio: 8.82 % - Vector Length Use: 21.32 %
►Loop Computation Issues+		2
○	[SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points.	2
►Data Access Issues+		16
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 8 issues ( = data accesses) costing 2 point each.	16
►Vectorization Roadblocks+		16
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 8 issues ( = data accesses) costing 2 point each.	16
►Loop 1403 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 15.48 % - Vector Length Use: 31.37 %
►Loop Computation Issues+		2
○	[SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points.	2
►Control Flow Issues+		488
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (481 paths) - Simplify control structure. There are 481 issues ( = paths) costing 1 point each with a malus of 4 points.	485
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Vectorization Roadblocks+		488
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (481 paths) - Simplify control structure. There are 481 issues ( = paths) costing 1 point each with a malus of 4 points.	485
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Loop 51 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 24.01 %
►Control Flow Issues+		52
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (45 paths) - Simplify control structure. There are 45 issues ( = paths) costing 1 point each with a malus of 4 points.	49
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Vectorization Roadblocks+		52
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (45 paths) - Simplify control structure. There are 45 issues ( = paths) costing 1 point each with a malus of 4 points.	49
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Loop 1803 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 15.24 % - Vector Length Use: 14.64 %
►Loop Computation Issues+		6
○	[SA] Presence of expensive FP instructions - Perform hoisting, change algorithm, use SVML or proper numerical library or perform value profiling (count the number of distinct input values). There are 1 issues (= instructions) costing 4 points each.	4
○	[SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points.	2
○Control Flow Issues		0
►Vectorization Roadblocks+		1000
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000
►Loop 752 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 88.89 % - Vector Length Use: 97.32 %
►Loop Computation Issues+		4
○	[SA] Presence of expensive FP instructions - Perform hoisting, change algorithm, use SVML or proper numerical library or perform value profiling (count the number of distinct input values). There are 1 issues (= instructions) costing 4 points each.	4
►Data Access Issues+		6
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 3 issues ( = data accesses) costing 2 point each.	6
►Vectorization Roadblocks+		6
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 3 issues ( = data accesses) costing 2 point each.	6
►Loop 1 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 9.23 %
►Loop Computation Issues+		2
○	[SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points.	2
►Loop 45 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 23.99 %
►Control Flow Issues+		1
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
►Vectorization Roadblocks+		1001
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000
►Loop 1824 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 23.16 % - Vector Length Use: 36.18 %
►Loop Computation Issues+		2
○	[SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points.	2
►Control Flow Issues+		2
○	[SA] Non innermost loop (Outermost) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Vectorization Roadblocks+		1002
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000
○	[SA] Non innermost loop (Outermost) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Loop 1405 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 100.00 % - Vector Length Use: 100.00 %
►Data Access Issues+		32
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 16 issues ( = data accesses) costing 2 point each.	32
►Vectorization Roadblocks+		32
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 16 issues ( = data accesses) costing 2 point each.	32
►Loop 53 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 22.40 %
►Control Flow Issues+		1
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
►Vectorization Roadblocks+		1
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1

▼Stylizer

[ 0 / 9 ] Compilation options are not available

Compilation options are an important optimization leverage but ONE-View is not able to analyze them.

[ 4 / 4 ] Application profile is long enough (70.65 s)

To have good quality measurements, it is advised that the application profiling time is greater than 10 seconds.

[ 2 / 3 ] Security settings from the host restrict profiling. Some metrics will be missing or incomplete.

Current value for kernel.perf_event_paranoid is 2. If possible, set it to 1 or check with your system administrator which flag can be used to achieve this.

[ 2 / 2 ] Application is correctly profiled ("Others" category represents 0.01 % of the execution time)

To have a representative profiling, it is advised that the category "Others" represents less than 20% of the execution time in order to analyze as much as possible of the user code

[ 1 / 1 ] Lstopo present. The Topology lstopo report will be generated.

▶Strategizer

[ 4 / 4 ] Enough time of the experiment time spent in analyzed loops (98.46%)

If the time spent in analyzed loops is less than 30%, standard loop optimizations will have a limited impact on application performances.

[ 4 / 4 ] Threads activity is good

On average, more than 783.68% of observed threads are actually active

[ 4 / 4 ] CPU activity is good

CPU cores are active 99.24% of time

[ 4 / 4 ] Loop profile is not flat

At least one loop coverage is greater than 4% (95.83%), representing an hotspot for the application

[ 4 / 4 ] Enough time of the experiment time spent in analyzed innermost loops (97.26%)

If the time spent in analyzed innermost loops is less than 15%, standard innermost loop optimizations such as vectorisation will have a limited impact on application performances.

[ 4 / 4 ] Affinity is good (99.56%)

Threads are not migrating to CPU cores: probably successfully pinned

[ 3 / 3 ] Less than 10% (0.00%) is spend in BLAS1 operations

It could be more efficient to inline by hand BLAS1 operations

[ 3 / 3 ] Functions mostly use all threads

Functions running on a reduced number of threads (typically sequential code) cover less than 10% of application walltime (0.85%)

[ 3 / 3 ] Cumulative Outermost/In between loops coverage (1.21%) lower than cumulative innermost loop coverage (97.26%)

Having cumulative Outermost/In between loops coverage greater than cumulative innermost loop coverage will make loop optimization more complex

[ 2 / 2 ] Less than 10% (0.00%) is spend in BLAS2 operations

BLAS2 calls usually could make a poor cache usage and could benefit from inlining.

[ 2 / 2 ] Less than 10% (0.16%) is spend in Libm/SVML (special functions)

▼Optimizer

Loop ID	Analysis	Penalty Score
►Loop 1828 - libggml-cpu.so+	Execution Time: 95 % - Vectorization Ratio: 8.82 % - Vector Length Use: 21.32 %
►Loop Computation Issues+		2
○	[SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points.	2
►Data Access Issues+		16
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 8 issues ( = data accesses) costing 2 point each.	16
►Vectorization Roadblocks+		16
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 8 issues ( = data accesses) costing 2 point each.	16
►Loop 1403 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 15.48 % - Vector Length Use: 31.37 %
►Loop Computation Issues+		2
○	[SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points.	2
►Control Flow Issues+		488
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (481 paths) - Simplify control structure. There are 481 issues ( = paths) costing 1 point each with a malus of 4 points.	485
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Vectorization Roadblocks+		488
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (481 paths) - Simplify control structure. There are 481 issues ( = paths) costing 1 point each with a malus of 4 points.	485
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Loop 51 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 24.01 %
►Control Flow Issues+		52
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (45 paths) - Simplify control structure. There are 45 issues ( = paths) costing 1 point each with a malus of 4 points.	49
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Vectorization Roadblocks+		52
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (45 paths) - Simplify control structure. There are 45 issues ( = paths) costing 1 point each with a malus of 4 points.	49
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Loop 1803 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 15.24 % - Vector Length Use: 14.64 %
►Loop Computation Issues+		6
○	[SA] Presence of expensive FP instructions - Perform hoisting, change algorithm, use SVML or proper numerical library or perform value profiling (count the number of distinct input values). There are 1 issues (= instructions) costing 4 points each.	4
○	[SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points.	2
○Control Flow Issues		0
►Vectorization Roadblocks+		1000
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000
►Loop 1 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 9.23 %
►Loop Computation Issues+		2
○	[SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points.	2
►Loop 752 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 88.89 % - Vector Length Use: 97.32 %
►Loop Computation Issues+		4
○	[SA] Presence of expensive FP instructions - Perform hoisting, change algorithm, use SVML or proper numerical library or perform value profiling (count the number of distinct input values). There are 1 issues (= instructions) costing 4 points each.	4
►Data Access Issues+		6
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 3 issues ( = data accesses) costing 2 point each.	6
►Vectorization Roadblocks+		6
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 3 issues ( = data accesses) costing 2 point each.	6
►Loop 45 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 23.99 %
►Control Flow Issues+		1
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
►Vectorization Roadblocks+		1001
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000
►Loop 1824 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 23.16 % - Vector Length Use: 36.18 %
►Loop Computation Issues+		2
○	[SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points.	2
►Control Flow Issues+		2
○	[SA] Non innermost loop (Outermost) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Vectorization Roadblocks+		1002
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000
○	[SA] Non innermost loop (Outermost) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Loop 1405 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 100.00 % - Vector Length Use: 100.00 %
►Data Access Issues+		32
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 16 issues ( = data accesses) costing 2 point each.	32
►Vectorization Roadblocks+		32
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 16 issues ( = data accesses) costing 2 point each.	32
►Loop 1802 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 64.09 % - Vector Length Use: 42.49 %
►Loop Computation Issues+		4
○	[SA] Less than 10% of the FP ADD/SUB/MUL arithmetic operations are performed using FMA - Reorganize arithmetic expressions to exhibit potential for FMA. This issue costs 4 points.	4
►Control Flow Issues+		4
○	[SA] Several paths (2 paths) - Simplify control structure or force the compiler to use masked instructions. There are 2 issues ( = paths) costing 1 point each.	2
○	[SA] Non innermost loop (Outermost) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Vectorization Roadblocks+		4
○	[SA] Several paths (2 paths) - Simplify control structure or force the compiler to use masked instructions. There are 2 issues ( = paths) costing 1 point each.	2
○	[SA] Non innermost loop (Outermost) - Collapse loop with innermost ones. This issue costs 2 points.	2

▼Stylizer

[ 0 / 9 ] Compilation options are not available

Compilation options are an important optimization leverage but ONE-View is not able to analyze them.

[ 4 / 4 ] Application profile is long enough (36.56 s)

To have good quality measurements, it is advised that the application profiling time is greater than 10 seconds.

[ 2 / 3 ] Security settings from the host restrict profiling. Some metrics will be missing or incomplete.

Current value for kernel.perf_event_paranoid is 2. If possible, set it to 1 or check with your system administrator which flag can be used to achieve this.

[ 2 / 2 ] Application is correctly profiled ("Others" category represents 0.01 % of the execution time)

To have a representative profiling, it is advised that the category "Others" represents less than 20% of the execution time in order to analyze as much as possible of the user code

[ 1 / 1 ] Lstopo present. The Topology lstopo report will be generated.

▶Strategizer

[ 4 / 4 ] Enough time of the experiment time spent in analyzed loops (97.10%)

If the time spent in analyzed loops is less than 30%, standard loop optimizations will have a limited impact on application performances.

[ 4 / 4 ] Threads activity is good

On average, more than 1530.69% of observed threads are actually active

[ 4 / 4 ] CPU activity is good

CPU cores are active 98.27% of time

[ 4 / 4 ] Loop profile is not flat

At least one loop coverage is greater than 4% (94.35%), representing an hotspot for the application

[ 4 / 4 ] Enough time of the experiment time spent in analyzed innermost loops (95.81%)

If the time spent in analyzed innermost loops is less than 15%, standard innermost loop optimizations such as vectorisation will have a limited impact on application performances.

[ 4 / 4 ] Affinity is good (99.10%)

Threads are not migrating to CPU cores: probably successfully pinned

[ 3 / 3 ] Less than 10% (0.00%) is spend in BLAS1 operations

It could be more efficient to inline by hand BLAS1 operations

[ 3 / 3 ] Functions mostly use all threads

Functions running on a reduced number of threads (typically sequential code) cover less than 10% of application walltime (1.85%)

[ 3 / 3 ] Cumulative Outermost/In between loops coverage (1.29%) lower than cumulative innermost loop coverage (95.81%)

Having cumulative Outermost/In between loops coverage greater than cumulative innermost loop coverage will make loop optimization more complex

[ 2 / 2 ] Less than 10% (0.00%) is spend in BLAS2 operations

BLAS2 calls usually could make a poor cache usage and could benefit from inlining.

[ 2 / 2 ] Less than 10% (0.17%) is spend in Libm/SVML (special functions)

▼Optimizer

Loop ID	Analysis	Penalty Score
►Loop 1828 - libggml-cpu.so+	Execution Time: 94 % - Vectorization Ratio: 8.82 % - Vector Length Use: 21.32 %
►Loop Computation Issues+		2
○	[SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points.	2
►Data Access Issues+		16
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 8 issues ( = data accesses) costing 2 point each.	16
►Vectorization Roadblocks+		16
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 8 issues ( = data accesses) costing 2 point each.	16
►Loop 1403 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 15.48 % - Vector Length Use: 31.37 %
►Loop Computation Issues+		2
○	[SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points.	2
►Control Flow Issues+		488
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (481 paths) - Simplify control structure. There are 481 issues ( = paths) costing 1 point each with a malus of 4 points.	485
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Vectorization Roadblocks+		488
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (481 paths) - Simplify control structure. There are 481 issues ( = paths) costing 1 point each with a malus of 4 points.	485
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Loop 51 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 24.01 %
►Control Flow Issues+		52
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (45 paths) - Simplify control structure. There are 45 issues ( = paths) costing 1 point each with a malus of 4 points.	49
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Vectorization Roadblocks+		52
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (45 paths) - Simplify control structure. There are 45 issues ( = paths) costing 1 point each with a malus of 4 points.	49
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Loop 1803 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 15.24 % - Vector Length Use: 14.64 %
►Loop Computation Issues+		6
○	[SA] Presence of expensive FP instructions - Perform hoisting, change algorithm, use SVML or proper numerical library or perform value profiling (count the number of distinct input values). There are 1 issues (= instructions) costing 4 points each.	4
○	[SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points.	2
○Control Flow Issues		0
►Vectorization Roadblocks+		1000
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000
►Loop 1 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 9.23 %
►Loop Computation Issues+		2
○	[SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points.	2
►Loop 1802 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 64.09 % - Vector Length Use: 42.49 %
►Loop Computation Issues+		4
○	[SA] Less than 10% of the FP ADD/SUB/MUL arithmetic operations are performed using FMA - Reorganize arithmetic expressions to exhibit potential for FMA. This issue costs 4 points.	4
►Control Flow Issues+		4
○	[SA] Several paths (2 paths) - Simplify control structure or force the compiler to use masked instructions. There are 2 issues ( = paths) costing 1 point each.	2
○	[SA] Non innermost loop (Outermost) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Vectorization Roadblocks+		4
○	[SA] Several paths (2 paths) - Simplify control structure or force the compiler to use masked instructions. There are 2 issues ( = paths) costing 1 point each.	2
○	[SA] Non innermost loop (Outermost) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Loop 752 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 88.89 % - Vector Length Use: 97.32 %
►Loop Computation Issues+		4
○	[SA] Presence of expensive FP instructions - Perform hoisting, change algorithm, use SVML or proper numerical library or perform value profiling (count the number of distinct input values). There are 1 issues (= instructions) costing 4 points each.	4
►Data Access Issues+		6
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 3 issues ( = data accesses) costing 2 point each.	6
►Vectorization Roadblocks+		6
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 3 issues ( = data accesses) costing 2 point each.	6
►Loop 45 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 23.99 %
►Control Flow Issues+		1
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
►Vectorization Roadblocks+		1001
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000
►Loop 1824 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 23.16 % - Vector Length Use: 36.18 %
►Loop Computation Issues+		2
○	[SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points.	2
►Control Flow Issues+		2
○	[SA] Non innermost loop (Outermost) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Vectorization Roadblocks+		1002
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000
○	[SA] Non innermost loop (Outermost) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Loop 749 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 100.00 % - Vector Length Use: 100.00 %
►Data Access Issues+		32
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 16 issues ( = data accesses) costing 2 point each.	32
►Vectorization Roadblocks+		32
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 16 issues ( = data accesses) costing 2 point each.	32

▼Stylizer

[ 0 / 9 ] Compilation options are not available

Compilation options are an important optimization leverage but ONE-View is not able to analyze them.

[ 4 / 4 ] Application profile is long enough (26.40 s)

To have good quality measurements, it is advised that the application profiling time is greater than 10 seconds.

[ 2 / 3 ] Security settings from the host restrict profiling. Some metrics will be missing or incomplete.

Current value for kernel.perf_event_paranoid is 2. If possible, set it to 1 or check with your system administrator which flag can be used to achieve this.

[ 2 / 2 ] Application is correctly profiled ("Others" category represents 0.01 % of the execution time)

To have a representative profiling, it is advised that the category "Others" represents less than 20% of the execution time in order to analyze as much as possible of the user code

[ 1 / 1 ] Lstopo present. The Topology lstopo report will be generated.

▶Strategizer

[ 4 / 4 ] Enough time of the experiment time spent in analyzed loops (94.03%)

If the time spent in analyzed loops is less than 30%, standard loop optimizations will have a limited impact on application performances.

[ 4 / 4 ] Threads activity is good

On average, more than 2247.67% of observed threads are actually active

[ 4 / 4 ] CPU activity is good

CPU cores are active 97.24% of time

[ 4 / 4 ] Loop profile is not flat

At least one loop coverage is greater than 4% (85.83%), representing an hotspot for the application

[ 4 / 4 ] Enough time of the experiment time spent in analyzed innermost loops (92.73%)

If the time spent in analyzed innermost loops is less than 15%, standard innermost loop optimizations such as vectorisation will have a limited impact on application performances.

[ 4 / 4 ] Affinity is good (98.74%)

Threads are not migrating to CPU cores: probably successfully pinned

[ 3 / 3 ] Less than 10% (0.00%) is spend in BLAS1 operations

It could be more efficient to inline by hand BLAS1 operations

[ 3 / 3 ] Functions mostly use all threads

Functions running on a reduced number of threads (typically sequential code) cover less than 10% of application walltime (3.85%)

[ 3 / 3 ] Cumulative Outermost/In between loops coverage (1.31%) lower than cumulative innermost loop coverage (92.73%)

Having cumulative Outermost/In between loops coverage greater than cumulative innermost loop coverage will make loop optimization more complex

[ 2 / 2 ] Less than 10% (0.00%) is spend in BLAS2 operations

BLAS2 calls usually could make a poor cache usage and could benefit from inlining.

[ 2 / 2 ] Less than 10% (0.21%) is spend in Libm/SVML (special functions)

▼Optimizer

Loop ID	Analysis	Penalty Score
►Loop 1828 - libggml-cpu.so+	Execution Time: 85 % - Vectorization Ratio: 8.82 % - Vector Length Use: 21.32 %
►Loop Computation Issues+		2
○	[SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points.	2
►Data Access Issues+		16
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 8 issues ( = data accesses) costing 2 point each.	16
►Vectorization Roadblocks+		16
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 8 issues ( = data accesses) costing 2 point each.	16
►Loop 1829 - libggml-cpu.so+	Execution Time: 5 % - Vectorization Ratio: 46.15 % - Vector Length Use: 60.70 %
►Data Access Issues+		16
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 8 issues ( = data accesses) costing 2 point each.	16
►Vectorization Roadblocks+		16
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 8 issues ( = data accesses) costing 2 point each.	16
►Loop 1403 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 15.48 % - Vector Length Use: 31.37 %
►Loop Computation Issues+		2
○	[SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points.	2
►Control Flow Issues+		488
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (481 paths) - Simplify control structure. There are 481 issues ( = paths) costing 1 point each with a malus of 4 points.	485
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Vectorization Roadblocks+		488
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (481 paths) - Simplify control structure. There are 481 issues ( = paths) costing 1 point each with a malus of 4 points.	485
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Loop 51 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 24.01 %
►Control Flow Issues+		52
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (45 paths) - Simplify control structure. There are 45 issues ( = paths) costing 1 point each with a malus of 4 points.	49
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Vectorization Roadblocks+		52
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (45 paths) - Simplify control structure. There are 45 issues ( = paths) costing 1 point each with a malus of 4 points.	49
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Loop 1803 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 15.24 % - Vector Length Use: 14.64 %
►Loop Computation Issues+		6
○	[SA] Presence of expensive FP instructions - Perform hoisting, change algorithm, use SVML or proper numerical library or perform value profiling (count the number of distinct input values). There are 1 issues (= instructions) costing 4 points each.	4
○	[SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points.	2
○Control Flow Issues		0
►Vectorization Roadblocks+		1000
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000
►Loop 1824 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 23.16 % - Vector Length Use: 36.18 %
►Loop Computation Issues+		2
○	[SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points.	2
►Control Flow Issues+		2
○	[SA] Non innermost loop (Outermost) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Vectorization Roadblocks+		1002
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000
○	[SA] Non innermost loop (Outermost) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Loop 1 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 9.23 %
►Loop Computation Issues+		2
○	[SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points.	2
►Loop 1802 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 64.09 % - Vector Length Use: 42.49 %
►Loop Computation Issues+		4
○	[SA] Less than 10% of the FP ADD/SUB/MUL arithmetic operations are performed using FMA - Reorganize arithmetic expressions to exhibit potential for FMA. This issue costs 4 points.	4
►Control Flow Issues+		4
○	[SA] Several paths (2 paths) - Simplify control structure or force the compiler to use masked instructions. There are 2 issues ( = paths) costing 1 point each.	2
○	[SA] Non innermost loop (Outermost) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Vectorization Roadblocks+		4
○	[SA] Several paths (2 paths) - Simplify control structure or force the compiler to use masked instructions. There are 2 issues ( = paths) costing 1 point each.	2
○	[SA] Non innermost loop (Outermost) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Loop 752 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 88.89 % - Vector Length Use: 97.32 %
►Loop Computation Issues+		4
○	[SA] Presence of expensive FP instructions - Perform hoisting, change algorithm, use SVML or proper numerical library or perform value profiling (count the number of distinct input values). There are 1 issues (= instructions) costing 4 points each.	4
►Data Access Issues+		6
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 3 issues ( = data accesses) costing 2 point each.	6
►Vectorization Roadblocks+		6
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 3 issues ( = data accesses) costing 2 point each.	6
►Loop 45 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 23.99 %
►Control Flow Issues+		1
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
►Vectorization Roadblocks+		1001
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000

▼Stylizer

[ 0 / 9 ] Compilation options are not available

Compilation options are an important optimization leverage but ONE-View is not able to analyze them.

[ 4 / 4 ] Application profile is long enough (19.98 s)

To have good quality measurements, it is advised that the application profiling time is greater than 10 seconds.

[ 2 / 3 ] Security settings from the host restrict profiling. Some metrics will be missing or incomplete.

Current value for kernel.perf_event_paranoid is 2. If possible, set it to 1 or check with your system administrator which flag can be used to achieve this.

[ 2 / 2 ] Application is correctly profiled ("Others" category represents 0.01 % of the execution time)

To have a representative profiling, it is advised that the category "Others" represents less than 20% of the execution time in order to analyze as much as possible of the user code

[ 1 / 1 ] Lstopo present. The Topology lstopo report will be generated.

▶Strategizer

[ 4 / 4 ] Enough time of the experiment time spent in analyzed loops (94.25%)

If the time spent in analyzed loops is less than 30%, standard loop optimizations will have a limited impact on application performances.

[ 4 / 4 ] Threads activity is good

On average, more than 2927.77% of observed threads are actually active

[ 4 / 4 ] CPU activity is good

CPU cores are active 96.15% of time

[ 4 / 4 ] Loop profile is not flat

At least one loop coverage is greater than 4% (91.37%), representing an hotspot for the application

[ 4 / 4 ] Enough time of the experiment time spent in analyzed innermost loops (93.06%)

If the time spent in analyzed innermost loops is less than 15%, standard innermost loop optimizations such as vectorisation will have a limited impact on application performances.

[ 4 / 4 ] Affinity is good (98.29%)

Threads are not migrating to CPU cores: probably successfully pinned

[ 3 / 3 ] Less than 10% (0.00%) is spend in BLAS1 operations

It could be more efficient to inline by hand BLAS1 operations

[ 3 / 3 ] Functions mostly use all threads

Functions running on a reduced number of threads (typically sequential code) cover less than 10% of application walltime (4.72%)

[ 3 / 3 ] Cumulative Outermost/In between loops coverage (1.19%) lower than cumulative innermost loop coverage (93.06%)

Having cumulative Outermost/In between loops coverage greater than cumulative innermost loop coverage will make loop optimization more complex

[ 2 / 2 ] Less than 10% (0.00%) is spend in BLAS2 operations

BLAS2 calls usually could make a poor cache usage and could benefit from inlining.

[ 2 / 2 ] Less than 10% (0.20%) is spend in Libm/SVML (special functions)

▼Optimizer

Loop ID	Analysis	Penalty Score
►Loop 1828 - libggml-cpu.so+	Execution Time: 91 % - Vectorization Ratio: 8.82 % - Vector Length Use: 21.32 %
►Loop Computation Issues+		2
○	[SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points.	2
►Data Access Issues+		16
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 8 issues ( = data accesses) costing 2 point each.	16
►Vectorization Roadblocks+		16
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 8 issues ( = data accesses) costing 2 point each.	16
►Loop 1403 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 15.48 % - Vector Length Use: 31.37 %
►Loop Computation Issues+		2
○	[SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points.	2
►Control Flow Issues+		488
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (481 paths) - Simplify control structure. There are 481 issues ( = paths) costing 1 point each with a malus of 4 points.	485
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Vectorization Roadblocks+		488
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (481 paths) - Simplify control structure. There are 481 issues ( = paths) costing 1 point each with a malus of 4 points.	485
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Loop 51 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 24.01 %
►Control Flow Issues+		52
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (45 paths) - Simplify control structure. There are 45 issues ( = paths) costing 1 point each with a malus of 4 points.	49
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Vectorization Roadblocks+		52
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (45 paths) - Simplify control structure. There are 45 issues ( = paths) costing 1 point each with a malus of 4 points.	49
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Loop 1803 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 15.24 % - Vector Length Use: 14.64 %
►Loop Computation Issues+		6
○	[SA] Presence of expensive FP instructions - Perform hoisting, change algorithm, use SVML or proper numerical library or perform value profiling (count the number of distinct input values). There are 1 issues (= instructions) costing 4 points each.	4
○	[SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points.	2
○Control Flow Issues		0
►Vectorization Roadblocks+		1000
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000
►Loop 1802 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 64.09 % - Vector Length Use: 42.49 %
►Loop Computation Issues+		4
○	[SA] Less than 10% of the FP ADD/SUB/MUL arithmetic operations are performed using FMA - Reorganize arithmetic expressions to exhibit potential for FMA. This issue costs 4 points.	4
►Control Flow Issues+		4
○	[SA] Several paths (2 paths) - Simplify control structure or force the compiler to use masked instructions. There are 2 issues ( = paths) costing 1 point each.	2
○	[SA] Non innermost loop (Outermost) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Vectorization Roadblocks+		4
○	[SA] Several paths (2 paths) - Simplify control structure or force the compiler to use masked instructions. There are 2 issues ( = paths) costing 1 point each.	2
○	[SA] Non innermost loop (Outermost) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Loop 787 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 14.58 %
►Loop Computation Issues+		12
○	[SA] Presence of expensive FP instructions - Perform hoisting, change algorithm, use SVML or proper numerical library or perform value profiling (count the number of distinct input values). There are 2 issues (= instructions) costing 4 points each.	8
○	[SA] Less than 10% of the FP ADD/SUB/MUL arithmetic operations are performed using FMA - Reorganize arithmetic expressions to exhibit potential for FMA. This issue costs 4 points.	4
►Control Flow Issues+		1
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
►Data Access Issues+		8
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 2 issues ( = data accesses) costing 2 point each.	4
○	[SA] Presence of indirect accesses - Use array restructuring or gather instructions to lower the cost. There are 1 issues ( = indirect data accesses) costing 4 point each.	4
►Vectorization Roadblocks+		9
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 2 issues ( = data accesses) costing 2 point each.	4
○	[SA] Presence of indirect accesses - Use array restructuring or gather instructions to lower the cost. There are 1 issues ( = indirect data accesses) costing 4 point each.	4
►Loop 1 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 9.23 %
►Loop Computation Issues+		2
○	[SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points.	2
►Loop 1405 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 100.00 % - Vector Length Use: 100.00 %
►Data Access Issues+		32
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 16 issues ( = data accesses) costing 2 point each.	32
►Vectorization Roadblocks+		32
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 16 issues ( = data accesses) costing 2 point each.	32
►Loop 752 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 88.89 % - Vector Length Use: 97.32 %
►Loop Computation Issues+		4
○	[SA] Presence of expensive FP instructions - Perform hoisting, change algorithm, use SVML or proper numerical library or perform value profiling (count the number of distinct input values). There are 1 issues (= instructions) costing 4 points each.	4
►Data Access Issues+		6
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 3 issues ( = data accesses) costing 2 point each.	6
►Vectorization Roadblocks+		6
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 3 issues ( = data accesses) costing 2 point each.	6
►Loop 45 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 23.99 %
►Control Flow Issues+		1
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
►Vectorization Roadblocks+		1001
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000

▼Stylizer

[ 0 / 9 ] Compilation options are not available

Compilation options are an important optimization leverage but ONE-View is not able to analyze them.

[ 4 / 4 ] Application profile is long enough (17.27 s)

To have good quality measurements, it is advised that the application profiling time is greater than 10 seconds.

[ 2 / 3 ] Security settings from the host restrict profiling. Some metrics will be missing or incomplete.

Current value for kernel.perf_event_paranoid is 2. If possible, set it to 1 or check with your system administrator which flag can be used to achieve this.

[ 2 / 2 ] Application is correctly profiled ("Others" category represents 0.01 % of the execution time)

To have a representative profiling, it is advised that the category "Others" represents less than 20% of the execution time in order to analyze as much as possible of the user code

[ 1 / 1 ] Lstopo present. The Topology lstopo report will be generated.

▶Strategizer

[ 4 / 4 ] Enough time of the experiment time spent in analyzed loops (89.59%)

If the time spent in analyzed loops is less than 30%, standard loop optimizations will have a limited impact on application performances.

[ 4 / 4 ] Threads activity is good

On average, more than 3576.47% of observed threads are actually active

[ 4 / 4 ] CPU activity is good

CPU cores are active 94.60% of time

[ 4 / 4 ] Loop profile is not flat

At least one loop coverage is greater than 4% (86.81%), representing an hotspot for the application

[ 4 / 4 ] Enough time of the experiment time spent in analyzed innermost loops (88.40%)

If the time spent in analyzed innermost loops is less than 15%, standard innermost loop optimizations such as vectorisation will have a limited impact on application performances.

[ 4 / 4 ] Affinity is good (98.05%)

Threads are not migrating to CPU cores: probably successfully pinned

[ 3 / 3 ] Less than 10% (0.00%) is spend in BLAS1 operations

It could be more efficient to inline by hand BLAS1 operations

[ 3 / 3 ] Functions mostly use all threads

Functions running on a reduced number of threads (typically sequential code) cover less than 10% of application walltime (5.87%)

[ 3 / 3 ] Cumulative Outermost/In between loops coverage (1.19%) lower than cumulative innermost loop coverage (88.40%)

Having cumulative Outermost/In between loops coverage greater than cumulative innermost loop coverage will make loop optimization more complex

[ 2 / 2 ] Less than 10% (0.00%) is spend in BLAS2 operations

BLAS2 calls usually could make a poor cache usage and could benefit from inlining.

[ 2 / 2 ] Less than 10% (0.19%) is spend in Libm/SVML (special functions)

▼Optimizer

Loop ID	Analysis	Penalty Score
►Loop 1828 - libggml-cpu.so+	Execution Time: 86 % - Vectorization Ratio: 8.82 % - Vector Length Use: 21.32 %
►Loop Computation Issues+		2
○	[SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points.	2
►Data Access Issues+		16
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 8 issues ( = data accesses) costing 2 point each.	16
►Vectorization Roadblocks+		16
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 8 issues ( = data accesses) costing 2 point each.	16
►Loop 1403 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 15.48 % - Vector Length Use: 31.37 %
►Loop Computation Issues+		2
○	[SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points.	2
►Control Flow Issues+		488
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (481 paths) - Simplify control structure. There are 481 issues ( = paths) costing 1 point each with a malus of 4 points.	485
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Vectorization Roadblocks+		488
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (481 paths) - Simplify control structure. There are 481 issues ( = paths) costing 1 point each with a malus of 4 points.	485
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Loop 51 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 24.01 %
►Control Flow Issues+		52
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (45 paths) - Simplify control structure. There are 45 issues ( = paths) costing 1 point each with a malus of 4 points.	49
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Vectorization Roadblocks+		52
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (45 paths) - Simplify control structure. There are 45 issues ( = paths) costing 1 point each with a malus of 4 points.	49
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Loop 1802 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 64.09 % - Vector Length Use: 42.49 %
►Loop Computation Issues+		4
○	[SA] Less than 10% of the FP ADD/SUB/MUL arithmetic operations are performed using FMA - Reorganize arithmetic expressions to exhibit potential for FMA. This issue costs 4 points.	4
►Control Flow Issues+		4
○	[SA] Several paths (2 paths) - Simplify control structure or force the compiler to use masked instructions. There are 2 issues ( = paths) costing 1 point each.	2
○	[SA] Non innermost loop (Outermost) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Vectorization Roadblocks+		4
○	[SA] Several paths (2 paths) - Simplify control structure or force the compiler to use masked instructions. There are 2 issues ( = paths) costing 1 point each.	2
○	[SA] Non innermost loop (Outermost) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Loop 1803 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 15.24 % - Vector Length Use: 14.64 %
►Loop Computation Issues+		6
○	[SA] Presence of expensive FP instructions - Perform hoisting, change algorithm, use SVML or proper numerical library or perform value profiling (count the number of distinct input values). There are 1 issues (= instructions) costing 4 points each.	4
○	[SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points.	2
○Control Flow Issues		0
►Vectorization Roadblocks+		1000
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000
►Loop 787 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 14.58 %
►Loop Computation Issues+		12
○	[SA] Presence of expensive FP instructions - Perform hoisting, change algorithm, use SVML or proper numerical library or perform value profiling (count the number of distinct input values). There are 2 issues (= instructions) costing 4 points each.	8
○	[SA] Less than 10% of the FP ADD/SUB/MUL arithmetic operations are performed using FMA - Reorganize arithmetic expressions to exhibit potential for FMA. This issue costs 4 points.	4
►Control Flow Issues+		1
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
►Data Access Issues+		8
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 2 issues ( = data accesses) costing 2 point each.	4
○	[SA] Presence of indirect accesses - Use array restructuring or gather instructions to lower the cost. There are 1 issues ( = indirect data accesses) costing 4 point each.	4
►Vectorization Roadblocks+		9
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 2 issues ( = data accesses) costing 2 point each.	4
○	[SA] Presence of indirect accesses - Use array restructuring or gather instructions to lower the cost. There are 1 issues ( = indirect data accesses) costing 4 point each.	4
►Loop 752 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 88.89 % - Vector Length Use: 97.32 %
►Loop Computation Issues+		4
○	[SA] Presence of expensive FP instructions - Perform hoisting, change algorithm, use SVML or proper numerical library or perform value profiling (count the number of distinct input values). There are 1 issues (= instructions) costing 4 points each.	4
►Data Access Issues+		6
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 3 issues ( = data accesses) costing 2 point each.	6
►Vectorization Roadblocks+		6
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 3 issues ( = data accesses) costing 2 point each.	6
►Loop 1 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 9.23 %
►Loop Computation Issues+		2
○	[SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points.	2
►Loop 1405 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 100.00 % - Vector Length Use: 100.00 %
►Data Access Issues+		32
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 16 issues ( = data accesses) costing 2 point each.	32
►Vectorization Roadblocks+		32
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 16 issues ( = data accesses) costing 2 point each.	32
►Loop 749 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 100.00 % - Vector Length Use: 100.00 %
►Data Access Issues+		32
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 16 issues ( = data accesses) costing 2 point each.	32
►Vectorization Roadblocks+		32
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 16 issues ( = data accesses) costing 2 point each.	32

▼Stylizer

[ 0 / 9 ] Compilation options are not available

Compilation options are an important optimization leverage but ONE-View is not able to analyze them.

[ 4 / 4 ] Application profile is long enough (15.24 s)

To have good quality measurements, it is advised that the application profiling time is greater than 10 seconds.

[ 2 / 3 ] Security settings from the host restrict profiling. Some metrics will be missing or incomplete.

Current value for kernel.perf_event_paranoid is 2. If possible, set it to 1 or check with your system administrator which flag can be used to achieve this.

[ 2 / 2 ] Application is correctly profiled ("Others" category represents 0.01 % of the execution time)

To have a representative profiling, it is advised that the category "Others" represents less than 20% of the execution time in order to analyze as much as possible of the user code

[ 1 / 1 ] Lstopo present. The Topology lstopo report will be generated.

▶Strategizer

[ 4 / 4 ] Enough time of the experiment time spent in analyzed loops (87.51%)

If the time spent in analyzed loops is less than 30%, standard loop optimizations will have a limited impact on application performances.

[ 4 / 4 ] Threads activity is good

On average, more than 4199.32% of observed threads are actually active

[ 4 / 4 ] CPU activity is good

CPU cores are active 93.25% of time

[ 4 / 4 ] Loop profile is not flat

At least one loop coverage is greater than 4% (84.73%), representing an hotspot for the application

[ 4 / 4 ] Enough time of the experiment time spent in analyzed innermost loops (86.37%)

If the time spent in analyzed innermost loops is less than 15%, standard innermost loop optimizations such as vectorisation will have a limited impact on application performances.

[ 4 / 4 ] Affinity is good (97.73%)

Threads are not migrating to CPU cores: probably successfully pinned

[ 3 / 3 ] Less than 10% (0.00%) is spend in BLAS1 operations

It could be more efficient to inline by hand BLAS1 operations

[ 3 / 3 ] Functions mostly use all threads

Functions running on a reduced number of threads (typically sequential code) cover less than 10% of application walltime (8.09%)

[ 3 / 3 ] Cumulative Outermost/In between loops coverage (1.14%) lower than cumulative innermost loop coverage (86.37%)

Having cumulative Outermost/In between loops coverage greater than cumulative innermost loop coverage will make loop optimization more complex

[ 2 / 2 ] Less than 10% (0.00%) is spend in BLAS2 operations

BLAS2 calls usually could make a poor cache usage and could benefit from inlining.

[ 2 / 2 ] Less than 10% (0.23%) is spend in Libm/SVML (special functions)

▼Optimizer

Loop ID	Analysis	Penalty Score
►Loop 1828 - libggml-cpu.so+	Execution Time: 84 % - Vectorization Ratio: 8.82 % - Vector Length Use: 21.32 %
►Loop Computation Issues+		2
○	[SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points.	2
►Data Access Issues+		16
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 8 issues ( = data accesses) costing 2 point each.	16
►Vectorization Roadblocks+		16
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 8 issues ( = data accesses) costing 2 point each.	16
►Loop 1403 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 15.48 % - Vector Length Use: 31.37 %
►Loop Computation Issues+		2
○	[SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points.	2
►Control Flow Issues+		488
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (481 paths) - Simplify control structure. There are 481 issues ( = paths) costing 1 point each with a malus of 4 points.	485
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Vectorization Roadblocks+		488
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (481 paths) - Simplify control structure. There are 481 issues ( = paths) costing 1 point each with a malus of 4 points.	485
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Loop 51 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 24.01 %
►Control Flow Issues+		52
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (45 paths) - Simplify control structure. There are 45 issues ( = paths) costing 1 point each with a malus of 4 points.	49
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Vectorization Roadblocks+		52
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (45 paths) - Simplify control structure. There are 45 issues ( = paths) costing 1 point each with a malus of 4 points.	49
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Loop 1802 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 64.09 % - Vector Length Use: 42.49 %
►Loop Computation Issues+		4
○	[SA] Less than 10% of the FP ADD/SUB/MUL arithmetic operations are performed using FMA - Reorganize arithmetic expressions to exhibit potential for FMA. This issue costs 4 points.	4
►Control Flow Issues+		4
○	[SA] Several paths (2 paths) - Simplify control structure or force the compiler to use masked instructions. There are 2 issues ( = paths) costing 1 point each.	2
○	[SA] Non innermost loop (Outermost) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Vectorization Roadblocks+		4
○	[SA] Several paths (2 paths) - Simplify control structure or force the compiler to use masked instructions. There are 2 issues ( = paths) costing 1 point each.	2
○	[SA] Non innermost loop (Outermost) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Loop 1803 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 15.24 % - Vector Length Use: 14.64 %
►Loop Computation Issues+		6
○	[SA] Presence of expensive FP instructions - Perform hoisting, change algorithm, use SVML or proper numerical library or perform value profiling (count the number of distinct input values). There are 1 issues (= instructions) costing 4 points each.	4
○	[SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points.	2
○Control Flow Issues		0
►Vectorization Roadblocks+		1000
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000
►Loop 787 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 14.58 %
►Loop Computation Issues+		12
○	[SA] Presence of expensive FP instructions - Perform hoisting, change algorithm, use SVML or proper numerical library or perform value profiling (count the number of distinct input values). There are 2 issues (= instructions) costing 4 points each.	8
○	[SA] Less than 10% of the FP ADD/SUB/MUL arithmetic operations are performed using FMA - Reorganize arithmetic expressions to exhibit potential for FMA. This issue costs 4 points.	4
►Control Flow Issues+		1
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
►Data Access Issues+		8
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 2 issues ( = data accesses) costing 2 point each.	4
○	[SA] Presence of indirect accesses - Use array restructuring or gather instructions to lower the cost. There are 1 issues ( = indirect data accesses) costing 4 point each.	4
►Vectorization Roadblocks+		9
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 2 issues ( = data accesses) costing 2 point each.	4
○	[SA] Presence of indirect accesses - Use array restructuring or gather instructions to lower the cost. There are 1 issues ( = indirect data accesses) costing 4 point each.	4
►Loop 1405 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 100.00 % - Vector Length Use: 100.00 %
►Data Access Issues+		32
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 16 issues ( = data accesses) costing 2 point each.	32
►Vectorization Roadblocks+		32
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 16 issues ( = data accesses) costing 2 point each.	32
►Loop 749 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 100.00 % - Vector Length Use: 100.00 %
►Data Access Issues+		32
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 16 issues ( = data accesses) costing 2 point each.	32
►Vectorization Roadblocks+		32
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 16 issues ( = data accesses) costing 2 point each.	32
►Loop 752 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 88.89 % - Vector Length Use: 97.32 %
►Loop Computation Issues+		4
○	[SA] Presence of expensive FP instructions - Perform hoisting, change algorithm, use SVML or proper numerical library or perform value profiling (count the number of distinct input values). There are 1 issues (= instructions) costing 4 points each.	4
►Data Access Issues+		6
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 3 issues ( = data accesses) costing 2 point each.	6
►Vectorization Roadblocks+		6
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 3 issues ( = data accesses) costing 2 point each.	6
►Loop 1 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 9.23 %
►Loop Computation Issues+		2
○	[SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points.	2

▼Stylizer

[ 0 / 9 ] Compilation options are not available

Compilation options are an important optimization leverage but ONE-View is not able to analyze them.

[ 4 / 4 ] Application profile is long enough (14.04 s)

To have good quality measurements, it is advised that the application profiling time is greater than 10 seconds.

[ 2 / 3 ] Security settings from the host restrict profiling. Some metrics will be missing or incomplete.

Current value for kernel.perf_event_paranoid is 2. If possible, set it to 1 or check with your system administrator which flag can be used to achieve this.

[ 2 / 2 ] Application is correctly profiled ("Others" category represents 0.01 % of the execution time)

To have a representative profiling, it is advised that the category "Others" represents less than 20% of the execution time in order to analyze as much as possible of the user code

[ 1 / 1 ] Lstopo present. The Topology lstopo report will be generated.

▶Strategizer

[ 4 / 4 ] Enough time of the experiment time spent in analyzed loops (85.11%)

If the time spent in analyzed loops is less than 30%, standard loop optimizations will have a limited impact on application performances.

[ 4 / 4 ] Threads activity is good

On average, more than 4793.21% of observed threads are actually active

[ 4 / 4 ] CPU activity is good

CPU cores are active 91.66% of time

[ 4 / 4 ] Loop profile is not flat

At least one loop coverage is greater than 4% (77.18%), representing an hotspot for the application

[ 4 / 4 ] Enough time of the experiment time spent in analyzed innermost loops (83.93%)

If the time spent in analyzed innermost loops is less than 15%, standard innermost loop optimizations such as vectorisation will have a limited impact on application performances.

[ 4 / 4 ] Affinity is good (97.53%)

Threads are not migrating to CPU cores: probably successfully pinned

[ 3 / 3 ] Less than 10% (0.00%) is spend in BLAS1 operations

It could be more efficient to inline by hand BLAS1 operations

[ 3 / 3 ] Functions mostly use all threads

Functions running on a reduced number of threads (typically sequential code) cover less than 10% of application walltime (9.00%)

[ 3 / 3 ] Cumulative Outermost/In between loops coverage (1.19%) lower than cumulative innermost loop coverage (83.93%)

Having cumulative Outermost/In between loops coverage greater than cumulative innermost loop coverage will make loop optimization more complex

[ 2 / 2 ] Less than 10% (0.00%) is spend in BLAS2 operations

BLAS2 calls usually could make a poor cache usage and could benefit from inlining.

[ 2 / 2 ] Less than 10% (0.23%) is spend in Libm/SVML (special functions)

▼Optimizer

Loop ID	Analysis	Penalty Score
►Loop 1828 - libggml-cpu.so+	Execution Time: 77 % - Vectorization Ratio: 8.82 % - Vector Length Use: 21.32 %
►Loop Computation Issues+		2
○	[SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points.	2
►Data Access Issues+		16
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 8 issues ( = data accesses) costing 2 point each.	16
►Vectorization Roadblocks+		16
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 8 issues ( = data accesses) costing 2 point each.	16
►Loop 1829 - libggml-cpu.so+	Execution Time: 4 % - Vectorization Ratio: 46.15 % - Vector Length Use: 60.70 %
►Data Access Issues+		16
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 8 issues ( = data accesses) costing 2 point each.	16
►Vectorization Roadblocks+		16
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 8 issues ( = data accesses) costing 2 point each.	16
►Loop 1403 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 15.48 % - Vector Length Use: 31.37 %
►Loop Computation Issues+		2
○	[SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points.	2
►Control Flow Issues+		488
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (481 paths) - Simplify control structure. There are 481 issues ( = paths) costing 1 point each with a malus of 4 points.	485
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Vectorization Roadblocks+		488
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (481 paths) - Simplify control structure. There are 481 issues ( = paths) costing 1 point each with a malus of 4 points.	485
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Loop 787 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 14.58 %
►Loop Computation Issues+		12
○	[SA] Presence of expensive FP instructions - Perform hoisting, change algorithm, use SVML or proper numerical library or perform value profiling (count the number of distinct input values). There are 2 issues (= instructions) costing 4 points each.	8
○	[SA] Less than 10% of the FP ADD/SUB/MUL arithmetic operations are performed using FMA - Reorganize arithmetic expressions to exhibit potential for FMA. This issue costs 4 points.	4
►Control Flow Issues+		1
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
►Data Access Issues+		8
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 2 issues ( = data accesses) costing 2 point each.	4
○	[SA] Presence of indirect accesses - Use array restructuring or gather instructions to lower the cost. There are 1 issues ( = indirect data accesses) costing 4 point each.	4
►Vectorization Roadblocks+		9
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 2 issues ( = data accesses) costing 2 point each.	4
○	[SA] Presence of indirect accesses - Use array restructuring or gather instructions to lower the cost. There are 1 issues ( = indirect data accesses) costing 4 point each.	4
►Loop 1802 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 64.09 % - Vector Length Use: 42.49 %
►Loop Computation Issues+		4
○	[SA] Less than 10% of the FP ADD/SUB/MUL arithmetic operations are performed using FMA - Reorganize arithmetic expressions to exhibit potential for FMA. This issue costs 4 points.	4
►Control Flow Issues+		4
○	[SA] Several paths (2 paths) - Simplify control structure or force the compiler to use masked instructions. There are 2 issues ( = paths) costing 1 point each.	2
○	[SA] Non innermost loop (Outermost) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Vectorization Roadblocks+		4
○	[SA] Several paths (2 paths) - Simplify control structure or force the compiler to use masked instructions. There are 2 issues ( = paths) costing 1 point each.	2
○	[SA] Non innermost loop (Outermost) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Loop 51 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 24.01 %
►Control Flow Issues+		52
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (45 paths) - Simplify control structure. There are 45 issues ( = paths) costing 1 point each with a malus of 4 points.	49
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Vectorization Roadblocks+		52
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (45 paths) - Simplify control structure. There are 45 issues ( = paths) costing 1 point each with a malus of 4 points.	49
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Loop 749 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 100.00 % - Vector Length Use: 100.00 %
►Data Access Issues+		32
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 16 issues ( = data accesses) costing 2 point each.	32
►Vectorization Roadblocks+		32
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 16 issues ( = data accesses) costing 2 point each.	32
►Loop 1405 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 100.00 % - Vector Length Use: 100.00 %
►Data Access Issues+		32
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 16 issues ( = data accesses) costing 2 point each.	32
►Vectorization Roadblocks+		32
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 16 issues ( = data accesses) costing 2 point each.	32
►Loop 1803 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 15.24 % - Vector Length Use: 14.64 %
►Loop Computation Issues+		6
○	[SA] Presence of expensive FP instructions - Perform hoisting, change algorithm, use SVML or proper numerical library or perform value profiling (count the number of distinct input values). There are 1 issues (= instructions) costing 4 points each.	4
○	[SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points.	2
○Control Flow Issues		0
►Vectorization Roadblocks+		1000
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000
►Loop 765 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 12.76 %
►Loop Computation Issues+		6
○	[SA] Presence of expensive FP instructions - Perform hoisting, change algorithm, use SVML or proper numerical library or perform value profiling (count the number of distinct input values). There are 1 issues (= instructions) costing 4 points each.	4
○	[SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points.	2
►Control Flow Issues+		1
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
►Vectorization Roadblocks+		1001
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000

▼Stylizer

[ 0 / 9 ] Compilation options are not available

Compilation options are an important optimization leverage but ONE-View is not able to analyze them.

[ 4 / 4 ] Application profile is long enough (12.55 s)

To have good quality measurements, it is advised that the application profiling time is greater than 10 seconds.

[ 2 / 3 ] Security settings from the host restrict profiling. Some metrics will be missing or incomplete.

Current value for kernel.perf_event_paranoid is 2. If possible, set it to 1 or check with your system administrator which flag can be used to achieve this.

[ 2 / 2 ] Application is correctly profiled ("Others" category represents 0.01 % of the execution time)

To have a representative profiling, it is advised that the category "Others" represents less than 20% of the execution time in order to analyze as much as possible of the user code

[ 1 / 1 ] Lstopo present. The Topology lstopo report will be generated.

▶Strategizer

[ 4 / 4 ] Enough time of the experiment time spent in analyzed loops (87.13%)

If the time spent in analyzed loops is less than 30%, standard loop optimizations will have a limited impact on application performances.

[ 4 / 4 ] Threads activity is good

On average, more than 5366.52% of observed threads are actually active

[ 4 / 4 ] CPU activity is good

CPU cores are active 90.56% of time

[ 4 / 4 ] Loop profile is not flat

At least one loop coverage is greater than 4% (84.30%), representing an hotspot for the application

[ 4 / 4 ] Enough time of the experiment time spent in analyzed innermost loops (86.04%)

If the time spent in analyzed innermost loops is less than 15%, standard innermost loop optimizations such as vectorisation will have a limited impact on application performances.

[ 4 / 4 ] Affinity is good (97.20%)

Threads are not migrating to CPU cores: probably successfully pinned

[ 3 / 3 ] Less than 10% (0.00%) is spend in BLAS1 operations

It could be more efficient to inline by hand BLAS1 operations

[ 3 / 3 ] Functions mostly use all threads

Functions running on a reduced number of threads (typically sequential code) cover less than 10% of application walltime (8.08%)

[ 3 / 3 ] Cumulative Outermost/In between loops coverage (1.09%) lower than cumulative innermost loop coverage (86.04%)

Having cumulative Outermost/In between loops coverage greater than cumulative innermost loop coverage will make loop optimization more complex

[ 2 / 2 ] Less than 10% (0.00%) is spend in BLAS2 operations

BLAS2 calls usually could make a poor cache usage and could benefit from inlining.

[ 2 / 2 ] Less than 10% (0.24%) is spend in Libm/SVML (special functions)

▼Optimizer

Loop ID	Analysis	Penalty Score
►Loop 1828 - libggml-cpu.so+	Execution Time: 84 % - Vectorization Ratio: 8.82 % - Vector Length Use: 21.32 %
►Loop Computation Issues+		2
○	[SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points.	2
►Data Access Issues+		16
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 8 issues ( = data accesses) costing 2 point each.	16
►Vectorization Roadblocks+		16
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 8 issues ( = data accesses) costing 2 point each.	16
►Loop 1403 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 15.48 % - Vector Length Use: 31.37 %
►Loop Computation Issues+		2
○	[SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points.	2
►Control Flow Issues+		488
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (481 paths) - Simplify control structure. There are 481 issues ( = paths) costing 1 point each with a malus of 4 points.	485
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Vectorization Roadblocks+		488
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (481 paths) - Simplify control structure. There are 481 issues ( = paths) costing 1 point each with a malus of 4 points.	485
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Loop 787 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 14.58 %
►Loop Computation Issues+		12
○	[SA] Presence of expensive FP instructions - Perform hoisting, change algorithm, use SVML or proper numerical library or perform value profiling (count the number of distinct input values). There are 2 issues (= instructions) costing 4 points each.	8
○	[SA] Less than 10% of the FP ADD/SUB/MUL arithmetic operations are performed using FMA - Reorganize arithmetic expressions to exhibit potential for FMA. This issue costs 4 points.	4
►Control Flow Issues+		1
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
►Data Access Issues+		8
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 2 issues ( = data accesses) costing 2 point each.	4
○	[SA] Presence of indirect accesses - Use array restructuring or gather instructions to lower the cost. There are 1 issues ( = indirect data accesses) costing 4 point each.	4
►Vectorization Roadblocks+		9
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 2 issues ( = data accesses) costing 2 point each.	4
○	[SA] Presence of indirect accesses - Use array restructuring or gather instructions to lower the cost. There are 1 issues ( = indirect data accesses) costing 4 point each.	4
►Loop 51 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 24.01 %
►Control Flow Issues+		52
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (45 paths) - Simplify control structure. There are 45 issues ( = paths) costing 1 point each with a malus of 4 points.	49
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Vectorization Roadblocks+		52
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (45 paths) - Simplify control structure. There are 45 issues ( = paths) costing 1 point each with a malus of 4 points.	49
○	[SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Loop 1405 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 100.00 % - Vector Length Use: 100.00 %
►Data Access Issues+		32
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 16 issues ( = data accesses) costing 2 point each.	32
►Vectorization Roadblocks+		32
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 16 issues ( = data accesses) costing 2 point each.	32
►Loop 1802 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 64.09 % - Vector Length Use: 42.49 %
►Loop Computation Issues+		4
○	[SA] Less than 10% of the FP ADD/SUB/MUL arithmetic operations are performed using FMA - Reorganize arithmetic expressions to exhibit potential for FMA. This issue costs 4 points.	4
►Control Flow Issues+		4
○	[SA] Several paths (2 paths) - Simplify control structure or force the compiler to use masked instructions. There are 2 issues ( = paths) costing 1 point each.	2
○	[SA] Non innermost loop (Outermost) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Vectorization Roadblocks+		4
○	[SA] Several paths (2 paths) - Simplify control structure or force the compiler to use masked instructions. There are 2 issues ( = paths) costing 1 point each.	2
○	[SA] Non innermost loop (Outermost) - Collapse loop with innermost ones. This issue costs 2 points.	2
►Loop 1803 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 15.24 % - Vector Length Use: 14.64 %
►Loop Computation Issues+		6
○	[SA] Presence of expensive FP instructions - Perform hoisting, change algorithm, use SVML or proper numerical library or perform value profiling (count the number of distinct input values). There are 1 issues (= instructions) costing 4 points each.	4
○	[SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points.	2
○Control Flow Issues		0
►Vectorization Roadblocks+		1000
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000
►Loop 749 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 100.00 % - Vector Length Use: 100.00 %
►Data Access Issues+		32
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 16 issues ( = data accesses) costing 2 point each.	32
►Vectorization Roadblocks+		32
○	[SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 16 issues ( = data accesses) costing 2 point each.	32
►Loop 1 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 9.23 %
►Loop Computation Issues+		2
○	[SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points.	2
►Loop 765 - libggml-cpu.so+	Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 12.76 %
►Loop Computation Issues+		6
○	[SA] Presence of expensive FP instructions - Perform hoisting, change algorithm, use SVML or proper numerical library or perform value profiling (count the number of distinct input values). There are 1 issues (= instructions) costing 4 points each.	4
○	[SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points.	2
►Control Flow Issues+		1
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
►Vectorization Roadblocks+		1001
○	[SA] Presence of calls - Inline either by compiler or by hand and use SVML for libm calls. There are 1 issues (= calls) costing 1 point each.	1
○	[SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point.	1000

Report Configuration

exec - 2025-11-26 16:50:35 - MAQAO 2025.1.3

▼Stylizer

▶Strategizer

▼Optimizer

▼Stylizer

▶Strategizer

▼Optimizer

▼Stylizer

▶Strategizer

▼Optimizer

▼Stylizer

▶Strategizer

▼Optimizer

▼Stylizer

▶Strategizer

▼Optimizer

▼Stylizer

▶Strategizer

▼Optimizer

▼Stylizer

▶Strategizer

▼Optimizer

▼Stylizer

▶Strategizer

▼Optimizer

▼Stylizer

▶Strategizer

▼Optimizer

▼Stylizer

▶Strategizer

▼Optimizer

▼Stylizer

▶Strategizer

▼Optimizer