Help is available by moving the cursor above any
symbol or by checking MAQAO website.
[ 4 / 4 ] Application profile is long enough (13.71 s)
To have good quality measurements, it is advised that the application profiling time is greater than 10 seconds.
[ 3 / 3 ] Optimization level option is correctly used
[ 3 / 3 ] Most of time spent in analyzed modules comes from functions compiled with -g and -fno-omit-frame-pointer
-g option gives access to debugging informations, such are source locations. -fno-omit-frame-pointer improves the accuracy of callchains found during the application profiling.
[ 3 / 3 ] Host configuration allows retrieval of all necessary metrics.
[ 0 / 3 ] Compilation of some functions is not optimized for the target processor
Application run on the SKYLAKE micro-architecture while the code was specialized for skylake-avx512.
[ 2 / 2 ] Application is correctly profiled ("Others" category represents 10.54 % of the execution time)
To have a representative profiling, it is advised that the category "Others" represents less than 20% of the execution time in order to analyze as much as possible of the user code
[ 1 / 1 ] Lstopo present. The Topology lstopo report will be generated.
[ 0 / 0 ] Fastmath not used
Consider to add ffast-math to compilation flags (or replace -O3 with -Ofast) to unlock potential extra speedup by relaxing floating-point computation consistency. Warning: floating-point accuracy may be reduced and the compliance to IEEE/ISO rules/specifications for math functions will be relaxed, typically 'errno' will no longer be set after calling some math functions.
[ 4 / 4 ] Enough time of the experiment time spent in analyzed loops (87.60%)
If the time spent in analyzed loops is less than 30%, standard loop optimizations will have a limited impact on application performances.
[ 4 / 4 ] Threads activity is good
On average, more than 99.65% of observed threads are actually active
[ 4 / 4 ] CPU activity is good
CPU cores are active 99.65% of time
[ 4 / 4 ] Loop profile is not flat
At least one loop coverage is greater than 4% (46.73%), representing an hotspot for the application
[ 4 / 4 ] Enough time of the experiment time spent in analyzed innermost loops (78.66%)
If the time spent in analyzed innermost loops is less than 15%, standard innermost loop optimizations such as vectorisation will have a limited impact on application performances.
[ 4 / 4 ] Affinity is good (99.64%)
Threads are not migrating to CPU cores: probably successfully pinned
[ 3 / 3 ] Less than 10% (0.00%) is spend in BLAS1 operations
It could be more efficient to inline by hand BLAS1 operations
[ 3 / 3 ] Functions mostly use all threads
Functions running on a reduced number of threads (typically sequential code) cover less than 10% of application walltime (0.00%)
[ 3 / 3 ] Cumulative Outermost/In between loops coverage (8.94%) lower than cumulative innermost loop coverage (78.66%)
Having cumulative Outermost/In between loops coverage greater than cumulative innermost loop coverage will make loop optimization more complex
[ 2 / 2 ] Less than 10% (0.00%) is spend in BLAS2 operations
BLAS2 calls usually could make a poor cache usage and could benefit from inlining.
[ 2 / 2 ] Less than 10% (1.50%) is spend in Libm/SVML (special functions)
| Loop ID | Analysis | Penalty Score |
|---|---|---|
| ►Loop 23 - libqmckl.so.0.0.0 | Execution Time: 46 % - Vectorization Ratio: 0.00 % - Vector Length Use: 12.50 % | |
| ►Loop Computation Issues | 2 | |
| ○ | [SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points. | 2 |
| ►Control Flow Issues | 1000 | |
| ○ | [SA] Too many paths (6561 paths) - Simplify control structure. There are 6561 issues ( = paths) costing 1 point, limited to 1000. | 1000 |
| ►Vectorization Roadblocks | 1000 | |
| ○ | [SA] Too many paths (6561 paths) - Simplify control structure. There are 6561 issues ( = paths) costing 1 point, limited to 1000. | 1000 |
| ►Loop 13 - libqmckl.so.0.0.0 | Execution Time: 15 % - Vectorization Ratio: 0.00 % - Vector Length Use: 12.50 % | |
| ►Data Access Issues | 16 | |
| ○ | [SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 2 issues ( = data accesses) costing 2 point each. | 4 |
| ○ | [SA] Presence of indirect accesses - Use array restructuring or gather instructions to lower the cost. There are 3 issues ( = indirect data accesses) costing 4 point each. | 12 |
| ►Vectorization Roadblocks | 16 | |
| ○ | [SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 2 issues ( = data accesses) costing 2 point each. | 4 |
| ○ | [SA] Presence of indirect accesses - Use array restructuring or gather instructions to lower the cost. There are 3 issues ( = indirect data accesses) costing 4 point each. | 12 |
| ►Loop 12 - libqmckl.so.0.0.0 | Execution Time: 8 % - Vectorization Ratio: 15.25 % - Vector Length Use: 16.44 % | |
| ►Loop Computation Issues | 2 | |
| ○ | [SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points. | 2 |
| ►Control Flow Issues | 1002 | |
| ○ | [SA] Too many paths (17053 paths) - Simplify control structure. There are 17053 issues ( = paths) costing 1 point, limited to 1000. | 1000 |
| ○ | [SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points. | 2 |
| ►Data Access Issues | 25 | |
| ○ | [SA] Presence of special instructions executing on a single port (INSERT/EXTRACT, BLEND/MERGE, BROADCAST, Other_packing) - Simplify data access and try to get stride 1 access. There are 25 issues (= instructions) costing 1 point each. | 25 |
| ►Vectorization Roadblocks | 1002 | |
| ○ | [SA] Too many paths (17053 paths) - Simplify control structure. There are 17053 issues ( = paths) costing 1 point, limited to 1000. | 1000 |
| ○ | [SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points. | 2 |
| ►Inefficient Vectorization | 25 | |
| ○ | [SA] Presence of special instructions executing on a single port (INSERT/EXTRACT, BLEND/MERGE, BROADCAST, Other_packing) - Simplify data access and try to get stride 1 access. There are 25 issues (= instructions) costing 1 point each. | 25 |
| ►Loop 20 - libqmckl.so.0.0.0 | Execution Time: 4 % - Vectorization Ratio: 100.00 % - Vector Length Use: 50.00 % | |
| ►Data Access Issues | 4 | |
| ○ | [SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 2 issues ( = data accesses) costing 2 point each. | 4 |
| ►Vectorization Roadblocks | 4 | |
| ○ | [SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 2 issues ( = data accesses) costing 2 point each. | 4 |
| ►Loop 15 - libqmckl.so.0.0.0 | Execution Time: 3 % - Vectorization Ratio: 0.00 % - Vector Length Use: 12.50 % | |
| ►Data Access Issues | 12 | |
| ○ | [SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 2 issues ( = data accesses) costing 2 point each. | 4 |
| ○ | [SA] Presence of indirect accesses - Use array restructuring or gather instructions to lower the cost. There are 2 issues ( = indirect data accesses) costing 4 point each. | 8 |
| ►Vectorization Roadblocks | 12 | |
| ○ | [SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 2 issues ( = data accesses) costing 2 point each. | 4 |
| ○ | [SA] Presence of indirect accesses - Use array restructuring or gather instructions to lower the cost. There are 2 issues ( = indirect data accesses) costing 4 point each. | 8 |
| ►Loop 21 - libqmckl.so.0.0.0 | Execution Time: 3 % - Vectorization Ratio: 0.00 % - Vector Length Use: 12.50 % | |
| ►Data Access Issues | 12 | |
| ○ | [SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 2 issues ( = data accesses) costing 2 point each. | 4 |
| ○ | [SA] Presence of indirect accesses - Use array restructuring or gather instructions to lower the cost. There are 2 issues ( = indirect data accesses) costing 4 point each. | 8 |
| ►Vectorization Roadblocks | 12 | |
| ○ | [SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 2 issues ( = data accesses) costing 2 point each. | 4 |
| ○ | [SA] Presence of indirect accesses - Use array restructuring or gather instructions to lower the cost. There are 2 issues ( = indirect data accesses) costing 4 point each. | 8 |
| ►Loop 16 - libqmckl.so.0.0.0 | Execution Time: 1 % - Vectorization Ratio: 37.50 % - Vector Length Use: 23.44 % | |
| ►Loop Computation Issues | 4 | |
| ○ | [SA] Less than 10% of the FP ADD/SUB/MUL arithmetic operations are performed using FMA - Reorganize arithmetic expressions to exhibit potential for FMA. This issue costs 4 points. | 4 |
| ►Data Access Issues | 16 | |
| ○ | [SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 2 issues ( = data accesses) costing 2 point each. | 4 |
| ○ | [SA] Presence of special instructions executing on a single port (INSERT/EXTRACT, BLEND/MERGE, Other_packing) - Simplify data access and try to get stride 1 access. There are 12 issues (= instructions) costing 1 point each. | 12 |
| ►Vectorization Roadblocks | 4 | |
| ○ | [SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 2 issues ( = data accesses) costing 2 point each. | 4 |
| ►Inefficient Vectorization | 12 | |
| ○ | [SA] Presence of special instructions executing on a single port (INSERT/EXTRACT, BLEND/MERGE, Other_packing) - Simplify data access and try to get stride 1 access. There are 12 issues (= instructions) costing 1 point each. | 12 |
| ►Loop 188 - libqmckl.so.0.0.0 | Execution Time: 0 % - Vectorization Ratio: 100.00 % - Vector Length Use: 50.00 % | |
| ►Loop Computation Issues | 4 | |
| ○ | [SA] Less than 10% of the FP ADD/SUB/MUL arithmetic operations are performed using FMA - Reorganize arithmetic expressions to exhibit potential for FMA. This issue costs 4 points. | 4 |
| ○Loop 619 - libqmckl.so.0.0.0 | Execution Time: 0 % - Vectorization Ratio: 100.00 % - Vector Length Use: 50.00 % | |
| ►Loop 22 - libqmckl.so.0.0.0 | Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 11.76 % | |
| ►Loop Computation Issues | 2 | |
| ○ | [SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points. | 2 |
| ►Control Flow Issues | 2 | |
| ○ | [SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points. | 2 |
| ►Vectorization Roadblocks | 1002 | |
| ○ | [SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point. | 1000 |
| ○ | [SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points. | 2 |
[ 4 / 4 ] Application profile is long enough (13.69 s)
To have good quality measurements, it is advised that the application profiling time is greater than 10 seconds.
[ 3 / 3 ] Optimization level option is correctly used
[ 3 / 3 ] Most of time spent in analyzed modules comes from functions compiled with -g and -fno-omit-frame-pointer
-g option gives access to debugging informations, such are source locations. -fno-omit-frame-pointer improve the accuracy of callchains found during the application profiling.
[ 3 / 3 ] Host configuration allows retrieval of all necessary metrics.
[ 0 / 3 ] Compilation of some functions is not optimized for the target processor
Application run on the SKYLAKE micro-architecture while the code was specialized for skylake-avx512.
[ 2 / 2 ] Application is correctly profiled ("Others" category represents 10.52 % of the execution time)
To have a representative profiling, it is advised that the category "Others" represents less than 20% of the execution time in order to analyze as much as possible of the user code
[ 1 / 1 ] Lstopo present. The Topology lstopo report will be generated.
[ 0 / 0 ] Fastmath not used
Consider to add ffast-math to compilation flags (or replace -O3 with -Ofast) to unlock potential extra speedup by relaxing floating-point computation consistency. Warning: floating-point accuracy may be reduced and the compliance to IEEE/ISO rules/specifications for math functions will be relaxed, typically 'errno' will no longer be set after calling some math functions.
[ 4 / 4 ] Enough time of the experiment time spent in analyzed loops (87.33%)
If the time spent in analyzed loops is less than 30%, standard loop optimizations will have a limited impact on application performances.
[ 4 / 4 ] Threads activity is good
On average, more than 99.57% of observed threads are actually active
[ 4 / 4 ] CPU activity is good
CPU cores are active 99.57% of time
[ 4 / 4 ] Loop profile is not flat
At least one loop coverage is greater than 4% (47.74%), representing an hotspot for the application
[ 4 / 4 ] Enough time of the experiment time spent in analyzed innermost loops (78.63%)
If the time spent in analyzed innermost loops is less than 15%, standard innermost loop optimizations such as vectorisation will have a limited impact on application performances.
[ 4 / 4 ] Affinity is good (99.90%)
Threads are not migrating to CPU cores: probably successfully pinned
[ 3 / 3 ] Less than 10% (0.00%) is spend in BLAS1 operations
It could be more efficient to inline by hand BLAS1 operations
[ 3 / 3 ] Functions mostly use all threads
Functions running on a reduced number of threads (typically sequential code) cover less than 10% of application walltime (0.00%)
[ 3 / 3 ] Cumulative Outermost/In between loops coverage (8.69%) lower than cumulative innermost loop coverage (78.63%)
Having cumulative Outermost/In between loops coverage greater than cumulative innermost loop coverage will make loop optimization more complex
[ 2 / 2 ] Less than 10% (0.00%) is spend in BLAS2 operations
BLAS2 calls usually could make a poor cache usage and could benefit from inlining.
[ 2 / 2 ] Less than 10% (1.68%) is spend in Libm/SVML (special functions)
| Loop ID | Analysis | Penalty Score |
|---|---|---|
| ►Loop 23 - libqmckl.so.0.0.0 | Execution Time: 47 % - Vectorization Ratio: 0.00 % - Vector Length Use: 12.50 % | |
| ►Loop Computation Issues | 2 | |
| ○ | [SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points. | 2 |
| ►Control Flow Issues | 1000 | |
| ○ | [SA] Too many paths (6561 paths) - Simplify control structure. There are 6561 issues ( = paths) costing 1 point, limited to 1000. | 1000 |
| ►Vectorization Roadblocks | 1000 | |
| ○ | [SA] Too many paths (6561 paths) - Simplify control structure. There are 6561 issues ( = paths) costing 1 point, limited to 1000. | 1000 |
| ►Loop 13 - libqmckl.so.0.0.0 | Execution Time: 16 % - Vectorization Ratio: 0.00 % - Vector Length Use: 12.50 % | |
| ►Data Access Issues | 16 | |
| ○ | [SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 2 issues ( = data accesses) costing 2 point each. | 4 |
| ○ | [SA] Presence of indirect accesses - Use array restructuring or gather instructions to lower the cost. There are 3 issues ( = indirect data accesses) costing 4 point each. | 12 |
| ►Vectorization Roadblocks | 16 | |
| ○ | [SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 2 issues ( = data accesses) costing 2 point each. | 4 |
| ○ | [SA] Presence of indirect accesses - Use array restructuring or gather instructions to lower the cost. There are 3 issues ( = indirect data accesses) costing 4 point each. | 12 |
| ►Loop 12 - libqmckl.so.0.0.0 | Execution Time: 7 % - Vectorization Ratio: 15.25 % - Vector Length Use: 16.44 % | |
| ►Loop Computation Issues | 2 | |
| ○ | [SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points. | 2 |
| ►Control Flow Issues | 1002 | |
| ○ | [SA] Too many paths (17053 paths) - Simplify control structure. There are 17053 issues ( = paths) costing 1 point, limited to 1000. | 1000 |
| ○ | [SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points. | 2 |
| ►Data Access Issues | 25 | |
| ○ | [SA] Presence of special instructions executing on a single port (INSERT/EXTRACT, BLEND/MERGE, BROADCAST, Other_packing) - Simplify data access and try to get stride 1 access. There are 25 issues (= instructions) costing 1 point each. | 25 |
| ►Vectorization Roadblocks | 1002 | |
| ○ | [SA] Too many paths (17053 paths) - Simplify control structure. There are 17053 issues ( = paths) costing 1 point, limited to 1000. | 1000 |
| ○ | [SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points. | 2 |
| ►Inefficient Vectorization | 25 | |
| ○ | [SA] Presence of special instructions executing on a single port (INSERT/EXTRACT, BLEND/MERGE, BROADCAST, Other_packing) - Simplify data access and try to get stride 1 access. There are 25 issues (= instructions) costing 1 point each. | 25 |
| ►Loop 20 - libqmckl.so.0.0.0 | Execution Time: 4 % - Vectorization Ratio: 100.00 % - Vector Length Use: 50.00 % | |
| ►Data Access Issues | 4 | |
| ○ | [SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 2 issues ( = data accesses) costing 2 point each. | 4 |
| ►Vectorization Roadblocks | 4 | |
| ○ | [SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 2 issues ( = data accesses) costing 2 point each. | 4 |
| ►Loop 15 - libqmckl.so.0.0.0 | Execution Time: 3 % - Vectorization Ratio: 0.00 % - Vector Length Use: 12.50 % | |
| ►Data Access Issues | 12 | |
| ○ | [SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 2 issues ( = data accesses) costing 2 point each. | 4 |
| ○ | [SA] Presence of indirect accesses - Use array restructuring or gather instructions to lower the cost. There are 2 issues ( = indirect data accesses) costing 4 point each. | 8 |
| ►Vectorization Roadblocks | 12 | |
| ○ | [SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 2 issues ( = data accesses) costing 2 point each. | 4 |
| ○ | [SA] Presence of indirect accesses - Use array restructuring or gather instructions to lower the cost. There are 2 issues ( = indirect data accesses) costing 4 point each. | 8 |
| ►Loop 21 - libqmckl.so.0.0.0 | Execution Time: 2 % - Vectorization Ratio: 0.00 % - Vector Length Use: 12.50 % | |
| ►Data Access Issues | 12 | |
| ○ | [SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 2 issues ( = data accesses) costing 2 point each. | 4 |
| ○ | [SA] Presence of indirect accesses - Use array restructuring or gather instructions to lower the cost. There are 2 issues ( = indirect data accesses) costing 4 point each. | 8 |
| ►Vectorization Roadblocks | 12 | |
| ○ | [SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 2 issues ( = data accesses) costing 2 point each. | 4 |
| ○ | [SA] Presence of indirect accesses - Use array restructuring or gather instructions to lower the cost. There are 2 issues ( = indirect data accesses) costing 4 point each. | 8 |
| ►Loop 16 - libqmckl.so.0.0.0 | Execution Time: 1 % - Vectorization Ratio: 37.50 % - Vector Length Use: 23.44 % | |
| ►Loop Computation Issues | 4 | |
| ○ | [SA] Less than 10% of the FP ADD/SUB/MUL arithmetic operations are performed using FMA - Reorganize arithmetic expressions to exhibit potential for FMA. This issue costs 4 points. | 4 |
| ►Data Access Issues | 16 | |
| ○ | [SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 2 issues ( = data accesses) costing 2 point each. | 4 |
| ○ | [SA] Presence of special instructions executing on a single port (INSERT/EXTRACT, BLEND/MERGE, Other_packing) - Simplify data access and try to get stride 1 access. There are 12 issues (= instructions) costing 1 point each. | 12 |
| ►Vectorization Roadblocks | 4 | |
| ○ | [SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 2 issues ( = data accesses) costing 2 point each. | 4 |
| ►Inefficient Vectorization | 12 | |
| ○ | [SA] Presence of special instructions executing on a single port (INSERT/EXTRACT, BLEND/MERGE, Other_packing) - Simplify data access and try to get stride 1 access. There are 12 issues (= instructions) costing 1 point each. | 12 |
| ►Loop 22 - libqmckl.so.0.0.0 | Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 11.76 % | |
| ►Loop Computation Issues | 2 | |
| ○ | [SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points. | 2 |
| ►Control Flow Issues | 2 | |
| ○ | [SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points. | 2 |
| ►Vectorization Roadblocks | 1002 | |
| ○ | [SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point. | 1000 |
| ○ | [SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points. | 2 |
| ►Loop 188 - libqmckl.so.0.0.0 | Execution Time: 0 % - Vectorization Ratio: 100.00 % - Vector Length Use: 50.00 % | |
| ►Loop Computation Issues | 4 | |
| ○ | [SA] Less than 10% of the FP ADD/SUB/MUL arithmetic operations are performed using FMA - Reorganize arithmetic expressions to exhibit potential for FMA. This issue costs 4 points. | 4 |
| ►Loop 615 - libqmckl.so.0.0.0 | Execution Time: 0 % - Vectorization Ratio: 100.00 % - Vector Length Use: 50.00 % | |
| ►Loop Computation Issues | 32 | |
| ○ | [SA] Presence of expensive FP instructions - Perform hoisting, change algorithm, use SVML or proper numerical library or perform value profiling (count the number of distinct input values). There are 8 issues (= instructions) costing 4 points each. | 32 |
| ►Data Access Issues | 2 | |
| ○ | [SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 1 issues ( = data accesses) costing 2 point each. | 2 |
| ►Vectorization Roadblocks | 2 | |
| ○ | [SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 1 issues ( = data accesses) costing 2 point each. | 2 |
[ 4 / 4 ] Application profile is long enough (13.80 s)
To have good quality measurements, it is advised that the application profiling time is greater than 10 seconds.
[ 3 / 3 ] Optimization level option is correctly used
[ 3 / 3 ] Most of time spent in analyzed modules comes from functions compiled with -g and -fno-omit-frame-pointer
-g option gives access to debugging informations, such are source locations. -fno-omit-frame-pointer improves the accuracy of callchains found during the application profiling.
[ 3 / 3 ] Host configuration allows retrieval of all necessary metrics.
[ 0 / 3 ] Compilation of some functions is not optimized for the target processor
Application run on the SKYLAKE micro-architecture while the code was specialized for skylake-avx512.
[ 2 / 2 ] Application is correctly profiled ("Others" category represents 9.89 % of the execution time)
To have a representative profiling, it is advised that the category "Others" represents less than 20% of the execution time in order to analyze as much as possible of the user code
[ 1 / 1 ] Lstopo present. The Topology lstopo report will be generated.
[ 0 / 0 ] Fastmath not used
Consider to add ffast-math to compilation flags (or replace -O3 with -Ofast) to unlock potential extra speedup by relaxing floating-point computation consistency. Warning: floating-point accuracy may be reduced and the compliance to IEEE/ISO rules/specifications for math functions will be relaxed, typically 'errno' will no longer be set after calling some math functions.
[ 4 / 4 ] Enough time of the experiment time spent in analyzed loops (88.19%)
If the time spent in analyzed loops is less than 30%, standard loop optimizations will have a limited impact on application performances.
[ 4 / 4 ] Threads activity is good
On average, more than 99.47% of observed threads are actually active
[ 4 / 4 ] CPU activity is good
CPU cores are active 99.47% of time
[ 4 / 4 ] Loop profile is not flat
At least one loop coverage is greater than 4% (46.38%), representing an hotspot for the application
[ 4 / 4 ] Enough time of the experiment time spent in analyzed innermost loops (77.79%)
If the time spent in analyzed innermost loops is less than 15%, standard innermost loop optimizations such as vectorisation will have a limited impact on application performances.
[ 4 / 4 ] Affinity is good (99.68%)
Threads are not migrating to CPU cores: probably successfully pinned
[ 3 / 3 ] Less than 10% (0.00%) is spend in BLAS1 operations
It could be more efficient to inline by hand BLAS1 operations
[ 3 / 3 ] Functions mostly use all threads
Functions running on a reduced number of threads (typically sequential code) cover less than 10% of application walltime (0.00%)
[ 3 / 3 ] Cumulative Outermost/In between loops coverage (10.40%) lower than cumulative innermost loop coverage (77.79%)
Having cumulative Outermost/In between loops coverage greater than cumulative innermost loop coverage will make loop optimization more complex
[ 2 / 2 ] Less than 10% (0.00%) is spend in BLAS2 operations
BLAS2 calls usually could make a poor cache usage and could benefit from inlining.
[ 2 / 2 ] Less than 10% (1.56%) is spend in Libm/SVML (special functions)
| Loop ID | Analysis | Penalty Score |
|---|---|---|
| ►Loop 23 - libqmckl.so.0.0.0 | Execution Time: 46 % - Vectorization Ratio: 0.00 % - Vector Length Use: 12.50 % | |
| ►Loop Computation Issues | 2 | |
| ○ | [SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points. | 2 |
| ►Control Flow Issues | 1000 | |
| ○ | [SA] Too many paths (6561 paths) - Simplify control structure. There are 6561 issues ( = paths) costing 1 point, limited to 1000. | 1000 |
| ►Vectorization Roadblocks | 1000 | |
| ○ | [SA] Too many paths (6561 paths) - Simplify control structure. There are 6561 issues ( = paths) costing 1 point, limited to 1000. | 1000 |
| ►Loop 13 - libqmckl.so.0.0.0 | Execution Time: 15 % - Vectorization Ratio: 0.00 % - Vector Length Use: 12.50 % | |
| ►Data Access Issues | 16 | |
| ○ | [SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 2 issues ( = data accesses) costing 2 point each. | 4 |
| ○ | [SA] Presence of indirect accesses - Use array restructuring or gather instructions to lower the cost. There are 3 issues ( = indirect data accesses) costing 4 point each. | 12 |
| ►Vectorization Roadblocks | 16 | |
| ○ | [SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 2 issues ( = data accesses) costing 2 point each. | 4 |
| ○ | [SA] Presence of indirect accesses - Use array restructuring or gather instructions to lower the cost. There are 3 issues ( = indirect data accesses) costing 4 point each. | 12 |
| ►Loop 12 - libqmckl.so.0.0.0 | Execution Time: 9 % - Vectorization Ratio: 15.25 % - Vector Length Use: 16.44 % | |
| ►Loop Computation Issues | 2 | |
| ○ | [SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points. | 2 |
| ►Control Flow Issues | 1002 | |
| ○ | [SA] Too many paths (17053 paths) - Simplify control structure. There are 17053 issues ( = paths) costing 1 point, limited to 1000. | 1000 |
| ○ | [SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points. | 2 |
| ►Data Access Issues | 25 | |
| ○ | [SA] Presence of special instructions executing on a single port (INSERT/EXTRACT, BLEND/MERGE, BROADCAST, Other_packing) - Simplify data access and try to get stride 1 access. There are 25 issues (= instructions) costing 1 point each. | 25 |
| ►Vectorization Roadblocks | 1002 | |
| ○ | [SA] Too many paths (17053 paths) - Simplify control structure. There are 17053 issues ( = paths) costing 1 point, limited to 1000. | 1000 |
| ○ | [SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points. | 2 |
| ►Inefficient Vectorization | 25 | |
| ○ | [SA] Presence of special instructions executing on a single port (INSERT/EXTRACT, BLEND/MERGE, BROADCAST, Other_packing) - Simplify data access and try to get stride 1 access. There are 25 issues (= instructions) costing 1 point each. | 25 |
| ►Loop 15 - libqmckl.so.0.0.0 | Execution Time: 3 % - Vectorization Ratio: 0.00 % - Vector Length Use: 12.50 % | |
| ►Data Access Issues | 12 | |
| ○ | [SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 2 issues ( = data accesses) costing 2 point each. | 4 |
| ○ | [SA] Presence of indirect accesses - Use array restructuring or gather instructions to lower the cost. There are 2 issues ( = indirect data accesses) costing 4 point each. | 8 |
| ►Vectorization Roadblocks | 12 | |
| ○ | [SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 2 issues ( = data accesses) costing 2 point each. | 4 |
| ○ | [SA] Presence of indirect accesses - Use array restructuring or gather instructions to lower the cost. There are 2 issues ( = indirect data accesses) costing 4 point each. | 8 |
| ►Loop 20 - libqmckl.so.0.0.0 | Execution Time: 3 % - Vectorization Ratio: 100.00 % - Vector Length Use: 50.00 % | |
| ►Data Access Issues | 4 | |
| ○ | [SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 2 issues ( = data accesses) costing 2 point each. | 4 |
| ►Vectorization Roadblocks | 4 | |
| ○ | [SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 2 issues ( = data accesses) costing 2 point each. | 4 |
| ►Loop 21 - libqmckl.so.0.0.0 | Execution Time: 3 % - Vectorization Ratio: 0.00 % - Vector Length Use: 12.50 % | |
| ►Data Access Issues | 12 | |
| ○ | [SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 2 issues ( = data accesses) costing 2 point each. | 4 |
| ○ | [SA] Presence of indirect accesses - Use array restructuring or gather instructions to lower the cost. There are 2 issues ( = indirect data accesses) costing 4 point each. | 8 |
| ►Vectorization Roadblocks | 12 | |
| ○ | [SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 2 issues ( = data accesses) costing 2 point each. | 4 |
| ○ | [SA] Presence of indirect accesses - Use array restructuring or gather instructions to lower the cost. There are 2 issues ( = indirect data accesses) costing 4 point each. | 8 |
| ►Loop 16 - libqmckl.so.0.0.0 | Execution Time: 2 % - Vectorization Ratio: 37.50 % - Vector Length Use: 23.44 % | |
| ►Loop Computation Issues | 4 | |
| ○ | [SA] Less than 10% of the FP ADD/SUB/MUL arithmetic operations are performed using FMA - Reorganize arithmetic expressions to exhibit potential for FMA. This issue costs 4 points. | 4 |
| ►Data Access Issues | 16 | |
| ○ | [SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 2 issues ( = data accesses) costing 2 point each. | 4 |
| ○ | [SA] Presence of special instructions executing on a single port (INSERT/EXTRACT, BLEND/MERGE, Other_packing) - Simplify data access and try to get stride 1 access. There are 12 issues (= instructions) costing 1 point each. | 12 |
| ►Vectorization Roadblocks | 4 | |
| ○ | [SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 2 issues ( = data accesses) costing 2 point each. | 4 |
| ►Inefficient Vectorization | 12 | |
| ○ | [SA] Presence of special instructions executing on a single port (INSERT/EXTRACT, BLEND/MERGE, Other_packing) - Simplify data access and try to get stride 1 access. There are 12 issues (= instructions) costing 1 point each. | 12 |
| ►Loop 188 - libqmckl.so.0.0.0 | Execution Time: 0 % - Vectorization Ratio: 100.00 % - Vector Length Use: 50.00 % | |
| ►Loop Computation Issues | 4 | |
| ○ | [SA] Less than 10% of the FP ADD/SUB/MUL arithmetic operations are performed using FMA - Reorganize arithmetic expressions to exhibit potential for FMA. This issue costs 4 points. | 4 |
| ►Loop 22 - libqmckl.so.0.0.0 | Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 11.76 % | |
| ►Loop Computation Issues | 2 | |
| ○ | [SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points. | 2 |
| ►Control Flow Issues | 2 | |
| ○ | [SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points. | 2 |
| ►Vectorization Roadblocks | 1002 | |
| ○ | [SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point. | 1000 |
| ○ | [SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points. | 2 |
| ►Loop 615 - libqmckl.so.0.0.0 | Execution Time: 0 % - Vectorization Ratio: 100.00 % - Vector Length Use: 50.00 % | |
| ►Loop Computation Issues | 32 | |
| ○ | [SA] Presence of expensive FP instructions - Perform hoisting, change algorithm, use SVML or proper numerical library or perform value profiling (count the number of distinct input values). There are 8 issues (= instructions) costing 4 points each. | 32 |
| ►Data Access Issues | 2 | |
| ○ | [SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 1 issues ( = data accesses) costing 2 point each. | 2 |
| ►Vectorization Roadblocks | 2 | |
| ○ | [SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 1 issues ( = data accesses) costing 2 point each. | 2 |
[ 4 / 4 ] Application profile is long enough (13.87 s)
To have good quality measurements, it is advised that the application profiling time is greater than 10 seconds.
[ 3 / 3 ] Optimization level option is correctly used
[ 3 / 3 ] Most of time spent in analyzed modules comes from functions compiled with -g and -fno-omit-frame-pointer
-g option gives access to debugging informations, such are source locations. -fno-omit-frame-pointer improves the accuracy of callchains found during the application profiling.
[ 3 / 3 ] Host configuration allows retrieval of all necessary metrics.
[ 0 / 3 ] Compilation of some functions is not optimized for the target processor
Application run on the SKYLAKE micro-architecture while the code was specialized for skylake-avx512.
[ 2 / 2 ] Application is correctly profiled ("Others" category represents 8.65 % of the execution time)
To have a representative profiling, it is advised that the category "Others" represents less than 20% of the execution time in order to analyze as much as possible of the user code
[ 1 / 1 ] Lstopo present. The Topology lstopo report will be generated.
[ 0 / 0 ] Fastmath not used
Consider to add ffast-math to compilation flags (or replace -O3 with -Ofast) to unlock potential extra speedup by relaxing floating-point computation consistency. Warning: floating-point accuracy may be reduced and the compliance to IEEE/ISO rules/specifications for math functions will be relaxed, typically 'errno' will no longer be set after calling some math functions.
[ 4 / 4 ] Enough time of the experiment time spent in analyzed loops (89.33%)
If the time spent in analyzed loops is less than 30%, standard loop optimizations will have a limited impact on application performances.
[ 4 / 4 ] Threads activity is good
On average, more than 99.55% of observed threads are actually active
[ 4 / 4 ] CPU activity is good
CPU cores are active 99.55% of time
[ 4 / 4 ] Loop profile is not flat
At least one loop coverage is greater than 4% (47.01%), representing an hotspot for the application
[ 4 / 4 ] Enough time of the experiment time spent in analyzed innermost loops (79.24%)
If the time spent in analyzed innermost loops is less than 15%, standard innermost loop optimizations such as vectorisation will have a limited impact on application performances.
[ 4 / 4 ] Affinity is good (99.91%)
Threads are not migrating to CPU cores: probably successfully pinned
[ 3 / 3 ] Less than 10% (0.00%) is spend in BLAS1 operations
It could be more efficient to inline by hand BLAS1 operations
[ 3 / 3 ] Functions mostly use all threads
Functions running on a reduced number of threads (typically sequential code) cover less than 10% of application walltime (0.00%)
[ 3 / 3 ] Cumulative Outermost/In between loops coverage (10.09%) lower than cumulative innermost loop coverage (79.24%)
Having cumulative Outermost/In between loops coverage greater than cumulative innermost loop coverage will make loop optimization more complex
[ 2 / 2 ] Less than 10% (0.00%) is spend in BLAS2 operations
BLAS2 calls usually could make a poor cache usage and could benefit from inlining.
[ 2 / 2 ] Less than 10% (1.69%) is spend in Libm/SVML (special functions)
| Loop ID | Analysis | Penalty Score |
|---|---|---|
| ►Loop 23 - libqmckl.so.0.0.0 | Execution Time: 47 % - Vectorization Ratio: 0.00 % - Vector Length Use: 12.50 % | |
| ►Loop Computation Issues | 2 | |
| ○ | [SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points. | 2 |
| ►Control Flow Issues | 1000 | |
| ○ | [SA] Too many paths (6561 paths) - Simplify control structure. There are 6561 issues ( = paths) costing 1 point, limited to 1000. | 1000 |
| ►Vectorization Roadblocks | 1000 | |
| ○ | [SA] Too many paths (6561 paths) - Simplify control structure. There are 6561 issues ( = paths) costing 1 point, limited to 1000. | 1000 |
| ►Loop 13 - libqmckl.so.0.0.0 | Execution Time: 15 % - Vectorization Ratio: 0.00 % - Vector Length Use: 12.50 % | |
| ►Data Access Issues | 16 | |
| ○ | [SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 2 issues ( = data accesses) costing 2 point each. | 4 |
| ○ | [SA] Presence of indirect accesses - Use array restructuring or gather instructions to lower the cost. There are 3 issues ( = indirect data accesses) costing 4 point each. | 12 |
| ►Vectorization Roadblocks | 16 | |
| ○ | [SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 2 issues ( = data accesses) costing 2 point each. | 4 |
| ○ | [SA] Presence of indirect accesses - Use array restructuring or gather instructions to lower the cost. There are 3 issues ( = indirect data accesses) costing 4 point each. | 12 |
| ►Loop 12 - libqmckl.so.0.0.0 | Execution Time: 9 % - Vectorization Ratio: 15.25 % - Vector Length Use: 16.44 % | |
| ►Loop Computation Issues | 2 | |
| ○ | [SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points. | 2 |
| ►Control Flow Issues | 1002 | |
| ○ | [SA] Too many paths (17053 paths) - Simplify control structure. There are 17053 issues ( = paths) costing 1 point, limited to 1000. | 1000 |
| ○ | [SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points. | 2 |
| ►Data Access Issues | 25 | |
| ○ | [SA] Presence of special instructions executing on a single port (INSERT/EXTRACT, BLEND/MERGE, BROADCAST, Other_packing) - Simplify data access and try to get stride 1 access. There are 25 issues (= instructions) costing 1 point each. | 25 |
| ►Vectorization Roadblocks | 1002 | |
| ○ | [SA] Too many paths (17053 paths) - Simplify control structure. There are 17053 issues ( = paths) costing 1 point, limited to 1000. | 1000 |
| ○ | [SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points. | 2 |
| ►Inefficient Vectorization | 25 | |
| ○ | [SA] Presence of special instructions executing on a single port (INSERT/EXTRACT, BLEND/MERGE, BROADCAST, Other_packing) - Simplify data access and try to get stride 1 access. There are 25 issues (= instructions) costing 1 point each. | 25 |
| ►Loop 15 - libqmckl.so.0.0.0 | Execution Time: 4 % - Vectorization Ratio: 0.00 % - Vector Length Use: 12.50 % | |
| ►Data Access Issues | 12 | |
| ○ | [SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 2 issues ( = data accesses) costing 2 point each. | 4 |
| ○ | [SA] Presence of indirect accesses - Use array restructuring or gather instructions to lower the cost. There are 2 issues ( = indirect data accesses) costing 4 point each. | 8 |
| ►Vectorization Roadblocks | 12 | |
| ○ | [SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 2 issues ( = data accesses) costing 2 point each. | 4 |
| ○ | [SA] Presence of indirect accesses - Use array restructuring or gather instructions to lower the cost. There are 2 issues ( = indirect data accesses) costing 4 point each. | 8 |
| ►Loop 20 - libqmckl.so.0.0.0 | Execution Time: 3 % - Vectorization Ratio: 100.00 % - Vector Length Use: 50.00 % | |
| ►Data Access Issues | 4 | |
| ○ | [SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 2 issues ( = data accesses) costing 2 point each. | 4 |
| ►Vectorization Roadblocks | 4 | |
| ○ | [SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 2 issues ( = data accesses) costing 2 point each. | 4 |
| ►Loop 21 - libqmckl.so.0.0.0 | Execution Time: 3 % - Vectorization Ratio: 0.00 % - Vector Length Use: 12.50 % | |
| ►Data Access Issues | 12 | |
| ○ | [SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 2 issues ( = data accesses) costing 2 point each. | 4 |
| ○ | [SA] Presence of indirect accesses - Use array restructuring or gather instructions to lower the cost. There are 2 issues ( = indirect data accesses) costing 4 point each. | 8 |
| ►Vectorization Roadblocks | 12 | |
| ○ | [SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 2 issues ( = data accesses) costing 2 point each. | 4 |
| ○ | [SA] Presence of indirect accesses - Use array restructuring or gather instructions to lower the cost. There are 2 issues ( = indirect data accesses) costing 4 point each. | 8 |
| ►Loop 16 - libqmckl.so.0.0.0 | Execution Time: 1 % - Vectorization Ratio: 37.50 % - Vector Length Use: 23.44 % | |
| ►Loop Computation Issues | 4 | |
| ○ | [SA] Less than 10% of the FP ADD/SUB/MUL arithmetic operations are performed using FMA - Reorganize arithmetic expressions to exhibit potential for FMA. This issue costs 4 points. | 4 |
| ►Data Access Issues | 16 | |
| ○ | [SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 2 issues ( = data accesses) costing 2 point each. | 4 |
| ○ | [SA] Presence of special instructions executing on a single port (INSERT/EXTRACT, BLEND/MERGE, Other_packing) - Simplify data access and try to get stride 1 access. There are 12 issues (= instructions) costing 1 point each. | 12 |
| ►Vectorization Roadblocks | 4 | |
| ○ | [SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 2 issues ( = data accesses) costing 2 point each. | 4 |
| ►Inefficient Vectorization | 12 | |
| ○ | [SA] Presence of special instructions executing on a single port (INSERT/EXTRACT, BLEND/MERGE, Other_packing) - Simplify data access and try to get stride 1 access. There are 12 issues (= instructions) costing 1 point each. | 12 |
| ►Loop 188 - libqmckl.so.0.0.0 | Execution Time: 0 % - Vectorization Ratio: 100.00 % - Vector Length Use: 50.00 % | |
| ►Loop Computation Issues | 4 | |
| ○ | [SA] Less than 10% of the FP ADD/SUB/MUL arithmetic operations are performed using FMA - Reorganize arithmetic expressions to exhibit potential for FMA. This issue costs 4 points. | 4 |
| ►Loop 22 - libqmckl.so.0.0.0 | Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 11.76 % | |
| ►Loop Computation Issues | 2 | |
| ○ | [SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points. | 2 |
| ►Control Flow Issues | 2 | |
| ○ | [SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points. | 2 |
| ►Vectorization Roadblocks | 1002 | |
| ○ | [SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point. | 1000 |
| ○ | [SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points. | 2 |
| ►Loop 18 - libqmckl.so.0.0.0 | Execution Time: 0 % - Vectorization Ratio: 44.44 % - Vector Length Use: 26.39 % | |
| ►Data Access Issues | 12 | |
| ○ | [SA] Presence of special instructions executing on a single port (INSERT/EXTRACT, BLEND/MERGE, Other_packing) - Simplify data access and try to get stride 1 access. There are 12 issues (= instructions) costing 1 point each. | 12 |
| ►Inefficient Vectorization | 12 | |
| ○ | [SA] Presence of special instructions executing on a single port (INSERT/EXTRACT, BLEND/MERGE, Other_packing) - Simplify data access and try to get stride 1 access. There are 12 issues (= instructions) costing 1 point each. | 12 |
[ 4 / 4 ] Application profile is long enough (13.98 s)
To have good quality measurements, it is advised that the application profiling time is greater than 10 seconds.
[ 3 / 3 ] Optimization level option is correctly used
[ 3 / 3 ] Most of time spent in analyzed modules comes from functions compiled with -g and -fno-omit-frame-pointer
-g option gives access to debugging informations, such are source locations. -fno-omit-frame-pointer improves the accuracy of callchains found during the application profiling.
[ 3 / 3 ] Host configuration allows retrieval of all necessary metrics.
[ 0 / 3 ] Compilation of some functions is not optimized for the target processor
Application run on the SKYLAKE micro-architecture while the code was specialized for skylake-avx512.
[ 2 / 2 ] Application is correctly profiled ("Others" category represents 9.55 % of the execution time)
To have a representative profiling, it is advised that the category "Others" represents less than 20% of the execution time in order to analyze as much as possible of the user code
[ 1 / 1 ] Lstopo present. The Topology lstopo report will be generated.
[ 0 / 0 ] Fastmath not used
Consider to add ffast-math to compilation flags (or replace -O3 with -Ofast) to unlock potential extra speedup by relaxing floating-point computation consistency. Warning: floating-point accuracy may be reduced and the compliance to IEEE/ISO rules/specifications for math functions will be relaxed, typically 'errno' will no longer be set after calling some math functions.
[ 4 / 4 ] Enough time of the experiment time spent in analyzed loops (87.95%)
If the time spent in analyzed loops is less than 30%, standard loop optimizations will have a limited impact on application performances.
[ 4 / 4 ] Threads activity is good
On average, more than 99.59% of observed threads are actually active
[ 4 / 4 ] CPU activity is good
CPU cores are active 99.59% of time
[ 4 / 4 ] Loop profile is not flat
At least one loop coverage is greater than 4% (46.28%), representing an hotspot for the application
[ 4 / 4 ] Enough time of the experiment time spent in analyzed innermost loops (78.79%)
If the time spent in analyzed innermost loops is less than 15%, standard innermost loop optimizations such as vectorisation will have a limited impact on application performances.
[ 4 / 4 ] Affinity is good (99.85%)
Threads are not migrating to CPU cores: probably successfully pinned
[ 3 / 3 ] Less than 10% (0.00%) is spend in BLAS1 operations
It could be more efficient to inline by hand BLAS1 operations
[ 3 / 3 ] Functions mostly use all threads
Functions running on a reduced number of threads (typically sequential code) cover less than 10% of application walltime (0.00%)
[ 3 / 3 ] Cumulative Outermost/In between loops coverage (9.16%) lower than cumulative innermost loop coverage (78.79%)
Having cumulative Outermost/In between loops coverage greater than cumulative innermost loop coverage will make loop optimization more complex
[ 2 / 2 ] Less than 10% (0.00%) is spend in BLAS2 operations
BLAS2 calls usually could make a poor cache usage and could benefit from inlining.
[ 2 / 2 ] Less than 10% (2.11%) is spend in Libm/SVML (special functions)
| Loop ID | Analysis | Penalty Score |
|---|---|---|
| ►Loop 23 - libqmckl.so.0.0.0 | Execution Time: 46 % - Vectorization Ratio: 0.00 % - Vector Length Use: 12.50 % | |
| ►Loop Computation Issues | 2 | |
| ○ | [SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points. | 2 |
| ►Control Flow Issues | 1000 | |
| ○ | [SA] Too many paths (6561 paths) - Simplify control structure. There are 6561 issues ( = paths) costing 1 point, limited to 1000. | 1000 |
| ►Vectorization Roadblocks | 1000 | |
| ○ | [SA] Too many paths (6561 paths) - Simplify control structure. There are 6561 issues ( = paths) costing 1 point, limited to 1000. | 1000 |
| ►Loop 13 - libqmckl.so.0.0.0 | Execution Time: 15 % - Vectorization Ratio: 0.00 % - Vector Length Use: 12.50 % | |
| ►Data Access Issues | 16 | |
| ○ | [SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 2 issues ( = data accesses) costing 2 point each. | 4 |
| ○ | [SA] Presence of indirect accesses - Use array restructuring or gather instructions to lower the cost. There are 3 issues ( = indirect data accesses) costing 4 point each. | 12 |
| ►Vectorization Roadblocks | 16 | |
| ○ | [SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 2 issues ( = data accesses) costing 2 point each. | 4 |
| ○ | [SA] Presence of indirect accesses - Use array restructuring or gather instructions to lower the cost. There are 3 issues ( = indirect data accesses) costing 4 point each. | 12 |
| ►Loop 12 - libqmckl.so.0.0.0 | Execution Time: 8 % - Vectorization Ratio: 15.25 % - Vector Length Use: 16.44 % | |
| ►Loop Computation Issues | 2 | |
| ○ | [SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points. | 2 |
| ►Control Flow Issues | 1002 | |
| ○ | [SA] Too many paths (17053 paths) - Simplify control structure. There are 17053 issues ( = paths) costing 1 point, limited to 1000. | 1000 |
| ○ | [SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points. | 2 |
| ►Data Access Issues | 25 | |
| ○ | [SA] Presence of special instructions executing on a single port (INSERT/EXTRACT, BLEND/MERGE, BROADCAST, Other_packing) - Simplify data access and try to get stride 1 access. There are 25 issues (= instructions) costing 1 point each. | 25 |
| ►Vectorization Roadblocks | 1002 | |
| ○ | [SA] Too many paths (17053 paths) - Simplify control structure. There are 17053 issues ( = paths) costing 1 point, limited to 1000. | 1000 |
| ○ | [SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points. | 2 |
| ►Inefficient Vectorization | 25 | |
| ○ | [SA] Presence of special instructions executing on a single port (INSERT/EXTRACT, BLEND/MERGE, BROADCAST, Other_packing) - Simplify data access and try to get stride 1 access. There are 25 issues (= instructions) costing 1 point each. | 25 |
| ►Loop 20 - libqmckl.so.0.0.0 | Execution Time: 4 % - Vectorization Ratio: 100.00 % - Vector Length Use: 50.00 % | |
| ►Data Access Issues | 4 | |
| ○ | [SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 2 issues ( = data accesses) costing 2 point each. | 4 |
| ►Vectorization Roadblocks | 4 | |
| ○ | [SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 2 issues ( = data accesses) costing 2 point each. | 4 |
| ►Loop 15 - libqmckl.so.0.0.0 | Execution Time: 4 % - Vectorization Ratio: 0.00 % - Vector Length Use: 12.50 % | |
| ►Data Access Issues | 12 | |
| ○ | [SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 2 issues ( = data accesses) costing 2 point each. | 4 |
| ○ | [SA] Presence of indirect accesses - Use array restructuring or gather instructions to lower the cost. There are 2 issues ( = indirect data accesses) costing 4 point each. | 8 |
| ►Vectorization Roadblocks | 12 | |
| ○ | [SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 2 issues ( = data accesses) costing 2 point each. | 4 |
| ○ | [SA] Presence of indirect accesses - Use array restructuring or gather instructions to lower the cost. There are 2 issues ( = indirect data accesses) costing 4 point each. | 8 |
| ►Loop 21 - libqmckl.so.0.0.0 | Execution Time: 3 % - Vectorization Ratio: 0.00 % - Vector Length Use: 12.50 % | |
| ►Data Access Issues | 12 | |
| ○ | [SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 2 issues ( = data accesses) costing 2 point each. | 4 |
| ○ | [SA] Presence of indirect accesses - Use array restructuring or gather instructions to lower the cost. There are 2 issues ( = indirect data accesses) costing 4 point each. | 8 |
| ►Vectorization Roadblocks | 12 | |
| ○ | [SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 2 issues ( = data accesses) costing 2 point each. | 4 |
| ○ | [SA] Presence of indirect accesses - Use array restructuring or gather instructions to lower the cost. There are 2 issues ( = indirect data accesses) costing 4 point each. | 8 |
| ►Loop 16 - libqmckl.so.0.0.0 | Execution Time: 1 % - Vectorization Ratio: 37.50 % - Vector Length Use: 23.44 % | |
| ►Loop Computation Issues | 4 | |
| ○ | [SA] Less than 10% of the FP ADD/SUB/MUL arithmetic operations are performed using FMA - Reorganize arithmetic expressions to exhibit potential for FMA. This issue costs 4 points. | 4 |
| ►Data Access Issues | 16 | |
| ○ | [SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 2 issues ( = data accesses) costing 2 point each. | 4 |
| ○ | [SA] Presence of special instructions executing on a single port (INSERT/EXTRACT, BLEND/MERGE, Other_packing) - Simplify data access and try to get stride 1 access. There are 12 issues (= instructions) costing 1 point each. | 12 |
| ►Vectorization Roadblocks | 4 | |
| ○ | [SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 2 issues ( = data accesses) costing 2 point each. | 4 |
| ►Inefficient Vectorization | 12 | |
| ○ | [SA] Presence of special instructions executing on a single port (INSERT/EXTRACT, BLEND/MERGE, Other_packing) - Simplify data access and try to get stride 1 access. There are 12 issues (= instructions) costing 1 point each. | 12 |
| ►Loop 188 - libqmckl.so.0.0.0 | Execution Time: 0 % - Vectorization Ratio: 100.00 % - Vector Length Use: 50.00 % | |
| ►Loop Computation Issues | 4 | |
| ○ | [SA] Less than 10% of the FP ADD/SUB/MUL arithmetic operations are performed using FMA - Reorganize arithmetic expressions to exhibit potential for FMA. This issue costs 4 points. | 4 |
| ►Loop 22 - libqmckl.so.0.0.0 | Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 11.76 % | |
| ►Loop Computation Issues | 2 | |
| ○ | [SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points. | 2 |
| ►Control Flow Issues | 2 | |
| ○ | [SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points. | 2 |
| ►Vectorization Roadblocks | 1002 | |
| ○ | [SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point. | 1000 |
| ○ | [SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points. | 2 |
| ○Loop 619 - libqmckl.so.0.0.0 | Execution Time: 0 % - Vectorization Ratio: 100.00 % - Vector Length Use: 50.00 % |
[ 4 / 4 ] Application profile is long enough (14.30 s)
To have good quality measurements, it is advised that the application profiling time is greater than 10 seconds.
[ 3 / 3 ] Optimization level option is correctly used
[ 3 / 3 ] Most of time spent in analyzed modules comes from functions compiled with -g and -fno-omit-frame-pointer
-g option gives access to debugging informations, such are source locations. -fno-omit-frame-pointer improve the accuracy of callchains found during the application profiling.
[ 3 / 3 ] Host configuration allows retrieval of all necessary metrics.
[ 0 / 3 ] Compilation of some functions is not optimized for the target processor
Application run on the SKYLAKE micro-architecture while the code was specialized for skylake-avx512.
[ 2 / 2 ] Application is correctly profiled ("Others" category represents 11.47 % of the execution time)
To have a representative profiling, it is advised that the category "Others" represents less than 20% of the execution time in order to analyze as much as possible of the user code
[ 1 / 1 ] Lstopo present. The Topology lstopo report will be generated.
[ 0 / 0 ] Fastmath not used
Consider to add ffast-math to compilation flags (or replace -O3 with -Ofast) to unlock potential extra speedup by relaxing floating-point computation consistency. Warning: floating-point accuracy may be reduced and the compliance to IEEE/ISO rules/specifications for math functions will be relaxed, typically 'errno' will no longer be set after calling some math functions.
[ 4 / 4 ] Enough time of the experiment time spent in analyzed loops (86.43%)
If the time spent in analyzed loops is less than 30%, standard loop optimizations will have a limited impact on application performances.
[ 4 / 4 ] Threads activity is good
On average, more than 99.55% of observed threads are actually active
[ 4 / 4 ] CPU activity is good
CPU cores are active 99.55% of time
[ 4 / 4 ] Loop profile is not flat
At least one loop coverage is greater than 4% (43.41%), representing an hotspot for the application
[ 4 / 4 ] Enough time of the experiment time spent in analyzed innermost loops (76.18%)
If the time spent in analyzed innermost loops is less than 15%, standard innermost loop optimizations such as vectorisation will have a limited impact on application performances.
[ 4 / 4 ] Affinity is good (99.86%)
Threads are not migrating to CPU cores: probably successfully pinned
[ 3 / 3 ] Less than 10% (0.00%) is spend in BLAS1 operations
It could be more efficient to inline by hand BLAS1 operations
[ 3 / 3 ] Functions mostly use all threads
Functions running on a reduced number of threads (typically sequential code) cover less than 10% of application walltime (0.00%)
[ 3 / 3 ] Cumulative Outermost/In between loops coverage (10.25%) lower than cumulative innermost loop coverage (76.18%)
Having cumulative Outermost/In between loops coverage greater than cumulative innermost loop coverage will make loop optimization more complex
[ 2 / 2 ] Less than 10% (0.00%) is spend in BLAS2 operations
BLAS2 calls usually could make a poor cache usage and could benefit from inlining.
[ 2 / 2 ] Less than 10% (1.57%) is spend in Libm/SVML (special functions)
| Loop ID | Analysis | Penalty Score |
|---|---|---|
| ►Loop 23 - libqmckl.so.0.0.0 | Execution Time: 43 % - Vectorization Ratio: 0.00 % - Vector Length Use: 12.50 % | |
| ►Loop Computation Issues | 2 | |
| ○ | [SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points. | 2 |
| ►Control Flow Issues | 1000 | |
| ○ | [SA] Too many paths (6561 paths) - Simplify control structure. There are 6561 issues ( = paths) costing 1 point, limited to 1000. | 1000 |
| ►Vectorization Roadblocks | 1000 | |
| ○ | [SA] Too many paths (6561 paths) - Simplify control structure. There are 6561 issues ( = paths) costing 1 point, limited to 1000. | 1000 |
| ►Loop 13 - libqmckl.so.0.0.0 | Execution Time: 15 % - Vectorization Ratio: 0.00 % - Vector Length Use: 12.50 % | |
| ►Data Access Issues | 16 | |
| ○ | [SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 2 issues ( = data accesses) costing 2 point each. | 4 |
| ○ | [SA] Presence of indirect accesses - Use array restructuring or gather instructions to lower the cost. There are 3 issues ( = indirect data accesses) costing 4 point each. | 12 |
| ►Vectorization Roadblocks | 16 | |
| ○ | [SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 2 issues ( = data accesses) costing 2 point each. | 4 |
| ○ | [SA] Presence of indirect accesses - Use array restructuring or gather instructions to lower the cost. There are 3 issues ( = indirect data accesses) costing 4 point each. | 12 |
| ►Loop 12 - libqmckl.so.0.0.0 | Execution Time: 9 % - Vectorization Ratio: 15.25 % - Vector Length Use: 16.44 % | |
| ►Loop Computation Issues | 2 | |
| ○ | [SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points. | 2 |
| ►Control Flow Issues | 1002 | |
| ○ | [SA] Too many paths (17053 paths) - Simplify control structure. There are 17053 issues ( = paths) costing 1 point, limited to 1000. | 1000 |
| ○ | [SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points. | 2 |
| ►Data Access Issues | 25 | |
| ○ | [SA] Presence of special instructions executing on a single port (INSERT/EXTRACT, BLEND/MERGE, BROADCAST, Other_packing) - Simplify data access and try to get stride 1 access. There are 25 issues (= instructions) costing 1 point each. | 25 |
| ►Vectorization Roadblocks | 1002 | |
| ○ | [SA] Too many paths (17053 paths) - Simplify control structure. There are 17053 issues ( = paths) costing 1 point, limited to 1000. | 1000 |
| ○ | [SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points. | 2 |
| ►Inefficient Vectorization | 25 | |
| ○ | [SA] Presence of special instructions executing on a single port (INSERT/EXTRACT, BLEND/MERGE, BROADCAST, Other_packing) - Simplify data access and try to get stride 1 access. There are 25 issues (= instructions) costing 1 point each. | 25 |
| ►Loop 20 - libqmckl.so.0.0.0 | Execution Time: 4 % - Vectorization Ratio: 100.00 % - Vector Length Use: 50.00 % | |
| ►Data Access Issues | 4 | |
| ○ | [SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 2 issues ( = data accesses) costing 2 point each. | 4 |
| ►Vectorization Roadblocks | 4 | |
| ○ | [SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 2 issues ( = data accesses) costing 2 point each. | 4 |
| ►Loop 21 - libqmckl.so.0.0.0 | Execution Time: 3 % - Vectorization Ratio: 0.00 % - Vector Length Use: 12.50 % | |
| ►Data Access Issues | 12 | |
| ○ | [SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 2 issues ( = data accesses) costing 2 point each. | 4 |
| ○ | [SA] Presence of indirect accesses - Use array restructuring or gather instructions to lower the cost. There are 2 issues ( = indirect data accesses) costing 4 point each. | 8 |
| ►Vectorization Roadblocks | 12 | |
| ○ | [SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 2 issues ( = data accesses) costing 2 point each. | 4 |
| ○ | [SA] Presence of indirect accesses - Use array restructuring or gather instructions to lower the cost. There are 2 issues ( = indirect data accesses) costing 4 point each. | 8 |
| ►Loop 15 - libqmckl.so.0.0.0 | Execution Time: 3 % - Vectorization Ratio: 0.00 % - Vector Length Use: 12.50 % | |
| ►Data Access Issues | 12 | |
| ○ | [SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 2 issues ( = data accesses) costing 2 point each. | 4 |
| ○ | [SA] Presence of indirect accesses - Use array restructuring or gather instructions to lower the cost. There are 2 issues ( = indirect data accesses) costing 4 point each. | 8 |
| ►Vectorization Roadblocks | 12 | |
| ○ | [SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 2 issues ( = data accesses) costing 2 point each. | 4 |
| ○ | [SA] Presence of indirect accesses - Use array restructuring or gather instructions to lower the cost. There are 2 issues ( = indirect data accesses) costing 4 point each. | 8 |
| ►Loop 16 - libqmckl.so.0.0.0 | Execution Time: 2 % - Vectorization Ratio: 37.50 % - Vector Length Use: 23.44 % | |
| ►Loop Computation Issues | 4 | |
| ○ | [SA] Less than 10% of the FP ADD/SUB/MUL arithmetic operations are performed using FMA - Reorganize arithmetic expressions to exhibit potential for FMA. This issue costs 4 points. | 4 |
| ►Data Access Issues | 16 | |
| ○ | [SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 2 issues ( = data accesses) costing 2 point each. | 4 |
| ○ | [SA] Presence of special instructions executing on a single port (INSERT/EXTRACT, BLEND/MERGE, Other_packing) - Simplify data access and try to get stride 1 access. There are 12 issues (= instructions) costing 1 point each. | 12 |
| ►Vectorization Roadblocks | 4 | |
| ○ | [SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 2 issues ( = data accesses) costing 2 point each. | 4 |
| ►Inefficient Vectorization | 12 | |
| ○ | [SA] Presence of special instructions executing on a single port (INSERT/EXTRACT, BLEND/MERGE, Other_packing) - Simplify data access and try to get stride 1 access. There are 12 issues (= instructions) costing 1 point each. | 12 |
| ►Loop 188 - libqmckl.so.0.0.0 | Execution Time: 0 % - Vectorization Ratio: 100.00 % - Vector Length Use: 50.00 % | |
| ►Loop Computation Issues | 4 | |
| ○ | [SA] Less than 10% of the FP ADD/SUB/MUL arithmetic operations are performed using FMA - Reorganize arithmetic expressions to exhibit potential for FMA. This issue costs 4 points. | 4 |
| ►Loop 22 - libqmckl.so.0.0.0 | Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 11.76 % | |
| ►Loop Computation Issues | 2 | |
| ○ | [SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points. | 2 |
| ►Control Flow Issues | 2 | |
| ○ | [SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points. | 2 |
| ►Vectorization Roadblocks | 1002 | |
| ○ | [SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point. | 1000 |
| ○ | [SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points. | 2 |
| ►Loop 618 - libqmckl.so.0.0.0 | Execution Time: 0 % - Vectorization Ratio: 100.00 % - Vector Length Use: 50.00 % | |
| ►Loop Computation Issues | 32 | |
| ○ | [SA] Presence of expensive FP instructions - Perform hoisting, change algorithm, use SVML or proper numerical library or perform value profiling (count the number of distinct input values). There are 8 issues (= instructions) costing 4 points each. | 32 |
| ►Data Access Issues | 2 | |
| ○ | [SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 1 issues ( = data accesses) costing 2 point each. | 2 |
| ►Vectorization Roadblocks | 2 | |
| ○ | [SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 1 issues ( = data accesses) costing 2 point each. | 2 |
[ 4 / 4 ] Application profile is long enough (14.35 s)
To have good quality measurements, it is advised that the application profiling time is greater than 10 seconds.
[ 3 / 3 ] Optimization level option is correctly used
[ 3 / 3 ] Most of time spent in analyzed modules comes from functions compiled with -g and -fno-omit-frame-pointer
-g option gives access to debugging informations, such are source locations. -fno-omit-frame-pointer improve the accuracy of callchains found during the application profiling.
[ 3 / 3 ] Host configuration allows retrieval of all necessary metrics.
[ 0 / 3 ] Compilation of some functions is not optimized for the target processor
Application run on the SKYLAKE micro-architecture while the code was specialized for skylake-avx512.
[ 2 / 2 ] Application is correctly profiled ("Others" category represents 11.29 % of the execution time)
To have a representative profiling, it is advised that the category "Others" represents less than 20% of the execution time in order to analyze as much as possible of the user code
[ 1 / 1 ] Lstopo present. The Topology lstopo report will be generated.
[ 0 / 0 ] Fastmath not used
Consider to add ffast-math to compilation flags (or replace -O3 with -Ofast) to unlock potential extra speedup by relaxing floating-point computation consistency. Warning: floating-point accuracy may be reduced and the compliance to IEEE/ISO rules/specifications for math functions will be relaxed, typically 'errno' will no longer be set after calling some math functions.
[ 4 / 4 ] Enough time of the experiment time spent in analyzed loops (87.04%)
If the time spent in analyzed loops is less than 30%, standard loop optimizations will have a limited impact on application performances.
[ 4 / 4 ] Threads activity is good
On average, more than 99.54% of observed threads are actually active
[ 4 / 4 ] CPU activity is good
CPU cores are active 99.54% of time
[ 4 / 4 ] Loop profile is not flat
At least one loop coverage is greater than 4% (45.54%), representing an hotspot for the application
[ 4 / 4 ] Enough time of the experiment time spent in analyzed innermost loops (77.77%)
If the time spent in analyzed innermost loops is less than 15%, standard innermost loop optimizations such as vectorisation will have a limited impact on application performances.
[ 4 / 4 ] Affinity is good (99.74%)
Threads are not migrating to CPU cores: probably successfully pinned
[ 3 / 3 ] Less than 10% (0.00%) is spend in BLAS1 operations
It could be more efficient to inline by hand BLAS1 operations
[ 3 / 3 ] Functions mostly use all threads
Functions running on a reduced number of threads (typically sequential code) cover less than 10% of application walltime (0.00%)
[ 3 / 3 ] Cumulative Outermost/In between loops coverage (9.27%) lower than cumulative innermost loop coverage (77.77%)
Having cumulative Outermost/In between loops coverage greater than cumulative innermost loop coverage will make loop optimization more complex
[ 2 / 2 ] Less than 10% (0.00%) is spend in BLAS2 operations
BLAS2 calls usually could make a poor cache usage and could benefit from inlining.
[ 2 / 2 ] Less than 10% (1.46%) is spend in Libm/SVML (special functions)
| Loop ID | Analysis | Penalty Score |
|---|---|---|
| ►Loop 23 - libqmckl.so.0.0.0 | Execution Time: 45 % - Vectorization Ratio: 0.00 % - Vector Length Use: 12.50 % | |
| ►Loop Computation Issues | 2 | |
| ○ | [SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points. | 2 |
| ►Control Flow Issues | 1000 | |
| ○ | [SA] Too many paths (6561 paths) - Simplify control structure. There are 6561 issues ( = paths) costing 1 point, limited to 1000. | 1000 |
| ►Vectorization Roadblocks | 1000 | |
| ○ | [SA] Too many paths (6561 paths) - Simplify control structure. There are 6561 issues ( = paths) costing 1 point, limited to 1000. | 1000 |
| ►Loop 13 - libqmckl.so.0.0.0 | Execution Time: 15 % - Vectorization Ratio: 0.00 % - Vector Length Use: 12.50 % | |
| ►Data Access Issues | 16 | |
| ○ | [SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 2 issues ( = data accesses) costing 2 point each. | 4 |
| ○ | [SA] Presence of indirect accesses - Use array restructuring or gather instructions to lower the cost. There are 3 issues ( = indirect data accesses) costing 4 point each. | 12 |
| ►Vectorization Roadblocks | 16 | |
| ○ | [SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 2 issues ( = data accesses) costing 2 point each. | 4 |
| ○ | [SA] Presence of indirect accesses - Use array restructuring or gather instructions to lower the cost. There are 3 issues ( = indirect data accesses) costing 4 point each. | 12 |
| ►Loop 12 - libqmckl.so.0.0.0 | Execution Time: 8 % - Vectorization Ratio: 15.25 % - Vector Length Use: 16.44 % | |
| ►Loop Computation Issues | 2 | |
| ○ | [SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points. | 2 |
| ►Control Flow Issues | 1002 | |
| ○ | [SA] Too many paths (17053 paths) - Simplify control structure. There are 17053 issues ( = paths) costing 1 point, limited to 1000. | 1000 |
| ○ | [SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points. | 2 |
| ►Data Access Issues | 25 | |
| ○ | [SA] Presence of special instructions executing on a single port (INSERT/EXTRACT, BLEND/MERGE, BROADCAST, Other_packing) - Simplify data access and try to get stride 1 access. There are 25 issues (= instructions) costing 1 point each. | 25 |
| ►Vectorization Roadblocks | 1002 | |
| ○ | [SA] Too many paths (17053 paths) - Simplify control structure. There are 17053 issues ( = paths) costing 1 point, limited to 1000. | 1000 |
| ○ | [SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points. | 2 |
| ►Inefficient Vectorization | 25 | |
| ○ | [SA] Presence of special instructions executing on a single port (INSERT/EXTRACT, BLEND/MERGE, BROADCAST, Other_packing) - Simplify data access and try to get stride 1 access. There are 25 issues (= instructions) costing 1 point each. | 25 |
| ►Loop 15 - libqmckl.so.0.0.0 | Execution Time: 4 % - Vectorization Ratio: 0.00 % - Vector Length Use: 12.50 % | |
| ►Data Access Issues | 12 | |
| ○ | [SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 2 issues ( = data accesses) costing 2 point each. | 4 |
| ○ | [SA] Presence of indirect accesses - Use array restructuring or gather instructions to lower the cost. There are 2 issues ( = indirect data accesses) costing 4 point each. | 8 |
| ►Vectorization Roadblocks | 12 | |
| ○ | [SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 2 issues ( = data accesses) costing 2 point each. | 4 |
| ○ | [SA] Presence of indirect accesses - Use array restructuring or gather instructions to lower the cost. There are 2 issues ( = indirect data accesses) costing 4 point each. | 8 |
| ►Loop 21 - libqmckl.so.0.0.0 | Execution Time: 3 % - Vectorization Ratio: 0.00 % - Vector Length Use: 12.50 % | |
| ►Data Access Issues | 12 | |
| ○ | [SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 2 issues ( = data accesses) costing 2 point each. | 4 |
| ○ | [SA] Presence of indirect accesses - Use array restructuring or gather instructions to lower the cost. There are 2 issues ( = indirect data accesses) costing 4 point each. | 8 |
| ►Vectorization Roadblocks | 12 | |
| ○ | [SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 2 issues ( = data accesses) costing 2 point each. | 4 |
| ○ | [SA] Presence of indirect accesses - Use array restructuring or gather instructions to lower the cost. There are 2 issues ( = indirect data accesses) costing 4 point each. | 8 |
| ►Loop 20 - libqmckl.so.0.0.0 | Execution Time: 3 % - Vectorization Ratio: 100.00 % - Vector Length Use: 50.00 % | |
| ►Data Access Issues | 4 | |
| ○ | [SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 2 issues ( = data accesses) costing 2 point each. | 4 |
| ►Vectorization Roadblocks | 4 | |
| ○ | [SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 2 issues ( = data accesses) costing 2 point each. | 4 |
| ►Loop 16 - libqmckl.so.0.0.0 | Execution Time: 2 % - Vectorization Ratio: 37.50 % - Vector Length Use: 23.44 % | |
| ►Loop Computation Issues | 4 | |
| ○ | [SA] Less than 10% of the FP ADD/SUB/MUL arithmetic operations are performed using FMA - Reorganize arithmetic expressions to exhibit potential for FMA. This issue costs 4 points. | 4 |
| ►Data Access Issues | 16 | |
| ○ | [SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 2 issues ( = data accesses) costing 2 point each. | 4 |
| ○ | [SA] Presence of special instructions executing on a single port (INSERT/EXTRACT, BLEND/MERGE, Other_packing) - Simplify data access and try to get stride 1 access. There are 12 issues (= instructions) costing 1 point each. | 12 |
| ►Vectorization Roadblocks | 4 | |
| ○ | [SA] Presence of constant non unit stride data access - Use array restructuring, perform loop interchange or use gather instructions to lower a bit the cost. There are 2 issues ( = data accesses) costing 2 point each. | 4 |
| ►Inefficient Vectorization | 12 | |
| ○ | [SA] Presence of special instructions executing on a single port (INSERT/EXTRACT, BLEND/MERGE, Other_packing) - Simplify data access and try to get stride 1 access. There are 12 issues (= instructions) costing 1 point each. | 12 |
| ►Loop 188 - libqmckl.so.0.0.0 | Execution Time: 0 % - Vectorization Ratio: 100.00 % - Vector Length Use: 50.00 % | |
| ►Loop Computation Issues | 4 | |
| ○ | [SA] Less than 10% of the FP ADD/SUB/MUL arithmetic operations are performed using FMA - Reorganize arithmetic expressions to exhibit potential for FMA. This issue costs 4 points. | 4 |
| ►Loop 22 - libqmckl.so.0.0.0 | Execution Time: 0 % - Vectorization Ratio: 0.00 % - Vector Length Use: 11.76 % | |
| ►Loop Computation Issues | 2 | |
| ○ | [SA] Presence of a large number of scalar integer instructions - Simplify loop structure, perform loop splitting or perform unroll and jam. This issue costs 2 points. | 2 |
| ►Control Flow Issues | 2 | |
| ○ | [SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points. | 2 |
| ►Vectorization Roadblocks | 1002 | |
| ○ | [SA] Too many paths (at least 1000 paths) - Simplify control structure. There are at least 1000 issues ( = paths) costing 1 point. | 1000 |
| ○ | [SA] Non innermost loop (InBetween) - Collapse loop with innermost ones. This issue costs 2 points. | 2 |
| ○Loop 619 - libqmckl.so.0.0.0 | Execution Time: 0 % - Vectorization Ratio: 100.00 % - Vector Length Use: 50.00 % |