| Run mavx512 | Run xGRANITERAPIDS | Run xGRANITERAPIDS mprefer-vector-width=512 | Run xGRANITERAPIDS mprefer-vector-width=512 fp-model fast=2 |
| Loop Source Regions | - /home/eoseret/llm-attention/attention.cpp: 30-31
| Loop Source Regions | - /home/eoseret/llm-attention/attention.cpp: 30-31
| Loop Source Regions | - /home/eoseret/llm-attention/attention.cpp: 30-31
| Loop Source Regions | - /home/eoseret/llm-attention/attention.cpp: 30-31
|
| ASM Loop ID | Max Time Over Threads (s) | Time w.r.t. Wall Time (s) | Cov (%) | Vect. Ratio (%) | Vector Length Use (%) | GFLOP/s | ASM Loop ID | Max Time Over Threads (s) | Time w.r.t. Wall Time (s) | Cov (%) | Vect. Ratio (%) | Vector Length Use (%) | GFLOP/s | ASM Loop ID | Max Time Over Threads (s) | Time w.r.t. Wall Time (s) | Cov (%) | Vect. Ratio (%) | Vector Length Use (%) | GFLOP/s | ASM Loop ID | Max Time Over Threads (s) | Time w.r.t. Wall Time (s) | Cov (%) | Vect. Ratio (%) | Vector Length Use (%) | GFLOP/s |
| 31 | 1.06 | 1.06 | 20.54 | 87.5 | 41.41 | 2.66 | 51 | 0.38 | 0.38 | 8.03 | 0 | 8.1 | 6.61 | 52 | 0.41 | 0.41 | 8.24 | 68.75 | 41.41 | 5.89 | 56 | 0.33 | 0.33 | 9.21 | 96.43 | 51.56 | 6.86 |
| 47 | 0.23 | 0.23 | 4.55 | 87.5 | 41.41 | 3.66 | 24 | 1.53 | 1.53 | 32.42 | 0 | 8.1 | 6.74 | 28 | 1.68 | 1.68 | 33.87 | 68.75 | 41.41 | 6.11 | 30 | 1.08 | 1.08 | 29.99 | 97.83 | 53.4 | 9.39 |
| 39 | 0.28 | 0.28 | 5.52 | 87.5 | 41.41 | 2.9 | 35 | 1.58 | 1.58 | 33.26 | 0 | 8.1 | 6.3 | 64 | 0.40 | 0.40 | 8.14 | 68.75 | 41.41 | 6.06 | 70 | 0.31 | 0.31 | 8.65 | 96 | 51.75 | 7.26 |
| 21 | 1.03 | 1.03 | 19.96 | 90.91 | 46.02 | 8.19 | 46 | 0.39 | 0.38 | 8.13 | 0 | 8.1 | 6.46 | 40 | 1.67 | 1.67 | 33.57 | 68.75 | 41.41 | 5.82 | 43 | 1.14 | 1.14 | 31.80 | 96.08 | 51.72 | 7.77 |
| 43 | 0.19 | 0.19 | 3.68 | 87.5 | 41.41 | 4.64 | 56 | 0.39 | 0.38 | 8.13 | 0 | 8.1 | 6.48 | 58 | 0.41 | 0.41 | 8.34 | 68.75 | 41.41 | 5.9 | 63 | 0.31 | 0.31 | 8.79 | 96 | 51.75 | 7.11 |
| | | |
| Sum on 5 analyzed binary loops (attention-avx512 - 31, attention-avx512 - 47, attention-avx512 - 39, attention-avx512 - 21, attention-avx512 - 43) | Sum on 5 analyzed binary loops (attention-avx512 - 51, attention-avx512 - 24, attention-avx512 - 35, attention-avx512 - 46, attention-avx512 - 56) | Sum on 5 analyzed binary loops (attention-avx512 - 52, attention-avx512 - 28, attention-avx512 - 64, attention-avx512 - 40, attention-avx512 - 58) | Sum on 5 analyzed binary loops (attention-avx512 - 56, attention-avx512 - 30, attention-avx512 - 70, attention-avx512 - 43, attention-avx512 - 63) |
| Analysis | Count | Analysis | Count | Analysis | Count | Analysis | Count |
| Loop Computation Issues | | Loop Computation Issues | | Loop Computation Issues | | Loop Computation Issues | |
| Less than 10% of the FP ADD/SUB/MUL arithmetic operations are performed using FMA | | Less than 10% of the FP ADD/SUB/MUL arithmetic operations are performed using FMA | | Less than 10% of the FP ADD/SUB/MUL arithmetic operations are performed using FMA | 1 | Less than 10% of the FP ADD/SUB/MUL arithmetic operations are performed using FMA | 0 |
| Presence of a large number of scalar integer instructions | | Presence of a large number of scalar integer instructions | | Presence of a large number of scalar integer instructions | 0 | Presence of a large number of scalar integer instructions | 1 |
| Data Access Issues | | Data Access Issues | | Data Access Issues | | Data Access Issues | |
| Presence of constant non-unit stride data access | 1 | Presence of constant non-unit stride data access | 1 | Presence of constant non-unit stride data access | 1 | Presence of constant non-unit stride data access | 1 |
| Presence of indirect access | 1 | Presence of indirect access | 0 | Presence of indirect access | 1 | Presence of indirect access | 1 |
| Presence of expensive instructions: scatter/gather | 1 | Presence of expensive instructions: scatter/gather | 0 | Presence of expensive instructions: scatter/gather | 1 | Presence of expensive instructions: scatter/gather | 1 |
| Presence of special instructions executing on a single port | 1 | Presence of special instructions executing on a single port | 0 | Presence of special instructions executing on a single port | 1 | Presence of special instructions executing on a single port | 1 |
| Vectorization Roadblocks | | Vectorization Roadblocks | | Vectorization Roadblocks | | Vectorization Roadblocks | |
| Presence of constant non-unit stride data access | 1 | Presence of constant non-unit stride data access | 1 | Presence of constant non-unit stride data access | 1 | Presence of constant non-unit stride data access | 1 |
| Presence of indirect access | 1 | Presence of indirect access | 0 | Presence of indirect access | 1 | Presence of indirect access | 1 |
| Inefficient Vectorization | | Inefficient Vectorization | | Inefficient Vectorization | | Inefficient Vectorization | |
| Presence of expensive instructions: scatter/gather | 1 | Presence of expensive instructions: scatter/gather | | Presence of expensive instructions: scatter/gather | 1 | Presence of expensive instructions: scatter/gather | 1 |
| Presence of special instructions executing on a single port | 1 | Presence of special instructions executing on a single port | | Presence of special instructions executing on a single port | 1 | Presence of special instructions executing on a single port | 1 |