| Loop Id: 22 | Module: attention-gcc-znver5-256 | Source: random.tcc:458-3374 [...] | Coverage: 0.45% |
|---|
| Loop Id: 22 | Module: attention-gcc-znver5-256 | Source: random.tcc:458-3374 [...] | Coverage: 0.45% |
|---|
0x401940 MOV -0x13c0(%RBP,%RAX,8),%RDI [2] |
0x401948 LEA 0x1(%RAX),%RCX |
0x40194c ADD $0x4,%RBX |
0x401950 MOV %RCX,-0x40(%RBP) [2] |
0x401954 MOV %RDI,%RAX |
0x401957 SHR $0xb,%RAX |
0x40195b MOV %EAX,%EAX |
0x40195d XOR %RDI,%RAX |
0x401960 MOV %RAX,%RDI |
0x401963 SAL $0x7,%RDI |
0x401967 AND $-0x62d3a980,%EDI |
0x40196d XOR %RAX,%RDI |
0x401970 MOV %RDI,%RAX |
0x401973 SAL $0xf,%RAX |
0x401977 AND $-0x103a0000,%EAX |
0x40197c XOR %RDI,%RAX |
0x40197f MOV %RAX,%RDI |
0x401982 SHR $0x12,%RDI |
0x401986 XOR %RDI,%RAX |
0x401989 VCVTUSI2SS %RAX,%XMM1,%XMM0 |
0x40198f VMULSS %XMM2,%XMM0,%XMM0 |
0x401993 VCMPSS $0x2,%XMM0,%XMM3,%XMM4 |
0x401998 VBLENDVPS %XMM4,%XMM5,%XMM0,%XMM0 |
0x40199e VMOVSS %XMM0,-0x4(%RBX) [1] |
0x4019a3 CMP %R12,%RBX |
0x4019a6 JE 4019e5 |
0x4019a8 MOV %RCX,%RAX |
0x4019ab CMP $0x26f,%RCX |
0x4019b2 JBE 401940 |
0x4019b4 LEA -0x13c0(%RBP),%RDI |
0x4019bb CALL 403200 <_ZNSt23mersenne_twister_engineImLm32ELm624ELm397ELm31ELm2567483615ELm11ELm4294967295ELm7ELm2636928640ELm15ELm4022730752ELm18ELm1812433253EE11_M_gen_randEv> |
0x4019c0 VMOVSS 0x2840(%RIP),%XMM5 [3] |
0x4019c8 VMOVSS 0x2840(%RIP),%XMM3 [3] |
0x4019d0 VMOVSS 0x2834(%RIP),%XMM2 [3] |
0x4019d8 MOV -0x40(%RBP),%RAX [2] |
0x4019dc VXORPS %XMM1,%XMM1,%XMM1 |
0x4019e0 JMP 401940 |
/cluster/comp/gcc/15.1.0/include/c++/15.1.0/bits/random.tcc: 458 - 3374 |
-------------------------------------------------------------------------------- |
458: if (_M_p >= state_size) |
459: _M_gen_rand(); |
460: |
461: // Calculate o(x(i)). |
462: result_type __z = _M_x[_M_p++]; |
463: __z ^= (__z >> __u) & __d; |
464: __z ^= (__z << __s) & __b; |
465: __z ^= (__z << __t) & __c; |
466: __z ^= (__z >> __l); |
[...] |
3367: __sum += _RealType(__urng() - __urng.min()) * __tmp; |
3368: __tmp *= __r; |
3369: } |
3370: __ret = __sum / __tmp; |
3371: if (__builtin_expect(__ret >= _RealType(1), 0)) |
3372: { |
3373: #if _GLIBCXX_USE_C99_MATH_FUNCS |
3374: __ret = std::nextafter(_RealType(1), _RealType(0)); |
/home/eoseret/llm-attention/attention_v2.cpp: 163 - 163 |
-------------------------------------------------------------------------------- |
163: for (size_t i = 0; i < elemsX; ++i) h_X[i] = dist(rng); |
| Coverage (%) | Name | Source Location | Module |
|---|
| min | med | avg | max |
|---|---|---|---|
| Percentile Index | 10 | 20 | 30 | 40 | 50 | 60 | 70 | 80 | 90 | 100 |
|---|---|---|---|---|---|---|---|---|---|---|
| Value |
| min | med | avg | max |
|---|---|---|---|
| Percentile Index | 10 | 20 | 30 | 40 | 50 | 60 | 70 | 80 | 90 | 100 |
|---|---|---|---|---|---|---|---|---|---|---|
| Value |
| Path / |
| Metric | Value |
|---|---|
| CQA speedup if no scalar integer | 2.96 |
| CQA speedup if FP arith vectorized | 2.57 |
| CQA speedup if fully vectorized | 12.58 |
| CQA speedup if no inter-iteration dependency | NA |
| CQA speedup if next bottleneck killed | 1.55 |
| Bottlenecks | |
| Function | main |
| Source | random.tcc:458-466,random.tcc:3367-3374,attention_v2.cpp:163-163 |
| Source loop unroll info | unrolled by 3 |
| Source loop unroll confidence level | max |
| Unroll/vectorization loop type | peel/tail |
| Unroll factor | 1 |
| CQA cycles | 4.25 |
| CQA cycles if no scalar integer | 1.44 |
| CQA cycles if FP arith vectorized | 1.65 |
| CQA cycles if fully vectorized | 0.34 |
| Front-end cycles | 4.25 |
| P0 cycles | 2.75 |
| P1 cycles | 2.75 |
| P2 cycles | 2.75 |
| P3 cycles | 2.75 |
| P4 cycles | 2.75 |
| P5 cycles | 2.75 |
| P6 cycles | 1.25 |
| P7 cycles | 1.25 |
| P8 cycles | 1.25 |
| P9 cycles | 1.25 |
| P10 cycles | 1.00 |
| P11 cycles | 1.00 |
| P12 cycles | 1.00 |
| P13 cycles | 1.00 |
| P14 cycles | 0.50 |
| P15 cycles | 0.50 |
| DIV/SQRT cycles | 0.00 |
| Inter-iter dependencies cycles | 1 |
| FE+BE cycles (UFS) | NA |
| Stall cycles (UFS) | NA |
| Nb insns | 33.00 |
| Nb uops | 34.00 |
| Nb loads | 3.00 |
| Nb stores | 2.00 |
| Nb stack references | 1.50 |
| FLOP/cycle | 0.24 |
| Nb FLOP add-sub | 0.00 |
| Nb FLOP mul | 1.00 |
| Nb FLOP fma | 0.00 |
| Nb FLOP div | 0.00 |
| Nb FLOP rcp | 0.00 |
| Nb FLOP sqrt | 0.00 |
| Nb FLOP rsqrt | 0.00 |
| Bytes/cycle | 6.86 |
| Bytes prefetched | 0.00 |
| Bytes loaded | 18.00 |
| Bytes stored | 12.00 |
| Stride 0 | 1.00 |
| Stride 1 | 1.00 |
| Stride n | 0.00 |
| Stride unknown | 0.50 |
| Stride indirect | 0.00 |
| Vectorization ratio all | 9.13 |
| Vectorization ratio load | 0.00 |
| Vectorization ratio store | 0.00 |
| Vectorization ratio mul | 0.00 |
| Vectorization ratio add_sub | NA |
| Vectorization ratio fma | NA |
| Vectorization ratio div_sqrt | NA |
| Vectorization ratio other | 14.09 |
| Vector-efficiency ratio all | 11.53 |
| Vector-efficiency ratio load | 10.16 |
| Vector-efficiency ratio store | 9.38 |
| Vector-efficiency ratio mul | 6.25 |
| Vector-efficiency ratio add_sub | NA |
| Vector-efficiency ratio fma | NA |
| Vector-efficiency ratio div_sqrt | NA |
| Vector-efficiency ratio other | 13.07 |
| Metric | Value |
|---|---|
| CQA speedup if no scalar integer | 2.79 |
| CQA speedup if FP arith vectorized | 2.55 |
| CQA speedup if fully vectorized | 12.87 |
| CQA speedup if no inter-iteration dependency | NA |
| CQA speedup if next bottleneck killed | 1.63 |
| Bottlenecks | micro-operation queue, |
| Function | main |
| Source | random.tcc:458-466,random.tcc:3367-3374,attention_v2.cpp:163-163 |
| Source loop unroll info | unrolled by 3 |
| Source loop unroll confidence level | max |
| Unroll/vectorization loop type | peel/tail |
| Unroll factor | 1 |
| CQA cycles | 4.88 |
| CQA cycles if no scalar integer | 1.75 |
| CQA cycles if FP arith vectorized | 1.91 |
| CQA cycles if fully vectorized | 0.38 |
| Front-end cycles | 4.88 |
| P0 cycles | 3.00 |
| P1 cycles | 3.00 |
| P2 cycles | 3.00 |
| P3 cycles | 3.00 |
| P4 cycles | 3.00 |
| P5 cycles | 3.00 |
| P6 cycles | 1.75 |
| P7 cycles | 1.75 |
| P8 cycles | 1.75 |
| P9 cycles | 1.75 |
| P10 cycles | 1.00 |
| P11 cycles | 1.00 |
| P12 cycles | 1.00 |
| P13 cycles | 1.00 |
| P14 cycles | 0.50 |
| P15 cycles | 0.50 |
| DIV/SQRT cycles | 0.00 |
| Inter-iter dependencies cycles | 1 |
| FE+BE cycles (UFS) | NA |
| Stall cycles (UFS) | NA |
| Nb insns | 37.00 |
| Nb uops | 39.00 |
| Nb loads | 5.00 |
| Nb stores | 2.00 |
| Nb stack references | 2.00 |
| FLOP/cycle | 0.21 |
| Nb FLOP add-sub | 0.00 |
| Nb FLOP mul | 1.00 |
| Nb FLOP fma | 0.00 |
| Nb FLOP div | 0.00 |
| Nb FLOP rcp | 0.00 |
| Nb FLOP sqrt | 0.00 |
| Nb FLOP rsqrt | 0.00 |
| Bytes/cycle | 8.21 |
| Bytes prefetched | 0.00 |
| Bytes loaded | 28.00 |
| Bytes stored | 12.00 |
| Stride 0 | 2.00 |
| Stride 1 | 1.00 |
| Stride n | 0.00 |
| Stride unknown | 0.00 |
| Stride indirect | 0.00 |
| Vectorization ratio all | 11.11 |
| Vectorization ratio load | 0.00 |
| Vectorization ratio store | 0.00 |
| Vectorization ratio mul | 0.00 |
| Vectorization ratio add_sub | NA |
| Vectorization ratio fma | NA |
| Vectorization ratio div_sqrt | NA |
| Vectorization ratio other | 18.18 |
| Vector-efficiency ratio all | 11.46 |
| Vector-efficiency ratio load | 7.81 |
| Vector-efficiency ratio store | 9.38 |
| Vector-efficiency ratio mul | 6.25 |
| Vector-efficiency ratio add_sub | NA |
| Vector-efficiency ratio fma | NA |
| Vector-efficiency ratio div_sqrt | NA |
| Vector-efficiency ratio other | 13.64 |
| Metric | Value |
|---|---|
| CQA speedup if no scalar integer | 3.22 |
| CQA speedup if FP arith vectorized | 2.59 |
| CQA speedup if fully vectorized | 12.21 |
| CQA speedup if no inter-iteration dependency | NA |
| CQA speedup if next bottleneck killed | 1.45 |
| Bottlenecks | micro-operation queue, |
| Function | main |
| Source | random.tcc:458-466,random.tcc:3367-3374,attention_v2.cpp:163-163 |
| Source loop unroll info | unrolled by 3 |
| Source loop unroll confidence level | max |
| Unroll/vectorization loop type | peel/tail |
| Unroll factor | 1 |
| CQA cycles | 3.63 |
| CQA cycles if no scalar integer | 1.13 |
| CQA cycles if FP arith vectorized | 1.40 |
| CQA cycles if fully vectorized | 0.30 |
| Front-end cycles | 3.63 |
| P0 cycles | 2.50 |
| P1 cycles | 2.50 |
| P2 cycles | 2.50 |
| P3 cycles | 2.50 |
| P4 cycles | 2.50 |
| P5 cycles | 2.50 |
| P6 cycles | 0.75 |
| P7 cycles | 0.75 |
| P8 cycles | 0.75 |
| P9 cycles | 0.75 |
| P10 cycles | 1.00 |
| P11 cycles | 1.00 |
| P12 cycles | 1.00 |
| P13 cycles | 1.00 |
| P14 cycles | 0.50 |
| P15 cycles | 0.50 |
| DIV/SQRT cycles | 0.00 |
| Inter-iter dependencies cycles | 1 |
| FE+BE cycles (UFS) | NA |
| Stall cycles (UFS) | NA |
| Nb insns | 29.00 |
| Nb uops | 29.00 |
| Nb loads | 1.00 |
| Nb stores | 2.00 |
| Nb stack references | 1.00 |
| FLOP/cycle | 0.28 |
| Nb FLOP add-sub | 0.00 |
| Nb FLOP mul | 1.00 |
| Nb FLOP fma | 0.00 |
| Nb FLOP div | 0.00 |
| Nb FLOP rcp | 0.00 |
| Nb FLOP sqrt | 0.00 |
| Nb FLOP rsqrt | 0.00 |
| Bytes/cycle | 5.52 |
| Bytes prefetched | 0.00 |
| Bytes loaded | 8.00 |
| Bytes stored | 12.00 |
| Stride 0 | 0.00 |
| Stride 1 | 1.00 |
| Stride n | 0.00 |
| Stride unknown | 1.00 |
| Stride indirect | 0.00 |
| Vectorization ratio all | 7.14 |
| Vectorization ratio load | 0.00 |
| Vectorization ratio store | 0.00 |
| Vectorization ratio mul | 0.00 |
| Vectorization ratio add_sub | NA |
| Vectorization ratio fma | NA |
| Vectorization ratio div_sqrt | NA |
| Vectorization ratio other | 10.00 |
| Vector-efficiency ratio all | 11.61 |
| Vector-efficiency ratio load | 12.50 |
| Vector-efficiency ratio store | 9.38 |
| Vector-efficiency ratio mul | 6.25 |
| Vector-efficiency ratio add_sub | NA |
| Vector-efficiency ratio fma | NA |
| Vector-efficiency ratio div_sqrt | NA |
| Vector-efficiency ratio other | 12.50 |
| Path / |
| Function | main |
| Source file and lines | random.tcc:458-3374 |
| Module | attention-gcc-znver5-256 |
| nb instructions | 33 |
| nb uops | 34 |
| loop length | 140.50 |
| used x86 registers | 6 |
| used mmx registers | 0 |
| used xmm registers | 6 |
| used ymm registers | 0 |
| used zmm registers | 0 |
| nb stack references | 1.50 |
| micro-operation queue | 4.25 cycles |
| front end | 4.25 cycles |
| P0 | P1 | P2 | P3 | P4 | P5 | P6 | P7 | P8 | P9 | P10 | P11 | P12 | P13 | P14 | P15 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| uops | 2.75 | 2.75 | 2.75 | 2.75 | 2.75 | 2.75 | 1.25 | 1.25 | 1.25 | 1.25 | 1.00 | 1.00 | 1.00 | 1.00 | 0.50 | 0.50 |
| cycles | 2.75 | 2.75 | 2.75 | 2.75 | 2.75 | 2.75 | 1.25 | 1.25 | 1.25 | 1.25 | 1.00 | 1.00 | 1.00 | 1.00 | 0.50 | 0.50 |
| Cycles executing div or sqrt instructions | NA |
| Longest recurrence chain latency (RecMII) | 1.00 |
| Front-end | 4.25 |
| Dispatch | 2.75 |
| Data deps. | 1.00 |
| Overall L1 | 4.25 |
| all | 0% |
| load | 0% |
| store | 0% |
| mul | NA (no mul vectorizable/vectorized instructions) |
| add-sub | NA (no add-sub vectorizable/vectorized instructions) |
| fma | NA (no fma vectorizable/vectorized instructions) |
| other | 0% |
| all | 25% |
| load | 0% |
| store | 0% |
| mul | 0% |
| add-sub | NA (no add-sub vectorizable/vectorized instructions) |
| fma | NA (no fma vectorizable/vectorized instructions) |
| div/sqrt | NA (no div/sqrt vectorizable/vectorized instructions) |
| other | 58% |
| all | 9% |
| load | 0% |
| store | 0% |
| mul | 0% |
| add-sub | NA (no add-sub vectorizable/vectorized instructions) |
| fma | NA (no fma vectorizable/vectorized instructions) |
| div/sqrt | NA (no div/sqrt vectorizable/vectorized instructions) |
| other | 14% |
| all | 11% |
| load | 12% |
| store | 12% |
| mul | NA (no mul vectorizable/vectorized instructions) |
| add-sub | NA (no add-sub vectorizable/vectorized instructions) |
| fma | NA (no fma vectorizable/vectorized instructions) |
| other | 11% |
| all | 10% |
| load | 6% |
| store | 6% |
| mul | 6% |
| add-sub | NA (no add-sub vectorizable/vectorized instructions) |
| fma | NA (no fma vectorizable/vectorized instructions) |
| div/sqrt | NA (no div/sqrt vectorizable/vectorized instructions) |
| other | 17% |
| all | 11% |
| load | 10% |
| store | 9% |
| mul | 6% |
| add-sub | NA (no add-sub vectorizable/vectorized instructions) |
| fma | NA (no fma vectorizable/vectorized instructions) |
| div/sqrt | NA (no div/sqrt vectorizable/vectorized instructions) |
| other | 13% |
| Function | main |
| Source file and lines | random.tcc:458-3374 |
| Module | attention-gcc-znver5-256 |
| nb instructions | 37 |
| nb uops | 39 |
| loop length | 165 |
| used x86 registers | 6 |
| used mmx registers | 0 |
| used xmm registers | 6 |
| used ymm registers | 0 |
| used zmm registers | 0 |
| nb stack references | 2 |
| micro-operation queue | 4.88 cycles |
| front end | 4.88 cycles |
| P0 | P1 | P2 | P3 | P4 | P5 | P6 | P7 | P8 | P9 | P10 | P11 | P12 | P13 | P14 | P15 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| uops | 3.00 | 3.00 | 3.00 | 3.00 | 3.00 | 3.00 | 1.75 | 1.75 | 1.75 | 1.75 | 1.00 | 1.00 | 1.00 | 1.00 | 0.50 | 0.50 |
| cycles | 3.00 | 3.00 | 3.00 | 3.00 | 3.00 | 3.00 | 1.75 | 1.75 | 1.75 | 1.75 | 1.00 | 1.00 | 1.00 | 1.00 | 0.50 | 0.50 |
| Cycles executing div or sqrt instructions | NA |
| Longest recurrence chain latency (RecMII) | 1.00 |
| Front-end | 4.88 |
| Dispatch | 3.00 |
| Data deps. | 1.00 |
| Overall L1 | 4.88 |
| all | 0% |
| load | 0% |
| store | 0% |
| mul | NA (no mul vectorizable/vectorized instructions) |
| add-sub | NA (no add-sub vectorizable/vectorized instructions) |
| fma | NA (no fma vectorizable/vectorized instructions) |
| other | 0% |
| all | 25% |
| load | 0% |
| store | 0% |
| mul | 0% |
| add-sub | NA (no add-sub vectorizable/vectorized instructions) |
| fma | NA (no fma vectorizable/vectorized instructions) |
| div/sqrt | NA (no div/sqrt vectorizable/vectorized instructions) |
| other | 66% |
| all | 11% |
| load | 0% |
| store | 0% |
| mul | 0% |
| add-sub | NA (no add-sub vectorizable/vectorized instructions) |
| fma | NA (no fma vectorizable/vectorized instructions) |
| div/sqrt | NA (no div/sqrt vectorizable/vectorized instructions) |
| other | 18% |
| all | 11% |
| load | 12% |
| store | 12% |
| mul | NA (no mul vectorizable/vectorized instructions) |
| add-sub | NA (no add-sub vectorizable/vectorized instructions) |
| fma | NA (no fma vectorizable/vectorized instructions) |
| other | 11% |
| all | 10% |
| load | 6% |
| store | 6% |
| mul | 6% |
| add-sub | NA (no add-sub vectorizable/vectorized instructions) |
| fma | NA (no fma vectorizable/vectorized instructions) |
| div/sqrt | NA (no div/sqrt vectorizable/vectorized instructions) |
| other | 18% |
| all | 11% |
| load | 7% |
| store | 9% |
| mul | 6% |
| add-sub | NA (no add-sub vectorizable/vectorized instructions) |
| fma | NA (no fma vectorizable/vectorized instructions) |
| div/sqrt | NA (no div/sqrt vectorizable/vectorized instructions) |
| other | 13% |
| Instruction | Nb FU | P0 | P1 | P2 | P3 | P4 | P5 | P6 | P7 | P8 | P9 | P10 | P11 | P12 | P13 | P14 | P15 | Latency | Recip. throughput | Vectorization |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| MOV -0x13c0(%RBP,%RAX,8),%RDI | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0.25 | 0.25 | 0.25 | 0.25 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0.25 | scal (12.5%) |
| LEA 0x1(%RAX),%RCX | 1 | 0.17 | 0.17 | 0.17 | 0.17 | 0.17 | 0.17 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.17 | N/A |
| ADD $0x4,%RBX | 1 | 0.17 | 0.17 | 0.17 | 0.17 | 0.17 | 0.17 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.17 | N/A |
| MOV %RCX,-0x40(%RBP) | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0.25 | 0.25 | 0.25 | 0.25 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 | scal (12.5%) |
| MOV %RDI,%RAX | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.13 | N/A |
| SHR $0xb,%RAX | 1 | 0 | 0 | 0 | 0.33 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.33 | N/A |
| MOV %EAX,%EAX | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.13 | N/A |
| XOR %RDI,%RAX | 1 | 0.17 | 0.17 | 0.17 | 0.17 | 0.17 | 0.17 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.17 | N/A |
| MOV %RAX,%RDI | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.13 | scal (12.5%) |
| SAL $0x7,%RDI | 1 | 0 | 0 | 0 | 0.33 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.33 | scal (12.5%) |
| AND $-0x62d3a980,%EDI | 1 | 0.17 | 0.17 | 0.17 | 0.17 | 0.17 | 0.17 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.17 | scal (6.3%) |
| XOR %RAX,%RDI | 1 | 0.17 | 0.17 | 0.17 | 0.17 | 0.17 | 0.17 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.17 | scal (12.5%) |
| MOV %RDI,%RAX | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.13 | N/A |
| SAL $0xf,%RAX | 1 | 0 | 0 | 0 | 0.33 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.33 | N/A |
| AND $-0x103a0000,%EAX | 1 | 0.17 | 0.17 | 0.17 | 0.17 | 0.17 | 0.17 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.17 | N/A |
| XOR %RDI,%RAX | 1 | 0.17 | 0.17 | 0.17 | 0.17 | 0.17 | 0.17 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.17 | N/A |
| MOV %RAX,%RDI | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.13 | scal (12.5%) |
| SHR $0x12,%RDI | 1 | 0 | 0 | 0 | 0.33 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.33 | scal (12.5%) |
| XOR %RDI,%RAX | 1 | 0.17 | 0.17 | 0.17 | 0.17 | 0.17 | 0.17 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.17 | N/A |
| VCVTUSI2SS %RAX,%XMM1,%XMM0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.50 | 0.50 | 0 | 0 | 3-9 | 1 | scal (12.5%) |
| VMULSS %XMM2,%XMM0,%XMM0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 3 | 0.50 | scal (6.3%) |
| VCMPSS $0x2,%XMM0,%XMM3,%XMM4 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.50 | 0.50 | 0 | 0 | 2 | 0.50 | scal (6.3%) |
| VBLENDVPS %XMM4,%XMM5,%XMM0,%XMM0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 2 | 0.50 | vect (25.0%) |
| VMOVSS %XMM0,-0x4(%RBX) | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0.25 | 0.25 | 0.25 | 0.25 | 0 | 0 | 0 | 0 | 0.50 | 0.50 | 1 | 0.50 | scal (6.3%) |
| CMP %R12,%RBX | 1 | 0.17 | 0.17 | 0.17 | 0.17 | 0.17 | 0.17 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.17 | scal (12.5%) |
| JE 4019e5 <main+0x6e5> | 1 | 0 | 0 | 0 | 0.33 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33-0.50 | N/A |
| MOV %RCX,%RAX | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.13 | N/A |
| CMP $0x26f,%RCX | 1 | 0.17 | 0.17 | 0.17 | 0.17 | 0.17 | 0.17 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.17 | N/A |
| JBE 401940 <main+0x640> | 1 | 0 | 0 | 0 | 0.33 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33-0.50 | N/A |
| LEA -0x13c0(%RBP),%RDI | 1 | 0.17 | 0.17 | 0.17 | 0.17 | 0.17 | 0.17 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.17 | N/A |
| CALL 403200 <_ZNSt23mersenne_twister_engineImLm32ELm624ELm397ELm31ELm2567483615ELm11ELm4294967295ELm7ELm2636928640ELm15ELm4022730752ELm18ELm1812433253EE11_M_gen_randEv> | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | N/A |
| VMOVSS 0x2840(%RIP),%XMM5 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0.25 | 0.25 | 0.25 | 0.25 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.50 | scal (6.3%) |
| VMOVSS 0x2840(%RIP),%XMM3 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0.25 | 0.25 | 0.25 | 0.25 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.50 | scal (6.3%) |
| VMOVSS 0x2834(%RIP),%XMM2 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0.25 | 0.25 | 0.25 | 0.25 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.50 | scal (6.3%) |
| MOV -0x40(%RBP),%RAX | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0.25 | 0.25 | 0.25 | 0.25 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0.25 | N/A |
| VXORPS %XMM1,%XMM1,%XMM1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.17 | vect (25.0%) |
| JMP 401940 <main+0x640> | 1 | 0 | 0 | 0 | 0.33 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.50 | N/A |
| Function | main |
| Source file and lines | random.tcc:458-3374 |
| Module | attention-gcc-znver5-256 |
| nb instructions | 29 |
| nb uops | 29 |
| loop length | 116 |
| used x86 registers | 6 |
| used mmx registers | 0 |
| used xmm registers | 6 |
| used ymm registers | 0 |
| used zmm registers | 0 |
| nb stack references | 1 |
| micro-operation queue | 3.63 cycles |
| front end | 3.63 cycles |
| P0 | P1 | P2 | P3 | P4 | P5 | P6 | P7 | P8 | P9 | P10 | P11 | P12 | P13 | P14 | P15 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| uops | 2.50 | 2.50 | 2.50 | 2.50 | 2.50 | 2.50 | 0.75 | 0.75 | 0.75 | 0.75 | 1.00 | 1.00 | 1.00 | 1.00 | 0.50 | 0.50 |
| cycles | 2.50 | 2.50 | 2.50 | 2.50 | 2.50 | 2.50 | 0.75 | 0.75 | 0.75 | 0.75 | 1.00 | 1.00 | 1.00 | 1.00 | 0.50 | 0.50 |
| Cycles executing div or sqrt instructions | NA |
| Longest recurrence chain latency (RecMII) | 1.00 |
| Front-end | 3.63 |
| Dispatch | 2.50 |
| Data deps. | 1.00 |
| Overall L1 | 3.63 |
| all | 0% |
| load | 0% |
| store | 0% |
| mul | NA (no mul vectorizable/vectorized instructions) |
| add-sub | NA (no add-sub vectorizable/vectorized instructions) |
| fma | NA (no fma vectorizable/vectorized instructions) |
| other | 0% |
| all | 25% |
| load | NA (no load vectorizable/vectorized instructions) |
| store | 0% |
| mul | 0% |
| add-sub | NA (no add-sub vectorizable/vectorized instructions) |
| fma | NA (no fma vectorizable/vectorized instructions) |
| div/sqrt | NA (no div/sqrt vectorizable/vectorized instructions) |
| other | 50% |
| all | 7% |
| load | 0% |
| store | 0% |
| mul | 0% |
| add-sub | NA (no add-sub vectorizable/vectorized instructions) |
| fma | NA (no fma vectorizable/vectorized instructions) |
| div/sqrt | NA (no div/sqrt vectorizable/vectorized instructions) |
| other | 10% |
| all | 11% |
| load | 12% |
| store | 12% |
| mul | NA (no mul vectorizable/vectorized instructions) |
| add-sub | NA (no add-sub vectorizable/vectorized instructions) |
| fma | NA (no fma vectorizable/vectorized instructions) |
| other | 11% |
| all | 10% |
| load | NA (no load vectorizable/vectorized instructions) |
| store | 6% |
| mul | 6% |
| add-sub | NA (no add-sub vectorizable/vectorized instructions) |
| fma | NA (no fma vectorizable/vectorized instructions) |
| div/sqrt | NA (no div/sqrt vectorizable/vectorized instructions) |
| other | 15% |
| all | 11% |
| load | 12% |
| store | 9% |
| mul | 6% |
| add-sub | NA (no add-sub vectorizable/vectorized instructions) |
| fma | NA (no fma vectorizable/vectorized instructions) |
| div/sqrt | NA (no div/sqrt vectorizable/vectorized instructions) |
| other | 12% |
| Instruction | Nb FU | P0 | P1 | P2 | P3 | P4 | P5 | P6 | P7 | P8 | P9 | P10 | P11 | P12 | P13 | P14 | P15 | Latency | Recip. throughput | Vectorization |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| MOV -0x13c0(%RBP,%RAX,8),%RDI | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0.25 | 0.25 | 0.25 | 0.25 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0.25 | scal (12.5%) |
| LEA 0x1(%RAX),%RCX | 1 | 0.17 | 0.17 | 0.17 | 0.17 | 0.17 | 0.17 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.17 | N/A |
| ADD $0x4,%RBX | 1 | 0.17 | 0.17 | 0.17 | 0.17 | 0.17 | 0.17 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.17 | N/A |
| MOV %RCX,-0x40(%RBP) | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0.25 | 0.25 | 0.25 | 0.25 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 | scal (12.5%) |
| MOV %RDI,%RAX | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.13 | N/A |
| SHR $0xb,%RAX | 1 | 0 | 0 | 0 | 0.33 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.33 | N/A |
| MOV %EAX,%EAX | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.13 | N/A |
| XOR %RDI,%RAX | 1 | 0.17 | 0.17 | 0.17 | 0.17 | 0.17 | 0.17 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.17 | N/A |
| MOV %RAX,%RDI | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.13 | scal (12.5%) |
| SAL $0x7,%RDI | 1 | 0 | 0 | 0 | 0.33 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.33 | scal (12.5%) |
| AND $-0x62d3a980,%EDI | 1 | 0.17 | 0.17 | 0.17 | 0.17 | 0.17 | 0.17 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.17 | scal (6.3%) |
| XOR %RAX,%RDI | 1 | 0.17 | 0.17 | 0.17 | 0.17 | 0.17 | 0.17 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.17 | scal (12.5%) |
| MOV %RDI,%RAX | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.13 | N/A |
| SAL $0xf,%RAX | 1 | 0 | 0 | 0 | 0.33 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.33 | N/A |
| AND $-0x103a0000,%EAX | 1 | 0.17 | 0.17 | 0.17 | 0.17 | 0.17 | 0.17 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.17 | N/A |
| XOR %RDI,%RAX | 1 | 0.17 | 0.17 | 0.17 | 0.17 | 0.17 | 0.17 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.17 | N/A |
| MOV %RAX,%RDI | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.13 | scal (12.5%) |
| SHR $0x12,%RDI | 1 | 0 | 0 | 0 | 0.33 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.33 | scal (12.5%) |
| XOR %RDI,%RAX | 1 | 0.17 | 0.17 | 0.17 | 0.17 | 0.17 | 0.17 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.17 | N/A |
| VCVTUSI2SS %RAX,%XMM1,%XMM0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.50 | 0.50 | 0 | 0 | 3-9 | 1 | scal (12.5%) |
| VMULSS %XMM2,%XMM0,%XMM0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 3 | 0.50 | scal (6.3%) |
| VCMPSS $0x2,%XMM0,%XMM3,%XMM4 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.50 | 0.50 | 0 | 0 | 2 | 0.50 | scal (6.3%) |
| VBLENDVPS %XMM4,%XMM5,%XMM0,%XMM0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 2 | 0.50 | vect (25.0%) |
| VMOVSS %XMM0,-0x4(%RBX) | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0.25 | 0.25 | 0.25 | 0.25 | 0 | 0 | 0 | 0 | 0.50 | 0.50 | 1 | 0.50 | scal (6.3%) |
| CMP %R12,%RBX | 1 | 0.17 | 0.17 | 0.17 | 0.17 | 0.17 | 0.17 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.17 | scal (12.5%) |
| JE 4019e5 <main+0x6e5> | 1 | 0 | 0 | 0 | 0.33 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33-0.50 | N/A |
| MOV %RCX,%RAX | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.13 | N/A |
| CMP $0x26f,%RCX | 1 | 0.17 | 0.17 | 0.17 | 0.17 | 0.17 | 0.17 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.17 | N/A |
| JBE 401940 <main+0x640> | 1 | 0 | 0 | 0 | 0.33 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33-0.50 | N/A |
