Loop Id: 486 | Module: libqmcwfs.so | Source: BsplineFunctor.h:305-336 | Coverage: 0.03% |
---|
Loop Id: 486 | Module: libqmcwfs.so | Source: BsplineFunctor.h:305-336 | Coverage: 0.03% |
---|
0x4c180 VMOVSD (%RDI,%RDX,8),%XMM2 [5] |
0x4c185 VMOVAPD %YMM10,%YMM3 |
0x4c189 MOVSXD (%R10,%RDX,4),%RAX [3] |
0x4c18d INC %RDX |
0x4c190 VMULSD %XMM2,%XMM7,%XMM8 |
0x4c194 ADD %R8,%RAX |
0x4c197 VDIVSD %XMM2,%XMM15,%XMM2 |
0x4c19b VROUNDSD $0xb,%XMM8,%XMM8,%XMM1 |
0x4c1a1 VCVTTSD2SI %XMM8,%ESI |
0x4c1a6 VSUBSD %XMM1,%XMM8,%XMM0 |
0x4c1aa VMOVAPD 0x60(%RSP),%YMM1 [6] |
0x4c1b0 VMULSD %XMM0,%XMM0,%XMM6 |
0x4c1b4 VBROADCASTSD %XMM0,%YMM4 |
0x4c1b9 VFMADD132PD %YMM4,%YMM9,%YMM3 |
0x4c1be VFMADD213PD 0x40(%RSP),%YMM4,%YMM1 [6] |
0x4c1c5 MOVSXD %ESI,%R13 |
0x4c1c8 VFMADD132PD %YMM13,%YMM12,%YMM4 |
0x4c1cd VMOVUPD (%R11,%R13,8),%YMM8 [1] |
0x4c1d3 VMULSD %XMM6,%XMM0,%XMM0 |
0x4c1d7 VBROADCASTSD %XMM6,%YMM5 |
0x4c1dc VFMADD231PD %YMM5,%YMM11,%YMM3 |
0x4c1e1 VMULPD %YMM4,%YMM8,%YMM4 |
0x4c1e5 VBROADCASTSD %XMM0,%YMM6 |
0x4c1ea VMULPD 0x80(%RSP),%YMM6,%YMM0 [6] |
0x4c1f3 VMULPD %YMM3,%YMM8,%YMM3 |
0x4c1f7 VFMADD231PD 0x20(%RSP),%YMM5,%YMM0 [6] |
0x4c1fe VEXTRACTF128 $0x1,%YMM4,%XMM5 |
0x4c204 VADDPD %YMM0,%YMM1,%YMM1 |
0x4c208 VMULPD %YMM8,%YMM1,%YMM6 |
0x4c20d VADDPD %XMM4,%XMM5,%XMM8 |
0x4c211 VEXTRACTF128 $0x1,%YMM3,%XMM5 |
0x4c217 VADDPD %XMM3,%XMM5,%XMM3 |
0x4c21b VUNPCKHPD %XMM8,%XMM8,%XMM0 |
0x4c220 VADDPD %XMM8,%XMM0,%XMM1 |
0x4c225 VUNPCKHPD %XMM3,%XMM3,%XMM8 |
0x4c229 VADDPD %XMM3,%XMM8,%XMM0 |
0x4c22d VEXTRACTF128 $0x1,%YMM6,%XMM5 |
0x4c233 VADDPD %XMM6,%XMM5,%XMM6 |
0x4c237 VMULSD %XMM1,%XMM14,%XMM4 |
0x4c23b VMULSD %XMM0,%XMM7,%XMM1 |
0x4c23f VUNPCKHPD %XMM6,%XMM6,%XMM3 |
0x4c243 VADDPD %XMM6,%XMM3,%XMM8 |
0x4c247 VMOVSD %XMM4,(%R14,%RAX,8) [4] |
0x4c24d VMULSD %XMM1,%XMM2,%XMM4 |
0x4c251 VMOVLPD %XMM8,(%RBX,%RAX,8) [2] |
0x4c256 VMOVSD %XMM4,(%R12,%RAX,8) [7] |
0x4c25c CMP %RCX,%RDX |
0x4c25f JNE 4c180 |
/scratch_na/users/xoserete/qaas_runs/171-417-3180/intel/miniqmc/build/miniqmc/src/QMCWaveFunctions/Jastrow/BsplineFunctor.h: 305 - 336 |
-------------------------------------------------------------------------------- |
305: real_type r = distArrayCompressed[j]; |
306: int iScatter = distIndices[j]; |
307: real_type rinv = cOne / r; |
308: r *= DeltaRInv; |
309: int iGather = (int)r; |
310: real_type t = r - real_type(iGather); |
311: real_type tp0 = t * t * t; |
312: real_type tp1 = t * t; |
313: real_type tp2 = t; |
314: |
315: real_type sCoef0 = SplineCoefs[iGather + 0]; |
316: real_type sCoef1 = SplineCoefs[iGather + 1]; |
317: real_type sCoef2 = SplineCoefs[iGather + 2]; |
318: real_type sCoef3 = SplineCoefs[iGather + 3]; |
319: |
320: // clang-format off |
321: laplArray[iScatter] = dSquareDeltaRinv * |
322: (sCoef0*( d2A[ 2]*tp2 + d2A[ 3])+ |
323: sCoef1*( d2A[ 6]*tp2 + d2A[ 7])+ |
324: sCoef2*( d2A[10]*tp2 + d2A[11])+ |
325: sCoef3*( d2A[14]*tp2 + d2A[15])); |
326: |
327: gradArray[iScatter] = DeltaRInv * rinv * |
328: (sCoef0*( dA[ 1]*tp1 + dA[ 2]*tp2 + dA[ 3])+ |
329: sCoef1*( dA[ 5]*tp1 + dA[ 6]*tp2 + dA[ 7])+ |
330: sCoef2*( dA[ 9]*tp1 + dA[10]*tp2 + dA[11])+ |
331: sCoef3*( dA[13]*tp1 + dA[14]*tp2 + dA[15])); |
332: |
333: valArray[iScatter] = (sCoef0*(A[ 0]*tp0 + A[ 1]*tp1 + A[ 2]*tp2 + A[ 3])+ |
334: sCoef1*(A[ 4]*tp0 + A[ 5]*tp1 + A[ 6]*tp2 + A[ 7])+ |
335: sCoef2*(A[ 8]*tp0 + A[ 9]*tp1 + A[10]*tp2 + A[11])+ |
336: sCoef3*(A[12]*tp0 + A[13]*tp1 + A[14]*tp2 + A[15])); |
Coverage (%) | Name | Source Location | Module |
---|---|---|---|
►98.43+ | miniqmcreference::TwoBodyJastr[...] | TwoBodyJastrowRef.h:315 | libqmcwfs.so |
○ | qmcplusplus::WaveFunction::acc[...] | NewTimer.h:249 | libqmcwfs.so |
○ | main._omp_fn.1 | stl_vector.h:1121 | exec |
○ | gomp_thread_start | team.c:130 | libgomp.so.1.0.0 |
►1.57+ | miniqmcreference::TwoBodyJastr[...] | TwoBodyJastrowRef.h:315 | libqmcwfs.so |
○ | qmcplusplus::WaveFunction::acc[...] | NewTimer.h:249 | libqmcwfs.so |
○ | main._omp_fn.1 | stl_vector.h:1121 | exec |
○ | GOMP_parallel | libgomp.h:985 | libgomp.so.1.0.0 |
Path / |
Metric | Value |
---|---|
CQA speedup if no scalar integer | 1.00 |
CQA speedup if FP arith vectorized | 1.39 |
CQA speedup if fully vectorized | 3.39 |
CQA speedup if no inter-iteration dependency | NA |
CQA speedup if next bottleneck killed | 1.04 |
Bottlenecks | P0, P1, |
Function | miniqmcreference::TwoBodyJastrowRef |
Source | BsplineFunctor.h:305-336 |
Source loop unroll info | not unrolled or unrolled with no peel/tail loop |
Source loop unroll confidence level | max |
Unroll/vectorization loop type | NA |
Unroll factor | NA |
CQA cycles | 12.50 |
CQA cycles if no scalar integer | 12.50 |
CQA cycles if FP arith vectorized | 9.00 |
CQA cycles if fully vectorized | 3.69 |
Front-end cycles | 8.17 |
DIV/SQRT cycles | 12.50 |
P0 cycles | 12.50 |
P1 cycles | 2.33 |
P2 cycles | 2.33 |
P3 cycles | 1.50 |
P4 cycles | 12.00 |
P5 cycles | 1.60 |
P6 cycles | 1.50 |
P7 cycles | 1.50 |
P8 cycles | 1.50 |
P9 cycles | 1.40 |
P10 cycles | 2.33 |
P11 cycles | 4.00 |
Inter-iter dependencies cycles | 1 |
FE+BE cycles (UFS) | 14.46 |
Stall cycles (UFS) | 5.72 |
Nb insns | 48.00 |
Nb uops | 49.00 |
Nb loads | 7.00 |
Nb stores | 3.00 |
Nb stack references | 4.00 |
FLOP/cycle | 6.40 |
Nb FLOP add-sub | 17.00 |
Nb FLOP mul | 22.00 |
Nb FLOP fma | 20.00 |
Nb FLOP div | 1.00 |
Nb FLOP rcp | 0.00 |
Nb FLOP sqrt | 0.00 |
Nb FLOP rsqrt | 0.00 |
Bytes/cycle | 15.68 |
Bytes prefetched | 0.00 |
Bytes loaded | 172.00 |
Bytes stored | 24.00 |
Stride 0 | 1.00 |
Stride 1 | 2.00 |
Stride n | 0.00 |
Stride unknown | 3.00 |
Stride indirect | 1.00 |
Vectorization ratio all | 52.38 |
Vectorization ratio load | 83.33 |
Vectorization ratio store | 0.00 |
Vectorization ratio mul | 40.00 |
Vectorization ratio add_sub | 87.50 |
Vectorization ratio fma | 100.00 |
Vectorization ratio div_sqrt | 0.00 |
Vectorization ratio other | 33.33 |
Vector-efficiency ratio all | 26.79 |
Vector-efficiency ratio load | 43.75 |
Vector-efficiency ratio store | 12.50 |
Vector-efficiency ratio mul | 27.50 |
Vector-efficiency ratio add_sub | 26.56 |
Vector-efficiency ratio fma | 50.00 |
Vector-efficiency ratio div_sqrt | 12.50 |
Vector-efficiency ratio other | 18.75 |
Metric | Value |
---|---|
CQA speedup if no scalar integer | 1.00 |
CQA speedup if FP arith vectorized | 1.39 |
CQA speedup if fully vectorized | 3.39 |
CQA speedup if no inter-iteration dependency | NA |
CQA speedup if next bottleneck killed | 1.04 |
Bottlenecks | P0, P1, |
Function | miniqmcreference::TwoBodyJastrowRef |
Source | BsplineFunctor.h:305-336 |
Source loop unroll info | not unrolled or unrolled with no peel/tail loop |
Source loop unroll confidence level | max |
Unroll/vectorization loop type | NA |
Unroll factor | NA |
CQA cycles | 12.50 |
CQA cycles if no scalar integer | 12.50 |
CQA cycles if FP arith vectorized | 9.00 |
CQA cycles if fully vectorized | 3.69 |
Front-end cycles | 8.17 |
DIV/SQRT cycles | 12.50 |
P0 cycles | 12.50 |
P1 cycles | 2.33 |
P2 cycles | 2.33 |
P3 cycles | 1.50 |
P4 cycles | 12.00 |
P5 cycles | 1.60 |
P6 cycles | 1.50 |
P7 cycles | 1.50 |
P8 cycles | 1.50 |
P9 cycles | 1.40 |
P10 cycles | 2.33 |
P11 cycles | 4.00 |
Inter-iter dependencies cycles | 1 |
FE+BE cycles (UFS) | 14.46 |
Stall cycles (UFS) | 5.72 |
Nb insns | 48.00 |
Nb uops | 49.00 |
Nb loads | 7.00 |
Nb stores | 3.00 |
Nb stack references | 4.00 |
FLOP/cycle | 6.40 |
Nb FLOP add-sub | 17.00 |
Nb FLOP mul | 22.00 |
Nb FLOP fma | 20.00 |
Nb FLOP div | 1.00 |
Nb FLOP rcp | 0.00 |
Nb FLOP sqrt | 0.00 |
Nb FLOP rsqrt | 0.00 |
Bytes/cycle | 15.68 |
Bytes prefetched | 0.00 |
Bytes loaded | 172.00 |
Bytes stored | 24.00 |
Stride 0 | 1.00 |
Stride 1 | 2.00 |
Stride n | 0.00 |
Stride unknown | 3.00 |
Stride indirect | 1.00 |
Vectorization ratio all | 52.38 |
Vectorization ratio load | 83.33 |
Vectorization ratio store | 0.00 |
Vectorization ratio mul | 40.00 |
Vectorization ratio add_sub | 87.50 |
Vectorization ratio fma | 100.00 |
Vectorization ratio div_sqrt | 0.00 |
Vectorization ratio other | 33.33 |
Vector-efficiency ratio all | 26.79 |
Vector-efficiency ratio load | 43.75 |
Vector-efficiency ratio store | 12.50 |
Vector-efficiency ratio mul | 27.50 |
Vector-efficiency ratio add_sub | 26.56 |
Vector-efficiency ratio fma | 50.00 |
Vector-efficiency ratio div_sqrt | 12.50 |
Vector-efficiency ratio other | 18.75 |
Path / |
Function | miniqmcreference::TwoBodyJastrowRef |
Source file and lines | BsplineFunctor.h:305-336 |
Module | libqmcwfs.so |
nb instructions | 48 |
nb uops | 49 |
loop length | 229 |
used x86 registers | 13 |
used mmx registers | 0 |
used xmm registers | 11 |
used ymm registers | 12 |
used zmm registers | 0 |
nb stack references | 4 |
ADD-SUB / MUL ratio | 0.80 |
micro-operation queue | 8.17 cycles |
front end | 8.17 cycles |
P0 | P1 | P2 | P3 | P4 | P5 | P6 | P7 | P8 | P9 | P10 | P11 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
uops | 12.50 | 12.50 | 2.33 | 2.33 | 1.50 | 12.00 | 1.60 | 1.50 | 1.50 | 1.50 | 1.40 | 2.33 |
cycles | 12.50 | 12.50 | 2.33 | 2.33 | 1.50 | 12.00 | 1.60 | 1.50 | 1.50 | 1.50 | 1.40 | 2.33 |
Cycles executing div or sqrt instructions | 4.00 |
Longest recurrence chain latency (RecMII) | 1.00 |
FE+BE cycles | 14.46 |
Stall cycles | 5.72 |
RS full (events) | 12.58 |
Front-end | 8.17 |
Dispatch | 12.50 |
DIV/SQRT | 4.00 |
Data deps. | 1.00 |
Overall L1 | 12.50 |
all | 52% |
load | 83% |
store | 0% |
mul | 40% |
add-sub | 87% |
fma | 100% |
div/sqrt | 0% |
other | 33% |
all | 26% |
load | 43% |
store | 12% |
mul | 27% |
add-sub | 26% |
fma | 50% |
div/sqrt | 12% |
other | 18% |
Instruction | Nb FU | P0 | P1 | P2 | P3 | P4 | P5 | P6 | P7 | P8 | P9 | P10 | P11 | Latency | Recip. throughput |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
VMOVSD (%RDI,%RDX,8),%XMM2 | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 1 | 0.33 |
VMOVAPD %YMM10,%YMM3 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0-1 | 0.17 |
MOVSXD (%R10,%RDX,4),%RAX | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 1 | 0.33 |
INC %RDX | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.17 |
VMULSD %XMM2,%XMM7,%XMM8 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
ADD %R8,%RAX | 1 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0 | 1 | 0.20 |
VDIVSD %XMM2,%XMM15,%XMM2 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 13-15 | 4 |
VROUNDSD $0xb,%XMM8,%XMM8,%XMM1 | 2 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 8 | 1 |
VCVTTSD2SI %XMM8,%ESI | 2 | 1.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 7 | 1 |
VSUBSD %XMM1,%XMM8,%XMM0 | 1 | 0 | 0.50 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0.50 |
VMOVAPD 0x60(%RSP),%YMM1 | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 0-1 | 0.33 |
VMULSD %XMM0,%XMM0,%XMM6 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VBROADCASTSD %XMM0,%YMM4 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 1 |
VFMADD132PD %YMM4,%YMM9,%YMM3 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VFMADD213PD 0x40(%RSP),%YMM4,%YMM1 | 1 | 0.50 | 0.50 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 4 | 0.50 |
MOVSXD %ESI,%R13 | 1 | 0 | 0.33 | 0 | 0 | 0 | 0.33 | 0 | 0 | 0 | 0 | 0.33 | 0 | 1 | 0.33 |
VFMADD132PD %YMM13,%YMM12,%YMM4 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VMOVUPD (%R11,%R13,8),%YMM8 | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 0-1 | 0.33 |
VMULSD %XMM6,%XMM0,%XMM0 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VBROADCASTSD %XMM6,%YMM5 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 1 |
VFMADD231PD %YMM5,%YMM11,%YMM3 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VMULPD %YMM4,%YMM8,%YMM4 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VBROADCASTSD %XMM0,%YMM6 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 1 |
VMULPD 0x80(%RSP),%YMM6,%YMM0 | 1 | 0.50 | 0.50 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 4 | 0.50 |
VMULPD %YMM3,%YMM8,%YMM3 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VFMADD231PD 0x20(%RSP),%YMM5,%YMM0 | 1 | 0.50 | 0.50 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 4 | 0.50 |
VEXTRACTF128 $0x1,%YMM4,%XMM5 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 1 |
VADDPD %YMM0,%YMM1,%YMM1 | 1 | 0 | 0.50 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0.50 |
VMULPD %YMM8,%YMM1,%YMM6 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VADDPD %XMM4,%XMM5,%XMM8 | 1 | 0 | 0.50 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0.50 |
VEXTRACTF128 $0x1,%YMM3,%XMM5 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 1 |
VADDPD %XMM3,%XMM5,%XMM3 | 1 | 0 | 0.50 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0.50 |
VUNPCKHPD %XMM8,%XMM8,%XMM0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
VADDPD %XMM8,%XMM0,%XMM1 | 1 | 0 | 0.50 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0.50 |
VUNPCKHPD %XMM3,%XMM3,%XMM8 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
VADDPD %XMM3,%XMM8,%XMM0 | 1 | 0 | 0.50 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0.50 |
VEXTRACTF128 $0x1,%YMM6,%XMM5 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 1 |
VADDPD %XMM6,%XMM5,%XMM6 | 1 | 0 | 0.50 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0.50 |
VMULSD %XMM1,%XMM14,%XMM4 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VMULSD %XMM0,%XMM7,%XMM1 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VUNPCKHPD %XMM6,%XMM6,%XMM3 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
VADDPD %XMM6,%XMM3,%XMM8 | 1 | 0 | 0.50 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0.50 |
VMOVSD %XMM4,(%R14,%RAX,8) | 1 | 0 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0.50 | 0.50 | 0.50 | 0 | 0 | 1 | 0.50 |
VMULSD %XMM1,%XMM2,%XMM4 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VMOVLPD %XMM8,(%RBX,%RAX,8) | 1 | 0 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0.50 | 0.50 | 0.50 | 0 | 0 | 4-12 | 0.50 |
VMOVSD %XMM4,(%R12,%RAX,8) | 1 | 0 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0.50 | 0.50 | 0.50 | 0 | 0 | 1 | 0.50 |
CMP %RCX,%RDX | 1 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0 | 1 | 0.20 |
JNE 4c180 <_ZN16miniqmcreference17TwoBodyJastrowRefIN11qmcplusplus14BsplineFunctorIdEEE9computeU3ERKNS1_11ParticleSetEiPKdPdSA_SA_b+0x560> | 1 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0.50 |
Function | miniqmcreference::TwoBodyJastrowRef |
Source file and lines | BsplineFunctor.h:305-336 |
Module | libqmcwfs.so |
nb instructions | 48 |
nb uops | 49 |
loop length | 229 |
used x86 registers | 13 |
used mmx registers | 0 |
used xmm registers | 11 |
used ymm registers | 12 |
used zmm registers | 0 |
nb stack references | 4 |
ADD-SUB / MUL ratio | 0.80 |
micro-operation queue | 8.17 cycles |
front end | 8.17 cycles |
P0 | P1 | P2 | P3 | P4 | P5 | P6 | P7 | P8 | P9 | P10 | P11 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
uops | 12.50 | 12.50 | 2.33 | 2.33 | 1.50 | 12.00 | 1.60 | 1.50 | 1.50 | 1.50 | 1.40 | 2.33 |
cycles | 12.50 | 12.50 | 2.33 | 2.33 | 1.50 | 12.00 | 1.60 | 1.50 | 1.50 | 1.50 | 1.40 | 2.33 |
Cycles executing div or sqrt instructions | 4.00 |
Longest recurrence chain latency (RecMII) | 1.00 |
FE+BE cycles | 14.46 |
Stall cycles | 5.72 |
RS full (events) | 12.58 |
Front-end | 8.17 |
Dispatch | 12.50 |
DIV/SQRT | 4.00 |
Data deps. | 1.00 |
Overall L1 | 12.50 |
all | 52% |
load | 83% |
store | 0% |
mul | 40% |
add-sub | 87% |
fma | 100% |
div/sqrt | 0% |
other | 33% |
all | 26% |
load | 43% |
store | 12% |
mul | 27% |
add-sub | 26% |
fma | 50% |
div/sqrt | 12% |
other | 18% |
Instruction | Nb FU | P0 | P1 | P2 | P3 | P4 | P5 | P6 | P7 | P8 | P9 | P10 | P11 | Latency | Recip. throughput |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
VMOVSD (%RDI,%RDX,8),%XMM2 | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 1 | 0.33 |
VMOVAPD %YMM10,%YMM3 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0-1 | 0.17 |
MOVSXD (%R10,%RDX,4),%RAX | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 1 | 0.33 |
INC %RDX | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.17 |
VMULSD %XMM2,%XMM7,%XMM8 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
ADD %R8,%RAX | 1 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0 | 1 | 0.20 |
VDIVSD %XMM2,%XMM15,%XMM2 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 13-15 | 4 |
VROUNDSD $0xb,%XMM8,%XMM8,%XMM1 | 2 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 8 | 1 |
VCVTTSD2SI %XMM8,%ESI | 2 | 1.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 7 | 1 |
VSUBSD %XMM1,%XMM8,%XMM0 | 1 | 0 | 0.50 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0.50 |
VMOVAPD 0x60(%RSP),%YMM1 | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 0-1 | 0.33 |
VMULSD %XMM0,%XMM0,%XMM6 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VBROADCASTSD %XMM0,%YMM4 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 1 |
VFMADD132PD %YMM4,%YMM9,%YMM3 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VFMADD213PD 0x40(%RSP),%YMM4,%YMM1 | 1 | 0.50 | 0.50 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 4 | 0.50 |
MOVSXD %ESI,%R13 | 1 | 0 | 0.33 | 0 | 0 | 0 | 0.33 | 0 | 0 | 0 | 0 | 0.33 | 0 | 1 | 0.33 |
VFMADD132PD %YMM13,%YMM12,%YMM4 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VMOVUPD (%R11,%R13,8),%YMM8 | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 0-1 | 0.33 |
VMULSD %XMM6,%XMM0,%XMM0 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VBROADCASTSD %XMM6,%YMM5 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 1 |
VFMADD231PD %YMM5,%YMM11,%YMM3 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VMULPD %YMM4,%YMM8,%YMM4 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VBROADCASTSD %XMM0,%YMM6 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 1 |
VMULPD 0x80(%RSP),%YMM6,%YMM0 | 1 | 0.50 | 0.50 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 4 | 0.50 |
VMULPD %YMM3,%YMM8,%YMM3 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VFMADD231PD 0x20(%RSP),%YMM5,%YMM0 | 1 | 0.50 | 0.50 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 4 | 0.50 |
VEXTRACTF128 $0x1,%YMM4,%XMM5 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 1 |
VADDPD %YMM0,%YMM1,%YMM1 | 1 | 0 | 0.50 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0.50 |
VMULPD %YMM8,%YMM1,%YMM6 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VADDPD %XMM4,%XMM5,%XMM8 | 1 | 0 | 0.50 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0.50 |
VEXTRACTF128 $0x1,%YMM3,%XMM5 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 1 |
VADDPD %XMM3,%XMM5,%XMM3 | 1 | 0 | 0.50 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0.50 |
VUNPCKHPD %XMM8,%XMM8,%XMM0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
VADDPD %XMM8,%XMM0,%XMM1 | 1 | 0 | 0.50 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0.50 |
VUNPCKHPD %XMM3,%XMM3,%XMM8 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
VADDPD %XMM3,%XMM8,%XMM0 | 1 | 0 | 0.50 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0.50 |
VEXTRACTF128 $0x1,%YMM6,%XMM5 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 1 |
VADDPD %XMM6,%XMM5,%XMM6 | 1 | 0 | 0.50 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0.50 |
VMULSD %XMM1,%XMM14,%XMM4 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VMULSD %XMM0,%XMM7,%XMM1 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VUNPCKHPD %XMM6,%XMM6,%XMM3 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
VADDPD %XMM6,%XMM3,%XMM8 | 1 | 0 | 0.50 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0.50 |
VMOVSD %XMM4,(%R14,%RAX,8) | 1 | 0 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0.50 | 0.50 | 0.50 | 0 | 0 | 1 | 0.50 |
VMULSD %XMM1,%XMM2,%XMM4 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VMOVLPD %XMM8,(%RBX,%RAX,8) | 1 | 0 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0.50 | 0.50 | 0.50 | 0 | 0 | 4-12 | 0.50 |
VMOVSD %XMM4,(%R12,%RAX,8) | 1 | 0 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0.50 | 0.50 | 0.50 | 0 | 0 | 1 | 0.50 |
CMP %RCX,%RDX | 1 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0 | 1 | 0.20 |
JNE 4c180 <_ZN16miniqmcreference17TwoBodyJastrowRefIN11qmcplusplus14BsplineFunctorIdEEE9computeU3ERKNS1_11ParticleSetEiPKdPdSA_SA_b+0x560> | 1 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0.50 |