OV - vmc.mov1 - Loop 10620

0x7775d0 MOVUPS	(%RSI,%R12,8),%XMM5    [2]

0x7775d5 MOVUPS	0x10(%RSI,%R12,8),%XMM6    [2]

0x7775db MOVUPS	0x20(%RSI,%R12,8),%XMM7    [2]

0x7775e1 MOVUPS	0x30(%RSI,%R12,8),%XMM8    [2]

0x7775e7 MULPD	(%R9,%R12,8),%XMM5    [1]

0x7775ed MULPD	0x10(%R9,%R12,8),%XMM6    [1]

0x7775f4 MULPD	0x20(%R9,%R12,8),%XMM7    [1]

0x7775fb MULPD	0x30(%R9,%R12,8),%XMM8    [1]

0x777602 ADDPD	%XMM5,%XMM4

0x777606 ADDPD	%XMM6,%XMM3

0x77760a ADDPD	%XMM7,%XMM2

0x77760e ADDPD	%XMM8,%XMM1

0x777613 ADD	$0x8,%R12

0x777617 CMP	%RBX,%R12

0x77761a JB	7775d0

/home/kcamus/trex/champ/champ/src/vmc/multiply_slmi_mderiv.f: 24 - 25

--------------------------------------------------------------------------------

24:           do i=1,nel

25:             work(j)=work(j)+work_mat(i+jsh)*slmi(i+msh)

Path /

Metric	Value
CQA speedup if no scalar integer	1.00
CQA speedup if FP arith vectorized	1.12
CQA speedup if fully vectorized	2.25
CQA speedup if no inter-iteration dependency	NA
CQA speedup if next bottleneck killed	1.12
Bottlenecks
Function	multiply_slmi_mderiv_simple_.A
Source	multiply_slmi_mderiv.f:24-25
Source loop unroll info	unrolled by 8
Source loop unroll confidence level	max
Unroll/vectorization loop type	main
Unroll factor	8
CQA cycles	3.00
CQA cycles if no scalar integer	3.00
CQA cycles if FP arith vectorized	2.67
CQA cycles if fully vectorized	1.33
Front-end cycles	2.33
DIV/SQRT cycles	0.50
P0 cycles	0.50
P1 cycles	0.25
P2 cycles	0.25
P3 cycles	0.50
P4 cycles	2.67
P5 cycles	2.67
P6 cycles	2.67
P7 cycles	2.00
P8 cycles	2.00
P9 cycles	2.00
P10 cycles	2.00
P11 cycles	0.00
P12 cycles	0.00
P13 cycles	0.00
Inter-iter dependencies cycles	3
FE+BE cycles (UFS)	NA
Stall cycles (UFS)	NA
Nb insns	15.00
Nb uops	14.00
Nb loads	8.00
Nb stores	0.00
Nb stack references	0.00
FLOP/cycle	5.33
Nb FLOP add-sub	8.00
Nb FLOP mul	8.00
Nb FLOP fma	0.00
Nb FLOP div	0.00
Nb FLOP rcp	0.00
Nb FLOP sqrt	0.00
Nb FLOP rsqrt	0.00
Bytes/cycle	42.67
Bytes prefetched	0.00
Bytes loaded	128.00
Bytes stored	0.00
Stride 0	0.00
Stride 1	2.00
Stride n	0.00
Stride unknown	0.00
Stride indirect	0.00
Vectorization ratio all	100.00
Vectorization ratio load	100.00
Vectorization ratio store	NA
Vectorization ratio mul	100.00
Vectorization ratio add_sub	100.00
Vectorization ratio fma	NA
Vectorization ratio div_sqrt	NA
Vectorization ratio other	NA
Vector-efficiency ratio all	25.00
Vector-efficiency ratio load	25.00
Vector-efficiency ratio store	NA
Vector-efficiency ratio mul	25.00
Vector-efficiency ratio add_sub	25.00
Vector-efficiency ratio fma	NA
Vector-efficiency ratio div_sqrt	NA
Vector-efficiency ratio other	NA

Metric	Value
CQA speedup if no scalar integer	1.00
CQA speedup if FP arith vectorized	1.12
CQA speedup if fully vectorized	2.25
CQA speedup if no inter-iteration dependency	NA
CQA speedup if next bottleneck killed	1.12
Bottlenecks
Function	multiply_slmi_mderiv_simple_.A
Source	multiply_slmi_mderiv.f:24-25
Source loop unroll info	unrolled by 8
Source loop unroll confidence level	max
Unroll/vectorization loop type	main
Unroll factor	8
CQA cycles	3.00
CQA cycles if no scalar integer	3.00
CQA cycles if FP arith vectorized	2.67
CQA cycles if fully vectorized	1.33
Front-end cycles	2.33
DIV/SQRT cycles	0.50
P0 cycles	0.50
P1 cycles	0.25
P2 cycles	0.25
P3 cycles	0.50
P4 cycles	2.67
P5 cycles	2.67
P6 cycles	2.67
P7 cycles	2.00
P8 cycles	2.00
P9 cycles	2.00
P10 cycles	2.00
P11 cycles	0.00
P12 cycles	0.00
P13 cycles	0.00
Inter-iter dependencies cycles	3
FE+BE cycles (UFS)	NA
Stall cycles (UFS)	NA
Nb insns	15.00
Nb uops	14.00
Nb loads	8.00
Nb stores	0.00
Nb stack references	0.00
FLOP/cycle	5.33
Nb FLOP add-sub	8.00
Nb FLOP mul	8.00
Nb FLOP fma	0.00
Nb FLOP div	0.00
Nb FLOP rcp	0.00
Nb FLOP sqrt	0.00
Nb FLOP rsqrt	0.00
Bytes/cycle	42.67
Bytes prefetched	0.00
Bytes loaded	128.00
Bytes stored	0.00
Stride 0	0.00
Stride 1	2.00
Stride n	0.00
Stride unknown	0.00
Stride indirect	0.00
Vectorization ratio all	100.00
Vectorization ratio load	100.00
Vectorization ratio store	NA
Vectorization ratio mul	100.00
Vectorization ratio add_sub	100.00
Vectorization ratio fma	NA
Vectorization ratio div_sqrt	NA
Vectorization ratio other	NA
Vector-efficiency ratio all	25.00
Vector-efficiency ratio load	25.00
Vector-efficiency ratio store	NA
Vector-efficiency ratio mul	25.00
Vector-efficiency ratio add_sub	25.00
Vector-efficiency ratio fma	NA
Vector-efficiency ratio div_sqrt	NA
Vector-efficiency ratio other	NA

Path /

Average path: Display a virtual path defined by average values of all real paths

Function	multiply_slmi_mderiv_simple_.A
Source file and lines	multiply_slmi_mderiv.f:24-25
Module	vmc.mov1

The loop is defined in /home/kcamus/trex/champ/champ/src/vmc/multiply_slmi_mderiv.f:24-25.

It is main loop of related source loop which is unrolled by 8 (including vectorization).

gain
potential
hint
expert

Vectorization

Your loop is vectorized, but using only 128 out of 512 bits (SSE/AVX-128 instructions on AVX-512 processors).

By fully vectorizing your loop, you can lower the cost of an iteration from 3.00 to 1.33 cycles (2.25x speedup).

Details

All SSE/AVX instructions are used in vector version (process two or more data elements in vector registers). Since your execution units are vector units, only a fully vectorized loop can use their full power.

Workaround

Pass to your compiler a micro-architecture specialization option:
- use axHost or xHost
Use vector aligned instructions:
1. align your arrays on 64 bytes boundaries: compile with -align array64byte (remark: not affects arrays in COMMON blocks).
2. inform your compiler that your arrays are vector aligned: Append !DIR$ VECTOR ALIGNED to the loop if all accessed arrays are aligned, or !DIR$ ASSUME_ALIGNED FOO: 64 if only FOO is aligned.
Use the LOOP COUNT directive

Execution units bottlenecks

Found no such bottlenecks but see expert reports for more complex bottlenecks.

FMA

Presence of both ADD/SUB and MUL operations.

Workaround

Pass to your compiler a micro-architecture specialization option:
- use axHost or xHost
Try to change order in which elements are evaluated (using parentheses) in arithmetic expressions containing both ADD/SUB and MUL operations to enable your compiler to generate FMA instructions wherever possible. For instance a + b*c is a valid FMA (MUL then ADD). However (a+b)* c cannot be translated into an FMA (ADD then MUL).

Vector unaligned load/store instructions

Detected 4 suboptimal vector unaligned load/store instructions.

Details

MOVUPS: 4 occurrences

Workaround

Pass to your compiler a micro-architecture specialization option:
- use axHost or xHost
Use vector aligned instructions:
1. align your arrays on 64 bytes boundaries: compile with -align array64byte (remark: not affects arrays in COMMON blocks).
2. inform your compiler that your arrays are vector aligned: Append !DIR$ VECTOR ALIGNED to the loop if all accessed arrays are aligned, or !DIR$ ASSUME_ALIGNED FOO: 64 if only FOO is aligned.

Type of elements and instruction set

8 SSE or AVX instructions are processing arithmetic or math operations on double precision FP elements in vector mode (two at a time).

Matching between your loop (in the source code) and the binary loop

The binary loop is composed of 16 FP arithmetical operations:

8: addition or subtraction
8: multiply

The binary loop is loading 128 bytes (16 double precision FP elements).

Arithmetic intensity

Arithmetic intensity is 0.12 FP operations per loaded or stored byte.

General properties

nb instructions	15
nb uops	14
loop length	76
used x86 registers	4
used mmx registers	0
used xmm registers	8
used ymm registers	0
used zmm registers	0
nb stack references	0
ADD-SUB / MUL ratio	1.00

Front-end

ASSUMED MACRO FUSION FIT IN UOP CACHE

micro-operation queue	2.33 cycles
front end	2.33 cycles

Back-end

	ALU0/BRU0	ALU1	ALU2	ALU3	BRU1	AGU0	AGU1	AGU2	FP0	FP1	FP2	FP3	FP4	FP5
uops	0.50	0.50	0.25	0.25	0.50	2.67	2.67	2.67	2.00	2.00	2.00	2.00	0.00	0.00
cycles	0.50	0.50	0.25	0.25	0.50	2.67	2.67	2.67	2.00	2.00	2.00	2.00	0.00	0.00

Execution ports to units layout:

ALU0/BRU0: ALU
ALU1: ALU
ALU2: ALU
ALU3: ALU
BRU1:
AGU0 (256 bits): store address, load
AGU1 (256 bits): store address, load
AGU2 (256 bits): store address, load
FP0 (256 bits): VPU, DIV/SQRT
FP1 (256 bits): VPU, DIV/SQRT
FP2 (256 bits): VPU
FP3 (256 bits): VPU
FP4 (256 bits): FP store data
FP5 (256 bits): FP store data

Cycles executing div or sqrt instructions	NA
Longest recurrence chain latency (RecMII)	3.00

Cycles summary

Front-end	2.33
Dispatch	2.67
Data deps.	3.00
Overall L1	3.00

Vectorization ratios

all	100%
load	100%
store	NA (no store vectorizable/vectorized instructions)
mul	100%
add-sub	100%
fma	NA (no fma vectorizable/vectorized instructions)
div/sqrt	NA (no div/sqrt vectorizable/vectorized instructions)
other	NA (no other vectorizable/vectorized instructions)

Vector efficiency ratios

all	25%
load	25%
store	NA (no store vectorizable/vectorized instructions)
mul	25%
add-sub	25%
fma	NA (no fma vectorizable/vectorized instructions)
div/sqrt	NA (no div/sqrt vectorizable/vectorized instructions)
other	NA (no other vectorizable/vectorized instructions)

Cycles and memory resources usage

Assuming all data fit into the L1 cache, each iteration of the binary loop takes 3.00 cycles. At this rate:

44% of peak load performance is reached (42.67 out of 96.00 bytes loaded per cycle (GB/s @ 1GHz))

Front-end bottlenecks

Found no such bottlenecks.

ASM code

In the binary file, the address of the loop is: 7775d0

Instruction	Nb FU	ALU0/BRU0	ALU1	ALU2	ALU3	BRU1	AGU0	AGU1	AGU2	FP0	FP1	FP2	FP3	Latency	Recip. throughput
MOVUPS (%RSI,%R12,8),%XMM5	1	0	0	0	0	0	0.33	0.33	0.33	0	0	0	0	3	0.50
MOVUPS 0x10(%RSI,%R12,8),%XMM6	1	0	0	0	0	0	0.33	0.33	0.33	0	0	0	0	3	0.50
MOVUPS 0x20(%RSI,%R12,8),%XMM7	1	0	0	0	0	0	0.33	0.33	0.33	0	0	0	0	3	0.50
MOVUPS 0x30(%RSI,%R12,8),%XMM8	1	0	0	0	0	0	0.33	0.33	0.33	0	0	0	0	3	0.50
MULPD (%R9,%R12,8),%XMM5	1	0	0	0	0	0	0.33	0.33	0.33	0.50	0.50	0	0	3	0.50
MULPD 0x10(%R9,%R12,8),%XMM6	1	0	0	0	0	0	0.33	0.33	0.33	0.50	0.50	0	0	3	0.50
MULPD 0x20(%R9,%R12,8),%XMM7	1	0	0	0	0	0	0.33	0.33	0.33	0.50	0.50	0	0	3	0.50
MULPD 0x30(%R9,%R12,8),%XMM8	1	0	0	0	0	0	0.33	0.33	0.33	0.50	0.50	0	0	3	0.50
ADDPD %XMM5,%XMM4	1	0	0	0	0	0	0	0	0	0	0	0.50	0.50	3	0.50
ADDPD %XMM6,%XMM3	1	0	0	0	0	0	0	0	0	0	0	0.50	0.50	3	0.50
ADDPD %XMM7,%XMM2	1	0	0	0	0	0	0	0	0	0	0	0.50	0.50	3	0.50
ADDPD %XMM8,%XMM1	1	0	0	0	0	0	0	0	0	0	0	0.50	0.50	3	0.50
ADD $0x8,%R12	1	0.25	0.25	0.25	0.25	0	0	0	0	0	0	0	0	1	0.25
CMP %RBX,%R12	1	0.25	0.25	0.25	0.25	0	0	0	0	0	0	0	0	1	0.25
JB 7775d0 <multiply_slmi_mderiv_mp_multiply_slmi_mderiv_simple_.A+0x200>	1	0.50	0	0	0	0.50	0	0	0	0	0	0	0	1	0.50-1

Function	multiply_slmi_mderiv_simple_.A
Source file and lines	multiply_slmi_mderiv.f:24-25
Module	vmc.mov1

The loop is defined in /home/kcamus/trex/champ/champ/src/vmc/multiply_slmi_mderiv.f:24-25.

It is main loop of related source loop which is unrolled by 8 (including vectorization).

gain
potential
hint
expert

Vectorization

Your loop is vectorized, but using only 128 out of 512 bits (SSE/AVX-128 instructions on AVX-512 processors).

By fully vectorizing your loop, you can lower the cost of an iteration from 3.00 to 1.33 cycles (2.25x speedup).

Details

All SSE/AVX instructions are used in vector version (process two or more data elements in vector registers). Since your execution units are vector units, only a fully vectorized loop can use their full power.

Workaround

Pass to your compiler a micro-architecture specialization option:
- use axHost or xHost
Use vector aligned instructions:
1. align your arrays on 64 bytes boundaries: compile with -align array64byte (remark: not affects arrays in COMMON blocks).
2. inform your compiler that your arrays are vector aligned: Append !DIR$ VECTOR ALIGNED to the loop if all accessed arrays are aligned, or !DIR$ ASSUME_ALIGNED FOO: 64 if only FOO is aligned.
Use the LOOP COUNT directive

Execution units bottlenecks

Found no such bottlenecks but see expert reports for more complex bottlenecks.

FMA

Presence of both ADD/SUB and MUL operations.

Workaround

Pass to your compiler a micro-architecture specialization option:
- use axHost or xHost
Try to change order in which elements are evaluated (using parentheses) in arithmetic expressions containing both ADD/SUB and MUL operations to enable your compiler to generate FMA instructions wherever possible. For instance a + b*c is a valid FMA (MUL then ADD). However (a+b)* c cannot be translated into an FMA (ADD then MUL).

Vector unaligned load/store instructions

Detected 4 suboptimal vector unaligned load/store instructions.

Details

MOVUPS: 4 occurrences

Workaround

Pass to your compiler a micro-architecture specialization option:
- use axHost or xHost
Use vector aligned instructions:
1. align your arrays on 64 bytes boundaries: compile with -align array64byte (remark: not affects arrays in COMMON blocks).
2. inform your compiler that your arrays are vector aligned: Append !DIR$ VECTOR ALIGNED to the loop if all accessed arrays are aligned, or !DIR$ ASSUME_ALIGNED FOO: 64 if only FOO is aligned.

Type of elements and instruction set

8 SSE or AVX instructions are processing arithmetic or math operations on double precision FP elements in vector mode (two at a time).

Matching between your loop (in the source code) and the binary loop

The binary loop is composed of 16 FP arithmetical operations:

8: addition or subtraction
8: multiply

The binary loop is loading 128 bytes (16 double precision FP elements).

Arithmetic intensity

Arithmetic intensity is 0.12 FP operations per loaded or stored byte.

General properties

nb instructions	15
nb uops	14
loop length	76
used x86 registers	4
used mmx registers	0
used xmm registers	8
used ymm registers	0
used zmm registers	0
nb stack references	0
ADD-SUB / MUL ratio	1.00

Front-end

ASSUMED MACRO FUSION FIT IN UOP CACHE

micro-operation queue	2.33 cycles
front end	2.33 cycles

Back-end

	ALU0/BRU0	ALU1	ALU2	ALU3	BRU1	AGU0	AGU1	AGU2	FP0	FP1	FP2	FP3	FP4	FP5
uops	0.50	0.50	0.25	0.25	0.50	2.67	2.67	2.67	2.00	2.00	2.00	2.00	0.00	0.00
cycles	0.50	0.50	0.25	0.25	0.50	2.67	2.67	2.67	2.00	2.00	2.00	2.00	0.00	0.00

Execution ports to units layout:

ALU0/BRU0: ALU
ALU1: ALU
ALU2: ALU
ALU3: ALU
BRU1:
AGU0 (256 bits): store address, load
AGU1 (256 bits): store address, load
AGU2 (256 bits): store address, load
FP0 (256 bits): VPU, DIV/SQRT
FP1 (256 bits): VPU, DIV/SQRT
FP2 (256 bits): VPU
FP3 (256 bits): VPU
FP4 (256 bits): FP store data
FP5 (256 bits): FP store data

Cycles executing div or sqrt instructions	NA
Longest recurrence chain latency (RecMII)	3.00

Cycles summary

Front-end	2.33
Dispatch	2.67
Data deps.	3.00
Overall L1	3.00

Vectorization ratios

all	100%
load	100%
store	NA (no store vectorizable/vectorized instructions)
mul	100%
add-sub	100%
fma	NA (no fma vectorizable/vectorized instructions)
div/sqrt	NA (no div/sqrt vectorizable/vectorized instructions)
other	NA (no other vectorizable/vectorized instructions)

Vector efficiency ratios

all	25%
load	25%
store	NA (no store vectorizable/vectorized instructions)
mul	25%
add-sub	25%
fma	NA (no fma vectorizable/vectorized instructions)
div/sqrt	NA (no div/sqrt vectorizable/vectorized instructions)
other	NA (no other vectorizable/vectorized instructions)

Cycles and memory resources usage

Assuming all data fit into the L1 cache, each iteration of the binary loop takes 3.00 cycles. At this rate:

44% of peak load performance is reached (42.67 out of 96.00 bytes loaded per cycle (GB/s @ 1GHz))

Front-end bottlenecks

Found no such bottlenecks.

ASM code

In the binary file, the address of the loop is: 7775d0

Instruction	Nb FU	ALU0/BRU0	ALU1	ALU2	ALU3	BRU1	AGU0	AGU1	AGU2	FP0	FP1	FP2	FP3	Latency	Recip. throughput
MOVUPS (%RSI,%R12,8),%XMM5	1	0	0	0	0	0	0.33	0.33	0.33	0	0	0	0	3	0.50
MOVUPS 0x10(%RSI,%R12,8),%XMM6	1	0	0	0	0	0	0.33	0.33	0.33	0	0	0	0	3	0.50
MOVUPS 0x20(%RSI,%R12,8),%XMM7	1	0	0	0	0	0	0.33	0.33	0.33	0	0	0	0	3	0.50
MOVUPS 0x30(%RSI,%R12,8),%XMM8	1	0	0	0	0	0	0.33	0.33	0.33	0	0	0	0	3	0.50
MULPD (%R9,%R12,8),%XMM5	1	0	0	0	0	0	0.33	0.33	0.33	0.50	0.50	0	0	3	0.50
MULPD 0x10(%R9,%R12,8),%XMM6	1	0	0	0	0	0	0.33	0.33	0.33	0.50	0.50	0	0	3	0.50
MULPD 0x20(%R9,%R12,8),%XMM7	1	0	0	0	0	0	0.33	0.33	0.33	0.50	0.50	0	0	3	0.50
MULPD 0x30(%R9,%R12,8),%XMM8	1	0	0	0	0	0	0.33	0.33	0.33	0.50	0.50	0	0	3	0.50
ADDPD %XMM5,%XMM4	1	0	0	0	0	0	0	0	0	0	0	0.50	0.50	3	0.50
ADDPD %XMM6,%XMM3	1	0	0	0	0	0	0	0	0	0	0	0.50	0.50	3	0.50
ADDPD %XMM7,%XMM2	1	0	0	0	0	0	0	0	0	0	0	0.50	0.50	3	0.50
ADDPD %XMM8,%XMM1	1	0	0	0	0	0	0	0	0	0	0	0.50	0.50	3	0.50
ADD $0x8,%R12	1	0.25	0.25	0.25	0.25	0	0	0	0	0	0	0	0	1	0.25
CMP %RBX,%R12	1	0.25	0.25	0.25	0.25	0	0	0	0	0	0	0	0	1	0.25
JB 7775d0 <multiply_slmi_mderiv_mp_multiply_slmi_mderiv_simple_.A+0x200>	1	0.50	0	0	0	0.50	0	0	0	0	0	0	0	1	0.50-1

Report Configuration

Vectorization

Details

Workaround

Execution units bottlenecks

FMA

Workaround

Vector unaligned load/store instructions

Details

Workaround

Type of elements and instruction set

Matching between your loop (in the source code) and the binary loop

Arithmetic intensity

General properties

Front-end

Back-end

Cycles summary

Vectorization ratios

Vector efficiency ratios

Cycles and memory resources usage

Front-end bottlenecks

ASM code

Vectorization

Details

Workaround

Execution units bottlenecks

FMA

Workaround

Vector unaligned load/store instructions

Details

Workaround

Type of elements and instruction set

Matching between your loop (in the source code) and the binary loop

Arithmetic intensity

General properties

Front-end

Back-end

Cycles summary

Vectorization ratios

Vector efficiency ratios

Cycles and memory resources usage

Front-end bottlenecks

ASM code