Zen4's AVX512 Teardown
I won't get into how this happened, but AMD graciously sent me two test setups this year. A retail Zen3 setup back in January, and an engineering Zen4 setup in August.
The motivation here is to let me optimize y-cruncher for these chips. In the case for Zen4, it would also be ahead of release - before all the hardware reviewers get it. Thus when the launch-day benchmarks are released, at least y-cruncher will be properly optimized and will show the processor on a level playing field against the competition which has otherwise had the benefit of time to get the new optimizations.
While my task was to optimize y-cruncher (as well as beta test the chip), there were no optimizations resources for it yet. So no Agner Fog, no uops.info, no block diagrams, etc... All the resources that I usually rely on weren't available yet. So I was on my own.
So here is my personal teardown of Zen4 - or at least the SIMD portion of it since that's the part I was interested in. As this is the first time I've done this, there may be errors or incorrect conclusions. So it's here for discussion. To be honest, this was quite enjoyable - though time-consuming. And I look forward to seeing what I got right and what I got wrong in the upcoming days/weeks when the big names can get their eyes on it.
That said, I wasn't completely alone. I did get some help from Ian Cutress and ChipsandCheese on how to use some of the off-the-shelf tools needed to analyze the processor. So huge thanks to them for their help - which was also awkward because I couldn't actually show them results due to embargo.
But anyway, we live in a very strange world now. AMD has AVX512, but Intel does not for mainstream... If you told me this a few years ago, I'd have looked at you funny. But here we are... Despite the "double-pumping", Zen4's AVX512 is not only competitive with Intel's, it outright beats it in many ways. Intel does remain ahead in a few areas though.
AMD bringing a high quality AVX512 implementation to the consumer market is going to shake things up. This will make it much harder for Intel to kill off AVX512. So it will be interesting in the next few years to see how they respond. Do they bring AVX512 to the E-cores? Do they work with OS's to find a way to automatically migrate AVX512 processes to the P-cores? Do they kill off the E-cores? This situation is hilarious since it is Intel who made these instructions and now AMD is running away with it.
AVX512 Flavors:
This was leaked many months ago, but just to confirm it: Zen4's AVX512 flavors is all of Ice Lake plus AVX512-BF16.
So Zen4 has:
- AVX512-F
- AVX512-CD
- AVX512-VL
- AVX512-BW
- AVX512-DQ
- AVX512-IFMA
- AVX512-VBMI
- AVX512-VNNI
- AVX512-BF16
- AVX512-VPOPCNTDQ
- AVX512-VBMI2
- AVX512-VPCLMULQDQ
- AVX512-BITALG
- AVX512-GFNI
- AVX512-VAES
The ones it is missing are:
- The Xeon Phi ones: AVX512-PF, AVX512-ER, AVX512-4FMAPS, AVX512-4VNNIW
- AVX512-VP2INTERSECT (from Tiger Lake)
- AVX512-FP16 (from Sapphire Rapids and AVX512-enabled Alder Lake)
AVX512 gets a lot of hate for its "fragmentation". But really this is probably just a result of dark silicon on modern process nodes. So we're seeing more and more application-specific "accelerators" to put some of that extra silicon to use. Pretty much every new AVX512 flavor starting from Cannon Lake is arguably an accelerator for some specific application.
In the context of benchmarks and hardware reviews, this can lead to some wild results. If a benchmark happens to land on one of these "accelerators", it may get a disproportionately large speedup which some might consider "cheating". With more and more accelerators appearing for various applications and more developer attention to using them, this will probably become increasingly common in the future.
It would be interesting to see if AMD attempts to add their own AVX512 extensions in the future. AMD has not had much success in adding new instructions since x64 itself. Will they try again?
Zen4's 512-bit is Non-Native:
This was quite obvious even before AMD formally disclosed that their implementation was "double-pumped". AVX512 isn't used much. Widening the execution units to 512-bit would come at incredible silicon cost and power complications. AVX512 has many more features than just its width. So it was easy to predict that Zen4 was not going to have native 512-bit.
Thus as many of us predicted, 512-bit instructions are split into 2 x 256-bit of the same instruction. And 512-bit is always half-throughput of their 256-bit versions.
When a 512-bit instruction is split, it is issued on two consecutive cycles likely to the same unit. (hence the "double-pumping") So one half will always be a cycle behind the other half. I have not tested which half has the 1 cycle delay, but I assume it's the upper half. Thus one can assume that all the datapaths on Zen4 remain 256-bit as 512-bit operands would take two consecutive cycles on each port.
As long as there are no dependencies which cross the halves, this 1 cycle delay is not observable. Therefore, the latencies of most 512-bit instructions remain the same as the 256-bit versions. 512-bit instructions which do have dependencies that cross between lower and upper 256-bit halves have a 1 cycle additional latency - presumably to wait for the slower half to be ready.
Other than that, the splitting is seemless and I did not observe any other hazards. In the absence of any architecture block diagrams, I can only infer that Zen4's vector units remain roughly the same as Zen3. So the raw computing throughput hasn't changed much. Both integer and floating-point remains at 4 x 256-bit/cycle.
However, this does not mean 512-bit is a wash. In practice, it is very difficult to fully saturate all units due to front-end bottlenecks. For example, it's almost impossible to sustain 4 floating-point instructions per cycle due to high latencies and register pressure. AVX512 resolves this by halving the amount of front-end overhead since you only need to issue two 512-bit instructions/cycle to saturate the vector units instead of 4 x 256-bit. In other words, Zen3 is a powerful CPU that is difficult to realize. AVX512 allows Zen4 to unleash all that under-utilized computing potential.
In fact, there are power advantages to doing 2 x 512-bit instead of 4 x 256 due to less work by the front-end. Given the way that Zen4 does frequency scaling, lower power means more thermal headroom to increase clock speeds - which means higher performance.
For simplicity with the rest of this article, all throughputs I mention going forward are for the 256-bit version of the instruction. So unless otherwise stated, 512-bit versions are almost always half throughput with the same latency.
Power and Thermals:
Since 512-bit instructions are reusing the same 256-bit hardware, 512-bit does not come with additional thermal issues. There is no artificial throttling like on Intel chips.
However, the chip as a whole is very difficult to cool due to the dies being so small. So the bottleneck is the contact area between the die and the IHS. For all practical purposes, under a suitable all-core load, the 7950X will be running at 95C Tj.Max all the time. If you throw a bigger cooler on it, it will boost to higher clock speeds to get right back to Tj.Max. Because of this, the performance of the chip is dependent on the cooling. And it is basically impossible to hit the 230W power limit at stock without direct die or sub-ambient.
If 95C sounds scary, wait to you see the voltages involved. AMD advertises 5.7 GHz. In reality, a slightly higher value of 5.75 GHz seems to be the norm - often across half the cores simultaneously. So it's not just a single core load. The Fmax is 5.85 GHz, but I have never personally seen it go above 5.75.
If CPUz's reading is to be trusted, the 5.75 GHz requires 1.5v vcore - at stock settings. Time will tell if there are longevity issues with this kind of abuse. (Though worth mentioning that Intel is also headed in this direction with their 6 GHz Raptor Lake.)
For my* 7950X at stock settings with a 360 AIO:
- All-core scalar workloads seem to be 5.2 - 5.5 GHz depending on the workload.
- All-core 128-bit SSE workloads seem to be 5.2 - 5.5 GHz.
- All-core 256-bit AVX2 workloads seems to be 4.8 - 5.4 GHz.
- All-core 512-bit AVX512 workloads seem to be 4.9 - 5.5 GHz.
Compute-intensive workloads with high IPC that saturate the execution units will result in the lower-end of the above speed ranges. Memory-bound workloads will result in the higher-end of the range since the core isn't really doing a whole lot.
Observations:
- The top speeds are all around 5.5 GHz regardless of the instruction width. This implies that the boost speeds are entirely power-dependent and not artificial (so none of Intel's AVX offset stuff).
- 512-bit AVX512 has slightly *higher* clock speeds than 256-bit AVX2 - which is ironic given the history with Intel's chips.
As mentioned earlier, 512-bit has half the front-end overhead (decoding, scheduling, etc...) as 256-bit for the same amount of computational work. So in the absence of front-end bottlenecks, 512-bit consumes less power and gives the chip more headroom to boost to higher speeds - thus a reversal of Intel's implementation.
That said, more investigation is needed before I can suggest reversing the "prefer 128-bit" recommendation for optimizing compilers. The results above are for the
steady state. I have not analyzed the
transient behavior of "sprinkling" wide vector instructions into otherwise scalar code. While 512-bit is no worse than 256-bit on Zen4, 256-bit is still worse than 128-bit and I have not ruled out the possibility of momentary clock-stretching or throttling that could negatively affect other things. Likewise, a similar situation could arise between 256-bit and 512-bit on a future AMD processor if AMD decides to widen to native 512-bit.
(*As a disclaimer, my test setup involves an engineering motherboard with a very immature BIOS. And I have reasons to believe that the clock speeds and boosting behavior that I am seeing may not exactly match the retail launch. However, the chip itself appears to be of retail stepping. So it should be the same as what is being sold now.)
Floating-Point:
Zen4 has the same floating-point units as Zen3: 2 FADD and 2 FMA all of which are 256-bit wide. Latencies appear to be unchanged as well.
While it is possible to sustain 2 x FADD and 2 x FMA every cycle, it requires sustaining 4 instructions/cycle - which as mentioned earlier is difficult to do for floating-point. This was the case since Zen2 and remains unchanged in Zen4.
AVX512 means you only need 2 x 512-bit each cycle to saturate the 4 pipes. This is easily achievable in real-world code. Not only that, the extra registers will help reduce register pressure.
Comparing with Intel's offerings, Zen4 is capable of sustaining 1 x 512-bit FADD + 1 x 512-bit FMA every cycle. No client-side Intel processor could do this prior to Alder Lake. And the only server ones that can do it are those with that 2nd FMA unit. So Zen4 is very competitive with Intel here despite having non-native AVX512.
The only place Intel remains ahead is raw FMA throughput on the server parts which can sustain 2 x 512-bit FMA/cycle thanks to the 2nd FMA unit. But realistically, there are very few non-synthetic applications that spam 100% FMA and aren't also completely memory-bound. (Level-3 BLAS is probably the only major one I can think of.) Most floating-point applications will have a healthy mix of FADD and FMUL/FMA.
(A bit of an editorial)
In my opinion, Intel's mistake with AVX512 is to optimize for the 100% FMA workloads (namely Linpack) instead of the more common mixed FADD/FMA workloads. Adders are cheap. Multipliers are expensive. One of each would do just fine for most workloads. Instead, Intel decided to add a 2nd FMA to Skylake X/SP... It is that 2nd FMA which caused most of the power/throttling issues that has tainted AVX512's reputation and hindered its adoption.
In fact, many workloads would run faster with no 2nd FMA (and no throttling) than *with* the 2nd FMA (and with throttling). This is arguably quite embarrassing for market segmentation and may explain why Intel's lower-end (1-FMA) server chips have large AVX512 offsets despite not really needing it. The lack of that 2nd FMA and its power issues may be why Ice Lake (Client) and Tiger Lake turned out pretty well. (Cannon Lake would be here too, but it failed for unrelated reasons.)
So with 20/20 hindsight, Intel should have implemented AVX512 in one of these ways (among many other possibilities):
- Instead of adding the 2nd FMA, add a 512-bit FADD instead. Intel finally added separate FADD hardware in Golden Cove (Alder Lake + Sapphire Rapids).
- Implement 512-bit as 2 x 256-bit first. This allows the world to get all the non-width related features. Then in the future, when the silicon allows for it and the demand calls for it, widen things up to 512-bit. Basically copy AMD's approach to both 256-bit AVX and 512-bit AVX512.
Shuffles/Permutes:
Intel has historically only had one shuffle unit until Ice Lake where they added a second one. (Or rather, they probably exposed the upper-half of the 512-bit shuffle so it can be used as 2 x 256.)
Zen3 has 2 shuffle units, one small and one big. Both are 256-bit wide. The small shuffle can only handle simple shuffles while the big one can handle all of them.
(For reference, "simple" shuffles would be stuff like vunpcklps and vshufps while "complex" shuffles would be all the 3-input, all-to-all, and 128-bit lane-crossing shuffles.)
Zen4 expands that "big" shuffle unit to a native 512-bit shuffle. This big shuffle is capable of running all shuffle instructions with 1/cycle throughput. Even the 512-bit all-to-all 3-input byte-permutes (vpermi2b/vpermt2b) are 1/cycle throughput. This is incredible because the silicon cost for permutes is (probably) quadratic to the granularity, and it appears Zen4 has paid that cost. By comparison, Intel's lower-granular permutes are quite slow, though steadily improving since Skylake.
Furthermore, this 512-bit shuffle can also operate as a pair of 256-bit shuffles - each capable of running the same complex shuffles. Thus there is a total of 3 shuffle pipes on Zen4.
So the per-cycle throughput is:
- 3 x 256-bit simple shuffles - small + both halves of the big unit
- 2 x 256-bit complex shuffles - both halves of the big unit
- 1.5 x 512-bit simple shuffles - small + both halves of the big unit, but double-pumped
- 1 x 512-bit complex shuffle - natively into the big unit
512-bit shuffles with data dependencies that cross between the lower and upper 256-bit halves go natively into the 512-bit shuffle. While the execution of the shuffle itself is not double-pumped (due to being a native 512-bit unit), the register inputs and outputs still are. Since one of the 256-bit halves is always one cycle behind the other (due to being double-pumped everywhere else in the processor), these 512-bit complex shuffles take an extra cycle of latency to wait for the slower half before execution. Thus all 512-bit complex shuffles have one cycle of extra latency over their 256-bit versions.
Native 512-bit vs. double-pumped 2 x 256-bit?
How do I know that Zen4 has a native 512-bit shuffle as opposed to a pair of 256-bit ones which are double-pumped? I don't know for sure. But there is plenty of evidence pointing at the former:
- 512-bit complex shuffles cannot be double-pumped the same way as everything else because of the cross-half dependencies.
- I found no evidence of additional uops being generated to facillitate these cross-half dependencies if they were indeed double-pumped.
- When mixing 512-bit and 256-bit complex shuffles in close proximity, the observed throughput falls well short of the theoretical limit if they were double-pumped onto separate 256-bit units. Thus implying that a lone 256-bit shuffle being issued into the native 512-bit shuffle will block a 512-bit shuffle for that cycle. The scheduler's ability to reorder 256-bit and 512-bit shuffles to fill these bubbles is limited. No such bubbles were observed when mixing 512-bit and 256-bit arithmetic instructions that do not use the shuffle unit.
Regardless of the physical design of the Zen4's shuffle unit, it far exceeds my expectations and beats Intel in every aspect. Shuffle quality is often the determining factor on whether to use full-width vs half-width when the hardware is "double-pumped." From Bulldozer to Zen1, the shuffles were poor. So 128-bit AVX codepaths were usually the best. Zen4 does it right.
AVX512 Mask Registers:
All Intel (Skylake through Alder Lake) have a single write port for mask registers. This limits everything that writes to masks to 1/cycle which is often a bottleneck in predicated branch-heavy code. Zen4 appears to have 2 mask write ports.
All mask ALU instructions on Zen4 are 1-cycle latency. By comparison, cross-lane mask instructions (kshift, kadd, etc...) are 4 cycles on Intel.
The throughput for all mask ALU instructions is 2/cycle except for 64-bit masks which are 1/cycle. Does this imply that 64-bit masks are also "double-pumped"?
So AMD beats Intel hands down here. In fact, Intel's implementation of the mask arithmetic is so bad that a lot of code (and compilers) prefer to pay the (high cost) to move the masks to general purpose registers instead to do work, then move back later. Because AMD's implementation is much faster, it will probably change these trade-offs. So if Intel's mask registers remain slow, compilers will be forced to pick whether to favor AMD or Intel.
Data Port Limitations:
As with previous Zen chips, there appear to be a limited number of data ports within the SIMD unit. Exceeding it will cause delays. All 4 of the SIMD pipes have 2 native data ports for a total of 8. Then there are 2 more data ports that are shared in some way for a grand total of 10 data ports.
While it's possible to sustain 2 x FADD + 2 x FMA every cycle (10 inputs), it is not possible to sustain 4 x ternlog (12 inputs) despite the latter being computationally much cheaper. Going to 512-bit does not bypass the data port limitation as this is a bandwidth limitation and not a front-end bottleneck.
For masked AVX512 instructions, the mask itself does not use a data port. But in the case of merge-masking, the destination register may become an extra input which does require a port.
So these two sequences run at full throughput (2 instructions/cycle):
Code:
vaddpd zmm, zmm, zmm
vfmadd213pd zmm, zmm, zmm
(2 x 5 inputs)
Code:
vaddpd zmm{k}{z}, zmm, zmm
vfmadd213pd zmm{k}{z}, zmm, zmm
(2 x 5 inputs)
But this runs at half throughput (1 instruction/cycle):
Code:
vaddpd zmm{k}, zmm, zmm
vfmadd213pd zmm{k}, zmm, zmm
(2 x 6 inputs)
Intel chips do not have this limitation. They can run all 3 of the above
sequences at full throughput with no hazards. (2 instructions/cycle
for chips with both FMAs, 1 instruction/cycle for chips with one
FMA.)
Load/Store Unit:
I did not observe any throughput changes to the load/store units between Zen3 and Zen4. Vector load/stores remain unchanged at 2 x 256-bit
load or 1 x 256-bit per cycle with misalignment penalties. This translates to 1 x 512-bit load or 0.5 x 512-bit store per cycle and is arguably one of Zen4's bigger architectural weaknesses compared to Intel. All Intel chips capable of AVX512 can do 2 x 512-bit load or 1
x 512-bit store every cycle.
So developers will want to make good use of AVX512's extra registers to avoid what will be costly spills.
Vector 64x64 Integer Multiply (vpmullq):
This one instruction is probably my most surprising finding and deserves and entire section for it. This, along with the shuffle units may be AMD and TSMC flexing their dark silicon capabilities.
vpmullq performs a 64-bit integer multply on every 64-bit SIMD lane.
Naturally this is an expensive operation as the silicon cost grows
quadratically to the width of the multiplier. Intel's implementation
runs this through the double-precision 52-bit multiplier 3 times to
emulate it (3 uops). Thus vpmullq has 1/3 the throughput of the FMAs
and IFMAs.
Zen4's implementation is full-throughput - the same throughput as FMAs
and IFMAs. This implies that Zen4 has a 64-bit multiplier in every
SIMD lane. This is quite remarkable because vpmullq is not a commonly
used instruction, yet AMD is spending silicon budget to make it
fast anyway.
So if anyone wants to build a benchmark that draws as much power as
possible for a given clock speed, vpmullq may be a good candidate along
with vpermi2b/vpermt2b.
AVX512 Conflict Detection:
The conflict detection instructions (vpconflictd/vpconflictq) are very
fast. By comparison, Intel microcodes it on all but their Xeon Phi
line. However, these instructions don't follow the usual pattern of
the other vector instructions likely due to its unusual cross-lane
dependencies.
- 128-bit Wide: 2 lat / 0.50 TP
- 256-bit Wide: 6 lat / 0.78 TP
- 512-bit Wide: 6 lat / 1.33 TP
These numbers are hard to explain so I'm not even going to try.
As far as silicon costs go, this looks very expensive as it probably
needs the full O(N^2) number of comparisons. So likely another case of
dark silicon flexing.
AVX512 Compress Store:
vpcompressd (and family) with a memory destination is microcoded. I measured 142 cycles/instruction for "vpcompressd [mem]{k}, zmm".
The in-register version is fast as is the corresponding expand load
instructions (with or without memory operand). On Intel, these
compress stores are not microcoded and are fast. So this is specific
to the memory destination version on Zen4.
This is puzzling because the instruction can be emulated with:
Code:
vpcompressd zmm0{k1}{z}, zmm0
kmovd eax, k1
pext eax, -1, eax (the -1 in a register of course)
kmovd k1, eax
vmovdqu32 [mem]{k1}, zmm0
... which isn't nearly as bad. Why wasn't it microcoded to this sequence? Did I overlook something?
Alternatively, it can also be emulated as:
Code:
vpcompressd zmm0{k1}{z}, zmm0
vpcompressd zmm1{k1}{z}, [-1] (constant of all 1s)
vpcmpd k1, zmm1, [-1], 0 (constant of all 1s), compare for equality
vmovdqu32 [mem]{k1}, zmm0
Microcoding this instruction isn't too shocking given that the masking behavior doesn't fit any of the usual instruction classes. But what is less clear is why it is microcoded so poorly. In any case, compilers tuning for Zen4 should probably tear apart the intrinsics for these and replace them with these sequences instead.
SIMD Resource Sharing:
- The integer multiply shares its multipliers with the FMUL/FMA. (No surprise here.) It is not possible to sustain more than 2 multiply instructions/cycle no matter the type.
- The integer shift shares resources with the FADD (likely the barrel shifters?). It is not possible to sustain 2 x FADD + 2 x shift per cycle.
- Integer shift and FMA do not share resources. It is possible to sustain 2 x SHIFT + 2 x FMA per cycle.
- The "small" shuffle and the "big" shuffle pipe are shared with the two FMA units. The throughput is 0.75 x 512-bit FMA + 0.75 x 512-bit shuffle per cycle regardless of shuffle type.
- The "big" shuffle pipe is shared with one of the IMUL pipes. The throughput is 1 x 512-bit IMUL + 1 x 512-bit (simple) shuffle per cycle, or 0.75 x 512-bit IMUL + 0.75 x 512-bit (complex) shuffle.
After trying various sequences including unrealistic ones, I was never able to find one that could saturate both the big shuffle and multipliers simultaneously. It appears that the port assignments are setup in a way to prevent this. One could speculate that this is intentional as a way to enforce dark silicon. Since the shuffle and multipliers are likely some of the biggest and most power-hungry components on the SIMD unit, limiting their utilization puts a limit on how much active silicon there is at any given time - lest we run into the thermal issues of Intel's early AVX512 chips.
Slow on both Intel and AMD:
There's are a number of things that are really slow on Intel processors which are also slow on AMD's implementation in Zen4. I didn't expect these to be fast on Zen4, but it was at least worth checking to see if AMD pulled any sort of magic.
Gather/Scatter: No surprise here as these are inherently difficult to implement. Zen4's gather/scatters are quite a bit slower than Intel's - probably owing to its weaker load/store unit. So it's still best to avoid these when possible and use shuffles when the indices are compile-time constants.
Fault Suppression: Fault-suppression of masked out lanes incurs a significant penalty. I measured about ~256 cycles/load and ~355 cycles/store if a masked out lane faults. These are slightly better than Intel's Tiger Lake, but still very bad. So it's still best to avoid liberally reading/writing past the end of buffers or alternatively, allocate extra buffer space to ensure that accessing out-of-bounds does not fault.
Physical Register File:
My best estimate of Zen4's physical vector register file is ~192 x 512-bit with an uncertainty of +/- 16 on the count.
- Direct measurements of the reorder window give the same value for 256-bit and 512-bit instructions.
- ZMM registers can be renamed to the same degree as XMM and YMM registers. Thus suggesting that ZMM registers are not split in the register file.
- The 192 is inferred based on Zen4 having a measured reorder window that is 32 larger than Zen3. But this is just an estimate since it is unknown how many registers the core reserves internally.
While the register file is 512-bit in width, the datapaths probably remain 256-bit where ZMM reads/writes to the register file occupy consecutive cycles to deliver each half.
By comparison, Zen1 maps 256-bit registers into two entries in a 128-bit register file - resulting in 256-bit having half the reorder window of 128-bit code. Zen1 also cannot rename YMM registers due to them being split.
On Intel, the vector register files have also been native 512-bit since Skylake. Skylake client, which shares the same core layout as Skylake Server yet lacks AVX512, has a block of unused silicon on their die shots which are speculated to be the (unused) upper halves of the 512-bit register file.
So for full comparison:
- Skylake X + Cannon Lake: 168 x 512-bit
- Ice Lake + Tiger Lake: 224 x 512-bit
- Alder Lake + Sapphire Rapids: 332 x 512-bit
- Zen1: 160 x 128-bit
- Zen2+3: 160 x 256-bit
- Zen4: 192? x 512-bit
Tests on long latency sequences mostly mirror these register file sizes. So Zen4 remains behind the latest Intel chips here. I did notice that 256-bit SIMD seems to have slightly better reorder capabilty than 512-bit for long latency code. This does not happen on Intel processors. I have no explanation for this so more investigation is needed.
Other:
rdrand is much faster than Zen3. But rdseed got much slower.
All SIMD register widths can be move-renamed. This includes the 512-bit ZMM registers.
There are a lot of special cases (like "vxorpd reg, reg, reg") that have lower latencies and higher throughput since they are handled by the front-end and avoid the execution units. These apply equally to all SIMD registers - including ZMM.
I found no surprises among all the new AVX512 extensions. Performance is similar to Intel's implementations.
Some uncommon instructions have gotten slower. For example the min/max (vminpd/vmaxpd) instructions increased from 1 to 2 cycle latency. Throughput remains the same.
Overall, AMD's AVX512 implementation beat my expectations. I was expecting something similar to Zen1's "double-pumping" of AVX with half the register file and cross-lane instructions being super slow. But this is not the case on Zen4. The lack of power or thermal issues combined with stellar shuffle support makes it completely worthwhile to use from a developer standpoint. If your code can vectorize without excessive wasted computation, then go all the way to 512-bit. AMD not only made this worthwhile, but *incentivizes* it with the power savings. And if in the future AMD decides to widen things up, you may get a 2x speedup for free.
Personally, I'm super excited about Zen4 and am very much looking forward to a new build. It has the latest AVX512 and it blows my Skylake 7940X out of the water in code compilation. So once the platform matures a bit more, I plan on upgrading my main coding/gaming machine to Zen4 - either by scavenging parts from this test rig or something new with completely new hardware.
AM5 is a new platform so I'm sure there will be plenty of early adopter kinks to be sorted out. So I'll be waiting a bit for things to stabilize before I open my wallet. At the very least I want to see all the motherboard options since the current ones are somewhat absurdly priced.