13900KF vs 13700KF vs 7950x vs 7700x vs 5800x3d

Admin

Administrator
Moderator
Messages
3,738
#1

Admin

Administrator
Moderator
Messages
3,738
#2
Power consumption
According to various leaks intel will at least offer 2 options for limiting the power, one at 125W and one at 253W. The following test does indicate that the pl2 figure of 253W is indeed accurate.

wccftech.com/intel-core-i9-13900k-raptor-lake-cpu-with-5-8-ghz-boost-clock-350w-unlimited-power-setting-up-to-67-faster-amd-ryzen-9-5950x/

Intel does actually seem to somehow have improved the power efficiency with raptor lake but still behind zen 3



That was however done with the unlimited power setting, intel might actually beat 5950x in the PL1/PL2 mode.

https://appuals.com/i9-13900k-thermals-tested/

AMD claimed 170W TDP for the 5900x and 5950x which is a movement in the wrong direction. These will still be the models from AMD to consider the most since they will have twice the cache and more cores / $


It is worth noting that the TDP figures for AMD CPUs is basically bullshit. The 5950x draws a lot more than 105W for example.

105W.png


It looks like it will be very difficult to cool zen4 CPUs when they have gotten a significant overclock (a custom loop might be required for proper overclocking).

https://videocardz.com/newz/amd-ryzen-9-7950x-zen4-cpu-not-showing-its-full-potential-in-leaked-cinebench-r23-tests
 

Admin

Administrator
Moderator
Messages
3,738
#3
The intel gracemont efficiency cores
A gracemont cures take up around 33% of the space of a performance core while also delivering around 33% of the MT performance. Unlike the performance cores the gracemont cores does not have hyperthreading so each gracemont core will only offer one thread..

Unfortunately these so called "efficiency cores" are not particularly energy efficient due to high voltage applied to them and the 2 efficiency cores are overall weaker than a single performance core (with hyperthreading) even in multi-threaded applications.

anandtech.com/show/17047/the-intel-12th-gen-core-i912900k-review-hybrid-performance-brings-hybrid-complexity/10

Cinebench R20 performance at 2800 MHz
1661969025160.png


From the 12900K and 12600K

one p core with HT = 571
two ecores = 500.5

Cinebench R20 single-thread performance at 2800mhz
1661979576311.png

From this we see that the first p-core thread will give 422 and a thread used for hypterthreading will add 149. This indicates that using e-cores will provide better performance than using hyperthreading for extra threads at the same frequency.

With raptor lake the L2 cache per e-core was increased to 2MiB which should improve the e-core IPC.

If the p-cores are clocked at 6ghz and e-cores at 4.8 ghz we get (assuming linear scaling)

P-core: 904.3
e-core: 429
ht-thread: 319.3

In cinebench R23 we get that an e-core offer 39% of the MT performance you get fom a p-core.

6.1ghz 16 e-cores 46101

6.1ghz no e-cores 25755

In wPrime 1024m we get that an e-core offer 34% of the MT performance you get fom a p-core.

7.5ghz 16 e-cores 21.476 seconds

7.56ghz no e-cores 36.197 seconds
 

Admin

Administrator
Moderator
Messages
3,738
#4
AMD: we will lose
While they didn't openly say that they were going to lose you can just do basic maths on their official performance claim to figure out that intel will beat them again in single-threaded scenarios.

1661984691034.png


Raptor lake will offer at least 10% better gaming performance than alder lake so it looks like AMD will tie at best in gaming (with 7950x).

https://youtu.be/WcH_7xsYtUk?t=530

Most likely AMD will deliver worse single-thread performance at stock

1661984961026.png


videocardz.com/newz/amd-ryzen-7-7700x-is-20-faster-than-ryzen-7-5800x-in-leaked-single-core-cinebench-r20-test
 

Admin

Administrator
Moderator
Messages
3,738
#5
DDR5 vs DDR4
Intel has decided to support both ddr4 and ddr5 with raptor lake while AMD has opted for the less consumer friendly route of only giving support for the more expensive ddr5. For people who already bought ddr4 ram going for intel might be the better options since then they can wait with getting 128 GiB ddr5 or whatever they feel like getting.

Of course if you go with ddr4 & intel there will be a performance loss compared to using ddr5, that gap might be bigger in multi-threaded scenarios.

https://hothardware.com/news/raptor-lake-is-much-faster-with-ddr5
 

Admin

Administrator
Moderator
Messages
3,738
#6
AMD only promised to support AM5 to 2025
One argument for getting AMD is that they are committing to support the AM5 socket at least to 2025 (maybe longer) while the intel platform is approaching end of life.

https://www.pcgamer.com/amd-am5-support-thru-2025/

I personally do not think we should value that too highly since often when it's finally time to upgrade your CPU again you might want a new motherboard anyway or you might want to switch to the other team.

Replacing a CPU is also the least fun thing to do due to the risk of accidentally bending pins potentially destroying the CPU socket in the process. You reduce that risk buy buying a new CPU + motherboard and using the old combo for another computer (or just selling it).
 

Admin

Administrator
Moderator
Messages
3,738
#7
AVX-512
It is often claimed that this would not provide a benefit for gaming which is utter nonsense, it is very valuable for emulation which makes it something you definitely want for gaming.

https://www.tomshardware.com/news/ps3-emulator-avx-512-30-percent-performance-boost

Initially you could get AVX-512 support on alder lake by disabling the e-cores and having suitable bios & motherboard but later intel began to fuse that off for no good reason.

https://www.tomshardware.com/news/how-to-pick-up-an-avx-512-supporting-alder-lake-an-easy-way

Now with zen4 AMD will finally start supporting AVX-512 making them the better option unless intel is going to re-enable AVX-512 support with raptor lake, even if intel does that however it might still only be with the e-cores disabled which isn't a great solution, you want all e-cores enabled for max multi-threaded performance.
 

Admin

Administrator
Moderator
Messages
3,738
#8
Don't let AMD fanboys gaslight you
A lot of bad youtube channels are doing damage-control now for AMD.

While zen4 CPUs are significantly faster than zen3 CPUs a lot of that is merely due to higher factory clock-speeds which comes with higher power consumption. The main thing zen4 has going for it over intel will be AVX-512 (assuming intel hasn't somehow managed to add that back at the last minute which is unlikely).

For example we had this video where someone claimed that "smart access memory" would bring value to the 'AMD ecosystem' when in reality there is no AMD ecosystem to be found, what AMD calls "smart access memory" is part of the pci express standard and will work with any combination of supporting GPUs, CPUs, motherboards (such as r5 3600 + RTX 3090). Currently the support for re-sizeable bar is rather limited and you will only gain around 3% on average.

AMD may however had disabled it for their comparison with intel to make their CPU gains in game look more impressive than they really were.

The big issue with AM5 is that they will not offer eny cheap entry CPU for people who want a cheap holdover CPU while waiting for zen43d or zen5. People also will not be able to use their ddr4 ram while waiting for ddr5 prices to drop.


 

Admin

Administrator
Moderator
Messages
3,738
#9
Seems like AMD included dolphin emulator benchmark



From what i have read dolphin actually support avx512 which does explain the 32% increase in performance. This is all well and good except for the fact that a zen4 would be massive overkill for typical usage even without avx 512, it's simply not needed to emulate the weak wii system (it's very useful for ps3 emulation though).

The issue with wii emulation is that games run at a fixed framerate (typically 60) such that the framerate cannot be increased without also speeding up the game.
 

Admin

Administrator
Moderator
Messages
3,738
#10
Undervolting
Seems like you can cut the zen4 power consumption in half without losing any performance, this means no need to upgrade your power supply or get a better cooler.



AMD Ryzen 7000 "Zen 4" processors can hit up to 95 °C at stock settings, with cooling most appropriate to the TDP level. This is because the PPT (package power tracking) limits for the 170 W TDP processors is as high as 230 W, and for the 105 W TDP models, it's 130 W. After reaching this temperature threshold, the processor begins to downclock itself to lower temperatures. Harukaze5719 discovered that higher than needed core voltages could be at play, and manually undervolting the processors could free up significant thermal headroom, letting the processors hold on to higher boost multipliers better.

It remains to be zeen if you can get a similar reduction with raptor lake.
 

Admin

Administrator
Moderator
Messages
3,738
#11
Intel likely to offer much better multicore performance per $
Even the 7950x have a hard time beating the 13900K in cinebench R23 multithreaded and it's even worse when you go down in the stack.

notebookcheck.net/Alleged-AMD-Ryzen-9-7950X-Cinebench-R23-benchmark-score-confirms-Zen-4-s-single-core-hegemony.645277.0.html

videocardz.com/newz/amd-ryzen-9-7950x-scores-almost-39k-points-in-leaked-cinebench-r23-multi-core-test-thanks-water-cooling

notebookcheck.net/Intel-Raptor-Lake-Core-i9-13900K-makes-swift-meal-out-of-Core-i9-12900K-Ryzen-9-5950X-and-Ryzen-7-5800X-in-leaked-Geekbench-and-Cinebench-R23-benchmarks.639409.0.html

The issue is simply that AMD is making it too expensive for a lot of people to upgrade, 300$ is a lot just for a 6 core CPU and then people will also have to buy a motherboard and DDR5. Unfortunatily good ddr5 ram is rather expensive right now and that might not have changed much at launch.
 

Admin

Administrator
Moderator
Messages
3,738
#12
Summary
AMD that used to offer value now really doesn't look particularly great compared to what intel will offer just a months later.

Zen4 advantages
AVX512 support
Better motherboards (more future-proof)
Launches one months earlier
Likely lower power consumption when undervolted (making their CPUs much more energy efficient).

Raptor lake advantages
Slightly better performance across the board in workloads that doesn't utilize AVX512.
Motherboards likely to be significantly cheaper and existing ones will work
DDR4 support (so people can wait for lower prices before making the switch).
Budget models likely to offer better value.

People in general have mixed views with some planning to go with intel while others plan to go with AMD. A lot of people are unhappy with how expensive it will be to get an AM5 system with (motherbord, CPU, ram).

https://boards.4channel.org/g/thread/88407417

https://archive.ph/DnFR8
 

Admin

Administrator
Moderator
Messages
3,738
#13
Monolithic dies are better
It is often claimed that the chiplet design would allow for lower cost but we there really isn't any great evidence for that. You can get good yield from large dies simply by cutting down defective dies to use for cheaper models.

For example ga102 is used for all the following GPUs
RTX 3090ti
RTX 3090
RTX 3080ti
RTX 3080 12GB
RTX 3080 10GB
NVIDIA A10 PCIe
NVIDIA A10G PCIe
NVIDIA A40 PCIe
NVIDIA CMP 90HX
NVIDIA RTX A4500
NVIDIA RTX A5000
NVIDIA RTX A5500
NVIDIA RTX A6000

That allows nvidia to effectively use all their large dies even though a lot of them have defects.

With monolithic dies you can have unified l3 cache (for CPUs) allowing for significantly better performance. Monolithic dies also reduces the need for energy consuming and latency adding interconnects.

 

Admin

Administrator
Moderator
Messages
3,738
#14
The Taiwan invasion risk
One big issue with waiting for raptor lake is that china might invade Taiwain in october which would very much screw over people looking to buy a new CPU (due to the intel competition being largely eliminated for an extented period of time).

 

Admin

Administrator
Moderator
Messages
3,738
#15
X670E vs Z790
AMD will offer a total of 28 pci express 5.0 lanes

16x to the GPU
8x to NVME storage (or something less worthwhile).
4x to the chipset (only pci express 4.0 with X650/X670)

This is much better than the intel offering which is just 4 pci expres 4.0 lanes from the CPU besides the 16 pci express 5.0 lanes to the GPU


The bandwidth offered here to the chipset is double that of what AMD will offer (due to AMD going for a dumb daisy-chain setup). More importantly AMD instead offers 4 times the bandwidth from the CPU to use for NVME storage (to properly beat the PS5 without having to use multiple drives or pci express card).

Of course both with AMD and Intel it will be possible to get more bandwidth to nvme drives at the expense of less bandwidth to the GPU, this will cost you around 3% in terms of gaming performance.


As we see here the whole thing is bottlenecked by the pci express x4 connection, it's really bad and AMD will shove it down your throat. There is some hope that the connection might actually be pci express 5 but without officially meeting the required specs but don't get your hopes up. There is also the possibility that the picture itself is incorrect due to the source being incorrect (i couldn't find any official statement from AMD regarding this at all), i have not seen any source showing otherwise though

https://www.techpowerup.com/295394/amd-zen-4-socket-am5-explained-pcie-lanes-chipsets-connectivity
 

Admin

Administrator
Moderator
Messages
3,738
#16
Upcoming AM5 motherboards that show promise
You want at least 2 gen5 NVME slots and 1 gen4 NVME slot, having more than isn't actually that useful since then you just lose GPU bandwidth which isn't particularly great. You also want USB4

It's also unclear if having more than 2 ram slots is much of a benefit, you might end up with a lot more limited performance if you try to use all 4 slots.

The ROG CROSSHAIR X670E GENE might end up being the best board to buy since it seems to have pretty much anything you actually want (unless you want more than 4 sata ports). It even has a legacy PS/2 which is nice (and increasingly difficult to find). This is a bit surprising, you wouldn't expect a mATX board to have so good connectivity.

https://www.tomshardware.com/news/asus-teases-rog-crosshair-x670e-gene-for-zen-4-cpus

The gigabyte X670E-AORUS-MASTER looks good, the only thing i see that's missing is PS/2 connectivity.

The msi X670-P does look great but unfortunatily it's limited to 1 gen5 NVME slot (you want 2) it also lacks USB4 and ps/2 connectivity.

Asus ProArt X670E-CREATOR WIFI seems to have good specs but it looks unappealing and doesn't have ps/2.
 

Claire_Lovely

Well-known member
Messages
108
#17
This is all very interesting and it's definitely a lot of information to study. It sometimes gets very complex and in-depth though and it can feel easier to just buy stuff that will get good FPS and then focus on the game analysis itself. I think most of the latest hardware will do well on most games. Cost is also an important thing to consider as well as how the game's software is optimized to.
 

Admin

Administrator
Moderator
Messages
3,738
#18
Turns out the zen4 power efficiency is worse than the zen3 power efficiency at stock settings, this is really bad.


There is a 105W mode you can use instead and also a 65W eco mode, you do however lose significant performance with that (7950x). It does beat the 5950x even in the 65W mode and it should be significantly better in the 105W mode.

1664200074872.png


The gaming performance wasn't great at all for games that aren't emulated using AVX-512 (such as ps3 games). In the tomshardware average it was actually worse than old CPUs at 1080p and about the same at 1440p, the difference is that older CPUs (and upcoming raptor lake) will not requiring expensive DDR5 memory with worse latency than much cheaper DDR4 memory.

1664200831559.png


1664200561535.png
 

Admin

Administrator
Moderator
Messages
3,738
#19
This is disastrously bad
People were claiming that we would see a great uplift in emulation performance thanks to AVX-512 support. We didn't actually see that with ps3 emulation, instead we still disastrously bad framerate even when overclocked and paired with very expensive DDR5 memory. This is an epic failure and cannot be defended in any way. We did see a small uplift in switch emulation but we can still expect raptor lake to be faster for it.

emulation.png


I have seen people claim that avx-512 support hasn't been added to rpcs3 emulator, i will look more into this.
 

Admin

Administrator
Moderator
Messages
3,738
#20
Undervolting
Turns out that you can indeed get better performance and lower powerconsumption by adjusting the voltage-frequency curve. Not something super-great but still a nice improvement.

1664206044283.png

1664206108972.png


https://www.sweclockers.com/test/34873-amd-ryzen-9-7950x-och-ryzen-7-7700x-raphael/17

Performance: +3.96%
total power: -9.09%
Performance/W: +14.35%

It's unclear how optimized this overclock was and not every chip will have the same overclocking margin.
 

Admin

Administrator
Moderator
Messages
3,738
#23
Zen4's AVX512 Teardown
I won't get into how this happened, but AMD graciously sent me two test setups this year. A retail Zen3 setup back in January, and an engineering Zen4 setup in August.

The motivation here is to let me optimize y-cruncher for these chips. In the case for Zen4, it would also be ahead of release - before all the hardware reviewers get it. Thus when the launch-day benchmarks are released, at least y-cruncher will be properly optimized and will show the processor on a level playing field against the competition which has otherwise had the benefit of time to get the new optimizations.

While my task was to optimize y-cruncher (as well as beta test the chip), there were no optimizations resources for it yet. So no Agner Fog, no uops.info, no block diagrams, etc... All the resources that I usually rely on weren't available yet. So I was on my own.

So here is my personal teardown of Zen4 - or at least the SIMD portion of it since that's the part I was interested in. As this is the first time I've done this, there may be errors or incorrect conclusions. So it's here for discussion. To be honest, this was quite enjoyable - though time-consuming. And I look forward to seeing what I got right and what I got wrong in the upcoming days/weeks when the big names can get their eyes on it.

That said, I wasn't completely alone. I did get some help from Ian Cutress and ChipsandCheese on how to use some of the off-the-shelf tools needed to analyze the processor. So huge thanks to them for their help - which was also awkward because I couldn't actually show them results due to embargo.

But anyway, we live in a very strange world now. AMD has AVX512, but Intel does not for mainstream... If you told me this a few years ago, I'd have looked at you funny. But here we are... Despite the "double-pumping", Zen4's AVX512 is not only competitive with Intel's, it outright beats it in many ways. Intel does remain ahead in a few areas though.

AMD bringing a high quality AVX512 implementation to the consumer market is going to shake things up. This will make it much harder for Intel to kill off AVX512. So it will be interesting in the next few years to see how they respond. Do they bring AVX512 to the E-cores? Do they work with OS's to find a way to automatically migrate AVX512 processes to the P-cores? Do they kill off the E-cores? This situation is hilarious since it is Intel who made these instructions and now AMD is running away with it.

AVX512 Flavors:
This was leaked many months ago, but just to confirm it: Zen4's AVX512 flavors is all of Ice Lake plus AVX512-BF16.

So Zen4 has:
  • AVX512-F
  • AVX512-CD
  • AVX512-VL
  • AVX512-BW
  • AVX512-DQ
  • AVX512-IFMA
  • AVX512-VBMI
  • AVX512-VNNI
  • AVX512-BF16
  • AVX512-VPOPCNTDQ
  • AVX512-VBMI2
  • AVX512-VPCLMULQDQ
  • AVX512-BITALG
  • AVX512-GFNI
  • AVX512-VAES

The ones it is missing are:
  • The Xeon Phi ones: AVX512-PF, AVX512-ER, AVX512-4FMAPS, AVX512-4VNNIW
  • AVX512-VP2INTERSECT (from Tiger Lake)
  • AVX512-FP16 (from Sapphire Rapids and AVX512-enabled Alder Lake)
AVX512 gets a lot of hate for its "fragmentation". But really this is probably just a result of dark silicon on modern process nodes. So we're seeing more and more application-specific "accelerators" to put some of that extra silicon to use. Pretty much every new AVX512 flavor starting from Cannon Lake is arguably an accelerator for some specific application.

In the context of benchmarks and hardware reviews, this can lead to some wild results. If a benchmark happens to land on one of these "accelerators", it may get a disproportionately large speedup which some might consider "cheating". With more and more accelerators appearing for various applications and more developer attention to using them, this will probably become increasingly common in the future.

It would be interesting to see if AMD attempts to add their own AVX512 extensions in the future. AMD has not had much success in adding new instructions since x64 itself. Will they try again?

Zen4's 512-bit is Non-Native:
This was quite obvious even before AMD formally disclosed that their implementation was "double-pumped". AVX512 isn't used much. Widening the execution units to 512-bit would come at incredible silicon cost and power complications. AVX512 has many more features than just its width. So it was easy to predict that Zen4 was not going to have native 512-bit.

Thus as many of us predicted, 512-bit instructions are split into 2 x 256-bit of the same instruction. And 512-bit is always half-throughput of their 256-bit versions.

When a 512-bit instruction is split, it is issued on two consecutive cycles likely to the same unit. (hence the "double-pumping") So one half will always be a cycle behind the other half. I have not tested which half has the 1 cycle delay, but I assume it's the upper half. Thus one can assume that all the datapaths on Zen4 remain 256-bit as 512-bit operands would take two consecutive cycles on each port.

As long as there are no dependencies which cross the halves, this 1 cycle delay is not observable. Therefore, the latencies of most 512-bit instructions remain the same as the 256-bit versions. 512-bit instructions which do have dependencies that cross between lower and upper 256-bit halves have a 1 cycle additional latency - presumably to wait for the slower half to be ready.

Other than that, the splitting is seemless and I did not observe any other hazards. In the absence of any architecture block diagrams, I can only infer that Zen4's vector units remain roughly the same as Zen3. So the raw computing throughput hasn't changed much. Both integer and floating-point remains at 4 x 256-bit/cycle.

However, this does not mean 512-bit is a wash. In practice, it is very difficult to fully saturate all units due to front-end bottlenecks. For example, it's almost impossible to sustain 4 floating-point instructions per cycle due to high latencies and register pressure. AVX512 resolves this by halving the amount of front-end overhead since you only need to issue two 512-bit instructions/cycle to saturate the vector units instead of 4 x 256-bit. In other words, Zen3 is a powerful CPU that is difficult to realize. AVX512 allows Zen4 to unleash all that under-utilized computing potential.

In fact, there are power advantages to doing 2 x 512-bit instead of 4 x 256 due to less work by the front-end. Given the way that Zen4 does frequency scaling, lower power means more thermal headroom to increase clock speeds - which means higher performance.

For simplicity with the rest of this article, all throughputs I mention going forward are for the 256-bit version of the instruction. So unless otherwise stated, 512-bit versions are almost always half throughput with the same latency.

Power and Thermals:
Since 512-bit instructions are reusing the same 256-bit hardware, 512-bit does not come with additional thermal issues. There is no artificial throttling like on Intel chips.

However, the chip as a whole is very difficult to cool due to the dies being so small. So the bottleneck is the contact area between the die and the IHS. For all practical purposes, under a suitable all-core load, the 7950X will be running at 95C Tj.Max all the time. If you throw a bigger cooler on it, it will boost to higher clock speeds to get right back to Tj.Max. Because of this, the performance of the chip is dependent on the cooling. And it is basically impossible to hit the 230W power limit at stock without direct die or sub-ambient.

If 95C sounds scary, wait to you see the voltages involved. AMD advertises 5.7 GHz. In reality, a slightly higher value of 5.75 GHz seems to be the norm - often across half the cores simultaneously. So it's not just a single core load. The Fmax is 5.85 GHz, but I have never personally seen it go above 5.75.

If CPUz's reading is to be trusted, the 5.75 GHz requires 1.5v vcore - at stock settings. Time will tell if there are longevity issues with this kind of abuse. (Though worth mentioning that Intel is also headed in this direction with their 6 GHz Raptor Lake.)

For my* 7950X at stock settings with a 360 AIO:
  • All-core scalar workloads seem to be 5.2 - 5.5 GHz depending on the workload.
  • All-core 128-bit SSE workloads seem to be 5.2 - 5.5 GHz.
  • All-core 256-bit AVX2 workloads seems to be 4.8 - 5.4 GHz.
  • All-core 512-bit AVX512 workloads seem to be 4.9 - 5.5 GHz.
Compute-intensive workloads with high IPC that saturate the execution units will result in the lower-end of the above speed ranges. Memory-bound workloads will result in the higher-end of the range since the core isn't really doing a whole lot.

Observations:
  1. The top speeds are all around 5.5 GHz regardless of the instruction width. This implies that the boost speeds are entirely power-dependent and not artificial (so none of Intel's AVX offset stuff).
  2. 512-bit AVX512 has slightly *higher* clock speeds than 256-bit AVX2 - which is ironic given the history with Intel's chips.
As mentioned earlier, 512-bit has half the front-end overhead (decoding, scheduling, etc...) as 256-bit for the same amount of computational work. So in the absence of front-end bottlenecks, 512-bit consumes less power and gives the chip more headroom to boost to higher speeds - thus a reversal of Intel's implementation.

That said, more investigation is needed before I can suggest reversing the "prefer 128-bit" recommendation for optimizing compilers. The results above are for the steady state. I have not analyzed the transient behavior of "sprinkling" wide vector instructions into otherwise scalar code. While 512-bit is no worse than 256-bit on Zen4, 256-bit is still worse than 128-bit and I have not ruled out the possibility of momentary clock-stretching or throttling that could negatively affect other things. Likewise, a similar situation could arise between 256-bit and 512-bit on a future AMD processor if AMD decides to widen to native 512-bit.

(*As a disclaimer, my test setup involves an engineering motherboard with a very immature BIOS. And I have reasons to believe that the clock speeds and boosting behavior that I am seeing may not exactly match the retail launch. However, the chip itself appears to be of retail stepping. So it should be the same as what is being sold now.)

Floating-Point:
Zen4 has the same floating-point units as Zen3: 2 FADD and 2 FMA all of which are 256-bit wide. Latencies appear to be unchanged as well.

While it is possible to sustain 2 x FADD and 2 x FMA every cycle, it requires sustaining 4 instructions/cycle - which as mentioned earlier is difficult to do for floating-point. This was the case since Zen2 and remains unchanged in Zen4.

AVX512 means you only need 2 x 512-bit each cycle to saturate the 4 pipes. This is easily achievable in real-world code. Not only that, the extra registers will help reduce register pressure.

Comparing with Intel's offerings, Zen4 is capable of sustaining 1 x 512-bit FADD + 1 x 512-bit FMA every cycle. No client-side Intel processor could do this prior to Alder Lake. And the only server ones that can do it are those with that 2nd FMA unit. So Zen4 is very competitive with Intel here despite having non-native AVX512.

The only place Intel remains ahead is raw FMA throughput on the server parts which can sustain 2 x 512-bit FMA/cycle thanks to the 2nd FMA unit. But realistically, there are very few non-synthetic applications that spam 100% FMA and aren't also completely memory-bound. (Level-3 BLAS is probably the only major one I can think of.) Most floating-point applications will have a healthy mix of FADD and FMUL/FMA.

(A bit of an editorial)

In my opinion, Intel's mistake with AVX512 is to optimize for the 100% FMA workloads (namely Linpack) instead of the more common mixed FADD/FMA workloads. Adders are cheap. Multipliers are expensive. One of each would do just fine for most workloads. Instead, Intel decided to add a 2nd FMA to Skylake X/SP... It is that 2nd FMA which caused most of the power/throttling issues that has tainted AVX512's reputation and hindered its adoption.

In fact, many workloads would run faster with no 2nd FMA (and no throttling) than *with* the 2nd FMA (and with throttling). This is arguably quite embarrassing for market segmentation and may explain why Intel's lower-end (1-FMA) server chips have large AVX512 offsets despite not really needing it. The lack of that 2nd FMA and its power issues may be why Ice Lake (Client) and Tiger Lake turned out pretty well. (Cannon Lake would be here too, but it failed for unrelated reasons.)

So with 20/20 hindsight, Intel should have implemented AVX512 in one of these ways (among many other possibilities):
  1. Instead of adding the 2nd FMA, add a 512-bit FADD instead. Intel finally added separate FADD hardware in Golden Cove (Alder Lake + Sapphire Rapids).
  2. Implement 512-bit as 2 x 256-bit first. This allows the world to get all the non-width related features. Then in the future, when the silicon allows for it and the demand calls for it, widen things up to 512-bit. Basically copy AMD's approach to both 256-bit AVX and 512-bit AVX512.

Shuffles/Permutes:
Intel has historically only had one shuffle unit until Ice Lake where they added a second one. (Or rather, they probably exposed the upper-half of the 512-bit shuffle so it can be used as 2 x 256.)

Zen3 has 2 shuffle units, one small and one big. Both are 256-bit wide. The small shuffle can only handle simple shuffles while the big one can handle all of them.

(For reference, "simple" shuffles would be stuff like vunpcklps and vshufps while "complex" shuffles would be all the 3-input, all-to-all, and 128-bit lane-crossing shuffles.)

Zen4 expands that "big" shuffle unit to a native 512-bit shuffle. This big shuffle is capable of running all shuffle instructions with 1/cycle throughput. Even the 512-bit all-to-all 3-input byte-permutes (vpermi2b/vpermt2b) are 1/cycle throughput. This is incredible because the silicon cost for permutes is (probably) quadratic to the granularity, and it appears Zen4 has paid that cost. By comparison, Intel's lower-granular permutes are quite slow, though steadily improving since Skylake.

Furthermore, this 512-bit shuffle can also operate as a pair of 256-bit shuffles - each capable of running the same complex shuffles. Thus there is a total of 3 shuffle pipes on Zen4.

So the per-cycle throughput is:
  • 3 x 256-bit simple shuffles - small + both halves of the big unit
  • 2 x 256-bit complex shuffles - both halves of the big unit
  • 1.5 x 512-bit simple shuffles - small + both halves of the big unit, but double-pumped
  • 1 x 512-bit complex shuffle - natively into the big unit
512-bit shuffles with data dependencies that cross between the lower and upper 256-bit halves go natively into the 512-bit shuffle. While the execution of the shuffle itself is not double-pumped (due to being a native 512-bit unit), the register inputs and outputs still are. Since one of the 256-bit halves is always one cycle behind the other (due to being double-pumped everywhere else in the processor), these 512-bit complex shuffles take an extra cycle of latency to wait for the slower half before execution. Thus all 512-bit complex shuffles have one cycle of extra latency over their 256-bit versions.


Native 512-bit vs. double-pumped 2 x 256-bit?

How do I know that Zen4 has a native 512-bit shuffle as opposed to a pair of 256-bit ones which are double-pumped? I don't know for sure. But there is plenty of evidence pointing at the former:
  • 512-bit complex shuffles cannot be double-pumped the same way as everything else because of the cross-half dependencies.
  • I found no evidence of additional uops being generated to facillitate these cross-half dependencies if they were indeed double-pumped.
  • When mixing 512-bit and 256-bit complex shuffles in close proximity, the observed throughput falls well short of the theoretical limit if they were double-pumped onto separate 256-bit units. Thus implying that a lone 256-bit shuffle being issued into the native 512-bit shuffle will block a 512-bit shuffle for that cycle. The scheduler's ability to reorder 256-bit and 512-bit shuffles to fill these bubbles is limited. No such bubbles were observed when mixing 512-bit and 256-bit arithmetic instructions that do not use the shuffle unit.
Regardless of the physical design of the Zen4's shuffle unit, it far exceeds my expectations and beats Intel in every aspect. Shuffle quality is often the determining factor on whether to use full-width vs half-width when the hardware is "double-pumped." From Bulldozer to Zen1, the shuffles were poor. So 128-bit AVX codepaths were usually the best. Zen4 does it right.



AVX512 Mask Registers:
All Intel (Skylake through Alder Lake) have a single write port for mask registers. This limits everything that writes to masks to 1/cycle which is often a bottleneck in predicated branch-heavy code. Zen4 appears to have 2 mask write ports.

All mask ALU instructions on Zen4 are 1-cycle latency. By comparison, cross-lane mask instructions (kshift, kadd, etc...) are 4 cycles on Intel.
The throughput for all mask ALU instructions is 2/cycle except for 64-bit masks which are 1/cycle. Does this imply that 64-bit masks are also "double-pumped"?

So AMD beats Intel hands down here. In fact, Intel's implementation of the mask arithmetic is so bad that a lot of code (and compilers) prefer to pay the (high cost) to move the masks to general purpose registers instead to do work, then move back later. Because AMD's implementation is much faster, it will probably change these trade-offs. So if Intel's mask registers remain slow, compilers will be forced to pick whether to favor AMD or Intel.

Data Port Limitations:

As with previous Zen chips, there appear to be a limited number of data ports within the SIMD unit. Exceeding it will cause delays. All 4 of the SIMD pipes have 2 native data ports for a total of 8. Then there are 2 more data ports that are shared in some way for a grand total of 10 data ports.

While it's possible to sustain 2 x FADD + 2 x FMA every cycle (10 inputs), it is not possible to sustain 4 x ternlog (12 inputs) despite the latter being computationally much cheaper. Going to 512-bit does not bypass the data port limitation as this is a bandwidth limitation and not a front-end bottleneck.

For masked AVX512 instructions, the mask itself does not use a data port. But in the case of merge-masking, the destination register may become an extra input which does require a port.

So these two sequences run at full throughput (2 instructions/cycle):

Code:
vaddpd      zmm, zmm, zmm
vfmadd213pd zmm, zmm, zmm
(2 x 5 inputs)

Code:
vaddpd      zmm{k}{z}, zmm, zmm
vfmadd213pd zmm{k}{z}, zmm, zmm
(2 x 5 inputs)


But this runs at half throughput (1 instruction/cycle):

Code:
vaddpd      zmm{k}, zmm, zmm
vfmadd213pd zmm{k}, zmm, zmm
(2 x 6 inputs)

Intel chips do not have this limitation. They can run all 3 of the above
sequences at full throughput with no hazards. (2 instructions/cycle
for chips with both FMAs, 1 instruction/cycle for chips with one
FMA.)

Load/Store Unit:
I did not observe any throughput changes to the load/store units between Zen3 and Zen4. Vector load/stores remain unchanged at 2 x 256-bit
load or 1 x 256-bit per cycle with misalignment penalties. This translates to 1 x 512-bit load or 0.5 x 512-bit store per cycle and is arguably one of Zen4's bigger architectural weaknesses compared to Intel. All Intel chips capable of AVX512 can do 2 x 512-bit load or 1
x 512-bit store every cycle.

So developers will want to make good use of AVX512's extra registers to avoid what will be costly spills.

Vector 64x64 Integer Multiply (vpmullq):
This one instruction is probably my most surprising finding and deserves and entire section for it. This, along with the shuffle units may be AMD and TSMC flexing their dark silicon capabilities.

vpmullq performs a 64-bit integer multply on every 64-bit SIMD lane.
Naturally this is an expensive operation as the silicon cost grows
quadratically to the width of the multiplier. Intel's implementation
runs this through the double-precision 52-bit multiplier 3 times to
emulate it (3 uops). Thus vpmullq has 1/3 the throughput of the FMAs
and IFMAs.

Zen4's implementation is full-throughput - the same throughput as FMAs
and IFMAs. This implies that Zen4 has a 64-bit multiplier in every
SIMD lane. This is quite remarkable because vpmullq is not a commonly
used instruction, yet AMD is spending silicon budget to make it
fast anyway.

So if anyone wants to build a benchmark that draws as much power as
possible for a given clock speed, vpmullq may be a good candidate along
with vpermi2b/vpermt2b.

AVX512 Conflict Detection:
The conflict detection instructions (vpconflictd/vpconflictq) are very
fast. By comparison, Intel microcodes it on all but their Xeon Phi
line. However, these instructions don't follow the usual pattern of
the other vector instructions likely due to its unusual cross-lane
dependencies.
  • 128-bit Wide: 2 lat / 0.50 TP
  • 256-bit Wide: 6 lat / 0.78 TP
  • 512-bit Wide: 6 lat / 1.33 TP
These numbers are hard to explain so I'm not even going to try.

As far as silicon costs go, this looks very expensive as it probably
needs the full O(N^2) number of comparisons. So likely another case of
dark silicon flexing.

AVX512 Compress Store:
vpcompressd (and family) with a memory destination is microcoded. I measured 142 cycles/instruction for "vpcompressd [mem]{k}, zmm".
The in-register version is fast as is the corresponding expand load
instructions (with or without memory operand). On Intel, these
compress stores are not microcoded and are fast. So this is specific
to the memory destination version on Zen4.

This is puzzling because the instruction can be emulated with:

Code:
    vpcompressd zmm0{k1}{z}, zmm0
    kmovd       eax, k1
    pext        eax, -1, eax     (the -1 in a register of course)
    kmovd       k1, eax
    vmovdqu32   [mem]{k1}, zmm0
... which isn't nearly as bad. Why wasn't it microcoded to this sequence? Did I overlook something?

Alternatively, it can also be emulated as:

Code:
    vpcompressd zmm0{k1}{z}, zmm0
    vpcompressd zmm1{k1}{z}, [-1]      (constant of all 1s)
    vpcmpd      k1, zmm1, [-1], 0      (constant of all 1s), compare for equality
    vmovdqu32   [mem]{k1}, zmm0
Microcoding this instruction isn't too shocking given that the masking behavior doesn't fit any of the usual instruction classes. But what is less clear is why it is microcoded so poorly. In any case, compilers tuning for Zen4 should probably tear apart the intrinsics for these and replace them with these sequences instead.

SIMD Resource Sharing:
  • The integer multiply shares its multipliers with the FMUL/FMA. (No surprise here.) It is not possible to sustain more than 2 multiply instructions/cycle no matter the type.
  • The integer shift shares resources with the FADD (likely the barrel shifters?). It is not possible to sustain 2 x FADD + 2 x shift per cycle.
  • Integer shift and FMA do not share resources. It is possible to sustain 2 x SHIFT + 2 x FMA per cycle.
  • The "small" shuffle and the "big" shuffle pipe are shared with the two FMA units. The throughput is 0.75 x 512-bit FMA + 0.75 x 512-bit shuffle per cycle regardless of shuffle type.
  • The "big" shuffle pipe is shared with one of the IMUL pipes. The throughput is 1 x 512-bit IMUL + 1 x 512-bit (simple) shuffle per cycle, or 0.75 x 512-bit IMUL + 0.75 x 512-bit (complex) shuffle.
After trying various sequences including unrealistic ones, I was never able to find one that could saturate both the big shuffle and multipliers simultaneously. It appears that the port assignments are setup in a way to prevent this. One could speculate that this is intentional as a way to enforce dark silicon. Since the shuffle and multipliers are likely some of the biggest and most power-hungry components on the SIMD unit, limiting their utilization puts a limit on how much active silicon there is at any given time - lest we run into the thermal issues of Intel's early AVX512 chips.

Slow on both Intel and AMD:

There's are a number of things that are really slow on Intel processors which are also slow on AMD's implementation in Zen4. I didn't expect these to be fast on Zen4, but it was at least worth checking to see if AMD pulled any sort of magic.

Gather/Scatter: No surprise here as these are inherently difficult to implement. Zen4's gather/scatters are quite a bit slower than Intel's - probably owing to its weaker load/store unit. So it's still best to avoid these when possible and use shuffles when the indices are compile-time constants.

Fault Suppression: Fault-suppression of masked out lanes incurs a significant penalty. I measured about ~256 cycles/load and ~355 cycles/store if a masked out lane faults. These are slightly better than Intel's Tiger Lake, but still very bad. So it's still best to avoid liberally reading/writing past the end of buffers or alternatively, allocate extra buffer space to ensure that accessing out-of-bounds does not fault.

Physical Register File:
My best estimate of Zen4's physical vector register file is ~192 x 512-bit with an uncertainty of +/- 16 on the count.
  1. Direct measurements of the reorder window give the same value for 256-bit and 512-bit instructions.
  2. ZMM registers can be renamed to the same degree as XMM and YMM registers. Thus suggesting that ZMM registers are not split in the register file.
  3. The 192 is inferred based on Zen4 having a measured reorder window that is 32 larger than Zen3. But this is just an estimate since it is unknown how many registers the core reserves internally.
While the register file is 512-bit in width, the datapaths probably remain 256-bit where ZMM reads/writes to the register file occupy consecutive cycles to deliver each half.

By comparison, Zen1 maps 256-bit registers into two entries in a 128-bit register file - resulting in 256-bit having half the reorder window of 128-bit code. Zen1 also cannot rename YMM registers due to them being split.

On Intel, the vector register files have also been native 512-bit since Skylake. Skylake client, which shares the same core layout as Skylake Server yet lacks AVX512, has a block of unused silicon on their die shots which are speculated to be the (unused) upper halves of the 512-bit register file.

So for full comparison:
  • Skylake X + Cannon Lake: 168 x 512-bit
  • Ice Lake + Tiger Lake: 224 x 512-bit
  • Alder Lake + Sapphire Rapids: 332 x 512-bit
  • Zen1: 160 x 128-bit
  • Zen2+3: 160 x 256-bit
  • Zen4: 192? x 512-bit
Tests on long latency sequences mostly mirror these register file sizes. So Zen4 remains behind the latest Intel chips here. I did notice that 256-bit SIMD seems to have slightly better reorder capabilty than 512-bit for long latency code. This does not happen on Intel processors. I have no explanation for this so more investigation is needed.

Other:
rdrand is much faster than Zen3. But rdseed got much slower.

All SIMD register widths can be move-renamed. This includes the 512-bit ZMM registers.
There are a lot of special cases (like "vxorpd reg, reg, reg") that have lower latencies and higher throughput since they are handled by the front-end and avoid the execution units. These apply equally to all SIMD registers - including ZMM.

I found no surprises among all the new AVX512 extensions. Performance is similar to Intel's implementations.

Some uncommon instructions have gotten slower. For example the min/max (vminpd/vmaxpd) instructions increased from 1 to 2 cycle latency. Throughput remains the same.

Overall, AMD's AVX512 implementation beat my expectations. I was expecting something similar to Zen1's "double-pumping" of AVX with half the register file and cross-lane instructions being super slow. But this is not the case on Zen4. The lack of power or thermal issues combined with stellar shuffle support makes it completely worthwhile to use from a developer standpoint. If your code can vectorize without excessive wasted computation, then go all the way to 512-bit. AMD not only made this worthwhile, but *incentivizes* it with the power savings. And if in the future AMD decides to widen things up, you may get a 2x speedup for free.

Personally, I'm super excited about Zen4 and am very much looking forward to a new build. It has the latest AVX512 and it blows my Skylake 7940X out of the water in code compilation. So once the platform matures a bit more, I plan on upgrading my main coding/gaming machine to Zen4 - either by scavenging parts from this test rig or something new with completely new hardware.

AM5 is a new platform so I'm sure there will be plenty of early adopter kinks to be sorted out. So I'll be waiting a bit for things to stabilize before I open my wallet. At the very least I want to see all the motherboard options since the current ones are somewhat absurdly priced.
 

Admin

Administrator
Moderator
Messages
3,738
#24
ASUS X670 prices will be very high
That's a shame, i was interested in the X670E gene

https://appuals.com/asus-x670-motherboard-450-start/

That's 3 motherboards that will cost more than the 7950x. There are cheaper options but then you lose out on connectivity/features. The price of the ProArt X670E-CREATOR WIFI isn't listed above but it will probably not be much cheaper than the gene and i hate how it looks.

https://respawnfirst.com/asus-x670e-x670-motherboard-prices-listed-for-up-to-1475-euros/

MSI does have a great looking board that is actually affordable, might be worth considering to save hunders of $ in exchange for losing out on features and connectivity:
msi.png
 

Admin

Administrator
Moderator
Messages
3,738
#25
Will ryzen 7000 have a pluton backdoor?
I have tried finding information regarding this but i didn't find much

The Ryzen 7000 also inherits many of the basic security features of the Ryzen 6000 Mobile platform. AMD is still using their own Arm-based security processor within the IOD. And the new chip is compliant with Microsoft’s Pluton initiative as well – with all the mixed responses that will undoubtedly come from that.
anandtech.com/show/17585/amd-zen-4-ryzen-9-7950x-and-ryzen-5-7600x-review-retaking-the-high-end/6

This was supposedely turned of by default in lenovo laptops using ryzen 6000 series.
Pluton is disabled by default on 2022 Lenovo ThinkPad laptops using AMD Ryzen PRO 6000 Series processors because that’s what Lenovo customers have asked for
pcworld.com/article/621767/why-the-biggest-laptop-vendors-are-ignoring-microsofts-pluton-security-tech.html
 

Admin

Administrator
Moderator
Messages
3,738
#26
Raptor lake prices have leaked
Unsurprisingly it turns out that it will be much cheaper to go the intel route, especially if you don't need a lot of multi-threaded performance.

guru3d.com/news-story/intel-13th-gen-core-raptor-lake-cpu-pricing-leaked-by-newegg.html

13900KF: 629$
13700KF: 429$
13600KF: 309$

Later we got another leak showing even lower prices

13900K: 589$
13700K: 409$
13600K: 319$

The verge also had prices for the cheaper and better KF models

theverge.com/2022/9/27/23374386/intel-13th-gen-processors-release-date-price-raptor-lake

13900KF: 564$
13700KF: 384$
13600KF: 294$
 

Admin

Administrator
Moderator
Messages
3,738
#27
Raptor lake announced
Intel has promised big much better performance for the same W, they go so far as claiming that they can match the last-gen performance while only drawing 65W, that's a huge performance uplift.


Intel did promise support for pci express 5 without going into much details regarding it. Some z790 boards will have a single gen5 nvme slot but i didn't find any with more than one.


Unlike AMD they actually compared against the 5800x3d showing better overall performance.
 

Admin

Administrator
Moderator
Messages
3,738
#28
How much is the 7950x performance ruined by the chiplet approach?
7950x has twice the cache the 7700x has but still the performance isn't better in gaming due to the cache being divided in 2 chiplets separated by the slow infinity fabric.
7700x.png


https://www.youtube.com/watch?v=-P_iii5si40

This is despite the 7950x having much higher TDP and also higher boost-clock. You are actually losing significant gaming performance by going for the 7960x or 7900x, especially at the same frequency.

The 5800x3d has 16% better 1% lows than 5800x despite the 5800x running 4.7% to 5.85% higher frequency.

5800x3d has 3 times the cache. A monolithic 7950x would instead just have double the effective cache giving an ipc gain of around 16% over the 7700x.

ipc uplift from having all cores share 64MiB cache on the same die* (7950x, 1% lows): 19.4 %

With the higher clocks 7950x has the actual gaming performance uplift would be 21% to 25% just by having all cores on the same die (while having a separate IO die), now we end up having to wait months for much more expensive zen4 3d chips instead.

It's unclear how much having a separate io die negatively affects performance, we could get an idea by infinity fabric overclocking 7700x but i have not seen there results yet. We can be fairly certain that the total performance loss due to not being monolithic is over 20%, that's a massive performance loss and far greater than the small cost saving the chiplet approach brings to AMD (defective large dies can usually be salvaged by disabling cores/cache and selling as a cheaper model).

* It is worth noting that a monolithic design alone isn't enough to get a big performance improvement. You also need all the cores to be able to access all the L3 cache with low latency to get the full benefit from it, this is a big reason for why zen3 is better than zen2 (zen3 had properly unified L3 cache for each chiplet).
 

Admin

Administrator
Moderator
Messages
3,738
#29
Raptor lake pricing (swedish store)
This is from inet.se which is not really a low price store.

1664446438343.png


Currently 1$=11.3 sek which gives (before 25% VAT)

13900KF: 6552 sek = 580$
13700KF: 4632 sek = 410$
13600KF: 3512 sek = 311$
 

Admin

Administrator
Moderator
Messages
3,738
#30
AMD fanboys coping hard
They will have a hard time defending a product that is significantly more expensive than the upcoming competition (13900KF) while not delivering more performance. AMD is even harder to justify for people who have existing DDR4 ram and people looking to buy cheaper models (such as the 13700KF).

AMDfanboys.png


https://archive.ph/NktQd
 

Admin

Administrator
Moderator
Messages
3,738
#31
Z790 motherboards
Unfortunatily raptor lake is limited to just 16 gen5 pci express lanes. The issue is that all these 16 are typically reserved for the GPU resulting in 0 left for gen5 NVME storage.
  • PCIe Gen 5.0 support, with as many as 16 lanes off the processor.
https://www.intel.com/content/www/us/en/newsroom/news/13th-gen-core-launch.html#gs.dghyb9

The motherboards that do have a gen5 nvme support typically only have a single gen5 slot that will disable 8 GPU pci express lanes when activated (even though these lanes could have been used for 2 gen5 slots).

For the ASUS ROG STRIX Z790-E GAMING WIFI:
1664472016330.png


The following msi board did however not mention any such limitations but it's probably due to msi just not mentioning this limitation.

https://www.newegg.com/p/N82E16813144563

Cheap asus motherboard lacking gen5 nvme slot

https://www.newegg.com/asus-prime-z790-p-wifi/p/N82E16813119603

For GIGABYTE Z790 AORUS MASTER

1664476091707.png
 

Admin

Administrator
Moderator
Messages
3,738
#32
no igpu tho. igpu amogs dgpu
You get igpu both with zen4 and raptor lake K (not KF) cpus.

You need to pay around 25$ more to get an integrated GPU with intel.

I don't think you should pay extra for an integrated GPU if you already have more than 1 decrete GPU you can use (such as an old really bad one). It's mostly valuable if you just want a working system and do not have a GPU yet (such as due to waiting for RTX 4090) but most people probably have some old GPU at home they can use while waiting for their NEW GPU (or if there is some issue with the computer).
 
Messages
84
#33
You get igpu both with zen4 and raptor lake K (not KF) cpus.

You need to pay around 25$ more to get an integrated GPU with intel.

I don't think you should pay extra for an integrated GPU if you already have more than 1 decrete GPU you can use (such as an old really bad one).
dgpu requires gpu passthrough which is a big nono for security
 

Admin

Administrator
Moderator
Messages
3,738
#35
Don't let AMD gaslight into thinking "95° C is OK"
It's absolutely not OK. A 225W CPU shouldn't fun that hot. The issue is that the IHS for zen4 CPUs is awful combined with high termal density for the CPU, this adds around 20° C to the CPU temperature.


Having it however around 95° C would be ok if you increase the power budget to like 400W but at 225W power-limit it should a cooler than 95° C when you have a good air-cooler (such as alphacool as500 plus).
 
Top