Because those who nowadays have money for investing, do not invest them in the research problems whose solutions are urgently needed for the survival of humanity, e.g. for developing technologies for using all substances in closed cycles (like biosphere did before humans), but instead of that they invest all their money in research for the dream of developing AGI, which even if successful will be of benefit only for a small number of humans, not for all mankind.
The fp64 and fp32 performance is needed for physical simulations required by the former goal, while fp16 and fp8 performance is useful only for the latter goal.
So AMD's choice logically follows the choice of those who control the investment money.
>> Non-AI workloads prefer vector units and not matrix units
FEA and other "scientific" workloads are all matrix math. This is why super computers have been benchmarked using BLAS and LAPACK for the past 40 years. OTOH are those matrix * vector where AI is matrix * matrix?
So from CDNA3 to 4 they doubled fp16 and fp8 performance but cut fp32 and fp64 by half?
Wonder why the regression on non-AI workloads?
Because those who nowadays have money for investing, do not invest them in the research problems whose solutions are urgently needed for the survival of humanity, e.g. for developing technologies for using all substances in closed cycles (like biosphere did before humans), but instead of that they invest all their money in research for the dream of developing AGI, which even if successful will be of benefit only for a small number of humans, not for all mankind.
The fp64 and fp32 performance is needed for physical simulations required by the former goal, while fp16 and fp8 performance is useful only for the latter goal.
So AMD's choice logically follows the choice of those who control the investment money.
> The fp64 and fp32 performance is needed for physical simulations
In the very unlikely case where
1) You need fp64 Matrix-Matrix products for physical simulations
2) You bought the MI355X accelerator instead of hardware better suited for the task
you can still emulate it with the Ozaki scheme.
expanding (i think) to your point, it's perhaps just a fork into two product lines for different uses?
Will there be future hardware optimized for physical simulations, or should existing/faster hardware be stockpiled now?
cuz area and power
Area and power are why there was a choice to make. AI data centre demand is why they made this choice specifically.
Non-AI workloads prefer vector units and not matrix units
>> Non-AI workloads prefer vector units and not matrix units
FEA and other "scientific" workloads are all matrix math. This is why super computers have been benchmarked using BLAS and LAPACK for the past 40 years. OTOH are those matrix * vector where AI is matrix * matrix?
Either way its a regression which seems strange.
Nvidia b200 did the same. A lot of FEA go explicit (matrix free) because scaling is better.
Also lookup ozaki algorithms.
If AMD were serious they would show a fully-worked out GEMM, not just "here is our theoretical performance, this is the instruction to use".