Matrix Core Programming on AMD CDNA Architecture

phkahler 10 hours ago

So from CDNA3 to 4 they doubled fp16 and fp8 performance but cut fp32 and fp64 by half?

Wonder why the regression on non-AI workloads?

adrian_b 6 hours ago

Because those who nowadays have money for investing, do not invest them in the research problems whose solutions are urgently needed for the survival of humanity, e.g. for developing technologies for using all substances in closed cycles (like biosphere did before humans), but instead of that they invest all their money in research for the dream of developing AGI, which even if successful will be of benefit only for a small number of humans, not for all mankind.
The fp64 and fp32 performance is needed for physical simulations required by the former goal, while fp16 and fp8 performance is useful only for the latter goal.
So AMD's choice logically follows the choice of those who control the investment money.
- Archit3ch an hour ago
  
  > The fp64 and fp32 performance is needed for physical simulations
  In the very unlikely case where
  1) You need fp64 Matrix-Matrix products for physical simulations
  2) You bought the MI355X accelerator instead of hardware better suited for the task
  you can still emulate it with the Ozaki scheme.
- jjtheblunt 2 hours ago
  
  expanding (i think) to your point, it's perhaps just a fork into two product lines for different uses?
  - walterbell 11 minutes ago
    
    Will there be future hardware optimized for physical simulations, or should existing/faster hardware be stockpiled now?
bigdict 10 hours ago

cuz area and power
- fancyfredbot 4 hours ago
  
  Area and power are why there was a choice to make. AI data centre demand is why they made this choice specifically.
trueismywork 5 hours ago

Non-AI workloads prefer vector units and not matrix units
- phkahler an hour ago
  
  >> Non-AI workloads prefer vector units and not matrix units
  FEA and other "scientific" workloads are all matrix math. This is why super computers have been benchmarked using BLAS and LAPACK for the past 40 years. OTOH are those matrix * vector where AI is matrix * matrix?
  Either way its a regression which seems strange.
  - trueismywork 23 minutes ago
    
    Nvidia b200 did the same. A lot of FEA go explicit (matrix free) because scaling is better.
    Also lookup ozaki algorithms.

saagarjha an hour ago

If AMD were serious they would show a fully-worked out GEMM, not just "here is our theoretical performance, this is the instruction to use".