Yeah, the cache hierarchy is behaving kinda wonky lately. Many AI workloads (and that’s what’s driving development lately) are constrained by bandwidth, and cache will only help you with a part of that. Cache will help with repeated access, not as much with streaming access to datasets much larger than the cache (i.e. many current AI models).
Intel already tried selling CPUs with both on-package HBM and slotted DDR-RAM. No one wanted it, as the performance gains of the expensive HBM evaporated completely as soon as you touched memory out-of-package. (Assuming workloads bound by memory bandwidth, which currently dominate the compute market)
To get good performance out of that, you may need to explicitly code the memory transfers to enable prefetch (preferably asynchronous) from the slower memory into the faster, á la classic GPU programming. YMMW.
I wasn’t really thinking of HPC but my next gaming rig, TBH. The OS can move often accessed pages into faster RAM just as it can move busy threads to faster cores, gaining you some fps a second or two after alt-tabbing back to the game after messing around with firefox. If it wasn’t for memory controllers generally driving channels all at the same speed that could already be a thing right now. It definitely already was a thing back in the days of swapping out to spinning platters.
Not sure about HBM in CPUs in general but with packaging advancement any in-package stuff is only going to become cheaper, HBM, pedestrian bandwidth, doesn’t matter.
Yeah, the cache hierarchy is behaving kinda wonky lately. Many AI workloads (and that’s what’s driving development lately) are constrained by bandwidth, and cache will only help you with a part of that. Cache will help with repeated access, not as much with streaming access to datasets much larger than the cache (i.e. many current AI models).
Intel already tried selling CPUs with both on-package HBM and slotted DDR-RAM. No one wanted it, as the performance gains of the expensive HBM evaporated completely as soon as you touched memory out-of-package. (Assuming workloads bound by memory bandwidth, which currently dominate the compute market)
To get good performance out of that, you may need to explicitly code the memory transfers to enable prefetch (preferably asynchronous) from the slower memory into the faster, á la classic GPU programming. YMMW.
I wasn’t really thinking of HPC but my next gaming rig, TBH. The OS can move often accessed pages into faster RAM just as it can move busy threads to faster cores, gaining you some fps a second or two after alt-tabbing back to the game after messing around with firefox. If it wasn’t for memory controllers generally driving channels all at the same speed that could already be a thing right now. It definitely already was a thing back in the days of swapping out to spinning platters.
Not sure about HBM in CPUs in general but with packaging advancement any in-package stuff is only going to become cheaper, HBM, pedestrian bandwidth, doesn’t matter.