New NPU: Intel NPU 4, Up to 48 Peak TOPS

Perhaps Intel's main focal point, from a marketing point of view, is the latest generational change to its Neural Processing Unit or NPU.Intel has made some significant breakthroughs with its latest NPU, aptly called NPU 4. Although AMD disclosed a faster NPU during their Computex keynote, Intel claims up to 48 TOPS of peak AI performance.NPU 4, compared with the previous model, NPU 3, is a giant leap in enhancing power and efficiency in neural processing. The improvements in NPU 4 have been made possible by achieving higher frequencies, better power architectures, and a higher number of engines, thus giving it better performance and efficiency.

In NPU 4, these improvements are enhanced in vector performance architecture, with higher numbers of compute tiles and better optimality in matrix computations.This incurs a great deal of neural processing bandwidth; in other words, it is critical for applications that demand ultra-high-speed data processing and real-time inference. The architecture supports INT8 and FP16 precisions, with a maximum of 2048 MAC (multiply-accumulate) operations per cycle for INT8 and 1024 MAC operations for FP16, clearly showing a significant increase in computational efficiency.

A more in-depth look at the architecture reveals increased layering in the NPU 4. Each of the neural compute engines in this 4th version has an incredibly excellent inference pipeline embedded — comprising MAC arrays and many dedicated DSPs for different types of computing. The pipeline is built for numerous parallel operations, thus enhancing performance and efficiency. The new SHAVE DSP is optimized to four times the vector compute power it had in the previous generation, enabling more complex neural networks to be processed.

A significant improvement of NPU 4 is an increase in clock speed and introducing a new node that doubles the performance at the same power level as NPU 3. This results in peak performance quadrupling, making NPU 4 a powerhouse for demanding AI applications. The new MAC array features advanced data conversion capabilities on a chip, which allow for a datatype conversion on the fly, fused operations, and layout of the output data to make the data flow optimal with minimal latency.

The bandwidth improvements in NPU 4 are essential to handle bigger models and data sets, especially in transformer language model-based applications. The architecture supports higher data flow, thus reducing the bottleneck and ensuring it runs smoothly even when in operation. The DMA (Direct Memory Access) engine of NPU 4 doubles the DMA bandwidth—an essential addition in improving network performance and an effective handler of heavy neural network models. More functions, including embedding tokenization, are further supported, expanding the potential of what NPU 4 can do.

The significant improvement of NPU 4 is in the matrix multiplication and convolutional operations, whereby the MAC array can process up to 2048 MAC operations in a single cycle for INT8 and 1024 for FP16. This, in turn, makes an NPU capable of processing much more complex neural network calculations at a higher speed and lower power. That makes a difference in the dimension of the vector register file; NPU 4 is 512-bit wide. This implies that in one clock cycle, more vector operations can be done; this, in turn, carries on the efficiency of the calculations.

NPU 4 supports activation functions and a wider variety is available now that supports and treats any neural network, with the choice of precision to support the floating-point calculations, which should make the computations more precise and reliable. Improved activation functions and an optimized pipeline for inference will empower it to do more complicated and nuanced neuro-network models with much better speed and accuracy.

Upgrading to SHAVE DSP within NPU 4, with four times the vector compute power compared to NPU 3, will bring a 12x overall increase in vector performance. This would be most useful for transformer and large language model (LLM) performance, making it more prompt and energy efficient. Increasing vector operations per clock cycle enables the larger vector register file size, which significantly boosts the computation capabilities of NPU 4.

In general, NPU 4 presents a big performance jump over NPU 3, with 12 times vector performance, four times TOPS, and two times IP bandwidth. These improvements make NPU 4 a high-performing and efficient fit for up-to-date AI and machine learning applications where performance and latency are critical. These architectural improvements, along with steps in data conversion and bandwidth improvements, make NPU 4 the top-of-the-line solution for managing very demanding AI workloads.

Intel Lunar Lake: New E-Core, Skymont Takes Flight For Peak Efficiency Better I/O: Thunderbolt 4, Thunderbolt Share, Wi-Fi 7 Included
POST A COMMENT

91 Comments

View All Comments

  • The Hardcard - Wednesday, June 5, 2024 - link

    There will be Lion Cove with hyperthreading. It is designed such that it can be physically left out or included in depending on the value to each product.

    It was left out of Lunar Lake as the primary goal here is performance per watt and battery life superiority over Apple and Qualcomm.

    Server Lion Cove will absolutely have hyperthreading. Rumors are Arrow Lake will have it as well.
    Reply
  • TMDDX - Wednesday, June 5, 2024 - link

    Is on chip "AI" the new connected standby for NSA spying? Reply
  • ballsystemlord - Wednesday, June 5, 2024 - link

    Shhhhh, you're not supposed to say that. It's classified. ;) Reply
  • sharath.naik - Wednesday, June 5, 2024 - link

    So would this have on package memory, what is the size of memory? how many P cores how many E cores? So many questions no answers. Is this like a paper launch? Reply
  • sharath.naik - Wednesday, June 5, 2024 - link

    Never mind I was wrong. 4E+4P and up to 32 GB RAM. I wish they had option for 64GB, but 32GB is a good number Reply
  • stephenbrooks - Wednesday, June 5, 2024 - link

    The wider Lion Cove core looks pretty impressive, I'll be interested to see how it does in desktops. Reply
  • name99 - Wednesday, June 5, 2024 - link

    "In total, this puts 240KB of cache within 9 cycles' latency of the CPU cores"

    Does it? If they do things the usual Intel way the L1 is inclusive of the L0...
    Other options are possible, of course, but were they implemented?
    Reply
  • mode_13h - Thursday, June 6, 2024 - link

    I wonder if the tag RAM for the L0, L1D, and L2 are all separate? It would be interesting if they grouped it all together in a tree-structured lookup and put that as close as possible to the core's load/store unit. The actual data memory of the caches could be the only part that's physically separate. Reply
  • Bruzzone - Wednesday, June 5, 2024 - link

    It's worth the wait to Lunar and Arrow?
    Or take advantage of the Intel and AMD current generation clearance sales?

    Intel is flooding the channel with Raptor desktop and mobile in the last eight weeks apparently to sustain a Core supply bridge' into Lunar and Arrow. Intel is also sucking the financial capital out of the channel in an effort to block or slow the procurement of anything other than Intel.

    In parallel fighting it out for surplus control, AMD is also engaged sucking financial capital out of the channel by flooding the channel specifically with Raphael desktop.

    Where Meteor Lake and AMD Phoenix, Hawks and Granite Ridge continue as intermediate 'Al' technologies into Strix mobile and Arrow desktop. Not that I care about AI functionality currently.

    14th desktop channel available + 98% in the prior eight weeks
    13th desktop + 24.6%
    12th desktop + 33.4%

    Intel desktop all up;

    14th desktop available today = 24.9%
    13th desktop = 37% that is 48.4% more than 14th
    12th desktop = 37.9% equivalent with 13th

    Specific Intel mobile;

    Intel Meteor Lake mobile channel available gains + 216%. Within Meteor Lake Core SKUs are 10.3%. Among total, H performance mobile = 43.9% and U low power mobile = 56%. Meteor Lake associated are 11% of all Raptor Lake 13th mobile.

    14th mobile H + 16% in week and 30% of all Meteor and 36% of all 13th Raptor mobile H.
    13th mobile itself gains + 5.1%
    13th H specifically gains + 8.6%
    13th P clears down < 3.2%
    13th U gains + 4.8%

    12th Alder mobile all up + 13.2% in the prior eight weeks
    12th H specifically = flat
    12th P clears down < 3.2%
    12th U clears down < 2.6%

    I will have AMD desktop and mobile supply, trade-in and sales trend up later today at my SA comment line. Here are some immediate observations;

    5900XT and 5800XT on AMD so said pricing is sufficient to push Vermeer channel holdings down in price at so said $359 and $249 now pulled by AMD in the moment. The channel might not have been happy with that regulating price move on how much R5K there is too clear from the channel. R5K channel available is up + 68% since March 9 when R5K was 68% of all R7K and today 98% of R7K available.

    R7K desktop since March 9 channel supply volume available + 18%. R9K will minimally dribble out allowing R7K and R5K to clear? R9K might have to be priced up on specific SKUs to accomplish the same dribbling out objective allowing AMD back generation to clear?

    Notably 3600 gains in the channel + 94% in the prior month.
    3600X came back to secondary resale + 35%.
    3700X is up + 15.8% that's all trade-in.

    AMD might have to adjust R9K desktop top SKU and R5K desktop regulating SKUs not to interfere with the channel's ability to liquidate especially Vermeer from channel inventory holdings plus R7K SKUs that will follow in a first in first out channel sales system.

    In summary, there is plenty of Intel and AMD product in the channel. The PC market remains in a downward deflationary price spiral until at least q1 2025 aimed to clear existing inventories for channel financial reclaim to buy next generation.

    Subsequently there's this inventory bridge to traverse to Intel and AMD next generation products and through the summer into q4 it's never been a better time to buy a PC. I don't think desktop and mobile prices will be as low as they are heading into year end and for a long time following.

    For Intel at least flooding the channel with product indicates Intel is buying time.

    mb
    Reply
  • BushLin - Wednesday, June 5, 2024 - link

    Thanks for the uncited nonsense Mike, we were all on tenterhooks. Reply

Log in

Don't have an account? Sign up now