Intel's Larrabee Architecture Disclosure: A Calculated First Move

Name: Intel's Larrabee Architecture Disclosure: A Calculated First Move
Item: Intel's Larrabee Architecture Disclosure: A Calculated First Move
Author: Anand Lal Shimpi & Derek Wilson

by Anand Lal Shimpi & Derek Wilson on August 4, 2008 12:00 AM EST

Posted in
GPUs

101 Comments | Add A Comment

101 Comments

Building an Optimized Rasterizer for Larrabee

We've touched on the latency focus. We talked about caches and internal memory busses. But what about external memory? To be honest, the answer is that we don't know. But we have an idea of the direction they want to move in. Lower external bandwidth and possibly lower framebuffer size than traditional hardware seems to be the goal. If they can maintain good performance, reducing the amount of memory and the number of traces on the board will reduce the cost to add-in card vendors who may want to sell cards based on Larrabee (and in turn could reduce cost to the end user).

This bit of speculation isn't just based on what we know about the hardware so far. It's also based on the direction they decided to take with their rasterizer: Intel is implementing a tile based rasterizer to support DirectX and OpenGL as well as their own software renderer. Speaking of their software renderer, they did state that it would be available for use by developers so that they don't have to start from nothing. When asked whether it would be available only as a set of binaries or as source, our answer was that this was still under discussion. We put in our two cents and suggested that distributing the source is the way to go.

Anyway, we haven't discussed tile based rasterization in quite a while on AnandTech as the Kyro line didn't stick around on the desktop. To briefly run it down, screen space is broken up into tiles. For each tile, primitives (triangles) are set aside. Fragments are created for a tile based on all the geometry therein. Since none of these fragments are further processed or shaded until the entire tile is finished, only visible fragments are sent on to be shaded (at least, this is how it used to be: some aspects of DX10+ may require occluded fragments to hang around in some cases). Occluded fragments are thrown out during rasterization. Intel does also support Z culling at geometry, fragment and pixel levels, which is also very useful as the actual rasterization, blending etc. must occur in software as well. Cutting down work at every point possible is the modus operandi of optimizing graphics.

This is in stark contrast to immediate mode renderers, which are what ATI and NVIDIA have been building for the past decade. Immediate mode rendering requires more memory bandwidth as it processes every fragment in the scene, sometimes even those that aren't visible (that can't easily be thrown out by pre-shading depth test techniques). Immediate mode renderers have some tricks that can let them know what fragments will be visible in the scene to help cut down on work, but there are still cases where the GPU does extra work that it doesn't need to because the fragment it is processing and shading isn't even visible in the scene. Immediate mode renderers require more memory bandwidth than tile based renderers, but some algorithms and features have been easier to implement with immediate mode.

STMicro had a short run of popular tile (or deferred) renderers in the early 2000s with the Kyro series. This style of rendering still lives on in cell phone/smart phone and other ultra low power devices that need graphics. While performance on this hardware is very low, memory efficiency is important in this space and thus tile based renderers are preferred.

The technique dropped out of the desktop space not because it was inherently unable to perform, but simply because the players that won out in the era didn't choose to make use of it. With smaller process technology, larger on die cache sizes, larger tiles sizes, and smaller geometry (meaning less triangles span multiple tiles), some advantages of tile based rendering have gotten ... well, more advantageous with advancements in technology.

Getting into the details of tile based rendering is a bit beyond where we want to go right now. But the point is that this technique results fewer occluded fragments end up being shaded. Additionally, the grouping of fragments into tiles helps with breaking up the workload and could help to optimize prefetching and caching so that fragments are only ever fetched once from external memory (tiles on Larrabee will fit into less than half the L2 space per core). These and other features help to reduce bandwidth needs compared to immediate mode renderers.

Looking a little deeper, it is both the burden and advantage of Larrabee that it implements all steps of the traditional graphics pipeline in software. While current GPUs have hardware for geometry setup, rasterization, texturing, filtering, compressing, decompressing, blending and much more, Larrabee maintains a minimum of fixed function features (related to texturing). Often, for a specific purpose, fixed function hardware can be more efficient and faster than general purpose hardware. But at the same time, the needs of individual games shift, and allocating greater or fewer resources to a specific component of the rendering pipeline does have advantages over fixed function hardware. Current GPUs can't shift resources to offer faster rasterization if needed. They can't devote more flops to speeding up stenciling or blending.

The flexibility of Larrabee allows it to best fit any game running on it. But keep in mind that just because software has a greater potential to better utilize the hardware, we won't necessarily see better performance than what is currently out there. The burden is still on Intel to build a part that offers real-world performance that matches or exceeds what is currently out there. Efficiency and adaptability are irrelevant if real performance isn't there to back it up.

Thread and Data Management: It's Time to Blow Your Mind Shading Tiles with Larrabee (With Extra Goodies)

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

101 Comments

View All Comments

christophergorge - Tuesday, August 5, 2008 - link
is it just me or does it look like another transmeta crusoe in the making?
Byte - Tuesday, August 5, 2008 - link
Looks like Puma will have a hard prey to hunt. This should be pretty successful, even if it will be underpowered in DX games, but that shouldn't matter as even now Intel is selling lots of graphics just because they almost force it onto OEMs. Intel could similarly force these onto OEMs, but at least this time it won't be a huge pile of crap.
ilkhan - Tuesday, August 5, 2008 - link
So is the on package GPU we expect to see in Havendale & Auburndale chips going to be larrabee chips?

If anything Id expect to see 8 or 16 core versions to be the onboard GPU for those. Probably 8 core, to keep costs down for onboard chips.
DeepThought86 - Monday, August 4, 2008 - link
Nice HPC platform, terrible idea for a graphics chip. Just look at the die allocation, it's optimized for instruction-heavy and data-poor tasks. Killer for BOINC and folding type stuff, but there's no way this general purpose use of transistor budget makes sense for graphics.

Power consumption for the high-speed ringbus will be killer as well. In idle today's GPUs are quite efficient, Larrabee will burn watts doing nothing.

This architecture will occasionally handle a particular game excellently, but completely fall down in others. In a way it's the opposite of Nvidia or AMD today.

Ah well, they've had a good run since 2006, looks like they're headed for their next down cycle, just as AMD has started rising again...
ltcommanderdata - Monday, August 4, 2008 - link
From Intel's Siggraph paper, Larrabee's claimed performance is pretty decent.

Intel's internal results are that Larrabee will only require about 10 cores running at 1GHz to maintain HL2 Episode 2 above 60fps at a 1600x1200 resolution. They estimate that a 25 core 1GHz Larrabee will be sufficient to maintain FEAR and Gears of War above 60fps at 1600x1200. FEAR is older than Gears of course, but FEAR had an occasional frame spike, probably on a more complicate frame, so 25 cores should guarantee a 60fps minimum fps. Of course, these are Intel's own benchmarks and they only tested a very small section of the game that they picked, but things do look promising. At the very least performance is better they trying to play the game on current Intel IGPs.
iocedmyself - Tuesday, August 5, 2008 - link
1ghz core x 10 to maintain HL2 above 60 fps in 1600x1200...wow...that's on par with a x1800xt? at absolute most.

1ghz core x 25 for FEAR and gears of war @ 60 fps..that is the equivolent of a
$180 ATi 4850 running in 1920x1200,@ 1600x1200 does 90 fps 50% better

...or

the same frame rate as the $290 ATI 4870...in 2560x1600, in 1600x1200 it does 114 fps, nearly twice the performance.

yes, they could scale it up to 50 cores, running 3ghz and it would still only equal about 2/3 the processing power as a single core 4870. Intel's 80 core terascale chip does 1 teraflop/sec at 3.2ghz.

This is a horribly flawed design...they are doing the opposite of the logical step...in what twisted reality can someone say,

"well if GPU's are capable of delivering x20-x40 the performance of a desktop cpu package running at 1/5th the clock speed (or more accurately x80-x110 the performance on a core by core basis) the logical solution is to put 48 cpu cores in a single package!"

Intel couldn't manage to produce an IGP that ran the GUI of an operating system smoothly at all times, they took years longer than AMD to develope 2 core die dual core, years longer to be able to make a photocopy of thier IMC, and continues to fail in 64bit computations comparitively...

but they think because they've developed a 32bit arch, built of a 10 year old design and gained market control for less than two years after producing complete and utter crap for the previous 7 straight...that they can take the video card market from 2 companies each having 13+ years expeirence in the market.

AMD is already testing 40nm die 64bit dual/quad cpu with IMC supporting DDR2 AND DDR3, 1 or more gGpu's and a total of 6-10MB on die cache.

Native dual core Gpu's, cpu's gpu's and a combination of both with built in memory...ya know, designs that actually have some promise...but they are going to nail an x86 in which developers will have to change the way they think, program and deploy ideas. We barely have software that will utilize 4 cores, let alone 40. Meanwhile all amd has to do is intergrate the 780G IGP into a cpu package and intel is screwed.

But hey....i bet AMD could make a kick ass Gfx card if they took the r540 (x850xt PE core) gave it a die shrink down to 55nm and added SM4 support, then stuffed 50 or 60 into a single package it would do great.

HELL why stop there, just give the r770 a die shrink down to 40nm, put 10 cores to a gpu die,
make a dual gpu board,
2x5 gigs of of GDDR5 memory clocked at 1250mhz (5ghz effective)

they would have a single card capable of doing more than 20 teraflops/sec.

BUT WAIT! THAN THEY HAVE CROSSFIREX THEY COULD HAVE 80 R770 CORES WITH 40 GIGS OF 5GHZ GDDR5 IN A SINGLE SYSTEM!!!!! 3DMARK06 WOULD BREAK 1,000,000 POINTS IN 4096x3200 WITH 1920 FSAA 1280 AF!!!

EVERYONE WOULD BE ABLE TO RUN CRYSIS ON ULTRA HIGH SETTINGS USING A MOVIE THEATER SCREEN FOR A MONITOR WITH NO JAGGED EDGES!!!!!!!!!!!

Then it would become aware, and improve the game code, crysis would spill over into Halo, halo would break into COD4, Fallout 3 would spill over into World of Warcraft where the characters would become self aware and program viruses to only infect intel based platforms...which would destroy Mac's completely,

IT WOULD BE THE FIRST DIGITAL STD!!!!! ZOMG

It would be sold with a 6000w PSU, and it would be Green because it would run on the power of internet porn, and have the power to heat your entire house....it would save the enviroment....ZOMFG!!

But eventually....intel would come back from the wreckage...

bringing with them the next revolutionary product...

the octo-punmped Itanium 4...with Netburst 3.4 arch, featureing 127 Pentium MX cores, Each core could handle 3 threads, and it would scale to 50,000mhz, with 2 terabyte SATA 4 hard drives used for the L1 cache of each core...and testing has shown that each core will only have to run 4.7ghz to achieve 60fps in the human genome project...

Sigh...sorry, i was pretending i worked at intel. It sure is fun to imagine what could be...isn't it?
ZootyGray - Tuesday, August 5, 2008 - link
Hey OC - I just had a flashback to the "jump to light speed" scene in StarWars. Dude, total nirvana, o yeh, thx for the ride. :)

BTW - my GF says, she heard a rumour that the whole thing runs on 'corn'.

I think it must be nextgen corn, cos that's a lotta teraflops. Does any of this convert to metric tonnes of refined bs? Anyway, I think I will wait for your next release.

And you accomplished that in less than one page? nano shrink, huh!

peace.
ltcommanderdata - Tuesday, August 5, 2008 - link
So your own estimates are:

"..or

the same frame rate as the $290 ATI 4870...in 2560x1600, in 1600x1200 it does 114 fps, nearly twice the performance. "

So you are admitting that a 1GHz x 25 core Larrabee could be about 50% the performance of a HD 4870. But, Larrabee could be available in configurations up to 48 cores, so then a 48 core Larrabee at 1GHz could match a HD 4870. Of course, launch clocks will be better than 1GHz, since the Intel only clocked the Larrabee cores at 1GHz in their benchmarks because it's a convenient reference base. You say that Terascale clocked at 3.2GHz, but being more conservative, if Larrabee clocked in a 2GHz at launch with 48 cores, then it would be twice as fast as a HD 4870.

This is of course based on preproduction drivers. Final performance may be higher. Admittedly, this is mainly hypothetical on early Intel provided data, but using your own figures for comparison, Larrabee may not be able to be able to overtake the fastest GPUs available in 2009/2010, but it'll likely be competitive in the mid-range $200-$300 segment. Which is really all Intel needs, since the point is to get a more general purpose x86 based accelerator card into as many computers as possible. Gaming is just the vehicle to do it, and the mid-range is far higher volume than the top-end.

And in terms of flops, I believe it was in the SIGGRAPH paper somewhere that a baseline prototype Larrabee with 1 core at 1GHz gets about 32GFLOPS. Now no doubt scaling isn't perfectly linearly, but just assuming it is through clock speed and core count, a 48 core Larrabee at 2GHz could peak at 3072GFLOPS or 3 times that of a HD 4870. ATI and nVidia will obviously keep moving forward in the next year or two, just as Larrabee is still evolving, but for now, Larrabee isn't really in as bad a position as you make it out to be.
JarredWalton - Tuesday, August 5, 2008 - link
What's worse is that there are all these assumptions made with no knowledge of the settings. 1600x1200 in HL2 at absolute maximum detail settings is nothing to scoff at, and certainly 60FPS would surpass an X1950 XTX. Are we running 4xAA or not? No idea from Intel, so we've got no reference point other than to say that it should be able to generate playing performance.

FEAR is even better: 25 cores at 1GHz to hit 60FPS. Okay, that doesn't sound like a lot, but is that with or without 4xAA, and is it with or without soft shadows? Both of those factors can make a HUGE difference in performance. If they are enabled, 60 FPS at 1600x1200 is very impressive for early hardware. Now go with the assumption that Intel will hit clocks of at least 2GHz at launch and will likely have 32 or 48 cores. That should compare quite favorably with NVIDIA and ATI hardware next year.

Besides all of the above commentary on not knowing settings, we don't even know the scenes that were tested. Pretty much we have nothing to go on without a frame of reference. If Intel had said, "we achieve 60 FPS with 10 cores at 1GHz, and that compares to an 8800 GT running at 60 FPS with the same settings" we could start from a meaningful baseline. Which is probably why we didn't get that information.

Finally - and this is really the key - I believe all of the stuff right now is merely theoretical. They have modeled the performance of Larrabee in the various tests, but they do not have hardware and thus have not actually run any true tests. Okay, the modeling of the hardware is probably sufficient in all honesty, but some of you are talking as though these chips are actually up and running, and they're not (yet). We'll know a lot more in another year; until then, it all sounds very interesting but the proof as always is in the pudding.
The Preacher - Tuesday, August 5, 2008 - link
Man, you must have OC'ed yourself way too high! :D

Intel's Larrabee Architecture Disclosure: A Calculated First Move

Building an Optimized Rasterizer for Larrabee

Post Your Comment

101 Comments

View All Comments

christophergorge - Tuesday, August 5, 2008 - link

Byte - Tuesday, August 5, 2008 - link

ilkhan - Tuesday, August 5, 2008 - link

DeepThought86 - Monday, August 4, 2008 - link

ltcommanderdata - Monday, August 4, 2008 - link

iocedmyself - Tuesday, August 5, 2008 - link

ZootyGray - Tuesday, August 5, 2008 - link

ltcommanderdata - Tuesday, August 5, 2008 - link

JarredWalton - Tuesday, August 5, 2008 - link

The Preacher - Tuesday, August 5, 2008 - link

Log in

Don't have an account? Sign up now