16 cores may well be the limit?

NormCameron

Vista Guru
"Analysis: more than 16 cores may well be pointless

One of the ongoing themes of my microprocessor coverage over the past few years has been the relationship between on-chip execution bandwidth and the "memory wall." So I was intrigued to learn of new research from Sandia National Labs that indicates that the severity of the memory wall problem may be much greater than the industry generally anticipates.
In a nutshell, the "memory wall" problem is pretty straightforward, and it's by no means new to the multicore era. The problem arises when the execution bandwidth (i.e., aggregate instructions per second, either per-thread or across multiple threads and programs) available in a single socket is constrained by the amount of memory bandwidth available to that socket. As execution bandwidth increases, either because clockspeeds get faster or because the die contains more cores, memory bandwidth has to increase in order to keep up.
To put this in simple multicore terms, cramming a ton of processor cores onto a single die does you no good if you can't keep those cores fed with code and data.
But memory bandwidth isn't keeping up. Memory bus bandwidth (latency and/or throughput) hasn't increased quickly enough in proportion to Moore's Law, a fact that leaves processors starving for bytes. In this respect, the "memory wall" is a classic producer/consumer problem, and it's the reason that on-die cache sizes have ballooned in recent years. As the memory wall gets higher and higher, it takes more and more cache to get you over it. At this point, it would be fair to say that most modern server processors are really high-speed memories with some processor core stuck on the die, instead of vice versa.
The memory wall is therefore an added barrier to the success of the many-core paradigm. I say "added," because the most famous barrier is the programming model. Massively multithreaded programming isn't just a "hard problem"—rather, it's a generation's worth of Ph.D. dissertations that have yet to be written.
The work from the Sandia team, at least as it's summarized in an IEEE Spectrum article that infuriatingly omits a link to the original research, seems to indicate that 8 cores is the point where the memory wall causes a fall-off in performance on certain types of science and engineering workloads (informatics, to be specific). At the 16-core mark, the performance is the same as it is for dual-core, and it drops off rapidly after that as you approach 64 cores.
The chart included in the report is striking, and I wish I had the appropriate background to interpret it. (Again, the lack of any link, DOI, report title, deck title, or other reference information is unbelievable.) Nonetheless, despite the lack of color from the source, I'm sure the many-core skeptics in the audience—and there are quite a few—will seize on it as further validation that the maximum worthwhile core count is well below 16.
It looks like Sandia is proposing that stacking memory chips on top of the processor is the solution to this bandwidth problem. If that is indeed their proposal, then they're in good company. Both Intel and IBM have touted advances in chip-stacking techniques, and Sun has published research in the area of high-bandwidth memory interconnects that involve placing dice edge-to-edge. But, to my knowledge, these die-stacking schemes are further from down the road than the production of a mass-market processor with greater than 16 cores."

Analysis: more than 16 cores may well be pointless
 

My Computer

System One

  • Manufacturer/Model
    Scratch Built
    CPU
    Intel Quad Core 6600
    Motherboard
    Asus P5B
    Memory
    4096 MB Xtreme-Dark 800mhz
    Graphics Card(s)
    Zotac Amp Edition 8800GT - 512MB DDR3, O/C 700mhz
    Monitor(s) Displays
    Samsung 206BW
    Screen Resolution
    1680 X 1024
    Hard Drives
    4 X Samsung 500GB 7200rpm Serial ATA-II HDD w. 16MB Cache .
    PSU
    550 w
    Case
    Thermaltake
    Cooling
    3 x octua NF-S12-1200 - 120mm 1200RPM Sound Optimised Fans
    Keyboard
    Microsoft
    Mouse
    Targus
    Internet Speed
    1500kbs
    Other Info
    Self built.
good read
 

My Computer

System One

  • Manufacturer/Model
    gateway/m6881
    CPU
    centrino core 2 duo 2.2ghz T7500
    Memory
    3GB
    Hard Drives
    500GB WD
    Mouse
    logitech
    Internet Speed
    fios 35MB not!!!!
Intel said back in 2007, that Nehalem would be an octocore design and hyperthreaded so it would have a total of 16 threads (16 logical cores to the OS).

Now in 2009, Nehalem has 4 cores and 8 threads Hyperthreaded (8 logical processors to the OS), with the exception of the 32nm Gulftown having 6 cores hyperthreaded with 12 threads.

It is said that by the time we invent a microarchitecture of 20-10nm fabs, Either what you stated, NormCameron, will be correct and Moore will be as well.

Or the group of scientists that claimed Moore's law false, will change the perspective on how much is too much in a processor.
 

My Computer

System One

  • Manufacturer/Model
    JUST ME/Custom Built
    CPU
    Intel(R) Core(TM) 2 Quad Q9550 E0 2.83 Ghz @ 4037.5 Ghz
    Motherboard
    EVGA NFORCE 790I SLI FTW Digital PWM
    Memory
    8 Gig Corsair XMS3 DHX 1600 Mhz DDR3 CAS 9
    Graphics Card(s)
    2x GTX 480's w/ Koolance VID-NX480's
    Sound Card
    Creative SoundBlaster X-Fi Titanium (non-Fatal1ty) Edition
    Monitor(s) Displays
    Acer P243W 24" Premium Series
    Screen Resolution
    1920 x 1200
    Hard Drives
    Seagate Barracuda 7200.11 1 TB HDD 32mb cache
    (3 Platters, 6 Heads model)
    PSU
    Ultra X4 1050W PSU 76A Rated 12V SLI
    Case
    Antec 1200 Full tower ATX case
    Cooling
    x2 Thermaltake Blue LED 120mm 2800rpm fans, Zalman CNPS 9900
    Keyboard
    Logitech LX310 2.4 Ghz Wireless Keyboard
    Mouse
    Logitech LX310 2.4Ghz Wireless Laser Mouse
    Internet Speed
    32mb Download, 12mb Upload from Comcast
    Other Info
    300$ custom watercooling setup. Antec 1200 big boy Radiator, Dual bay XPSC reservoir with clear bubble window, OCZ hydropulse 800L/H 12v pump, 12 ft tygon 1/2" ID, 3/4" OD tubing, Feser One UV Green 1 L fluid. (and VID-NX480 waterblock)
I was to belive that some server boards had memory for each chip:huh:.
Not dealing with server boards I dont know how true this is.
I can see this going the blade way of having cores with there own dedicated memory with controller chips for each core.
16 cores with 2GB each allows for 32GB of ram.
Or would that be to simple?
 

My Computer

System One

  • Manufacturer/Model
    Self Built
    CPU
    I5 3570K
    Motherboard
    Gigabyte Z77-DS3H
    Memory
    4 x 4GB corsair ballistix sport DDR3 1600 Mhz
    Graphics Card(s)
    Gigabyte Geforce GTX 660 TI
    Sound Card
    creative x-fi
    Monitor(s) Displays
    Primary CiBox 22" Widescreen LCD ,Secondary Dell 22" Widescreen
    Screen Resolution
    Both 1680 x 1050
    Hard Drives
    2 x 500G HD (SATA) 1 x 2TB USB
    PSU
    Corsair HX 620W ATX2.2 Modular SLI Complient PSU
    Case
    Antec 900 Ultimate Gaming Case
    Cooling
    3 x 80mm tri led front, 120mm side 120mm back, 200mm top
    Keyboard
    Logik
    Mouse
    Technika TKOPTM2
    Internet Speed
    288 / 4000
    Other Info
    Creative Inspire 7.1 T7900 Speakers
    Trust Graphics Tablet
My interpretation of the current responses to the memory wall problem are the introduction of the LGA1366 socket and the triple-channel DDR3 memory architecture supported by X58 and i7. My guess is by the time we get to 16 core CPU's we would naturally see the introduction of the LGA1760 and the quad-channel DDR5 on the X78 chipset. I.e. more pins, more parallel multi-channel memory streaming and of course bigger caches, possibly even and 4th level on-die cache of course that would be the i9 or whatever ;-)
 

My Computer

From using a Dual CPU Xeon Quad Core everyday of the week. I can tell you that having 8 logical cores is useless too. I doesnt run any better than a dual core. The world keeps making faster a better processors that reach their limit based on memory and bandwidth.
 

My Computer

System One

  • Manufacturer/Model
    HP Pavilion DV7-1129wm Entertainment PC
    CPU
    AMD Turion X2 RM-72 Dual Core @ 2.1GHz
    Memory
    2 x 2 GB Hyundai DDR2 400 MHZ
    Graphics Card(s)
    ATI Mobility Radeon HD 3200 @ 256 MB
    Sound Card
    IDT High Definition Audio/SRS Premium Sound/Altec Lansing
    Monitor(s) Displays
    17" Laptop Screen
    Screen Resolution
    1440 x 900 laptop, external 17 in LCD 1024 x 768
    Hard Drives
    WD Scorpio Blue 320 GB SATA 5400 RPM
    Toshiba 68 GB SATA 5400 RPM Second Drive (backup)
    PSU
    8 Cell Lithium Ion Battery
    Case
    Laptop with "light up" HP Logo on outside
    Cooling
    Insane air coming out of Targus dual fan cooler
    Keyboard
    Full Keyboard with numpad
    Mouse
    Microsoft Wireless Mobile Mouse 3000 / Touchpad
    Internet Speed
    Comcast Cable 20 MBps
    Other Info
    Used Primarily for CAD design using SolidWorks 2010.
    Also I love to watch HD movies using the HDMI output(Netflix).
    Linked to my Xbox 360 for Windows Media Center
    3 USB ports + USB/eSata
    HP Remote for Windows Media Center and Quickplay
    Internal Dual Layer DVD+/-RW
    External HP Lightscribe Dual Layer DVD+/-RW
    HP Webcam and Microphone
From using a Dual CPU Xeon Quad Core everyday of the week. I can tell you that having 8 logical cores is useless too. I doesnt run any better than a dual core. The world keeps making faster a better processors that reach their limit based on memory and bandwidth.


Right now, its not the issue with memory bandwidth, but with the QPI/FSB bandwidth.

the QPI was a great leap forward for Intel because it morally balanced the bandwidth difference from the CPU to the memory, DDR3 generally speaking.
 

My Computer

System One

  • Manufacturer/Model
    JUST ME/Custom Built
    CPU
    Intel(R) Core(TM) 2 Quad Q9550 E0 2.83 Ghz @ 4037.5 Ghz
    Motherboard
    EVGA NFORCE 790I SLI FTW Digital PWM
    Memory
    8 Gig Corsair XMS3 DHX 1600 Mhz DDR3 CAS 9
    Graphics Card(s)
    2x GTX 480's w/ Koolance VID-NX480's
    Sound Card
    Creative SoundBlaster X-Fi Titanium (non-Fatal1ty) Edition
    Monitor(s) Displays
    Acer P243W 24" Premium Series
    Screen Resolution
    1920 x 1200
    Hard Drives
    Seagate Barracuda 7200.11 1 TB HDD 32mb cache
    (3 Platters, 6 Heads model)
    PSU
    Ultra X4 1050W PSU 76A Rated 12V SLI
    Case
    Antec 1200 Full tower ATX case
    Cooling
    x2 Thermaltake Blue LED 120mm 2800rpm fans, Zalman CNPS 9900
    Keyboard
    Logitech LX310 2.4 Ghz Wireless Keyboard
    Mouse
    Logitech LX310 2.4Ghz Wireless Laser Mouse
    Internet Speed
    32mb Download, 12mb Upload from Comcast
    Other Info
    300$ custom watercooling setup. Antec 1200 big boy Radiator, Dual bay XPSC reservoir with clear bubble window, OCZ hydropulse 800L/H 12v pump, 12 ft tygon 1/2" ID, 3/4" OD tubing, Feser One UV Green 1 L fluid. (and VID-NX480 waterblock)
From using a Dual CPU Xeon Quad Core everyday of the week. I can tell you that having 8 logical cores is useless too. I doesnt run any better than a dual core. The world keeps making faster a better processors that reach their limit based on memory and bandwidth.

I run dual quad Xeon's all day at work as well, while it may not vastly improve performance over a fast dual core for compiles, searching code tree, etc. it certainly does make a very large difference in allowing me to run multiple VM sessions alongside my boot O/S and not bring the system to it's knees. For those who think an 8 way system is going to make everything 6-8x faster you are exactly right. For those who are trying to multitask better using virtualization and/or running large databases at the same time it does provide a very significant performance improvement over dual or even quad core systems.

But that leads directly into another topic that most software developers have recognized for nearly two decades. Developing applicaton software, service and device drivers that leverage the maximum amount of parallelism is still a very slow, error prone and painstaking development task. Even given some of the newer tools (Intel's Parallel Tools, etc.) trying to develop, test and debug software is still not all that far from where it was 10-15 years ago. I'm hoping that the next 5-10 years of progress in software development tools produces tools that allow the average skilled developer to produce software the extracts a much greater level of performance from multicore architectures while keeping the programming model simple enough that programmers can become highly productive at creating new, faster applications without getting bogged down in the details of parallel, multi-threaded programming. Maybe just a pipe dream ;-)
 

My Computer

perhaps I'm not tech savvy enough to know the ins and outs but I've always wondered if ram could be made fast enough so that it could be inserted between processors. Let's Say you have one 3.0 clocked processor dividing original logic task between two or four 3.0 processors that do the actual math in sections then send it to a 5.0 processor for execution. Would Ram even be needed between or would current Ram configurations work? Would that speed logic processing or slow it? Would your BIOS be a mini OS at that stage? AND what is the limit to processor downsizing? will we eventually calculate on the atomic level (.1-.5nm)? I know they're trying it and hope to do so eventually. Hell compared to where we were ten years ago, 45nm is DAMN close to atomic level processing.
 

My Computer

System One

  • Manufacturer/Model
    Home built
    CPU
    phenom IIx4 810
    Motherboard
    m4a79t deluxe
    Memory
    8Gb DDR3 OCZ Platinum AMD Edition
    Graphics Card(s)
    evga 295 co-op and 9800 gtx for PhysX
    Monitor(s) Displays
    Mk241H
    Screen Resolution
    1920x1200
    Hard Drives
    Seagate Barracuda 7200 750Gb
    PSU
    ABS Tagan BZ 800W
    Case
    Xclio A380PLUS-BK Fully Black
    Cooling
    Asus Royal Knight
    Keyboard
    Razor Lycosa Mirror edition
    Mouse
    Logitech 5 button tilt wheel laser (don't remember model#)
    Internet Speed
    Wireless Road Runner
perhaps I'm not tech savvy enough to know the ins and outs but I've always wondered if ram could be made fast enough so that it could be inserted between processors. Let's Say you have one 3.0 clocked processor dividing original logic task between two or four 3.0 processors that do the actual math in sections then send it to a 5.0 processor for execution. Would Ram even be needed between or would current Ram configurations work? Would that speed logic processing or slow it? Would your BIOS be a mini OS at that stage? AND what is the limit to processor downsizing? will we eventually calculate on the atomic level (.1-.5nm)? I know they're trying it and hope to do so eventually. Hell compared to where we were ten years ago, 45nm is DAMN close to atomic level processing.

ALERT ALERT!

You have bumped a thread that is 2 weeks old! You should be ashamed of yourself! :p

(No, I'm not kidding... You shouldn't bump threads that are 2 weeks old.)
 

My Computer

System One

  • Manufacturer/Model
    Dell Inspiron 640m Notebook MXCO61
    CPU
    Intel Core Duo T2080 @ 1.73GHz
    Memory
    Dell Memory 2GB
    Graphics Card(s)
    Mobile Intel 945GM Graphics Accelerator
    Sound Card
    Dell High Def. Sound
    Monitor(s) Displays
    Dell 14.1" High Res. UltraSharp Notebook Display
    Screen Resolution
    1440x900
    Hard Drives
    One Hard Drive
    120GB
    Keyboard
    Dell Inspiron 640m Stock
    Mouse
    Logitech V4500 Wireless Notebook Laser Mouse
    Internet Speed
    54MB/s
my bad, I just thought I had some legit questions and hypotheses and this thread had the correct context. Besides, it was on top page of overclock and cooling still so I figured it was fresh enough. Should I start new thread to cover same topic? Seems frivolous. Especially on such a slow moving forum as OC&C
 

My Computer

System One

  • Manufacturer/Model
    Home built
    CPU
    phenom IIx4 810
    Motherboard
    m4a79t deluxe
    Memory
    8Gb DDR3 OCZ Platinum AMD Edition
    Graphics Card(s)
    evga 295 co-op and 9800 gtx for PhysX
    Monitor(s) Displays
    Mk241H
    Screen Resolution
    1920x1200
    Hard Drives
    Seagate Barracuda 7200 750Gb
    PSU
    ABS Tagan BZ 800W
    Case
    Xclio A380PLUS-BK Fully Black
    Cooling
    Asus Royal Knight
    Keyboard
    Razor Lycosa Mirror edition
    Mouse
    Logitech 5 button tilt wheel laser (don't remember model#)
    Internet Speed
    Wireless Road Runner
ill bump...
 

My Computer

System One

  • Manufacturer/Model
    JUST ME/Custom Built
    CPU
    Intel(R) Core(TM) 2 Quad Q9550 E0 2.83 Ghz @ 4037.5 Ghz
    Motherboard
    EVGA NFORCE 790I SLI FTW Digital PWM
    Memory
    8 Gig Corsair XMS3 DHX 1600 Mhz DDR3 CAS 9
    Graphics Card(s)
    2x GTX 480's w/ Koolance VID-NX480's
    Sound Card
    Creative SoundBlaster X-Fi Titanium (non-Fatal1ty) Edition
    Monitor(s) Displays
    Acer P243W 24" Premium Series
    Screen Resolution
    1920 x 1200
    Hard Drives
    Seagate Barracuda 7200.11 1 TB HDD 32mb cache
    (3 Platters, 6 Heads model)
    PSU
    Ultra X4 1050W PSU 76A Rated 12V SLI
    Case
    Antec 1200 Full tower ATX case
    Cooling
    x2 Thermaltake Blue LED 120mm 2800rpm fans, Zalman CNPS 9900
    Keyboard
    Logitech LX310 2.4 Ghz Wireless Keyboard
    Mouse
    Logitech LX310 2.4Ghz Wireless Laser Mouse
    Internet Speed
    32mb Download, 12mb Upload from Comcast
    Other Info
    300$ custom watercooling setup. Antec 1200 big boy Radiator, Dual bay XPSC reservoir with clear bubble window, OCZ hydropulse 800L/H 12v pump, 12 ft tygon 1/2" ID, 3/4" OD tubing, Feser One UV Green 1 L fluid. (and VID-NX480 waterblock)
Back
Top