posted by Bryon Moyer
While multicore took a while to take root in the embedded world, it’s now relatively commonplace. But the most accessible style of multicore is homogeneous: all cores the same. Combine that with a shared memory configuration, and this becomes easy because you can use a symmetric multiprocessing (SMP) solution out of the box from your favorite OS. It treats the multiple processors as a group as if they were a single processor. It can do so because the processors are all the same and they all have the same view of memory.
But that’s not how all systems work. Increasingly, folks are wrestling with heterogeneous systems, which have different processors, each of which might have its own OS instance running (or perhaps even no OS at all, so-called “bare metal”). While there might be some shared memory, it would typically be for message passing between processors. For the most part, each processor would have its own memory (or portion of memory) privately allocated.
(Just to clear up some potentially confusing terminology here, homo-/heterogeneous refers to the hardware architecture; SMP and its counterpart asynchronous multiprocessing (AMP) refer more to the software architecture, including the OS. SMP has a single OS instance managing the entire bunch of processors, and, as such, requires a homogeneous configuration. But a homogeneous configuration can be run as AMP if some of the processors have their own independent OS instances running. AMP systems can also include SMP components. At least, that’s how I see it…)
AMP/heterogeneous systems are harder to manage because each processor is aware only of itself. For the most part, no OS instance has system-wide scope (unlike SMP). So things that are easy with SMP become hard with AMP.
An easy example is bring-up: who’s in charge to make sure that all processors are synchronized in their operation? Typically, each processor comes up and holds at a so-called “barrier,” but which processor looks around, deems the system stable and ready for launch, and releases the barrier?
Much of that is system design work to assign that “master” processor, but then you need a programmatic way of implementing the decision. And the catch is that, because each processor is out of scope for any other processor, there’s no obvious logical way for one processor to control the others. There’s no “super-process” that has everything in scope.
This is what Mentor Graphics is trying to address with their recent multicore announcement. Now, to be clear, Mentor makes a lot of tools for a lot of things, including both silicon implementation and embedded systems. This announcement was not about how to create a multicore SoC; it was about how to implement software on an existing heterogeneous architecture.
There were three fundamental components to their announcement, addressing several different AMP challenges:
- remoteProc: this is a set of functions that allows one processor to control another. In particular, it allows for a well-coordinated system-wide boot sequence, with one processor controlling the life cycles of others.
- rpmsg: this allows intercommunication between processors (so-called inter-process(or) communication, or IPC). The figure below shows this interacting through the hypervisor, although no hypervisor is required. They’ve also optimized an MCAPI implementation, which can be layered over this.
- Improvements to their CodeSourcery tools to allow visualization of the different processes/processors in a clear and synchronized fashion (a challenge if you just run something like gdb on each core and then try to make sense out of the independent results).
Image courtesy Mentor Graphics
They did a cleanroom implementation of these functions to ensure that they could be used with proprietary software without exposing that proprietary software to GPL license restrictions (which would otherwise require making source code for that proprietary stuff public).
They have it integrated into their tools for ARM processors and available as libraries for bare-metal setups. They don’t have an integrated version for non-ARM processors; the libraries could be used, although they haven’t been validated on anything but ARM. It’s not that they think it won’t work on non-ARM; they just seem to feel that ARM is all that matters, so it’s where they’ve spent their energy. (I’m assuming that if something else blew up huge, they’d invest more energy there…)
You can find out more in their announcement.
posted by Bryon Moyer
Cadence recently announced new extraction tools, claiming both greater speed (5x) and best-in-class accuracy for full-chip extraction. And what is it that lets them speed up without sacrificing results?
The answer is the same thing that has benefited so many EDA tools over the last few years: parallelism. Both within a box (multi-threading) and using multiple boxes (distributed computing). The tools can scale up to hundreds of CPUs, although they’re remaining mum on the details of how they did this…
They have two new tools: a new random-walk field solver (Quantus FS) and the full-chip extraction tool (Quantus QRC). They say that the field solver is actually running around 20 times faster than their old one.
The field solver is much more detailed and accurate than the full-chip extraction tool. It’s intended for small circuits and high precision; its results are abstracted for use on a larger scale by the full-chip tool. That said, they claim good correlation between QRC and FS, so not much is lost in the abstraction.
They’ve also simplified the FinFET model, cutting the size of the circuit in half and increasing analysis speed by 2.5x.
While QRC is intended for the entire chip, it can also be used incrementally – in which case it can be three times again as fast. Both the Encounter digital implementation tool and their Tempus timing analysis tool can take advantage of this incremental capability to do real-time extraction as the tools make decisions. It’s also integrated into the Virtuoso analog/custom tool.
As to accuracy, they say they meet all of TSMC’s golden FinFET data, that they achieve consistent results with single- and multi-corner analysis, and that they’ve been certified by TSMC for the 16-nm node.
Their fundamental capabilities are summarized in the following figure, although this coverage is consistent with the prior tools.
Image courtesy Cadence
You can read more in their announcement.
posted by Bryon Moyer
3D has been tossed about quite a bit over the last few years. We can ignore the 3D TV craze that came and went like an evanescent avatar. But the two IC manifestations have been 3D transistors (i.e., FinFETs) and 3D package integration – stacking chips.
The latter is a more-than-Moore technology that allows multiple chips, each built on processes best suited to it, with the ability to leverage high-volume off-the-shelf dice like memories instead of designing them from scratch.
But what if you want to scale like circuits vertically? That’s to say, things that aren’t available off the shelf and that all require the same process? Either you have to build them laterally on a single chip or build multiple chips and stack them.
Well, Leti is working on another option: monolithic 3D integration. What this amounts to is building a standard chip and then growing a new layer of silicon (or something) above it and building more circuits. Sounds pretty straightforward in concept, but it’s easier to visualize than it is to accomplish. They presented their status at the recent Semicon West gathering.
Image courtesy Leti
The biggest concern that always arises with these sorts of ideas is thermal. For the bottom layer, you build your transistors, implant your dopants, and then “activate” them using heat to get them moving to where they’re supposed to be. After that, you want them to stay there. They’ll keep moving if you keep the heat on, so once they’re set, you don’t want any more heat.
There are also apparently worries about the contact salicide stability in the presence of extra heat.
And where might the extra heat come from?
Well when you build the next layers of transistor, you need to dope them and activate again. If your bottom transistors are already where you want them, the extra activation will screw them up. Do you try to under-activate the bottom ones, hoping that the second activation will bring them in line?
That’s not the approach Leti is taking. They’re experimenting with a “crème brulee” technique: use a broiler for the second layer activation. That is, heat from the top so that only the top layer gets activated in a short enough time that the heat doesn’t diffuse down and mess up the lower transistors.
Compatibility with existing processes is another consideration. You have to be able to connect the upper and lower transistors, and, in theory, there is no such interconnect at present. Rather than define new interconnect, they’re leveraging the local interconnect (LI) for that piece.
Finally, a big question: how to build and arrange the transistors and CMOS pairs – and other elements like NEMS devices that might want to ride along on the same chip? They’re playing with three different configurations.
The first is “CMOS over CMOS.” In other words, you build both N and P types on the same layer (top and bottom). They list FinFET over FinFET, Trigate/nanowire over Trigate/nanowire (all SOI), or FDSOI over FDSOI. But they also have a drawing showing an FDSOI transistor over a FinFET. Their allegation is that two layers of 14-nm technology provide the scaling of a single layer of 10-nm technology.
The second option is to optimize the transistors by having N and P types on different layers. So, whereas the first option has CMOS pairs built laterally, they’re built vertically in this second option. This allows them to use different materials on the two layers. They’ve already tried germanium (Ge) for P over silicon for N. And they’ve leveraged different crystal orientations, with silicon  for P over silicon  for N. Next up they’ll try InGaAs for N over Ge for P.
The third option involves integrating NEMS over CMOS. We looked at their M&NEMS program last year (which work continues).
They did some FPGA work already just to see what kinds of improvements they can get . They used two stacked FDSOI layers and two levels of tungsten LI. They improved area by 55% (not surprising), but they also improved performance by 23% and power by 12%. Win win win. Apparently going local matters.
We’ll update as we see new results.