An open ecosystem of small chips is critical to the future
Driven by Moore’s law, the goal of chip development is always a high performance, low cost, and high integration. With the increasing number of transistors that can be integrated on a single chip, the process node becomes smaller and smaller, the tunneling effect becomes more and more obvious, and the leakage problem becomes more and more prominent, leading to the bottleneck of a frequency improvement. In order to further improve the performance of the system, the chip develops from a single-core to a multi-core system.
In the post-Moore era, the r&d cost of advanced technology is too high, and the market demand changes too fast, resulting in serious application fragmentation. It is difficult to ensure that a large and complete chip can successfully cover all the needs. The high R&D cost and the drop in yield caused by the large Die area also lead to a sharp rise in chip cost. In order to continue Moore’s law, the heterogeneous integration of multiple chips is adopted instead of a single large chip to ensure further improvement of integration and performance at an acceptable cost. Therefore, the chip system has gradually evolved into a multi-core heterogeneous system.
What is Chip Interconnect Technology
Entering the era of multicore, the major manufacturers have coincidentally adopted the technical route of multi-Die expansion.
First, there is substrate packaging technology (MCM), through the way the substrate alignment Die interconnection, such as low-power ultra-short distance; second, silicon interposer technology (silicon interposer), the bottom of the Die to add a layer of silicon, as an intermediary layer to connect multiple Die, Apple uses this approach; third, embedded multi-core interconnection bridge technology (Embedded Multi-die Interconnect Bridge (EMIB), embedded in the substrate fabrication process with multiple wiring layers of the bridge, through these bridges to achieve the interconnection between multiple Die, Intel uses this approach.
Chris Bergey, senior vice president and general manager of infrastructure at Arm, said, “The future of CPU design is accelerating and moving toward multi-chip, which makes it imperative that the entire ecosystem support small-chip-based SoCs.”
Apple M1 Ultra Fusion
Comprised of 114 billion transistors, the M1 Ultra supports up to 128GB of high-bandwidth, low-latency unified memory, 20 CPU cores, 64 GPU cores, and a 32-core neural network engine that can run up to 22 trillion operations per second, providing eight times the GPU performance of Apple’s M1 chip and delivering GPU performance that is 90 percent higher than the latest 16-core PC desktops by 90%.
The key point of the technology of such an amazing chip is to join two M1 Max semiconductor die (semiconductor chip body) together to form an SoC twice as big. M1 Ultra, which puts two M1 Max chips together, makes the chip directly double the hardware indicators.
Existing PC dual-processor configurations connect two processors via wiring on the motherboard. However, in this configuration, the communication bandwidth between the CPUs is limited, so latency occurs and performance is not simply doubled; it also increases power consumption and heat generation.
The interconnect technology used by the M1 Ultra to address this issue is called “UltraFusion” and uses more than 10,000 silicon interposer layers (connection wiring) and connects the semiconductor cores as is, without passing through external circuitry. With this design, data transfer speeds of up to 2.5 TB/sec are possible in the interconnect section.
Most importantly, the instruction scheduler built into M1 Max assigns instructions to double the number of processing cores and runs like a single SoC. Since the memory controller also runs like an integrated, the entire memory channel is doubled and the memory bandwidth is increased to 800GB per second.
For example, an M1Max has 10 cores built into the CPU, but this increases to 20 cores with two CPUs connected. Which core will be used to process the commands in the program is assigned by the scheduler, a module, but the M1Max’s scheduler assumes a CPU with 20 cores, and the number of command buffers is optimized.
Nvidia, Intel, and AMD’s Choice
“Small chips and heterogeneous computing are critical to dealing with the Moore’s Law slow down,” said Ian Buck, vice president of superscale computing at Nvidia.
Nvidia’s recently released Grace CPU Superchip for data centers takes a similar approach.
The chip consists of two CPU chips interconnected by NVLink-C2C technology. The link is up to 25 times more energy-efficient and 90 times more area efficient than the PCIe Gen 5 on Nvidia’s chips, delivering bandwidth of 900 GIGABits per second or more.
NVLink-C2C is similar to the UCIe standard recently launched by Intel and TSMC, Samsung, and many other technology vendors, and is a new high-speed, low-latency, chip-to-chip interconnect technology that supports custom die interconnects with GPUs, CPUs, DPUs, NICs, and SoCs.
Previously Intel has demonstrated EMIB (Embedded Chip Interconnect Bridge) technology on Hotchips chips, where a single substrate can have many embedded bridges to provide extremely high I/O and well-controlled electrical interconnect paths between multiple dies as needed.
Since the chip does not have to be connected to the package through a silicon intermediate layer with TSV, it does not degrade its performance. We use micro bumps for high-density signals and thick pitch, and standard flip-chip bumps for direct power and ground connections from the chip to the package.
Why Chip Connect Technology?
For the current chip technology, TSMC’s 5nm process is the industry’s top process that has been truly attainable. But if you still want to launch a chip with more performance under the process constraints, there are two ways: first, to design another chip with a larger area. Second, is the original chipset used together, that is, two at a time.
But a larger area of the chip is also one of the current dilemmas in the development of the circuit, and when the larger the die area, its yield will be lower, more than 400 square millimeters chip yield down to 20-30%, the production of the large-area die means more bad points and lower yields. And from the way two at a time, the mainstream of the industry is currently connected through the motherboard PCB.
For example, the ASUS WS C621E SAGE motherboard is a dual CPU motherboard, designed to support two CPUs working at the same time.
However, the disadvantages of doing so are obvious, such as the two CPU slots and the corresponding cabling required to connect will obviously take up a large PCB area, so the size of the resulting product will be very large. And since the two CPUs are connected via PCB alignment, the latency becomes significant.
The disadvantages of connecting two CPUs through the motherboard PCB are basically the result of too long a connection, which is why Apple, Nvidia, and Intel have looked at the package.
Industry insiders speculate that Apple’s UltraFusion packaging architecture is at least a customized version of InFO_LSI/CoWoS-L. TSMC announced two versions of its silicon bridge technology, InFO_LSI, and CoWoS-L, with the InFO_LSI bump pad pitch specified at 25 µm. This is highly overlapping with the Apple M1 MAX bump pad pitch which has been compressed to 25 µm.
The RDL (Redistribution Layer) line/space size for InFO_LSI is 0.4/0.4 µm, which implies an I/O density of 1250/mm/layer. Given that the chip edge length on the interconnect side is over 18 mm, this provides over 20,000 potential I/Os, far exceeding the 10,000 quoted by Srouji.
In January 2021, TSMC President Chieh-Jia Wei revealed at an earnings meeting, “For advanced packaging technologies including SoIC, CoWoS, etc., we observe that chipset is becoming an industry trend. TSMC is working with several customers on 3D packaging development using chiplet architecture.
Limited by the different architectures, different interconnect interfaces, and protocols between dying (die) produced by different manufacturers, designers must take into account many complex factors such as process, packaging technology, system integration, scaling, etc. At the same time, they must also meet the requirements of different fields and scenarios in terms of information transmission speed, power consumption, etc., making the design process of small chips exceptionally difficult. The biggest difficulty in solving these problems is the lack of a unified standard protocol.
A hot interconnected alliance
Intel, TSMC, Samsung jointly with Sunrise, AMD, ARM, Qualcomm, Google, Microsoft, Meta (Facebook), and other ten industry giants announced the establishment of the Small Chip (Chiplet) Alliance and the launch of a new universal chip interconnection standard – UCIe, as a way to jointly build a small Chiplet interconnection standard, to promote open ecological construction.
The beauty of UCIe is that it can be specified under a unified standard for Chiplets from various companies, so that chips from different vendors, processes, architectures, and functions can be mixed and matched to easily achieve interoperability, and also achieve high bandwidth, low latency, low power consumption, and low cost.
In the UCIe Alliance there is no Nvidia and Apple, the two major heterogeneous integration companies, but from Nvidia’s NVLink-C2C interconnect technology and Apple’s UltraFusion proposal can be seen, these two companies will not be absent.
On April 2, 2022, VeriSilicon announced that it had officially joined the UCIe industry alliance, becoming one of the first companies in mainland China to join the organization. However, the current strength of domestic manufacturers in the UCIe alliance is still slightly weak. If these industry leaders intend to unite and develop “new rules of the game”, the downstream terminal companies will have no choice but to go with the flow. But for a rainy day, the country has already begun to build a set of native Chiplet standards.
In May 2021, the China Computer Interconnection Technology Alliance (CCITA) in the Ministry of Industry and Information Technology project Chiplet standard, that is, “small chip interface bus technical requirements”, by the Chinese Academy of Sciences Institute of computing, the Ministry of Industry and Information Technology Electronics Institute and a number of domestic chip manufacturers to cooperate in the development of standards.
Now, ten months have passed since the development of this work, and the relevant draft has been released, and will soon enter the consultation process, and then revised to complete the technical verification before the end of this year or early next year, and then officially released.
An open ecosystem of small chips is critical to this future, and key industry partners can work together with the support of the UCIe Alliance to achieve the common goal of changing the way the industry delivers new products and continues to deliver on the promise of Moore’s Law.
END.