From the perspective of an edge AI chip engineer, facing the triple challenges of bandwidth, power consumption, and cost, introducing an SRAM-based in-memory computing (IMC) architecture is one of the core solutions to the current bottlenecks in deploying large models on edge devices.
01
Why can’t DRAM main memory + traditional computing architecture meet the needs of large model deployment on edge devices?
- Bandwidth bottleneck (Memory Wall)
On edge chips, the bus bandwidth of DRAM (e.g., LPDDR5/DDR5) is extremely limited (10–50 GB/s), far below the data throughput required for large-model inference.
For example, a 7B-parameter FP16 model has about 14GB of parameters. If each inference round requires frequent weight fetching from DRAM, it introduces huge access latency and power overhead.
- Power consumption and energy efficiency limitations
The energy cost of data movement is far higher than computation itself:
- One DRAM access: ~100–200 pJ/bit
- One SRAM access: ~1–10 pJ/bit
- One MAC operation: <1 pJ (single precision)
In large models like Transformer, over 90% of latency and energy consumption come from memory access.
- Low compute utilization
In the traditional Von Neumann architecture, compute units (MAC arrays) spend long periods waiting for memory data, resulting in NPU/AI core utilization far below the ideal (<50%).
02
Why choose SRAM + in-memory computing architecture?
- Core objective: reduce data movement, improve energy efficiency
Storing weights in SRAM and performing local computations inside SRAM significantly reduces DRAM traffic and on-chip bus bandwidth usage, easing the bandwidth bottleneck.
SRAM’s high bandwidth and low latency make it well-suited for frequently accessed parameters, such as QKV matrix multiplications in attention mechanisms.
- Implementation approach: SRAM arrays + low-bitwidth MAC computation
Part of the weights are mapped into SRAM bitcells, combined with peripheral MAC logic to perform matrix-vector multiplications (MVM).
Using low-bitwidth formats (e.g., INT8, even Binary) further reduces power consumption.
Typical architectures include Processing-in-SRAM or more radical analog IMC in SRAM (using voltage/current as the compute medium).
03
Advantages of SRAM-based IMC (engineering perspective)
Technical Aspect | Description |
---|---|
High bandwidth | SRAM bandwidth reaches hundreds of GB/s, vs DRAM’s tens of GB/s, enabling large-model parallel read/write |
Low power | In-situ processing drastically lowers energy consumption, ideal for continuous AI inference on mobile devices |
Higher energy efficiency | Peak TOPS/W far exceeds traditional architectures; can reach 50–100 TOPS/W (vs DRAM-based <10) |
Predictable latency | SRAM access in ns range avoids DRAM’s multi-cycle uncertainty |
Flexible deployment | Supports small models fully resident in SRAM, or cache-based partial loading for large models |
04
Engineering challenges and solutions
Problem | Solution |
---|---|
High SRAM area cost | Low-precision formats (INT4/INT2), weight reuse, model pruning |
Limited compute precision | Mixed-precision design (critical layers in higher precision) |
Limited on-chip SRAM capacity | Layer-by-layer loading + weight reorganization |
Process constraints | Advanced nodes (e.g., TSMC N4/N3) for SRAM bitcell density improvements |
05
Representative chip cases (supporting evidence)
Chip | Approach | Characteristics |
---|---|---|
Apple M series / ANE | SRAM cache + compute fusion | Weights stored in SRAM blocks, low-latency processing for image and speech |
Google Edge TPU | SRAM as main memory + low-bitwidth compute | INT8 inference, energy efficiency >100 TOPS/W |
Ambiq Apollo4+ | All-SRAM architecture + uAI | Designed for ultra-low-power AI voice, power consumption only tens of µW |
Horizon Journey (旭日) | SRAM-based NPU array | For autonomous driving edge perception, optimized model structure matches SRAM access patterns |
06
Conclusion
The SRAM-based in-memory computing architecture is a key direction for deploying large models on edge AI chips. By enabling “in-situ computation,” it breaks through the bandwidth wall of traditional architectures, significantly improves energy efficiency and inference throughput, reduces power and thermal stress, and avoids BOM cost increases from DRAM. It is the most practical architectural breakthrough to address the three core contradictions of edge AI computing—bandwidth, power, and cost.
Disclaimer:
- This channel does not make any representations or warranties regarding the availability, accuracy, timeliness, effectiveness, or completeness of any information posted. It hereby disclaims any liability or consequences arising from the use of the information.
- This channel is non-commercial and non-profit. The re-posted content does not signify endorsement of its views or responsibility for its authenticity. It does not intend to constitute any other guidance. This channel is not liable for any inaccuracies or errors in the re-posted or published information, directly or indirectly.
- Some data, materials, text, images, etc., used in this channel are sourced from the internet, and all reposts are duly credited to their sources. If you discover any work that infringes on your intellectual property rights or personal legal interests, please contact us, and we will promptly modify or remove it.