Intelligent Processor Research Center (IPRC), SKLP, ICT, CAS
Intelligent Computing Group, ISRC, IS, CAS

QiMeng: Fully Automated Hardware and Software Design for Processor Chip

An open-source project for automated, high-performance, and easily deployable IC full-stack design.

Total Visits:--

Overview

QiMeng aims to achieve fully automated design of the chip hardware/software stack by leveraging large language models (LLMs), agents, and Boolean logic generation technologies. QiMeng has successfully automated designing RISC-V CPUs, optimizing operating system configurations, transcompiling tensor programs, and developing high-performance libraries, with performance comparable to that of human expertise.

Applications

Automated Chip Design

1.1 Automated CPU Design

Automatic CPU design aims to fully automated design the front-end of an entire CPU starting from I/Os.

QiMeng-CPU-v2 Thumbnail

QiMeng-CPU-v2: Automated Superscalar Processor Design by Learning Data Dependencies

Shuyao Cheng, Rui Zhang, Wenkai He, Pengwei Jin, Chongxiao Li, Zidong Du, Xing Hu, Yifan Hao, Guanglin Xu, Yuanbo Wen, Ling Li, Qi Guo, Yunji Chen

Qimeng-cpu-v2 is the world's first CPU designed by AI, improving performance by about 380x over the state-of-the-art automated design methods, and is comparable to human-designed superscalar processors such as the ARM Cortex A53.

IJCAI'25
QiMeng-CPU-v1 Thumbnail

QiMeng-CPU-v1: Automated CPU Design by Learning from Input-Output Examples

Shuyao Cheng, Pengwei Jin, Qi Guo, Zidong Du, Rui Zhang, Xing Hu, Yongwei Zhao, Yifan Hao, Xiangtao Guan, Husheng Han, Zhengyue Zhao, Ximing Liu, Xishan Zhang, Yuejie Chu, Weilong Mao, Tianshi Chen, Yunji Chen

Qimeng-cpu-v1 is an industrial-scale RISC-V CPU design completed within 5 hours, which is over 1700× larger than the state-of-the-art automated-designed circuits. The taped-out chip is the world's first CPU designed by AI, successfully runs the Linux operating system, and performs comparably to the human-designed Intel 80486SX CPU.

1.2 HDL Generation

Automatic HDL generation aims to train LLMs to generate HDL modules or fill the HDL code.

QiMeng-CRUX: Narrowing the Gap between Natural Language and Verilog via Core Refined Understanding eXpression Thumbnail

QiMeng-CRUX: Narrowing the Gap between Natural Language and Verilog via Core Refined Understanding eXpression

Lei Huang, Rui Zhang, Jiaming Guo, Yang Zhang, Di Huang, Shuyao Cheng, Pengwei Jin, Chongxiao Li, Zidong Du, Xing Hu, Qi Guo, Yunji Chen

QiMeng-CRUX treats Verilog generation as a constrained transformation from free-form natural language to strict HDL space. We propose Core Refined Understanding eXpression (CRUX), a structured interspace to capture the essential semantics of user intent while enabling precise Verilog code synthesis. CRUX not only improves Verilog accuracy through two-stage training, but also serves as transferable, cross-model guidance that systematically enhances the stability and intent alignment of other hardware code models.

SALV Thumbnail

QiMeng-SALV: Signal-Aware Learning for Verilog Code Generation

Yang Zhang, Rui Zhang, Jiaming Guo, Lei Huang, Di Huang, Yunpu Zhao, Shuyao Cheng, Pengwei Jin, Chongxiao Li, Zidong Du, Xing Hu, Qi Guo, Yunji Chen

QiMeng-SALV introduces a novel framework for Verilog code generation that shifts reinforcement learning optimization from module-level to signal-level rewards. By leveraging AST analysis and signal-aware verification, it extracts functionally correct code segments from partially incorrect modules, enabling more effective RL training.

CodeV Series Thumbnail

QiMeng-CodeV-R1: Reasoning-Enhanced Verilog Generation

Yaoyu Zhu, Di Huang, Hanqi Lyu, Xiaoyun Zhang, Chongxiao Li, Wenxuan Shi, Yutong Wu, Jianan Mu, Jinghua Wang, Yang Zhao, Pengwei Jin, Shuyao Cheng, Shengwen Liang, Xishan Zhang, Rui Zhang, Zidong Du, Qi Guo, Xing Hu, Yunji Chen

CodeV-R1 represents a further step towards accuracy by incorporating an explicit reasoning process. Before generating the final HDL code, CodeV-R1 outlines its thought process or plan, aiming to improve the logical correctness and adherence to complex requirements in the generated Verilog. CodeV-R1 exhibits test-time scaling (TTS) ability, and its 7B version is comparable to or surpasses commercial LLMs like 671B DeepSeek-R1.

CodeV Series Thumbnail

CodeV: Empowering LLMs with HDL Generation through Multi-Level Summarization

Yang Zhao, Di Huang, Chongxiao Li, Pengwei Jin, Muxin Song, Yinan Xu, Ziyuan Nan, Mingju Gao, Tianyun Ma, Lei Qi, Yansong Pan, Zhenxing Zhang, Rui Zhang, Xishan Zhang, Zidong Du, Qi Guo, Xing Hu

CodeV is specifically fine-tuned using the described pipeline but focused primarily on the Verilog language. They are designed to generate accurate Verilog code based on natural language descriptions, achieving strong performance on Verilog-specific benchmarks. CodeV-All extends the capabilities to be multi-lingual (supporting both Verilog and Chisel) and multi-scenario.

Automated Software Design

2.1 Operating System Optimization

Operating system optimization aims to automatically generate optimized operating system configurations.

2.2 High-performance Library Generation

High-performance library generation aims to automatically optimize the performance of common operators (e.g., GEMM) and provides a complete automated optimization toolchain.

QiMeng-Kernel: Macro-Thinking Micro-Coding Paradigm for LLM-Based High-Performance GPU Kernel Generation Thumbnail

QiMeng-Kernel: Macro-Thinking Micro-Coding Paradigm for LLM-Based High-Performance GPU Kernel Generation

Xinguo Zhu, Shaohui Peng, Jiaming Guo, Yunji Chen, Qi Guo, Yuanbo Wen, Hang Qin, Ruizhi Chen, Qirui Zhou, Ke Gao, Yanjun Wu, Chen Zhao, Ling Li

QiMeng-Kernel addresses the issue of excessive coupling between "optimization strategies and implementation details" in large language model (LLM)-based automatic generation of high-performance GPU Kernels by proposing a Multi-Level Macro-Micro Generation (MTMC) paradigm. Existing LLM-based methods often struggle to balance correctness and efficiency simultaneously: on one hand, the vast optimization space of GPU Kernels and their strong hardware dependence make it difficult for LLMs to search for effective strategies; on the other hand, the complexity of low-level implementation details leads to frequent compilation failures or performance degradation when directly generating complete kernel code. The proposed "Macro Thinking - Micro Coding" approach decouples high-level strategies from low-level implementations: macroscopically, it generates hardware-semantic-aware optimization decisions; microscopically, it implements these decisions through a multi-step fine-grained process, thereby maximizing correctness and improving performance. Experiments show that this method significantly outperforms existing schemes on KernelBench and TritonBench, with the correctness rate improved by over 50% and a maximum speedup of 7.3x.

AAAI'26
QiMeng-Attention Thumbnail

QiMeng-Attention: SOTA Attention Operator is generated by SOTA Attention Algorithm

Qirui Zhou, Shaohui Peng, Weiqiang Xiong, Haixin Chen, Yuanbo Wen, Haochen Li, Ling Li, Qi Guo, Yongwei Zhao, Ke Gao, Ruizhi Chen, Yanjun Wu, Chen Zhao, Yunji Chen

Qimeng-Attention introduces a self-optimizing framework for high-performance attention code generation. By leveraging an LLM-friendly Thinking Language (LLM-TL) and a two-stage reasoning workflow, it enables LLMs to decouple optimization logic from GPU implementation. The method outperforms human-optimized libraries, achieves up to 35.16× speedup, and reduces development time from months to minutes.

QiMeng-TensorOp Thumbnail

QiMeng-TensorOp: One-Line Prompt is Enough for High-Performance Tensor Operator Generation with Hardware Primitives

Xuzhi Zhang, Shaohui Peng, Qirui Zhou, Yuanbo Wen, Qi Guo, Ruizhi Chen, Xinguo Zhu, Weiqiang Xiong, Haixin Chen, Congying Ma, Ke Gao, Chen Zhao, Yanjun Wu, Yunji Chen, Ling Li

QiMeng-TensorOp is an AI framework that automatically generates high-performance tensor operators with a single prompt, adapting to different hardware platforms and achieving 251% of OpenBLAS performance on RISC-V CPUs and 124% of cuBLAS on NVIDIA GPUs.

QiMeng-GEMM Thumbnail

QiMeng-GEMM: Automatically Generating High-Performance Matrix Multiplication Code by Exploiting Large Language Models

Qirui Zhou, Yuanbo Wen, Ruizhi Chen, Ke Gao, Weiqiang Xiong, Ling Li, Qi Guo, Yanjun Wu and Yunji Chen

QiMeng‑GEMM presents a novel method for generating optimized implementations of the general matrix-multiplication operator (GEMM) automatically using LLMs. Based on a set of informative, adaptive, and iterative meta-prompts, it enables LLMs to comprehend the architectural characteristics of different hardware platforms and generate high performance GEMM implementations.

2.3 Compiler Generation

Compiler generation aims to leverage LLMs to complete automated compiler generation tasks.

ComBack Thumbnail

QiMeng-NeuComBack: Self-Evolving Translation from IR to Assembly Code

Hainan Fang, Yuanbo Wen, Jun Bi, Yihan Wang, Tonghui He, Yanlin Tang, Di Huang, Jiaming Guo, Rui Zhang, Qi Guo, Yunji Chen

Qimeng-NeuComBack introduces a benchmark and self-evolving framework for Neural Compilation. Leveraging the benchmark, it establishes new performance baselines of recent frontier LLMs. The proposed method enables LLMs to iteratively evolve their internal prompt strategies by extracting insights from prior self-debugging traces, thereby enhancing their neural compilation capabilities.

2.4 Tensor Program Transcompiler

Tensor program transcompiler aims to automatically perform cross-platform tensor-program conversion.

QiMeng-MuPa Thumbnail

QiMeng-MuPa: Mutual-Supervised Learning for Sequential-to-Parallel Code Translation

Changxin Ke, Rui Zhang, Shuo Wang, Li Ding, Guangli Li, Yuanbo Wen, Shuoming Zhang, Ruiyuan Xu, Jin Qin, Jiaming Guo, Chenxi Wang, Ling Li, Qi Guo, Yunji Chen

QiMeng-MuPa is an innovative mutual-supervised learning framework for automatic sequential-to-parallel code translation. It features a Translator and a Tester that co-evolve through iterative co-verify, ensuring functional equivalence and high-quality translation. QiMeng-MuPa significantly outperforms prior methods and developed the first domain-specific LLM capable of automatic code parallelization for HPC.

QiMeng-Xpiler Thumbnail

QiMeng-Xpiler: Transcompiling Tensor Programs for Deep Learning Systems with a Neural-Symbolic Approach

Shouyang Dong, Yuanbo Wen, Jun Bi, Di Huang, Jiaming Guo, Jianxing Xu, Ruibai Xu, Xinkai Song, Yifan Hao, Xuehai Zhou, Tianshi Chen, Qi Guo, Yunji Chen

QiMeng-Xpiler is a neural-symbolic transcompiler that automatically translates tensor programs across heterogeneous deep learning systems such as GPUs, ASICs, and MLUs. It integrates LLMs with symbolic program synthesis to ensure both correctness and efficiency. By leveraging LLM-assisted compilation passes and hierarchical auto-tuning, QiMeng-Xpiler achieves up to 95% translation accuracy and 2× performance over vendor-optimized libraries.

BabelTower Thumbnail

BabelTower: Learning to Auto-parallelized Program Translation

Yuanbo Wen, Qi Guo, Qiang Fu, Xiaqing Li, Jianxing Xu, Yanlin Tang, Yongwei Zhao, Xing Hu, Zidong Du, Ling Li, Chao Wang, Xuehai Zhou, Yunji Chen

BabelTower is a learning-based framework that automatically translates sequential C code into parallel CUDA code. By leveraging large-scale corpora and back-translation with discriminative reranking, it achieves up to 347× speedup and greatly improves developer productivity.