LÆGIS got accepted by ISCA'26, see you in Raleigh!
Email / LinkedIn / GitHub / Google Scholar / CV
Hi, I'm a PhD student in the Department of Computer Science at University of Virginia, advised by Prof. Adwait Jog.
Prior to that, I received my B.S. degree from Jilin University and was a member of ETECA Lab under Prof. Jingweijia Tan. I was also a visiting student at the State Key Laboratory of Processors at ICT, CAS, under Prof. Guangli Li.
I work on performance questions at the intersection of
Current
Understanding and addressing performance bottlenecks in GPU architectures, with a focus on GPU-based confidential computing. This includes multi-GPU traffic management [ISCA'25], characterization of GPU CC overheads [ISPASS'25], and optimization of encryption and UVM under CC [ISCA'26].
Future
I am interested in (i) scaling GPU-based CC, (ii) making cryptographic protocols more GPU-friendly, and (iii) investigating the memory wall in GPU-based CC.
LÆGIS got accepted by ISCA'26, see you in Raleigh!
Passed my PhD Qualifying Exam.
NetCrafter got accepted by ISCA'25, see you in Tokyo!
One paper got accepted by ISPASS'25, see you in Ghent!
University of Virginia
Jilin University
University of Virginia, USA
Advisor Adwait Jog
Topics GPU, Trusted Computing (TEE, Cryptography, etc.), Memory
ICT, CAS, China
Advisor Guangli Li
Topics Compiler, Profile-Guided Optimization, LLVM
Jilin University, China
Advisor Jingweijia Tan
Topics GPU Power Modeling, MCM-GPU, Under-Voltage Reliability
Thesis The Design and Implementation of Binary Code Analysis Framework for NVIDIA GPU
GPU-based Confidential Computing (CC) combines a CPU trusted execution environment (TEE) with a CC-capable GPU to protect data in use. However, it faces a memory wall: GPU memory is far more limited than CPU memory. Unified virtual memory (UVM) alleviates this constraint by allowing the CPU and GPU to share a single address space, with pages managed jointly by the GPU’s memory management unit (GMMU) and a CPU-side UVM driver. This design, however, is fault-driven and introduces significant overhead — when the GPU accesses pages residing in CPU memory, it triggers page faults that are aggregated by the GMMU and processed in batches by the driver. Under CC, every page migrated in this process must additionally be encrypted to ensure confidentiality and integrity, further amplifying the cost.
We identify three key inefficiencies in UVM-based GPU CC. First, it requires tight CPU–GPU synchronization to negotiate initialization vectors (IVs) for AES-GCM encryption, placing encryption on the critical path. While this design avoids storing IVs and thus eliminates the overhead of integrity-tree maintenance, the resulting synchronization overhead significantly degrades performance. Second, the driver thread often sits idle while waiting for new fault batches to arrive, wasting CPU cycles that could otherwise be used for encryption. Third, even when the CPU is utilized, CPU software encryption throughput is low (1.3 GB/s) because UVM relies on Linux Kernel Crypto APIs that are not currently parallelized, further compounding the slowdown.
To address these inefficiencies, we propose LÆGIS, a design that opportunistically pre-encrypts pages on the CPU. LÆGIS leverages secure high-bandwidth memory (HBM) for flexible IV management, introducing an IV Bank stored in 3D-stacked HBM that decouples encryption from CPU–GPU synchronization while eliminating the need for integrity trees. This enables out-of-order encryption and substantially improves UVM performance under CC. Our evaluation shows that LÆGIS significantly reduces CC overhead, achieving up to 3.13× (2.22× on average) and 5.05× (2.74× on average) speedup over the CC baseline under default and aggressive prefetching, respectively.
Coming soon.
We present NetCrafter, a set of novel techniques to manage network traffic, especially across low-bandwidth links in multi-GPU systems. NetCrafter reduces the volume of flit traffic by (i) stitching compatible, partially filled flits, (ii) trimming unnecessary flits to avoid redundant transfers, and (iii) sequencing flits so that latency-sensitive ones arrive at their destinations faster.
Confidential computing (CC) is a critical technology for protecting data in use. By leveraging encryption and virtual machine (VM) level isolation, CC allows existing code to run without modification while offering confidentiality and integrity guarantees. However, the performance impact of CC in GPU-based systems can be significant. In this work, we present a comprehensive performance evaluation of CC guided by a simple performance model. Specifically, we start by evaluating CUDA applications with a focus on data transfer, memory management, encryption, kernel launch, and kernel execution. We also present a detailed event-level analysis of these applications, revealing that the execution times of kernels that do not use unified virtual memory (UVM) are mostly unaffected, while associated kernel launch overhead and queuing time increase significantly. On the other hand, the execution time of kernels using UVM increases drastically under CC, in addition to other launch and queuing overheads. We also study CNN training and LLM inference to see how CC overhead would affect them. Finally, we consider several optimization techniques, including kernel fusion, overlapping, and quantization, towards addressing the overheads of CC.