Field
Type
Image & Video Compression

DiT-IC: Aligned Diffusion Transformer for Efficient Image Compression

Author:Junqi Shi, Ming Lu, Xingchen Li, Anle Ke, Ruiqi Zhang, Zhan Ma

Year:2026

Publication:IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

ScreenShot_2026-05-30_234533_567 (1).png

Diffusion-based image compression has recently shown outstanding perceptual fidelity, yet its practicality is hindered by prohibitive sampling overhead and high memory usage. Most existing diffusion codecs employ U-Net architectures, where hierarchical downsampling forces diffusion to operate in shallow latent spaces (typically with only 8x spatial downscaling), resulting in excessive computation. In contrast, conventional VAE-based codecs work in much deeper latent domains (16x - 64x downscaled), motivating a key question: Can diffusion operate effectively in such compact latent spaces without compromising reconstruction quality? To address this, we introduce DiT-IC, an Aligned Diffusion Transformer for Image Compression, which replaces the U-Net with a Diffusion Transformer capable of performing diffusion in latent space entirely at 32x downscaled resolution. DiT-IC adapts a pretrained text-to-image multi-step DiT into a single-step reconstruction model through three key alignment mechanisms: (1) a variance-guided reconstruction flow that adapts denoising strength to latent uncertainty for efficient reconstruction; (2) a self-distillation alignment that enforces consistency with encoder-defined latent geometry to enable one-step diffusion; and (3) a latent-conditioned guidance that replaces text prompts with semantically aligned latent conditions, enabling text-free inference. With these designs, DiT-IC achieves state-of-the-art perceptual quality while offering up to 30x faster decoding and drastically lower memory usage than existing diffusion-based codecs. Remarkably, it can reconstruct 2048x2048 images on a 16 GB laptop GPU.

Paper
Code
Image & Video Compression

CoD: A Diffusion Foundation Model for Image Compression

Author:Zhaoyang Jia, Zihan Zheng, Naifu Xue, Jiahao Li, Bin Li, Zongyu Guo, Xiaoyi Zhang, Houqiang Li, Yan Lu

Year:2026

Publication:IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

ScreenShot_2026-05-30_233652_769 (1).png

Existing diffusion codecs typically build on text-to-image diffusion foundation models like Stable Diffusion. However, text conditioning is suboptimal from a compression perspective, hindering the potential of downstream diffusion codecs, particularly at ultra-low bitrates. To address it, we introduce \textbf{CoD}, the first \textbf{Co}mpression-oriented \textbf{D}iffusion foundation model, trained from scratch to enable end-to-end optimization of both compression and generation. CoD is not a fixed codec but a general foundation model designed for various diffusion-based codecs. It offers several advantages: \textbf{High compression efficiency}, replacing Stable Diffusion with CoD in downstream codecs like DiffC achieves SOTA results, especially at ultra-low bitrates (e.g., 0.0039 bpp); \textbf{Low-cost and reproducible training}, 300 faster training than Stable Diffusion ( 20 vs. 6,250 A100 GPU days) on entirely open image-only datasets; \textbf{Providing new insights}, e.g., We find pixel-space diffusion can achieve VTM-level PSNR with high perceptual quality and can outperform GAN-based codecs using fewer parameters. We hope CoD lays the foundation for future diffusion codec research. Codes are released at this https URL.

Paper
Code
Image & Video Compression

Distributed Image Compression with Multimodal Side Information at Extremely Low Bitrates

Author:Guojun Xu, Mingyang Zhang, Jianwen Xiang, Cheng Tan, Yanchao Yang, Junwei Zhou

Year:2026

Publication:IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

ScreenShot_2026-05-30_233053_513 (1).png

Distributed Image Compression (DIC) is crucial for multi-view transmission, especially when operating at extremely low bitrates (< 0.1 bpp). Its core challenge is effectively utilizing side information to achieve high-quality reconstruction under strict bitrate budgets. However, existing DIC approaches struggle to exploit global context and object-level details from side information, leading to local blurring and the loss of fine details in the reconstruction. To address these limitations, we propose a Multimodal DIC framework (MDIC), which, for the first time, leverages side information in a multimodal manner into the DIC paradigm, effectively preserving fine-grained local details and enhancing global perceptual quality in reconstructed images. Specifically, we introduce a text-to-image diffusion-based decoder conditioned on textual side information extracted from correlated images to capture shared global semantics. Moreover, we design a feature-mask generator, supervised by a multimodal fine-grained alignment task, to strengthen the exploitation of visual side information. The generated mask serves two purposes: first, it guides the extraction of fine-grained details from losslessly transmitted side information to preserve the semantic consistency of reconstructed details; second, it regulates the extraction of clustered feature representations from the quantized VQ-VAE embeddings, compensating for category information lost under the extreme compression of the primary image. Extensive experiments on the widely used KITTI Stereo and Cityscapes datasets demonstrate that MDIC achieves state-of-the-art perceptual quality at extremely low bitrates.

Paper
Image & Video Compression

Block-based Learned Image Compression without Blocking Artifacts

Author:Jong Wook Kim, Suyong Bahk, TaeHwa Lee, HyunDong Cho, Donghyun Kim, Sung-Chang Lim, Jin Soo Choi, Hui Yong Kim

Year:2026

Publication:IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

ScreenShot_2026-05-30_231909_958 (1).png

Learned image compression (LIC) outperforms traditional codecs but suffers from excessive peak memory usage when handling high-resolution images. Consequently, block-based LIC has been studied to reduce peak memory and peak computational cost, but it often introduces blocking artifacts that degrade visual quality. To mitigate this, the JPEG-AI standard introduced a patch-based scheme in which overlapped blocks are coded independently using empirically determined overlap sizes. However, experimentally searching for optimal overlaps is time-consuming and does not guarantee blocking-free reconstruction. In this paper, we propose an analytic framework that models overlap propagation through convolution and transposed convolution layers to precisely determine the minimal overlaps required for blocking-free reconstruction. Based on the calculated minimum overlaps, we provide a block-based implementation methodology applicable to most CNN-based LIC models. Applied to four CNN-based LIC models on 4K images partitioned into various block sizes (256 x 256, 512 x 512), our method achieves rate-distortion performance identical to full-image coding while reducing average peak memory usage to 13.94% (encoder) and 13.33% (decoder), and average peak computational cost to 2.6% and 1.24%, respectively. Notably, the proposed block-based framework does not require any retraining of the original model. Furthermore, it can also be applied to most CNN-based image-processing neural networks without performance degradation.

Paper
Image & Video Compression

Differentiable Vector Quantization for Rate-Distortion Optimization of Generative Image Compression

Author:Shiyin Jiang, Wei Long, Minghao Han, Zhenghao Chen, Ce Zhu, Shuhang Gu

Year:2026

Publication:IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

ScreenShot_2026-05-30_231501_328 (1).png

The rapid growth of visual data under stringent storage and bandwidth constraints makes extremely low-bitrate image compression increasingly important. While Vector Quantization (VQ) offers strong structural fidelity, existing methods lack a principled mechanism for joint rate-distortion (RD) optimization due to the disconnect between representation learning and entropy modeling. We propose RDVQ, a unified framework that enables end-to-end RD optimization for VQ-based compression via a differentiable relaxation of the codebook distribution, allowing the entropy loss to directly shape the latent prior. We further develop an autoregressive entropy model that supports accurate entropy modeling and test-time rate control. Extensive experiments demonstrate that RDVQ achieves strong performance at extremely low bitrates with a lightweight architecture, attaining competitive or superior perceptual quality with significantly fewer parameters. Compared with RDEIC, RDVQ reduces bitrate by up to 75.71% on DISTS and 37.63% on LPIPS on DIV2K-val. Beyond empirical gains, RDVQ introduces an entropy-constrained formulation of VQ, highlighting the potential for a more unified view of image tokenization and compression. The code will be available at this https URL.

Paper
Code
Image & Video Compression

What Matters in Practical Learned Image Compression

Author:Kedar Tatwawadi, Parisa Rahimzadeh, Zhanghao Sun, Zhiqi Chen, Ziyun Yang, Sanjay Nair, Divija Hasteer, Oren Rippel

Year:2026

Publication:IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

ScreenShot_2026-05-30_231145_175 (1).png

One of the major differentiators unlocked by learned codecs relative to their hard-coded traditional counterparts is their ability to be optimized directly to appeal to the human visual system. Despite this potential, a perceptual yet practical image codec is yet to be proposed. In this work, we aim to close this gap. We conduct a comprehensive study of the key modeling choices that govern the design of a practical learned image codec, jointly optimized for perceptual quality and runtime -- including within the ablations several novel techniques. We then perform performance-aware neural architecture search over millions of backbone configurations to identify models that achieve the target on-device runtime while maximizing compression performance as captured by perceptual metrics. We combine the various optimizations to construct a new codec that achieves a significantly improved tradeoff between speed and perceptual quality. Based on rigorous subjective user studies, it provides 2.3-3x bitrate savings against AV1, AV2, VVC, ECM and JPEG-AI, and 20-40% bitrate savings against the best learned codec alternatives. At the same time, on an iPhone 17 Pro Max, it encodes 12MP images as fast as 230ms, and decodes them in 150ms -- faster than most top ML-based codecs run on a V100 GPU.

Paper
Image & Video Compression

Ultra-Fast Neural Video Compression

Author:Jiahao Li, Wenxuan Xie, Zhaoyang Jia, Bin Li, Zongyu Guo, Xiaoyi Zhang, Yan Lu

Year:2026

Publication:IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

ScreenShot_2026-05-30_230615_045 (1).png

While neural video codecs (NVCs) have demonstrated superior compression ratio, their prohibitive computational complexity remains a critical barrier to real-world deployment. This paper introduces a chunk-based coding framework designed to significantly improve the rate-distortion-complexity trade-off. Instead of processing frames sequentially, our approach encodes a chunk of multiple frames into a single compact latent representation and decodes them simultaneously. This is enabled by cross-frame interaction modules for joint spatial-temporal modeling and frame-specific decoders for parallel reconstruction. This paradigm not only dramatically enhances coding throughput but also facilitates more effective modeling of long-term temporal correlations. To further boost speed, we propose a streamlined entropy coding mechanism that consolidates bit-stream interactions into a single step, substantially reducing decoding overhead. Building on these innovations, we present DCVC-UF (Ultra-Fast), a new NVC that sets a new SOTA in performance. Our experiments show that DCVC-UF can achieve ultra-fast encoding and decoding speeds, significantly outperforming previous leading codecs. DCVC-UF serves as a notable landmark in the journey of NVC evolution. The code is at https://github.com/microsoft/DCVC.

Paper
Code
Image & Video Compression

Adaptive Learned Image Compression with Graph Neural Networks

Author:Yunuo Chen, Bing He, Zezheng Lyu, Hongwei Hu, Qunshan Gu, Yuan Tian, Guo Lu

Year:2026

Publication:IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

ScreenShot_2026-05-30_224419_940 (1).png

Efficient image compression relies on modeling both local and global redundancy. Most state-of-the-art (SOTA) learned image compression (LIC) methods are based on CNNs or Transformers, which are inherently rigid. Standard CNN kernels and window-based attention mechanisms impose fixed receptive fields and static connectivity patterns, which potentially couple non-redundant pixels simply due to their proximity in Euclidean space. This rigidity limits the model's ability to adaptively capture spatially varying redundancy across the image, particularly at the global level. To overcome these limitations, we propose a content-adaptive image compression framework based on Graph Neural Networks (GNNs). Specifically, our approach constructs dual-scale graphs that enable flexible, data-driven receptive fields. Furthermore, we introduce adaptive connectivity by dynamically adjusting the number of neighbors for each node based on local content complexity. These innovations empower our Graph-based Learned Image Compression (GLIC) model to effectively model diverse redundancy patterns across images, leading to more efficient and adaptive compression. Experiments demonstrate that GLIC achieves state-of-the-art performance, achieving BD-rate reductions of 19.29%, 21.69%, and 18.71% relative to VTM-9.1 on Kodak, Tecnick, and CLIC, respectively. Code will be released at this https URL.

Paper
Code
Image & Video Compression

MambaSIC: Mamba-based Stereo Image Compression with Bi-directional Multi-reference Entropy Model

Author:Shiyu Qin, Xinjie Zhang, Zhening Liu, Jinpeng Wang, Bin Chen, Jiawei Li, Yifan Ren, Shu-Tao Xia, Jun Zhang

Year:2026

Publication:IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

ScreenShot_2026-05-30_224018_678 (1).png

Stereo image compression (SIC) has become increasingly vital with its applications surging in fields such as 3D reconstruction and autonomous navigation. Previous methods leverage cross-attention to model inter-view redundancy and employ autoregressive entropy models to predict probability distributions, achieving impressive rate-distortion performance. However, they suffer from slow coding speed due to the quadratic complexity of cross-attention mechanisms and the spatial autoregressive iterations of the entropy models. To address these limitations, we propose MambaSIC, which introduces two key innovations. First, we propose a Mamba-based stereo visual state space block (stereo VSSB) that leverages its linear complexity and long-range modeling capabilities to more rapidly and efficiently capture redundancy information between the two views. Second, to accelerate the compression process and enhance the accuracy of probability estimation, we introduce a bi-directional multi-reference entropy model that utilizes a checkerboard partitioning strategy and the stereo VSSB to get rich inter-view priors. Experimental results demonstrate that our MambaSIC outperforms the state-of-the-art methods in both compression performance and efficiency.

Paper
1 2 3 ... 225 Jump topage