Field
Type
Image Super-Resolution

Diffusion Transformer meets Multi-level Wavelet Spectrum for Single Image Super-Resolution

Author:Peng Du, Hui Li, Han Xu, Paul Barom Jeon, Dongwook Lee, Daehyun Ji, Ran Yang, Feng Zhu

Year:2025

Publication:IEEE International Conference on Computer Vision (ICCV)

Diffusion Transformer meets Multi-level Wavelet Spectrum for Single Image Super-Resolution.jpg

Discrete Wavelet Transform (DWT) has been widely explored to enhance the performance of image super-resolution (SR). Despite some DWT-based methods improving SR by capturing fine-grained frequency signals, most existing approaches neglect the interrelations among multi-scale frequency sub-bands, resulting in inconsistencies and unnatural artifacts in the reconstructed images. To address this challenge, we propose a Diffusion Transformer model based on image Wavelet spectra for SR (DTWSR). DTWSR incorporates the superiority of diffusion models and transformers to capture the interrelations among multi-scale frequency sub-bands, leading to a more consistence and realistic SR image. Specifically, we use a Multi-level Discrete Wavelet Transform (MDWT) to decompose images into wavelet spectra. A pyramid tokenization method is proposed which embeds the spectra into a sequence of tokens for transformer model, facilitating to capture features from both spatial and frequency domain. A dual-decoder is designed elaborately to handle the distinct variances in low-frequency (LF) and high-frequency (HF) sub-bands, without omitting their alignment in image generation. Extensive experiments on multiple benchmark datasets demonstrate the effectiveness of our method, with high performance on both perception quality and fidelity.

Paper
Image Super-Resolution

LightBSR: Towards Lightweight Blind Super-Resolution via Discriminative Implicit Degradation Representation Learning

Author:Jiang Yuan, Ji Ma, Bo Wang, Guanzhou Ke, Weiming Hu

Year:2025

Publication:IEEE International Conference on Computer Vision (ICCV)

LightBSR Towards Lightweight Blind Super-Resolution via Discriminative Implicit Degradation Representation Learning.jpg

Implicit degradation estimation-based blind super-resolution (IDE-BSR) hinges on extracting the implicit degradation representation (IDR) of the LR image and adapting it to LR image features to guide HR detail restoration. Although IDE-BSR has shown potential in dealing with noise interference and complex degradations, existing methods ignore the importance of IDR discriminability for BSR and instead over-complicate the adaptation process to improve effect, resulting in a significant increase in the model's parameters and computations. In this paper, we focus on the discriminability optimization of IDR and propose a new powerful and lightweight BSR model termed LightBSR. Specifically, we employ a knowledge distillation-based learning framework. We first introduce a well-designed degradation-prior-constrained contrastive learning technique during teacher stage to make the model more focused on distinguishing different degradation types. Then we utilize a feature alignment technique to transfer the degradation-related knowledge acquired by the teacher to the student for practical inferencing. Extensive experiments demonstrate the effectiveness of IDR discriminability-driven BSR model design. The proposed LightBSR can achieve outstanding performance with minimal complexity across a range of blind SR tasks.

Paper
Code
Image Super-Resolution

Rethinking the Upsampling Process in Light Field Super-Resolution with Spatial-Epipolar Implicit Image Function

Author:Ruixuan Cong, Yu Wang, Mingyuan Zhao, Da Yang, Rongshan Chen, Hao Sheng

Year:2025

Publication:IEEE International Conference on Computer Vision (ICCV)

Rethinking the Upsampling Process in Light Field Super-Resolution with Spatial-Epipolar Implicit Image Function.jpg

Deep learning-based light field image super-resolution methods have witnessed remarkable success in recent years. However, most of them only focus on the encoder design and overlook the importance of upsampling process in decoder part. Inspired by the recent progress in single image domain with implicit neural representation, we elaborately propose spatial-epipolar implicit image function (SEIIF), which optimizes upsampling process to significantly improve performance and supports arbitrary-scale light filed image super-resolution. Specifically, SEIIF contains two complementary upsampling patterns. One is spatial implicit image function (SIIF) that exploits intra-view information in sub-aperture images. The other is epipolar implicit image function (EIIF) that mines inter-view information in epipolar plane images. By unifying the upsampling step of two branches, SEIIF extra introduces cross-branch feature interaction to fully fuse intra-view information and inter-view information. Besides, given that line structure in epipolar plane image integrates spatial-angular correlation of light field, we present an oriented line sampling strategy to exactly aggregate inter-view information. The experimental results demonstrate that our SEIIF can be effectively combined with most encoders and achieve outstanding performance on both fixed-scale and arbitrary-scale light field image super-resolution.

Paper
Code
Image Super-Resolution

STAR: Spatial-Temporal Augmentation with Text-to-Video Models for Real-World Video Super-Resolution

Author:Rui Xie, Yinhong Liu, Penghao Zhou, Chen Zhao, Jun Zhou, Kai Zhang, Zhenyu Zhang, Jian Yang, Zhenheng Yang, Ying Tai

Year:2025

Publication:IEEE International Conference on Computer Vision (ICCV)

STAR Spatial-Temporal Augmentation with Text-to-Video Models for Real-World Video Super-Resolution.jpg

Image diffusion models have been adapted for real-world video super-resolution to tackle over-smoothing issues in GAN-based methods. However, these models struggle to maintain temporal consistency, as they are trained on static images, limiting their ability to capture temporal dynamics effectively. Integrating text-to-video (T2V) models into video super-resolution for improved temporal modeling is straightforward. However, two key challenges remain: artifacts introduced by complex degradations in real-world scenarios, and compromised fidelity due to the strong generative capacity of powerful T2V models (e.g., CogVideoX-5B). To enhance the spatio-temporal quality of restored videos, we introduce STAR (Spatial-Temporal Augmentation with T2V models for Real-world video super-resolution), a novel approach that leverages T2V models for real-world video super-resolution, achieving realistic spatial details and robust temporal consistency. Specifically, we introduce a Local Information Enhancement Module (LIEM) before the global attention block to enrich local details and mitigate degradation artifacts. Moreover, we propose a Dynamic Frequency (DF) Loss to reinforce fidelity, guiding the modelto focus on different frequency components across diffusion steps. Extensive experiments demonstrate STAR outperforms state-of-the-art methods on both synthetic and real-world datasets.

Paper
Code
Image Super-Resolution

Emulating Self-attention with Convolution for Efficient Image Super-Resolution

Author:Dongheon Lee, Seokju Yun, Youngmin Ro

Year:2025

Publication:IEEE International Conference on Computer Vision (ICCV)

Emulating Self-attention with Convolution for Efficient Image Super-Resolution.jpg

In this paper, we tackle the high computational overhead of Transformers for efficient image super-resolution (SR). Motivated by the observations of self-attention's inter-layer repetition, we introduce a convolutionized self-attention module named Convolutional Attention (ConvAttn) that emulates self-attention's long-range modeling capability and instance-dependent weighting with a single shared large kernel and dynamic kernels. By utilizing the ConvAttn module, we significantly reduce the reliance on self-attention and its involved memory-bound operations while maintaining the representational capability of Transformers. Furthermore, we overcome the challenge of integrating flash attention into the lightweight SR regime, effectively mitigating self-attention's inherent memory bottleneck. We scale up the window size to 32x32 with flash attention rather than proposing an intricate self-attention module, significantly improving PSNR by 0.31dB on Urban100x2 while reducing latency and memory usage by 16xand 12.2x. Building on these approaches, our proposed network, termed Emulating Self-attention with Convolution (ESC), notably improves PSNR by 0.27 dB on Urban100x4 compared to HiT-SRF, reducing the latency and memory usage by 3.7xand 6.2x, respectively. Extensive experiments demonstrate that our ESC maintains the ability for long-range modeling, data scalability, and the representational power of Transformers despite most self-attention being replaced by the ConvAttn module.

Paper
Code
Image Super-Resolution

Not All Degradations Are Equal: A Targeted Feature Denoising Framework for Generalizable Image Super-Resolution

Author:Hongjun Wang, Jiyuan Chen, Zhengwei Yin, Xuan Song, Yinqiang Zheng

Year:2025

Publication:IEEE International Conference on Computer Vision (ICCV)

Not All Degradations Are Equal A Targeted Feature Denoising Framework for Generalizable Image Super-Resolution.jpg

Generalizable Image Super-Resolution aims to enhance model generalization capabilities under unknown degradations. To achieve such goal, the models are expected to focus only on image content-related features instead of degradation details (i.e., overfitting degradations).Recently, numerous approaches such as dropout and feature alignment have been proposed to suppress models' natural tendency to overfitting degradations and yields promising results. Nevertheless, these works have assumed that models overfit to all degradation types (e.g., blur, noise), while through careful investigations in this paper, we discover that models predominantly overfit to noise, largely attributable to the distinct degradation pattern in noise compared to other degradation types. In this paper, we propose a targeted feature denoising framework, comprising noise detection and denoising modules. Our approach represents a general solution that can be seamlessly integrated with existing super-resolution models without requiring architectural modifications. Our framework demonstrates superior performance compared to previous regularization-based methods across five traditional benchmark and datasets, encompassing both synthetic and real-world scenarios.

Paper
1 2 3 ... 114 Jump topage