In visual reinforcement learning, agents operating under sparse rewards frequently fail to explore effectively because they overfit to high-frequency visual noise and irrelevant textures early in training. By forcing the agent to initially perceive only the coarse, low-frequency structural elements of its environment, we can drastically simplify the state space and facilitate the discovery of foundational behaviors. Gradually reintroducing high-frequency details as the agent's competence improves ensures robust credit assignment and stable policy convergence without requiring dense reward engineering.
Approach
We propose an adaptive frequency curriculum that integrates internal feature map smoothing into the visual encoder of an RL agent. Inspired by [Curriculum By Smoothing](/paper/art_ade7d4b8c4684cde8eb8eff20fee22a3), we insert Gaussian low-pass filters after the convolutional layers of the policy network to suppress high-frequency spatial details during early exploration. Unlike static annealing schedules, the variance of the smoothing kernel is dynamically decayed based on the agent's temporal difference (TD) error, ensuring the curriculum progresses only when the agent has mastered the current difficulty level. This forces the agent to learn broad spatial navigation before attempting fine-grained manipulation, directly addressing the exploration bottlenecks highlighted in [PushWorld: A benchmark for manipulation planning with tools and movable obstacles](/paper/art_efa23f423c374ef5beaf1bb524b43f1f).
Experimental Plan
We will evaluate our method on the pixel-based manipulation tasks in the Meta-World benchmark and the sparse-reward navigation environments in MiniGrid. The primary hypothesis is that adaptive spectral smoothing will achieve higher success rates and faster sample efficiency than standard end-to-end RL and static curriculum baselines by preventing early-stage optimization traps. We will compare our approach against standard Soft Actor-Critic (SAC), SAC with a static [Curriculum By Texture](/paper/art_7782e22ab11a4d09b567ddb175f46f59) applied to the encoder, and an intrinsic motivation baseline like [Adversarial Intrinsic Motivation for Reinforcement Learning](/paper/art_8034f2f3d49c4fc78e3f74ec72afbf5c). Metrics will include episodic success rate, sample efficiency (environment steps to convergence), and zero-shot robustness to novel visual distractors.