Reward scaling

Reward Scaling in a Predator-Prey-Grass Multi-Agent Reinforcement Learning Environment

One of the most subtle but crucial parameters in multi-agent reinforcement learning (MARL) is reward magnitude — especially in environments where rewards are sparse and agents must coordinate to survive. In the custom-built MARL environment simulating predators, prey, and grass on a grid, changing the reproduction reward from 1.0 to 10.0 dramatically altered the course of learning. This section explains why that matters, and what it reveals about reward scaling in co-evolving agent populations.

The Setup: A Sparse-Reinforcement Ecosystem

In the default environment setup:

  • Prey only receive reward when they reproduce. They get nothing for eating grass.
  • Predators only receive reward when they reproduce. They get no reward for catching prey.
  • There are no dense rewards, and no reward shaping. Just life, death, and reproduction.

This creates a sparse reward problem: many steps in the simulation provide no reward at all. Only successful reproduction yields any reward. Despite this, stable dynamics emerge: predators learning to catch prey, prey learning to survive and reproduce, and ideally, both populations coexisting without collapsing. However, success turned out to hinge critically on the size of the reproduction reward.

When Reward = 1.0: Weak Signal, Weak Survival

In early training runs with the reproduction reward set to 1.0, the outcomes were poor:

  • Agents failed to develop coherent survival strategies.
  • Predator and prey populations often collapsed early.
  • Co-evolutionary dynamics did not emerge.

In hindsight, this made sense. When rewards are both sparse and small, it becomes nearly impossible for the learning algorithm to find meaningful signals. In PPO, which relies on estimating advantages and clipped policy updates, the signal from a rare +1 reward was too small to stand out from the noise of thousands of zero-reward steps.

When Reward = 10.0: Sparse but Learnable

Simply increasing the reproduction reward to 10.0 — while keeping everything else the same — resulted in striking improvements:

  • Agents survived much longer on average.
  • Predators and prey began to develop behaviors that looked like pursuit and evasion.
  • Reproduction occurred frequently enough to sustain populations.
  • The environment began to show dynamic, emergent co-evolution between agent types.

With a reward of 10.0, even a sparse signal became strong enough to be learned. It allowed early successes (like an agent reproducing) to result in meaningful gradient updates. This is essential in environments with no intermediate reward shaping.

Why Reward Scaling Matters in Sparse Settings

These results align with known insights from reinforcement learning:

  • In sparse-reward environments, rare positive events must produce a signal strong enough to influence policy updates.
  • A reward of 1.0 might work well in dense-reward tasks, but in sparse tasks, it can easily get lost in the variance of the episode.
  • In PPO specifically, rewards are typically normalized and clipped, meaning that low-magnitude rewards produce tiny advantage values — leading to minimal updates.

In short: reward magnitude is not just a multiplier. It shapes the entire learning dynamics.

Evaluation Confirms the Effect

After training two sets of PPO agents — one with reproduction reward set to 1.0, and one with it set to 10.0 — post-training evaluations under identical conditions were conducted.

  • Agents trained with reward = 1.0 often failed to maintain stable populations. Their strategies collapsed, and episodes ended early.
  • Agents trained with reward = 10.0 maintained persistent, diverse populations. Different agent speeds emerged and coexisted. Reproduction patterns were sustained.

These results were visible not only in episode length, but also in population graphs, reward traces, and behavioral observations.

Conclusion: Reward Scale Drives Ecosystem Viability

In an environment with no static targets and no fixed goal, the size of the reward signal controls whether any adaptive behavior emerges at all. In our experiments:

  • A reward of 1.0 led to extinction.
  • A reward of 10.0 led to open-ended co-evolution and ecological balance.

This finding should be of interest to anyone building MARL environments focused on emergence and adaptation. If rewards are sparse, their magnitudes must be high enough to stand out — not just in theory, but in practice, where stochasticity and delayed returns are the norm.