Self Playing PPO - Teaching Cars to Race
A beginner friendly guide!!! (hopefully)
Inspiration
I’ve always been interested in AI and more specifically RL. I know this is a broad statement, but it’s true. I might not be the most technical, but I’ve always admired how intuitive RL is at a surface level. At it’s core, it learns the same way that we learn as humans: it performs an action, gets told if that action is good or bad, and then learns to either continue performing or stop performing it.
Moreover, I’ve also always enjoyed racing. Now, I’m not a diehard F1 fan nor do I have an obsession for racing. But, there’s just something so satisfying about the complexities underlying the simplistic nature of racing. From the paths taken by racers to strategies like blocking other racers, these strategies have all had to be developed and learned over time. (And I also love procrastinating my homework by playing racing games online – shoutout Polytrack and Night City Racing).
And, that’s what lead me to this project. I wondered if models could learn these strategies without being explicitly programmed with heuristics or rules. More specifically, there is a quote that I think sums up this exploration perfectly:
Necessity is the mother of invention.
– Plato
I wanted to see if the very nature of competition could implicitly teach models to take optimal strategies in a race environment, and then compare this against individual models trained without the pressure of competition and models trained to explicitly go faster.
And just for a sneak peak, this is what you’ll learn to train:
Background Info
Now, just as a brief disclaimer, I won’t go into extreme detail about any algorithm or code. My goal is to provide a generalized overview that anyone can understand (specifically with regards to PPO), while at the same time using proper terminology (bolded) and hopefully helping you learn something. If you wish to learn more, I’d suggest reading up on the original papers, this course by Hugging Face, or to even browse my notes for a more condensed version. With that out of the way, we can get onto the fun stuff!
As mentioned earlier, in RL, an agent interacts with an environment over a sequence of timesteps (a trajectory). At each step, the agent observes a state, takes an action, and receives a reward. The objective is not to maximize immediate reward, but to maximize cumulative reward over time, often called the return.
The agent’s behaviour is governed by a policy, usually denoted π, which defines how actions are chosen given a state. In modern (deep) RL, policies are often parameterized by neural networks. In other words, they are represented by neural networks who update their weights (parameters) according to the returns. This idea is that of policy gradients: directly adjusting policy parameters in the direction that increases expected reward.
Then, to estimate how good an action was, we can use a value function which predicts expected return with bootstrapping given a state. (PS, what bootstrapping means is that it basically updates estimates of the value of states by using the immediate reward plus the current predicted value of the next state. Basically, if it predicts something is good and also gets a good reward going into it, it becomes worth more. This means it doesn’t have to wait for an entire episode to finish to learn and relates to something called TD learning.) Going in a bit deeper, TD learning is ALSO used to calculate advantage which measures how much better or worse taking a specific action in state was compared to the average expected outcome from that state. One final thing to note is that advantage replaces raw returns/rewards because those can have high variance (i.e. could all be positive with varying degrees of size).
Combining these ideas gives rise to actor-critic methods (of which PPO is one), where the actor updates the policy and the critic evaluates it. (Think of it like a player doing something and a coach telling them how to improve).
A key challenge in policy gradient methods is training stability. Large updates to the policy can drastically change behaviour, leading to collapsed performance or highly noisy learning. AND… this is where PPO and the motivation behind it become important.
Now, let’s take a step back. This was a lot of jargon, especially if you are just starting out. So, if there’s nothing else you remember, just remember the following analogy:
Imagine a player learning a new game with a coach. The player (policy) tries different moves, and the coach (value function) predicts how good each situation is. After each move, the coach gives feedback: “That move was better or worse than I expected” (advantage) based on the immediate result and their expectation of what would happen next (bootstrapping). Over time, the player updates their strategy a little at a time, guided by this feedback, so they don’t completely 180 their strategy (PPO’s key idea).
PPO
Now that you know the general details, we can go into more specifics. For the next two parts, I’ll be taking screenshots of the code as I go along with my explanation.
First, we need to make an actual agent. As mentioned earlier, it will have two functions (value and policy):
Within those neural networks, it’s important to note what the very first input and last output correspond to. In both cases, the input they take in has the dimensions of the observations while the critic outputs one value (the value of the state) and the actor outputs a mean action for each continuous action dimension which, when combined with the standard deviation, allows us to choose from a continuous range of values for the actual action.
Then, we need can write the following functions shown above to actually provide these predictions to the PPO algorithm. Note that the optimization at the bottom simply helps reduce variance for better training.
From here, we can initialize a PPO class:
And finally, we can go through the PPO loop:
1. The agent interacts with the environment and collects trajectories to learn from. This is called a rollout.
Note that the policy does not get updated here.
The key idea here is that we collect a bunch of data for a batch/minibatch and use that to update PPO later on (not updating at every single step but also not after excessively long time).
2. The critic predicts the value of each state in the trajectory.
It does this by just calling the Agent class’ function that we defined earlier.
3. The agent calculates the advantages at each step.
Note that you can use GAE to manage the bias-variance tradeoff (better updates because it takes a weighted average).
4. Update the policy.
5. Update the value function.
Note that the two steps above can be combined into one in PyTorch.
As mentioned in the note, PyTorch is able to perform the gradient updates on both the policy and value function at the same time and so we can perform the steps on simply one loss instead of two.
ADDITIONALLY, it is important to note that PyTorch is better at finding local minima than maxima. Hence, although the formula for PPO wants to maximize, we apply a negative to some values to still be able to optimize for extrema.
6. Repeat
Just repeat this as many times as you want to train a progressively more advanced model. BUT be careful because RL models can still overfit (but we deal with this later on with strategies like random environment sampling).
But either way, see how easy that was??
Environment & Gymnasium
Chances are, if you’ve ever dabbled in RL, you’ve heard of OpenAI’s Gymnasium. Or, if you’ve ever used ChatGPT like ever, you’ve probably heard of OpenAI. But what exactly is Gymnasium?
Well, put simply, it’s a VERY popular library for reinforcement learning environments BECAUSE it provides a standardized interface for agents to interact with different tasks. In other words, you are able to swap out environments (whether that’s classic problems like CartPole or a custom environment like the one I am about to show you) without having to change your learning code.
At its core, every environment in Gymnasium has a couple building blocks:
action_spaceandobservation_space→ these define the actions an agent can take and the observations it can see (i.e. their shapes, the values they can take on, etc.)reset()→ this initializes the environment and returns the starting state that agents use to make their first actionstep(action)→ this is the main function that an agent uses to interact with an environmentThis is where the results of taking an action are returned, including the next state, reward for that action, a done flag (terminated and truncated), and an optional info dictionary for debugging
render→ this renders a visual representation of the environment (but this is optional)
For our racing environment, they resemble the following:
action_space and observation_space
As you can see, we can define the boundaries of the values in each array, as well as it’s shape and the data type it takes on.
reset()
Not much here. The only thing really of note is that we can make separate functions to get the observation and information of the environment to keep our code DRY.
step(action)
This is the meat of our environment. As you can see, the function takes in an action and then gets the individual steering and throttle which we can then use to step through the car’s 1 timestep movement. Then, the car gets a reward/penalty based on the new state after it’s movement.
THE MOST IMPORTANT PART HERE is the reward function and I’ll touch on this more at the end as well. For now, just know that you will probably spend time tuning this to optimize your own agent’s learning should you try to implement your own environment. This is because you want the agent to learn but sometimes with sparse rewards, it’s too hard while with plenty rewards, it can be reward hacked (that’s why I added the checkpoints for example lol).
Of course, you will still need to define the track and car classes and all their functionality to be able to pass them into this gym environment. But I won’t show the code right now (you can check my Github if you would like to see it in detail), although I will make a small note about the cars and raycasting in particular near the end.
Self Play
Now this is where things start to get really interesting. Before, it was simply an individual model learning by interacting with a fixed environment. With self play, the environment becomes dynamic (the agent learns by playing against itself). This strategy falls under that of adversarial MARL (Multi Agent RL), and is key because it ensures that the agent plays against appropriately challenging opponents. This is, for example, how the famous Alpha Go was trained!!!
One common approach, as well as the one we are going to be using, is to take snapshots of agents at certain frequencies and add them to a pool of opponents. Then, you can train the agent by choosing an opponent from this pool of agents and training against it for each episode. But, it’s key to note that there are challenges. Agents can become over specialized and fail in generalized environments against generalized opponents. Think of if an agent was repeatedly trained against other agents that only prioritized blocking and nothing else. Then, even if it wasn’t blocked, it might still not go as fast as it could because it would be a new experience. BASICALLY, you want to train the agent in a diverse set of experiences (both environment and agent).
Concentrating on our racing project in specific, we need to implement the following changes in both the environment and the PPO algorithm for self play to work:
Starting off with the environment, we don’t have to rewrite all of it again. The Car and the Track classes for self play can inherit from the individual counterpart. But the racing environment is too different so it’s better to rewrite it (although you can copy a lot of the code haha…).
To give a general summary of the changes, most of the logic stays the same. The only main difference is that each agent has an action space and observation space and this continues throughout the code (i.e. we perform reward calculations for each agent instead of just one). Additionally, we modify the observation space so that the cars can also observe each other to learn from this additional information.
Apart from this, the last major change is that we add competition. How you ask? We add a reward that only one car can get.
By doing this, cars are incentivized to drive faster and better to place quicker (hopefully at least lmao).
Then, to change PPO, as mentioned earlier, we need to add a pool of agents (but we can still inherit most of the functionality from the base PPO implementation).
Essentially, for every update, we’ll choose an agent from the current pool (or perform random actions if no agents yet) and add a new snapshot of the current agent every so often to the pool. If the pool gets to large, we can simply remove the oldest agent.
Something to note is that you need to DEEP COPY when making snapshots so all the weights get copied over!!!
But yeah, apart from this, there’s not much difference between the original PPO implementation and environment and the current one for self play. Isn’t that so cool? You’ve already learned so much (I hope).
Key Notes
After going through this article, you should now know the general gist of how to implement self play PPO and to build a custom environment. But, as mentioned in the beginning, this was a relatively simplified explanation. So, here are some of the choices made key to improving this project:
Manually annealing logarithmic standard deviation.
When using the learned logarithmic standard deviation, it wouldn’t decrease by enough so even by the end when rewards weren’t improving, the agent’s movements would be jittery and its speed wouldn’t be able to reach higher levels (see below).
So instead, I tried manual annealing where you start with more exploration and tune it down as the updates progress so that by the end, the model is more deterministic (less range of options) and it moves much smoother.
Vectorizing the ray casting.
As you might guess, the main blockage for the processing within this project was not the GPU (the entire neural network of the policy and the value function was only made up 3 FC layers of max size 64 neurons. Rather, the main computation for this project was the processing required for each step, more specifically, the ray casting.
The general logic for the ray casting was to check each ray (11 rays total per car) against each segment of the boundaries (hundreds) and this would need to be performed each step. Obviously, this was VERY slow. So instead, I vectorized the ray casting.
Instead of using a loop, I used NumPy broadcasting and masking to compute all ray–segment intersections in one pass.
Fine tuning the reward shaping.
This probably took me the longest out of any parts of the project. Although it may seem simple, reward shaping had a lot of caveats. Here are some of the main ones:
You need checkpoints to ensure that the car doesn’t reward hack and just go backwards and then forwards at the start.
You can benefit by having some rewards at these checkpoints to ensure that the car knows it’s going the correct way (or else the rewards are too sparse).
Speed rewards need to be controlled as well.
For instance, I only rewarded speed when the agent was going in the correct direction. Otherwise, it would have learned to just go backwards at max speed because then it could get an overall higher reward because it wouldn’t finish until the step limit of the environment was reached.
And it gets even more complicated when you get into self play and MARL:
Rewarding agents for surpassing other agents is hard because you don’t want them to hack the reward (i.e. if they get a reward for passing they might pass, slow down, and pass again, and repeat).
It’s very easy for agents to learn degenerate behaviour.
For example, I tried to make the cars learn to play fair by penalizing wall and car crashes. But, during one training, because the agents were too scared of car crashes, one would drive towards the other and the other would drive away, always ending up crashing into the wall.
Varying factors to prevent overfitting and increase learning.
In my code, I made randomized track creation and agent starting positions. This allowed the agent to get exposure to more types of tracks during its training while also learning strategy from different starting positions. Something to note for this was that I always kept the same track per environment (with a different track per environment) because PPO assumes that, per environment, the distribution remains relatively stationary.
To dive a bit deeper into the track creation, my logic was essentially to take a circle and then plot points with variation inside and outside of the circumference, and then connect the points with cubic splines. I feel like this is pretty intuitive, so I’ll leave it at that…
And also, what I covered was a base implementation of PPO without any optimizations. But, as you can see in my code, I added various optimizations like KL divergence stopping, GAE, learning rate annealing, etc. (most of which helped stabilize training / the agent learn better). If you want to read up a bit on them at a very simplified manner, I suggest looking at my code and my notes.
Results
And now for the part you were probably all waiting for… the results. Which agents actually raced the best, and how did they compare to each other?
Well unfortunately… I lost :( Both SB3 models beat my single and self-play models:
Or did they…
Let’s talk first about the single agents. Sure, the SB3 agents went faster and took less steps than my ‘from scratch’ single agent. But look closer. Do you notice anything special about the paths they took? That’s right! My custom agent will always take a shorter path on average:
This is interesting to note because it shows how in RL, unless instructions are very precise, two different agents can learn two different solutions which could both be considered ‘optimal’ for the given task. Although the SB3 agents were faster, they took on average less optimal routes. (Just to give context, this was evaluated on 40 different tracks with 5 attempts per track).
And next, concentrating on self-play, we can also observe some interesting stuff. For instance, because the self-play agents learned that coming in first was good, it learned natural strategies like overtaking by taking a sharp inside line around corners. As such, it was able to eek out the general SB3 model in this regard. But, it also learned other strategies like sacrificing speed to block other agents which, while unintentional, help to showcase the competitive nature of the environment. And, one more interesting thing of note is that the self-play agent never failed in all the evaluations. So, although it went slower, it still demonstrated how self play can still preserve robust systems and environments.
Finally, we can take a look at the training data:
Note that the data may look skewed since the reward functions differed between models so they needed to be normalized so in fact although the SB3 finetuned model did better than the generalized one, they look around the same.
Looking at the training data, we can actually see how well we succeeded in our implementation. Firstly, our ‘from scratch’ implementation of single PPO matched the curve of the SB3 agent which used the same hyperparameters as our PPO. So, this acts as a validation of correctness (the from scratch model performed and learned similarly to a SOTA baseline model. And secondly, our self-play PPO, although it actually decreased over time and had a lot of fluctuations between 1 and 3 million steps, actually worked. This behaviour can be explained by curriculum learning: it was matched up against harder and harder opponents meaning that it wasn’t able to get as many rewards. So, the fact that it started to stabilize signifies success and that the agent approached an optimum in self-play training.
So overall, I think that this was a massive success!!!
Conclusion
This was a little walkthrough of how to build a self playing PPO agent from the ground up, starting from ideas core to reinforcement learning and ending with competitive behaviour emerging implicitly from a racing environment.
What I hoped to explore through this project and highlight through this blog post, was the power of learning through interaction. By making progressively more intricate changes to the environment like adding another car, the objective itself becomes changed, leading to the agent learning new strategies.
Of course, this project only scratched the surface. There are so many more things that could be implemented, from curriculum learning (progressively increasing the difficulty of tasks) to population based training (training multiple agents in parallel and improving the weaker agents using the weights and hyper-parameters of stronger agents) to even something as simple as improved opponent sampling.
But, overall, I hope that this project inspired you to investigate RL further and showed you that it really isn’t that complicated to get into. I encourage you to experiment and make your own environments and implementations of algorithms, because that’s the best way to learn!


















