# Problem Description

Drivers drive good most of the time and they are bad only a few times. The problem statement considers the fact that all the drivers are good most of the time. This experiment's main goal is to study whether an agent can learn the behavior of the driver when they drive better and warn them when they are bad according to their own standards. Can this problem can be modeled as an MDP (Markov decision process)problem? Yes, and to model the problem as an MDP, the state of the agent should follow the Markovian property. Markovian property is memory less, as in the future state depends upon only the current state and is independent of all the previous states. Velocity is Markovian as velocity tells both the direction and also the magnitude of the property and the future velocity is independent of the past stream of velocities given the present velocity. And also the relative position of the vehicle. The future position of the vehicle is independent of all the past stream of velocities given the present velocity and also the present state of the vehicle.

# Environment

Pictures of the environment below. “Green” circle is the car with the warning agent, and the blue dots are the other agents(cars) in the environment.

# Driver behavior

As shown in the above figure, the driver is the green circle and the driver is interacting with the other agents in the environment. A simple model is adopted for the driver in this model where the driver brakes the vehicle, when there is an agent which is nearer than the threshold value predefined for the driver. Various driver behaviors are experimented with. A very cautious driver is supposed to have higher threshold distance and a reckless driver is supposed to have lower threshold values. The agent is able to learn in all the scenarios as shown in the results section.

# Actions

The actions for the agent are

1. To not warn the driver
2. To warn the driver

# Rewards

There are 4 possible cases for the rewards to be given to the agent. They are:

1. Agent: NotWarn and Car is braked: No: Reward is 0

2. Agent: Warns and Car is braked: No: Reward is -1

3. Agent: NotWarn and Car is braked: Yes: Reward is -1

4. Agent: Warns and Car is braked: Yes: Reward is 1

# Dimensions

A 2 dimensional world is assumed for this model. The dimensions of the environment are 200 units in the X direction and 200 units in the Y direction. The environment is within the limits (-100,100) in both X and Y directions.The agent starts at (X, Y ) = (- 100; 0) with constant velocity Vx, Vy = (1, 0). For all the other agents, the starting positions are given at random and the velocities are chosen at random between (-5, 5)units/time step .

# Mathematical model of the environment

Linear kinematic model is assumed for the environment. Assumptions include:

1. Constant velocity for all the agents. Acceleration of all the vehicles is considered as zero.
2. All the agents move in a straight line rather than the real life cases like turning or braking.
3. The motions of all the agents in the environment are independent of each other.
4. The car moves in a straight line and the other agents also move in a straight line, but are given velocities in 2 dimensions.

Mathematical kinematic equation for the motion model.

# Learning

Function Approximation: For estimating the q state action value for a given state and action, linear approximation is used. The warning agent present in the car gets the data from the vehicle sensors, which are given by the environment in this model. Features for the warning agent are: relative positions and relative position of the other vehicles in the environment in the X and Y dimensions. As there are 2 actions that can be taken by the warning agent, the total number of features are:

(Number of actions)*length of (Number of features per agent in the environment)*(Number of agents)

Episode: An episode is from the start to the first time the agent brakes the vehicle. Once the driver brakes the vehicle, new episode is started. The maximum number of time steps in an episode are 200.

# Hyper parameters

The model has multiple hyper-parameters starting from the environment variables to the learning hyper parameters. Hyper parameters include: There are multiple sets of hyper parameters in the model. One set is within the simulation of the learning environment like 1. Number of agents 2. Kinematic model of the environment etc. The other set of hyper parameters include the parameters related to the learning model. 1. Learning rate, epsilon, epsilon-decay.

# Experiments and Observations

The experiments and observations include the learning of the warning agent in the environment using SARSA and Q_Learning. Number of trails is equal to 10 and the number of episodes in each trial is equal to 2000.

Figure description: Results for application of SARSA on the environment with parameters mentioned in the table. Figure 6 is the result of q_learning for illustration.

# Observations

1. It is observed that the model converges for all kinds of drivers (dierent driver thresholds).

2. It is observed that the model converges slower in the case of larger number of surrounding agents. As the number of surrounding agents are more, the episodes are shorter and therefore lead to slower convergence.

3. It is observed that the Markovian assumption for relative velocities and relative position features is reasonable.

# Conclusion

Reinforcement learning can be a very good application in the self driving cars for motion planning. It is assumed that the driver uses a very simple rule to drive the car(distance). But the agent is able to learn the complex patterns from the relative positions and the relative velocities. I would like to explore application of reinforcement learning in the self-driving car domain using simulated environments (for example in ROS).

--

--