OPTIMA: Optimized Policy for Intelligent Multi-Agent Systems Enables Coordination-Aware Autonomous Vehicles

Rui Du1, Kai Zhao1, Jinlong Hou1, Qiang Zhang1 and Peter Zhang2 1Rui Du, Kai Zhao, Jinlong Hou, Qiang Zhang are with the AI Platform Department, Bilibili Inc., Shanghai 200233, China [email protected]; [email protected]; [email protected];[email protected]2Peter Zhang is with the Heinz College of Information Systems and Public Policy, Carnegie Mellon University, 4800 Forbes Ave, Pittsburgh, PA 15213, USA, [email protected]
Abstract

Coordination among connected and autonomous vehicles (CAVs) is advancing due to developments in control and communication technologies. However, much of the current work is based on oversimplified and unrealistic task-specific assumptions, which may introduce vulnerabilities. This is critical because CAVs not only interact with their environment but are also integral parts of it. Insufficient exploration can result in policies that carry latent risks, highlighting the need for methods that explore the environment both extensively and efficiently. This work introduces OPTIMA, a novel distributed reinforcement learning framework for cooperative autonomous vehicle tasks. OPTIMA alternates between thorough data sampling from environmental interactions and multi-agent reinforcement learning algorithms to optimize CAV cooperation, emphasizing both safety and efficiency. Our goal is to improve the generality and performance of CAVs in highly complex and crowded scenarios. Furthermore, the industrial-scale distributed training system easily adapts to different algorithms, reward functions, and strategies.

I INTRODUCTION

The long-term goal of autonomous vehicles is to address advanced real-world traffic challenges. According to a report by the National Highway Traffic Safety Administration (NHTSA) [1], 94% of all crashes are caused by human error. However, human-error crashes stem not only from impaired or distracted driving but also from misjudging other vehicles’ intentions. The key to eliminating crashes lies in the connectivity of vehicles and their ability to achieve comprehensive situational awareness, enabling them to respond appropriately to each other’s movements and decisions. Real-world traffic problems are complex and unpredictable. Thus, training a highly intelligent model that understands complex roads and diverse driving styles is necessary.

However, the majority of previous studies have focused on overly simplistic scenarios in simulations. This approach may lead to potential risks affecting the overall safety and effectiveness of autonomous driving technologies. CAVs are not only interacting with the environment they are also integral parts of it. Each CAV’s actions influence the surroundings, which in turn affects other CAVs, creating a complex feedback loop that simplistic simulations often fail to capture. If CAVs are not exposed to a wide range of scenarios during training, they may be ill-prepared for real-world driving diversity. When encountering unfamiliar situations, CAVs might exhibit unexpected behaviors, potentially triggering a cascade effect where other vehicles are pushed into their own corner cases. This interconnectedness underscores the need for comprehensive, diverse, and challenging simulations to develop robust and safe CAV systems capable of handling the unpredictable nature of real-world traffic.

This paper seeks to enhance the scalability of autonomous driving systems by utilizing distributed reinforcement learning techniques. We aim to address traditional traffic problems by integrating advanced AI models that can efficiently process and react to complex traffic scenarios with enhanced safety. The main contribution of our work is a novel Optimized Policy for Intelligent Multi-Agent Systems, or OPTIMA. Our key contributions are summarized as follows:

  • We have integrated learning-based methods with established perception and cooperation techniques such as centralized policy, safety distances, right-of-way.

  • We have successful implemented distributed reinforcement learning for autonomous vehicles, enhancing scalability and performance in complex scenarios.

  • OPTIMA achieved a 100% success rate in navigating extremely challenging multi-agent cooperation scenarios, a feat previously unattained at this level of environmental complexity.

By addressing these complex scenarios without relying on simplifying assumptions, OPTIMA sets a new benchmark for autonomous driving research. It demonstrates the potential of distributed reinforcement learning in handling real-world traffic complexities and provides valuable insights for future research and development in this field.

II Related Work

Refer to caption
Figure 1: The process where the model receives observations about a vehicle and its neighboring vehicles, denoted as O𝑂Oitalic_O for the vehicle’s own observation and N𝑁Nitalic_N for the neighboring vehicles’ observations, from the environment. Using a deep reinforcement learning model, represented as NN𝑁𝑁NNitalic_N italic_N, it generates appropriate actions to control the vehicle’s response and maneuvers. The model also involves a reward function R𝑅Ritalic_R, which influences the actions based on predefined criteria. NN𝑁𝑁NNitalic_N italic_N outputs not only the control actions but also an estimation of future outcomes, represented as E𝐸Eitalic_E.

To define what constitutes safety and efficiency in traffic, we have reviewed a variety of literature on the subject. Some studies have explored these aspects by focusing on varied hazardous scenarios and interaction dynamics between autonomous and human-driven vehicles. For instance, one study evaluated and compared 33 base metrics and 51 variants of traffic safety indicators published from 1967 to 2022 [2]. Another paper discusses the safety of autonomous vehicles with great care, proposing a white-box, interpretable mathematical model for safety assurance, which the authors call Responsibility-Sensitive Safety (RSS) [3]. Other papers discuss various safety indicators and challenges across different scenarios in detail [4, 5, 6, 7]. Inspired by these works, we have determined that vehicle safety indicators can be defined by safe distances, intent recognition, field of view limitations, and the safety of sensitive areas. After summarizing these issues and challenges, we factorize the exploration problem into (i) learning to understand the intentions of other vehicles and perceive surrounding environmental information, and (ii) making appropriate operational decisions based on this understanding.

Several works have proposed learning-based approaches to enhance the perception of surrounding vehicles in CAVs context. A particular study talks about two primary decision-making strategies: pipeline planning and end-to-end planning [8]. Pipeline planning, often referred to as rule-based planning, is a traditional approach the planning within a broader system that includes perception, localization, and control [9, 10]. This method forms a critical part of the overall framework necessary for executing autonomous driving functions. Conversely, end-to-end planning represents a more holistic approach where the entire driving function from perception to action is encapsulated within a single, comprehensive model.

Moreover, research into end-to-end reinforcement learning methods is increasingly being explored, as exemplified by studies such as [11, 12, 13, 14, 15, 16, 17, 18, 19]. These approaches are promising for their ability to directly map sensory inputs to driving actions, potentially simplifying the complex process of autonomous decision-making. However, these methods often remain confined to simplistic scenarios and struggle with scalability.

Existing RL frameworks each have their own limitations. For instance, the CleanRL framework, designed with a single-file structure for research-friendly features, prioritizes ease of learning over scalability [20]. Stable Baselines3, while offering reliable RL algorithms, lacks support for asynchronous multi-actor parallel capabilities [21]. Ray RLlib, despite its powerful features and multi-machine scalability, has a steep learning curve and deeply nested abstractions that can hinder customization for specific tasks [22, 23]. These limitations become particularly apparent in multi-agent reinforcement learning (MARL) cooperation tasks like intersection, which differ significantly from traditional zero-sum competitive tasks such as Go [24]. Cooperative tasks often require a delicate balance between team and individual rewards, a nuance not easily captured in existing frameworks. Furthermore, the partial observation of agents in traffic scenarios adds another layer of complexity. Adapting existing open-source frameworks to meet these specific needs often requires extensive code modifications, which can limit their flexibility for complex, cooperative traffic tasks. This highlights the need for a more adaptable and scalable framework designed specifically for complex MARL scenarios in autonomous driving. OPTIMA enhances this approach by incorporating distributable, asynchronous actors, which facilitate scalability and are well-suited for traffic cooperation scenarios.

TABLE I: Summary of RL Framework
Framework Distributable Async Actors Scalable Traffic Solutions
CleanRL ×\times× ×\times× ×\times× ×\times×
SB3 ×\times× ×\times× ×\times× ×\times×
Ray RLlib \checkmark \checkmark ×\times× ×\times×
OPTIMA \checkmark \checkmark \checkmark \checkmark

III Preliminary

III-A Markov Decision Process Formulation

We formulate the task as a set of Decentralized Partially Observable Markov Decision Processes (Dec-POMDPs) [25]. To accommodate the evolving vehicles in the complex scenarios, Dec-POMDPs are represented as a tuple S,{Ai},T,R,{Ωi},O,{Ni},γ𝑆subscript𝐴𝑖𝑇𝑅subscriptΩ𝑖𝑂subscript𝑁𝑖𝛾\langle S,\{A_{i}\},T,R,\{\Omega_{i}\},O,\{N_{i}\},\gamma\rangle⟨ italic_S , { italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } , italic_T , italic_R , { roman_Ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } , italic_O , { italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } , italic_γ ⟩. We defined the environmental time step of each agent as tiTsubscript𝑡𝑖𝑇t_{i}\in Titalic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_T. S𝑆Sitalic_S is a set of global states. At each time step t𝑡titalic_t, the action chosen by an agent is ai,tAi,tsubscript𝑎𝑖𝑡subscript𝐴𝑖𝑡a_{i,t}\in A_{i,t}italic_a start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT ∈ italic_A start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT, where Ai,tsubscript𝐴𝑖𝑡A_{i,t}italic_A start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT is a two-dimensional continuous action space in the environment. Each agent receives its own observation at each time step t𝑡titalic_t, denoted as st,isubscript𝑠𝑡𝑖s_{t,i}italic_s start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT, which includes a range of sensor inputs such as lidar data, vehicle dynamics, and lane information. Additionally, each agent i𝑖iitalic_i has a set of neighbors Nisubscript𝑁𝑖N_{i}italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. The observation function O𝑂Oitalic_O, defined by O(s,a,o)=P(os,a)𝑂superscript𝑠𝑎𝑜𝑃conditional𝑜superscript𝑠𝑎O(s^{\prime},a,o)=P(o\mid s^{\prime},a)italic_O ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a , italic_o ) = italic_P ( italic_o ∣ italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a ) represents the set of conditional observation probabilities. γ[0,1]𝛾01\gamma\in[0,1]italic_γ ∈ [ 0 , 1 ] is the discount factor for reward function R𝑅Ritalic_R. The reward function R𝑅Ritalic_R consists of the distance and speed driven on the road and the final arrival at the destination. Of course, if there is a collision or driving off the road, there will be corresponding penalties.

The action space A𝐴Aitalic_A is designed to accommodate the dynamics of each of the 40 agents within the environment. At any given time step t𝑡titalic_t, the action at,isubscript𝑎𝑡𝑖a_{t,i}italic_a start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT for each agent i𝑖iitalic_i is derived from a specific set of control parameters indexed by j𝑗jitalic_j. The composite action vector for each agent, denoted at,i=(at,i,0,at,i,1)subscript𝑎𝑡𝑖subscript𝑎𝑡𝑖0subscript𝑎𝑡𝑖1a_{t,i}=(a_{t,i,0},a_{t,i,1})italic_a start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT = ( italic_a start_POSTSUBSCRIPT italic_t , italic_i , 0 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t , italic_i , 1 end_POSTSUBSCRIPT ), composes

  • at,i,0[1,1]subscript𝑎𝑡𝑖011a_{t,i,0}\in[-1,1]italic_a start_POSTSUBSCRIPT italic_t , italic_i , 0 end_POSTSUBSCRIPT ∈ [ - 1 , 1 ]: Steering angle, with -1 and 1 indicating the maximum left and right turns, respectively.

  • at,i,1[1,1]subscript𝑎𝑡𝑖111a_{t,i,1}\in[-1,1]italic_a start_POSTSUBSCRIPT italic_t , italic_i , 1 end_POSTSUBSCRIPT ∈ [ - 1 , 1 ]: Throttle and brake control, where positive values signify forward acceleration, negative values denote braking, and reverse movement is enabled if the vehicle’s speed is less than or equal to zero.

IV Methods

TABLE II: Performance of RL Algorithms
DDPG SAC PPO
success 3.31 10.49 40
out_of_road 26061.2 69.14 14.39
crash_vehicle 16542.9 171.31 31.43
velocity_mean 1.87 0.82 4.11
episode_steps 1000 1000 478.92

In the complex landscape of urban transportation, intersections stand out as critical environment, often serving as the primary source of traffic congestion. Within the framework of intelligent transportation systems, the coordinated management of intersection traffic emerges as a crucial component. This approach leverages vehicle-to-infrastructure and vehicle-to-vehicle communication capabilities, offering promising avenues for enhancing both road safety and traffic efficiency. Our study focuses on a particularly challenging scenario: the coordination of autonomous vehicles at a four-way intersection without traditional traffic signals.

IV-A Experiment Settings

TABLE III: Experimental Results: Impact of Safe Distance and Right-of-Way Rules
baseline safe_distance right-of-way safe_distance & right-of-way
success 40 40 40 40
out_of_road 14.39 21.53 13.51 13.48
crash_vehicle 31.43 14.62 31.77 12.61
velocity_mean 4.11 3.97 3.9 3.97
velocity_mean_in_conflict_zone 2.47 3.01 2.23 3.31
acceleration 0.6 0.55 0.64 0.59
acceleration_in_conflict_zone 0.43 0.53 0.43 0.6
arrive_steps 277.96 285.46 295.67 283.02
episode_steps 478.92 498.19 504.84 491.62
mean_conflict_zone_num 5.73 4.34 6.4 3.91
max_conflict_zone_num 12.12 8.72 13.02 8.2
conflict_zone_when_crash 8.34 5.86 8.9 5.25
front_end_distance 0.18 0.2 0.17 0.2
limited_lidar 0.56 0.64 0.55 0.67
limited_lidar_in_conflict_zone 0.43 0.53 0.41 0.56
front_end_distance_in_conflict_zone 0.12 0.15 0.11 0.16
pair_distance 37.76 41.82 36.4 45.24

Enhancing Simulation Complexity and Precision

To ensure both the efficiency of training and the realism of the simulation environment, we selected MetaDrive111MetaDrive can be found at: https://github.com/metadriverse/metadrive, a lightweight 3D traffic simulator optimized for the training and evaluation of MARL methods. MetaDrive is particularly suited for scalable deployment across distributed clusters, accommodating the increasing complexity of training tasks as the number of agents grows. This choice was driven by the need for extensive training data to support complex multi-agent cooperative tasks. Unlike other simulators that may offer richer visual effects or features [26, 27, 28], MetaDrive is designed to be lightweight, ensuring easier deployment on Linux servers. It provides reinforcement learning friendly APIs, such as encapsulated reward functions and observation feature extraction, facilitating efficient training. In our study, we configured the simulator with 40 vehicles in an intersection scenario without traffic lights to assess agent cooperation.

Many simulations setting where vehicles disappear after collisions, our setup maintains crashed vehicles on the road [18]. This design choice significantly increases the complexity of the environment and more accurately reflects real-world traffic conditions. Removing crashed vehicles can inadvertently simplify the traffic flow, potentially leading to less realistic training scenarios. By keeping collided vehicles in place, we ensure that agents must learn to navigate around obstacles and deal with the ongoing consequences of accidents, just as they would in real-world driving situations. Moreover, to achieve more precise control over the vehicles, we use continuous action in simulation. Many studies in the field assume macro-level actions for vehicle control [19]. However, this assumption leads to several limitations, particularly in complex traffic scenarios. Macro-level actions often result in reduced adaptability and responsiveness, limiting the ability of autonomous vehicles to execute precise maneuvers necessary for safe and efficient driving.

Reinforcement Learning Algorithms

Reinforcement learning has demonstrated remarkable efficiency and effectiveness across a variety of domains [29], particularly in multi-agent settings[30]. The abundance of available algorithms provides a rich toolkit for addressing complex tasks such as autonomous driving. In this study, our goal is to evaluate the performance of different RL algorithms in enhancing the safety aspects of autonomous driving and to identify which algorithm achieves superior experimental results.

Given the varied performance of different RL algorithms across diverse tasks, it is crucial to undertake a comparative analysis to discern their strengths and weaknesses in specific scenarios. We employed a distributed training setup utilizing 4 GPUs and 256 CPUs to train three prominent reinforcement learning algorithms: Deep Deterministic Policy Gradient (DDPG), Soft Actor-Critic (SAC), and Proximal Policy Optimization (PPO) [31, 32, 33]. Training was conducted over 24 hours, simulating approximately 250 million data instances. Due to the distributed nature of our simulation across various CPUs within the same cluster, data communication was asynchronous.

The results, summarized in the table below Table II, clearly demonstrate that PPO significantly outperformed DDPG and SAC in this task. Notably, PPO achieved a perfect success rate with all 40 vehicles passing the test, which was not matched by the other algorithms. Consequently, PPO will be referred to as the baseline algorithm in subsequent discussions within this paper.

Advanced Distributed Training for Collaborative Efficiency

To enhance the speed and efficiency of our experiments, we implemented a distributed training system as illustrated in Figure 2. This system consists of 256 CPU cores and 4 NVIDIA V100 GPUs.

Refer to caption
Figure 2: Architecture of the distributed training system.

In this setup, two GPUs were dedicated to inference tasks while the other two were used for training. This separation allowed us to optimize the parallel processing of data generation and model training, overcoming the limitations of single-machine setups where inference and training often occur sequentially, leading to bottlenecks. As a result, our system was able to generate approximately 15,000 data points per minute and perform 256 gradient descent updates per minute, significantly improving the efficiency of our reinforcement learning experiments, especially in these complex tasks.

Unlike zero-sum environments where rewards are strictly competitive, cooperative tasks in MARL require more nuanced reward distributions to prevent overfitting due to disproportionately large rewards for some agents. To address this, we using reward normalization techniques within our OPTIMA framework. Given that data is centrally collected in the data buffer, we can apply batch normalization or moving averages to rewards or advantages during sampling. This method is superior to local normalization as it allows for significantly larger batch sizes [34], enhancing the stability and effectiveness of our training procedures.

Asynchronous Optimization Trade-offs

While distributed reinforcement learning offers significant advantages in terms of scalability and efficiency, it also introduces certain challenges, particularly in the context of asynchronous optimization. The decoupling of various modules in the system, while beneficial for parallelization, can lead to version inconsistencies between the model being trained and the one used for inference. For instance, while the model is being updated during training ΩtsubscriptΩ𝑡\Omega_{t}roman_Ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, the actors may be simultaneously conducting inference using a slightly older version of the model Ωt1subscriptΩ𝑡1\Omega_{t-1}roman_Ω start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT or even earlier ΩtnsubscriptΩ𝑡𝑛\Omega_{t-n}roman_Ω start_POSTSUBSCRIPT italic_t - italic_n end_POSTSUBSCRIPT. This asynchronous nature results in a situation where the collected training data is always a few versions behind the policy ΩtsubscriptΩ𝑡\Omega_{t}roman_Ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT currently being trained. This stale data can potentially impact the stability and convergence of the learning process, even caused the training to crash, requiring careful consideration in the design and implementation of the distributed system.

Moreover, in practice, we cannot simply discard this stale data during the training process. There are two primary reasons for this. First, discarding data reduces data utilization efficiency, potentially leading to insufficient data for training, where the learner constantly waits for new data. Second, and more critically, in the actual training process, data is not sent to the learner from the actor until it reaches a certain Horizon length. However, this situation often occurs during the last few crucial steps of an episode, such as when a CAV reaches its destination, which are generally the most valuable learning experiences. Furthermore, due to the complex environment reset and warm-up times, the time span for collecting this crucial data is extended. Therefore, striking a balance between the amount of exploratory data for training and managing stale data becomes particularly crucial in this context.

To address these challenges, we propose a straightforward and universally applicable method that doesn’t require modifying the loss function, making it suitable for any RL algorithm. Our approach involves managing stale data in the sampling data buffer, ensuring a certain level of data freshness. Specifically, in our system settings, we work with data batches of size 4096 and a horizon length of 128, resulting in a data set of 4096 x 128 elements. For this set, we set a maximum allowable average gap of 8 versions between the data ΩtnsubscriptΩ𝑡𝑛\Omega_{t-n}roman_Ω start_POSTSUBSCRIPT italic_t - italic_n end_POSTSUBSCRIPT and the current Learner model ΩtsubscriptΩ𝑡\Omega_{t}roman_Ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. If this threshold is exceeded, we discard the older data from the queue. We calculated that this approach results in discarding approximately 13% of the data, which we consider acceptable without significantly impacting the training effectiveness. This method strikes a balance between data utilization and training stability, addressing the challenges of asynchronous optimization in a practical and efficient manner.

IV-B Policy Coordination: Decentralized vs Centralized

Building upon our distributed training framework, we now turn our attention to a crucial aspect of multi-agent systems: policy coordination. To study the trade-off between efficiency and safety for decentralized intelligent vehicles and centralized interconnected intelligent vehicles, we consider two approaches:

Centralized Training with Decentralized Execution (CTDE): In standard CTDE [35], agents are trained using global information but execute actions based only on local obsetions. This approach does not involve pooling of hidden variables during execution.

Centralized Training with Cecentralized Execution (CTCE): We implement a fully centralized method that extends beyond traditional CTDE. We use centralized communication and policy coordination through the pooling of hidden variables across different policies in the set ΩΩ\Omegaroman_Ω, where Ω={Ω1,Ω2,,Ωn}ΩsubscriptΩ1subscriptΩ2subscriptΩ𝑛\Omega=\{\Omega_{1},\Omega_{2},\ldots,\Omega_{n}\}roman_Ω = { roman_Ω start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , roman_Ω start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , roman_Ω start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }, each represented by a multilayer perceptron (MLP) network [36]. This pooling operation allows for information sharing even during execution, distinguishing our method from standard CTDE.

While autonomous vehicles are equipped with lidar sensors for local environment perception, they lack information about distant vehicles. By comparing these two approaches, we aim to evaluate the trade-offs in efficiency and safety between decentralized and centralized decision-making in autonomous vehicle scenarios.

IV-C Rule-Based Coordination

As we refine the methods of distributed training to achieve optimal performance, it is equally crucial to align the reward functions closely with real-world scenarios. Effective cooperation among vehicles not only depends on individual performance but also on how well the system incentivizes safe and cooperative behavior. This leads us to explore rule-based coordination strategies that address common traffic challenges more realistically.

Safe Distance

The objective of this experiment is to explore the impact of maintaining safety distances in multi-agent environments. Maintaining an appropriate distance between vehicles is essential for safe driving operations. In our experiment, the front 10 lidar points of the vehicle are designated as the sensing area, covering a 50-degree angle in front.

Refer to caption
Figure 3: The blue vehicle is penalized for being too close to the red vehicle in front of it. However, surrounding green vehicles, due to either sufficient distance or not being directly ahead of the blue vehicle, do not trigger a penalty for the blue vehicle.

When a vehicle ahead is detected within a distance of less than 5 meters, a penalty is imposed on the vehicle as illustrated in Figure 3. The safe distance penalty is calculated using the formula:

Rsd,i=0.5×(5mdi,j(t)5m)subscript𝑅sd𝑖0.55𝑚subscript𝑑𝑖𝑗𝑡5𝑚R_{\text{sd},i}=-0.5\times\left(\frac{5m-d_{i,j}(t)}{5m}\right)italic_R start_POSTSUBSCRIPT sd , italic_i end_POSTSUBSCRIPT = - 0.5 × ( divide start_ARG 5 italic_m - italic_d start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ( italic_t ) end_ARG start_ARG 5 italic_m end_ARG ) (1)

where di,j(t)subscript𝑑𝑖𝑗𝑡d_{i,j}(t)italic_d start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ( italic_t ) represents the distance between vehicle i𝑖iitalic_i and vehicle j𝑗jitalic_j at time t𝑡titalic_t. This setup aims to simulate the importance of maintaining safe distances in real driving and encourages agents to learn to avoid dangerously close behaviors through negative incentives. Ultimately, the calculated penalty Rsdsubscript𝑅sdR_{\text{sd}}italic_R start_POSTSUBSCRIPT sd end_POSTSUBSCRIPT is added into the existing local reward function Rlocalsubscript𝑅𝑙𝑜𝑐𝑎𝑙R_{local}italic_R start_POSTSUBSCRIPT italic_l italic_o italic_c italic_a italic_l end_POSTSUBSCRIPT, enhancing the model’s ability to train agents on the significance of maintaining safe distances.

Right-of-way Responsibility

In our simulated driving environment, a common issue encountered is the dual penalization of vehicles involved in collisions, irrespective of the actual traffic fault. This approach often contradicts real-world traffic regulations, where typically only one party is deemed primarily responsible for the incident. This discrepancy between simulated and real-world scenarios prompted us to reevaluate the fairness and realism of our penalty system.

In reality, the right-of-way rules dictate that not all vehicles involved in an accident should bear equal responsibility [37]. This principle is well established in traffic law but less commonly represented accurately in reinforcement learning simulations. To address this, we implemented a simple rule-based heuristic to determine the primary responsible vehicle in multi-vehicle collisions, defined as Z(i,j)𝑍𝑖𝑗Z(i,j)italic_Z ( italic_i , italic_j ).

Rrc,i={2Rcollsion,iif Z(i,j)0otherwisesubscript𝑅rc𝑖cases2subscript𝑅collsion𝑖if 𝑍𝑖𝑗0otherwiseR_{\text{rc},i}=\begin{cases}2\cdot R_{\text{collsion},i}&\text{if }Z(i,j)\\ 0&\text{otherwise}\end{cases}italic_R start_POSTSUBSCRIPT rc , italic_i end_POSTSUBSCRIPT = { start_ROW start_CELL 2 ⋅ italic_R start_POSTSUBSCRIPT collsion , italic_i end_POSTSUBSCRIPT end_CELL start_CELL if italic_Z ( italic_i , italic_j ) end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL otherwise end_CELL end_ROW (2)

This method leverages known traffic rules and right-of-way principles to assign fault more accurately, thereby aligning our simulation more closely with real-world legal frameworks. This adjustment not only enhances the realism of our model but also allows for more nuanced assessments of autonomous driving policies under various traffic conditions.

TABLE IV: Performance Comparison between CTDE and CTCE Approaches
CTDE CTCE
success 40 40
out_of_road 14.39 10.32
crash_vehicle 31.43 28.37
velocity_mean 4.11 3.64
velocity_mean_in_conflict_zone 2.47 2.11
acceleration 0.6 0.7
acceleration_in_conflict_zone 0.43 0.49
arrive_steps 277.96 313.76
episode_steps 478.92 534.72
mean_conflict_zone_num 5.73 6.33
max_conflict_zone_num 12.12 13.2
front_end_distance 0.18 0.18
limited_lidar 0.56 0.55
limited_lidar_in_conflict_zone 0.43 0.41
front_end_distance_in_conflict_zone 0.12 0.11
pair_distance 37.76 37.01

V Experiments And Results

In this section, we present the evaluations of our experiments. Our training lasted 24 hours, and we choose the best model to present. There are many angles in the SVO experiment, so we only trained for 12 hours and took the best model. All models were evaluated 100 times to ensure the accuracy of the experiments. Finally, we discussed the impact of different conditions on safety and efficiency.

V-A Performance Indicators

To understand the trade-off between safety and efficiency, we must consider four main indicators. For safety, these indicators include the number of collisions and the number of vehicles out of the road. For efficiency, we focus on the total number of vehicles that successfully arrive at the destination and the average speed of the vehicles. A detailed addition is that, since vehicles do not disappear after collisions in our simulations setting, it is possible for multiple collisions to occur between the same vehicles in a short period. We also monitor situations such as obstructions in the vehicle’s field of vision, distance to the front vehicle, and vehicle density. In this paper, due to environmental constraints, we use lidar obstruction to represent visual obstructions. Additionally, we pay close attention to vehicle performance in the central area, or the conflict zone, where most collisions occur. For efficiency, we also focus on the average number of steps to reach all vehicles and the steps taken by the last vehicle to arrive, corresponding to the episode steps in reinforcement learning.

V-B Centralized Policy

CTDE and CTCE share the same objective function, aiming to balance safety and efficiency in autonomous vehicle coordination. As demonstrated in Table IV, while CTCE exhibits improved safety with fewer out-of-road incidents and slightly fewer crashes, it does so at the expense of reduced velocity and extended episode steps. This suggests that CTCE may adopt a more cautious strategy, prioritizing safety over speed. Conversely, CTDE tends to prioritize efficiency, evidenced by higher velocities and shorter episode completions, though this approach comes with a slight increase in safety risks, as indicated by more frequent out-of-road incidents and crashes. The increased conflict zone numbers in CTCE suggest that it may be more effective at coordinating multiple vehicles through intersections simultaneously, possibly reflecting a more sophisticated approach to traffic management, despite a slightly slower velocity.

V-C Rule-Based Reward Function Policies

Table III presents the outcomes of experiments using rule-based rewards, comparing the baseline model with models that include penalties for not maintaining safe distances and not adhering to right-of-way rules, along with their combined effects.

Implementing a safe distance penalty alone reduced crash incidents from 31.43 to 14.62, demonstrating the effectiveness of embedding safety-aware behaviors through negative reinforcement. When combined with right-of-way rules, the system’s performance improved further, reducing the crash rate to 12.61 and out-of-road incidents to 13.48. The mean conflict zone number decreased to 3.91, indicating enhanced navigational safety and efficiency. Notably, these safety improvements came with only a slight reduction in vehicle efficiency.

These results show that rule-based reward functions can significantly influence autonomous vehicle behavior, promoting safer and more efficient driving practices. While not necessarily consistent with all real-world traffic rules, this approach demonstrates potential for improving traffic flow and safety outcomes in simulated environments.

VI CONCLUSIONS

This paper has explored various strategies to enhance the safety and efficiency of autonomous vehicle systems through the implementation of a novel distributed training framework OPTIMA. The results have demonstrated that the integration of these strategies can significantly influence vehicle behavior, promoting safer and more efficient traffic management in simulated environments. This approach has led to notable improvements in both the safety and efficiency of CAVs in challenging traffic situations, particularly in multi-agent environments like intersections. Moving forward, while this study has focused on multi-agent systems using homogeneous policies, future work could investigate the integration of heterogeneous policy strategies. This approach would explore how different policy strategies interact and potentially enhance the overall safety and efficiency of CAVs. Such studies could provide deeper insights into the dynamic interactions within multi-agent systems and lead to more robust autonomous transportation solutions.

References

  • [1] N. H. T. S. Administration et al., “National motor vehicle crash causation survey: Report to congress,” National Highway Traffic Safety Administration Technical Report DOT HS, vol. 811, p. 059, 2008.
  • [2] H. Singh, B. Weng, S. J. Rao, and D. Elsasser, “A diversity analysis of safety metrics comparing vehicle performance in the lead-vehicle interaction regime,” IEEE Transactions on Intelligent Transportation Systems, 2023.
  • [3] S. Shalev-Shwartz, S. Shammah, and A. Shashua, “On a formal model of safe and scalable self-driving cars,” arXiv preprint arXiv:1708.06374, 2017.
  • [4] A. Abbas-Turki, Y. Mualla, N. Gaud, D. Calvaresi, W. Du, A. Lombard, M. Dridi, and A. Koukam, “Autonomous intersection management: Optimal trajectories and efficient scheduling,” Sensors, vol. 23, no. 3, p. 1509, 2023.
  • [5] A. Shetty, M. Yu, A. Kurzhanskiy, O. Grembek, H. Tavafoghi, and P. Varaiya, “Safety challenges for autonomous vehicles in the absence of connectivity,” Transportation research part C: emerging technologies, vol. 128, p. 103133, 2021.
  • [6] W. Zhang, M. Elmahgiubi, K. Rezaee, B. Khamidehi, H. Mirkhani, F. Arasteh, C. Li, M. A. Kaleem, E. R. Corral-Soto, D. Sharma et al., “Analysis of a modular autonomous driving architecture: The top submission to carla leaderboard 2.0 challenge,” arXiv preprint arXiv:2405.01394, 2024.
  • [7] L. Wang, H. Zhong, W. Ma, M. Abdel-Aty, and J. Park, “How many crashes can connected vehicle and automated vehicle technologies prevent: A meta-analysis,” Accident Analysis & Prevention, vol. 136, p. 105299, 2020.
  • [8] S. Teng, X. Hu, P. Deng, B. Li, Y. Li, Y. Ai, D. Yang, L. Li, Z. Xuanyuan, F. Zhu et al., “Motion planning for autonomous driving: The state of the art and future perspectives,” IEEE Transactions on Intelligent Vehicles, vol. 8, no. 6, pp. 3692–3711, 2023.
  • [9] R. Song, Y. Ai, B. Tian, L. Chen, F. Zhu, and F. Yao, “Msfanet: A light weight object detector based on context aggregation and attention mechanism for autonomous mining truck,” IEEE Transactions on Intelligent Vehicles, vol. 8, no. 3, pp. 2285–2295, 2022.
  • [10] L. Gong, Y. Wu, B. Gao, Y. Sun, X. Le, and C. Liu, “Real-time dynamic planning and tracking control of auto-docking for efficient wireless charging,” IEEE Transactions on Intelligent Vehicles, vol. 8, no. 3, pp. 2123–2134, 2022.
  • [11] R. Valiente, M. Razzaghpour, B. Toghi, G. Shah, and Y. P. Fallah, “Prediction-aware and reinforcement learning-based altruistic cooperative driving,” IEEE Transactions on Intelligent Transportation Systems, 2023.
  • [12] G. Wang, J. Hu, Z. Li, and L. Li, “Harmonious lane changing via deep reinforcement learning,” IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 5, pp. 4642–4650, 2021.
  • [13] B. R. Kiran, I. Sobh, V. Talpaert, P. Mannion, A. A. Al Sallab, S. Yogamani, and P. Pérez, “Deep reinforcement learning for autonomous driving: A survey,” IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 6, pp. 4909–4926, 2021.
  • [14] G. Li, Y. Yang, S. Li, X. Qu, N. Lyu, and S. E. Li, “Decision making of autonomous vehicles in lane change scenarios: Deep reinforcement learning approaches with risk awareness,” Transportation research part C: emerging technologies, vol. 134, p. 103452, 2022.
  • [15] J. Zhang, C. Chang, X. Zeng, and L. Li, “Multi-agent drl-based lane change with right-of-way collaboration awareness,” IEEE Transactions on Intelligent Transportation Systems, vol. 24, no. 1, pp. 854–869, 2022.
  • [16] Q. Li, Z. Peng, L. Feng, Q. Zhang, Z. Xue, and B. Zhou, “Metadrive: Composing diverse driving scenarios for generalizable reinforcement learning,” IEEE transactions on pattern analysis and machine intelligence, vol. 45, no. 3, pp. 3461–3475, 2022.
  • [17] X. Lyu, Y. Xiao, B. Daley, and C. Amato, “Contrasting centralized and decentralized critics in multi-agent reinforcement learning,” arXiv preprint arXiv:2102.04402, 2021.
  • [18] Z. Peng, Q. Li, K. M. Hui, C. Liu, and B. Zhou, “Learning to simulate self-driven particles system with coordinated policy optimization,” Advances in Neural Information Processing Systems, vol. 34, pp. 10 784–10 797, 2021.
  • [19] S. Han, S. Zhou, J. Wang, L. Pepin, C. Ding, J. Fu, and F. Miao, “A multi-agent reinforcement learning approach for safe and efficient behavior planning of connected autonomous vehicles,” arXiv preprint arXiv:2003.04371, 2020.
  • [20] S. Huang, R. F. J. Dossa, C. Ye, J. Braga, D. Chakraborty, K. Mehta, and J. G. AraÚjo, “Cleanrl: High-quality single-file implementations of deep reinforcement learning algorithms,” Journal of Machine Learning Research, vol. 23, no. 274, pp. 1–18, 2022.
  • [21] A. Raffin, A. Hill, A. Gleave, A. Kanervisto, M. Ernestus, and N. Dormann, “Stable-baselines3: Reliable reinforcement learning implementations,” Journal of Machine Learning Research, vol. 22, no. 268, pp. 1–8, 2021.
  • [22] E. Liang, R. Liaw, R. Nishihara, P. Moritz, R. Fox, J. Gonzalez, K. Goldberg, and I. Stoica, “Ray rllib: A composable and scalable reinforcement learning library,” arXiv preprint arXiv:1712.09381, vol. 85, p. 245, 2017.
  • [23] E. Liang, R. Liaw, R. Nishihara, P. Moritz, R. Fox, K. Goldberg, J. Gonzalez, M. Jordan, and I. Stoica, “Rllib: Abstractions for distributed reinforcement learning,” in International conference on machine learning.   PMLR, 2018, pp. 3053–3062.
  • [24] Y. Lv and X. Ren, “Approximate nash solutions for multiplayer mixed-zero-sum game with reinforcement learning,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 49, no. 12, pp. 2739–2750, 2018.
  • [25] C. Amato, G. Chowdhary, A. Geramifard, N. K. Üre, and M. J. Kochenderfer, “Decentralized control of partially observable markov decision processes,” in 52nd IEEE Conference on Decision and Control, 2013, pp. 2398–2405.
  • [26] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun, “Carla: An open urban driving simulator,” in Conference on robot learning.   PMLR, 2017, pp. 1–16.
  • [27] S. Shah, D. Dey, C. Lovett, and A. Kapoor, “Airsim: High-fidelity visual and physical simulation for autonomous vehicles,” in Field and Service Robotics: Results of the 11th International Conference.   Springer, 2018, pp. 621–635.
  • [28] D. Krajzewicz, “Traffic simulation with sumo–simulation of urban mobility,” Fundamentals of traffic simulation, pp. 269–293, 2010.
  • [29] R. S. Sutton, “Reinforcement learning: an introduction,” A Bradford Book, 2018.
  • [30] R. Lowe, Y. I. Wu, A. Tamar, J. Harb, O. Pieter Abbeel, and I. Mordatch, “Multi-agent actor-critic for mixed cooperative-competitive environments,” Advances in neural information processing systems, vol. 30, 2017.
  • [31] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” arXiv preprint arXiv:1509.02971, 2015.
  • [32] T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan, V. Kumar, H. Zhu, A. Gupta, P. Abbeel et al., “Soft actor-critic algorithms and applications,” arXiv preprint arXiv:1812.05905, 2018.
  • [33] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017.
  • [34] T. Schaul, G. Ostrovski, I. Kemaev, and D. Borsa, “Return-based scaling: Yet another normalisation trick for deep rl,” arXiv preprint arXiv:2105.05347, 2021.
  • [35] G. Chen, “A new framework for multi-agent reinforcement learning–centralized training and exploration with decentralized execution via policy distillation,” arXiv preprint arXiv:1910.09152, 2019.
  • [36] H. Taud and J.-F. Mas, “Multilayer perceptron (mlp),” Geomatic approaches for modeling land change scenarios, pp. 451–455, 2018.
  • [37] L. Li, C. Zhao, X. Wang, Z. Li, L. Chen, Y. Lv, N.-N. Zheng, and F.-Y. Wang, “Three principles to determine the right-of-way for avs: Safe interaction with humans,” IEEE transactions on intelligent transportation systems, vol. 23, no. 7, pp. 7759–7774, 2021.