LTLDoG: Satisfying Temporally-Extended Symbolic Constraints for Safe Diffusion-based Planning
Abstract
Operating effectively in complex environments while complying with specified constraints is crucial for the safe and successful deployment of robots that interact with and operate around people. In this work, we focus on generating long-horizon trajectories that adhere to static and temporally-extended constraints/instructions at test time. We propose a data-driven diffusion-based framework, LTLDoG, that modifies the inference steps of the reverse process given an instruction specified using finite linear temporal logic (). LTLDoG leverages a satisfaction value function on and guides the sampling steps using its gradient field. This value function can also be trained to generalize to new instructions not observed during training, enabling flexible test-time adaptability. Experiments in robot navigation and manipulation illustrate that the method is able to generate trajectories that satisfy formulae that specify obstacle avoidance and visitation sequences. Code and supplementary material are available online at https://github.com/clear-nus/ltldog.
I Introduction
Recent methodologies [1, 2, 3] utilizing data-driven diffusion models [4, 5, 6] have shown remarkable performance in generating robot behaviors across a wide range of tasks. Thanks to their ability to model complex distributions, these methods have surpassed several leading offline reinforcement learning techniques and classical model-based trajectory optimization methods, especially in long-horizon decision-making tasks [1, 3]. However, while conventional diffusion models excel at learning from training datasets, they lack the ability to adapt to new objectives or comply with new constraints during deployment. This shortcoming can lead to unsafe behaviors, posing risks to humans, robots, and their surrounding environment.
In view of this limitation, there has been very recent work on diffusing safe trajectories. Xiao et al. [7] integrated a dynamics model into the denoising diffusion process and incorporated a class of Control Barrier Functions (CBF) to meet safety criteria. Botteghi et al. [8] approached the issue by embedding both safety and reward considerations into a constrained optimization framework, employing CBF constraints as labels for classifier guidance. However, these approaches primarily address static environmental constraints. For example, while they can maneuver around obstacles on a local scale, they fail to comply with more complex temporally-extended directives such as “avoid the kitchen until you are clean”.
In this work, we propose an alternative approach to flexible trajectory planning with diffusion models, designed to satisfy both static safety requirements and temporal constraints. The core idea is to plan with diffusion models to satisfy finite linear temporal logic () formulae [9]. offers the ability to define a broad spectrum of instructions/constraints that might emerge during deployment. For example, can describe a visitation order of different objects and locations. The use of propositional logic operators, such as not, facilitates the delineation of safe regions within the state space.
We develop LTLDoG (LTL Diffusion-orienting Guidance, pronounced “Little Dog”), a posterior-sampling based diffusion framework that accommodates finite LTL formulae at test time. We present two variants of LTLDoG: our main method, LTLDoG-S, can be applied to the generation of finite-length trajectory in robot tasks where the labeling function for propositional events is differentiable. LTLDoG-S employs a differentiable formula checker in conditional sampling — specifically, we modify the reverse process to condition upon the criteria that the final (predicted) trajectory satisfies a given formula. For when a differentiable labeling function is unavailable, we propose LTLDoG-R, which uses a trained neural-symbolic evaluator for posterior sampling. Notably, both variants do not require collecting expert demonstrations for every potential instruction. They retain the temporal compositionality and local consistency properties associated with diffusion models [1] — as long as the dataset contains a diverse set of paths, they can potentially “stitch together” snippets of trajectories from the training data to generate plans for unseen formulae.
Experiments on two benchmark environments (long-horizon planning for navigation and policy learning for manipulation) demonstrate that LTLDoG is able to generate trajectories that satisfy feasible safety and temporal constraints. We find that our methods possess the ability to re-plan alternative paths at a high-level based on given instruction. Moreover, real robot experiments show that the generated trajectories can be successfully transferred to a quadruped robot. In summary, this paper makes three key contributions:
-
•
A conditional trajectory sampling approach designed for instructions that leverages pre-trained diffusion models;
-
•
A regressor-guidance neural network for diffusion that generalizes to novel formulae within a given template structure;
-
•
Experimental results on benchmark problems and real world demonstrations that validate the effectiveness of planning with safety and temporal constraints.
From a broader perspective, LTLDoG is the first method that fuses symbolic model checking (using ) with expressive diffusion-based generative models. We hope our results lays the groundwork towards performant, yet safer and more trustworthy robots.
II Preliminaries and Notation
In this work, our focus is to extend diffusion-based planning methods towards generating trajectories that comply with specified formulae. Here, we provide a concise introduction to diffusion methods in the context of planning and finite linear temporal logic.
II-A Planning with Diffusion
Many tasks in planning, reinforcement learning, and imitation learning require generating trajectories under some specific objective. Let and denote the state and action space, respectively. We use to refer to a trajectory, where is the planning horizon. The environment transitions to a new state when an agent executes action at state . Let the abbreviation denote the objective value function conditioned on a goal state where the trajectory must terminate at, for example, (discounted) cumulative rewards in reinforcement learning, cumulative error of actions in imitation learning, or cost for safety constraints.
Diffusion-based planning methods directly generate partial or entire trajectories by using diffusion models pre-trained on a dataset of trajectories. Let denote the distribution of trajectories in dataset, where represents a noiseless trajectory. Given an -step discrete approximation of forward diffusion process that slowly corrupts data by adding prespecified noise, diffusion models learn an iterative denoising procedure by approximating the score function using a step-dependent neural network trained with denoising score matching [10]:
(1) |
in which is the diffusion timestep, and is the trajectory corrupted with noise. Throughout the paper, we adopt Denoising Diffusion Probabilistic Models (DDPM) [5] as the sampling method, where , , and is a sequence of positive noise scales .
II-B Linear Temporal Logic (LTL)
Given a finite set of propositional symbols , the formula set of contains formulas recursively defined in Backus-Naur form as follows [11, 12]:
where and . Intuitively, the formula (next ) is satisfied if is satisfied at the next time step. ( until ) is satisfied if is satisfied until is satisfied, and is satisfied by the end of the sequence. From these, other commonly used logical connectives and temporal operators can be defined according to the following equivalences: , (eventually ) and (always ). The symbols and can also be in the formula set defined by and .
In contrast to propositional logic, these formulas are evaluated over finite sequences of observations (i.e., truth assignments to the propositional symbols in ), where and iff proposition is satisfied at time step . () is always satisfied (not satisfied) by any assignment. Formally, satisfies at time , denoted by , as follows:
-
•
iff , where
-
•
iff
-
•
iff and
-
•
iff
-
•
iff s.t. and ,
A sequence is then said to satisfy , i.e., , iff .
III Method
In this section, we describe our primary contribution, LTLDoG, a diffusion-based framework for generating trajectories that satisfy formulae. We first discuss how to conditionally sample using diffusion models, followed by how formulae can be used to guide the diffusion process.
III-A Conditional Sampling in Diffusion Models
Given a trained score function from (1) such that , a diffusion model denoises samples according to the distribution starting from a Gaussian prior . For example, Diffuser [1] samples a trajectory from a diffusion model, which an agent then executes. However, this original sampling process is unable to control detailed properties of generated context.
Here, we are interested in sampling trajectories that satisfy both the final goal and the specified instructions encoded as an formula that is provided during deployment. In other words, we aim to sample trajectories under an objective function . For example, can have a high value if the events induced by satisfy and have a low value otherwise.
Formally, given a set of atomic propositions , the assignments for are given by a labeling function , where each timestep of induces an assignment to the propositional symbols in . For example, in navigation task, can represent multiple regions to avoid and is a function indicating whether in are inside these regions or not. We will slightly abuse notation and write to indicate that ’s assignments satisfy a formula.
We aim to sample from the posterior with the diffusion model as the prior. In this work, the constraint of goal state conditioning can be either achieved by inpainting (similarly on the start state ) or implicitly encoded in the dataset, which does not require separate modeling in conditional sampling. Therefore, we mainly target the posterior . Let denote a binary random variable indicating the likelihood of satisfying . Hence, the denoising process requires a score function conditioned on and by Bayes’ rule: . The first term on the right-hand side has been learned by a neural network in diffusion. However, the latter term requires an integration over all possible values of : . We consider the plug-and-play conditional generation setting and approximating this integration with sample estimation [13], e.g., point estimation with where the noiseless trajectory is estimated via Tweedie’s formula [14] . We model the likelihood term as , where is a normalizing constant and is the indicator function. Putting the above elements together, the conditional score function can be computed as follows,
(2) |
Unfortunately, both the indicator function and the satisfaction evaluation (performed by techniques like model checking with finite automata) are non-differentiable — this prohibits application in the gradient-based sampling process of diffusion models. We address this problem in the next subsection.
III-B Differentiable Evaluation of
Our key approach is to “soften” the satisfaction evaluation. To make the evaluation differentiable, we modify our formula evaluator to output positive real values if a trajectory satisfies the formula, and negative real values otherwise. Instead of using a binary labeling functions, we assume real valued assignments of atomic propositions can be obtained through computation on the generated trajectories. As a specific example, consider a navigation task where a robot has to avoid obstacles; one can determine the assignment for proposition at timestep using the Euclidean distance between and the centers of the region (assuming a circular shape). A positive value of , where is the radius of the circle, indicates a true assignment. Consequently, the labeling function for the entire trajectory is differentiable and the assignments are .
A binary version of through a sign function satisfies the definition of satisfy defined in Section II-B, but the sign operation breaks differentiability. As such, we employ a formula evaluator to check satisfaction [15], with positive values implying , similar to signal temporal logic [16]. As such, is differentiable with the evaluation process defined as follows:
-
•
-
•
-
•
-
•
-
•
-
•
-
•
-
•
-
•
-
•
, where is the smallest integer s.t.
Note that the and functions are likewise “soft” to maintain differentiability. To reduce clutter, we have omitted the subscript when . With these operations, maintains quantitative semantics, which preserves the relative values between different such that trajectories with larger margin to a satisfying assignment have larger values.
Given and , we can obtain a differentiable score function by replacing the likelihood term in (2) with . The conditional score is then
(3) |
This approximation of the conditional score function can be directly used with a pre-trained diffusion model to sample trajectories conditioned on an formula. We call this method LTLDoG-S, since it performs the above posterior sampling in the reverse process (Algorithm 1). The gradient ascent step is controlled by a stepsize ; in practice, the stepsize for each denoise step can be adaptive such that remains valid according to the formula evaluator .
III-C Classifier Guidance over
One limitation of LTLDoG-S is that it necessitates a known (and differentiable) expression for the formula evaluator . This requirement can be challenging to meet in scenarios where the truth assignments of propositions are uncertain, for example, when the ground-truth physical dynamics are unknown. Here, we circumvent this problem by employing classifier guidance using a trained formula evaluator.
We propose a variant of LTLDoG with regressor guidance, which we abbreviate as LTLDoG-R. Using an loss, we train a neural network to predict the satisfaction values from noisy trajectories conditioned on instructions. In other words, our neural network approximates the conditional score function . Note that in contrast to learning a binary classifier for trajectory satisfaction, we apply the labeling function in Section III-B on the dataset and associate with each trajectory its objective value . We conduct ablation study in Section V-D to show that using real values performs better than using binary labels.
To generalize over different , the neural network takes both noisy trajectory and formula embedding as input. Multiple methods exist for embedding formulae. For instance, one can use Graph Neural Networks (GNNs) [17, 18] to embed the tree representation of an formula directly. Alternatively, the deterministic finite-state automaton (DFA) [19, 20] associated with the formulae can be embedded [21]. In this work we employ tasks from [22] and embed the directed graph of an formula using the Relational Graph Convolutional Network (R-GCN) [23], which can generalize to formulae with same template structure. The model that approximates the score function after training can be plugged into the conditional reverse process using regressor guidance as summarized in Algorithm 2.
IV Related Work
LTLDoG builds upon prior work in diffusion-based planning and symbolic reasoning using for robotics. In the following, we give a brief overview of related work.
Learning and Planning under . As an expressive language for specifying high-level planning requirements [24, 25, 26], has been extensively used in various robotic tasks to express temporally extended goals [27, 28]. These methods usually require the information about the environment’s dynamics, e.g., a model or an abstraction, to effectively plan under a given formula. Reinforcement learning agent learn in a model-free way under objectives or constraints [29, 30, 31] with the ability to generalize over different formulae [22]. However, these methods operate agent in an online manner via trial and error, which can lead to expensive or even unsafe interactions.
Planning and Policy Learning with Diffusion. Recent diffusion-based planning methods are flexible that only rely on offline datasets without access or interaction to environments. They have been successfully applied to long-horizon planning problems by generating states or actions for control [1, 3, 2], but not tasks with test-time temporal requirements. Recent work has looked into safety critical tasks e.g., the aforementioned CBF-based methods [7, 8]. As discussed above, these methods were designed for static safety criteria, and the lack the ability to satisfy temporally extended behaviors. Our work inherits the advantages of diffusion based methods and can fulfill requirements.
Inverse Problems in Diffusion. Our proposed method formulates conditional measurement under with differentiable loss function using the unnormalized likelihood [13, 32] for posterior sampling and can guide the sampling process in a plug-and-play fashion. The most popular methods to guide diffusion models during inference is classifier guidance [33] and classifier-free guidance [34]. However, these methods cannot be applied in a plug-and-play fashion for new conditioning factors. Our work is also related to inverse task that infers a posterior distribution over data given a measurement. Inverse methods [35, 36, 37, 38] do not require training a conditional model in diffusion and can directly leverage pre-trained neural networks.
V Experiments
Method\Perf. | Satisfaction rate1 () | Reward (UnCon) 2 | |
---|---|---|---|
Planning | Rollout | ||
Diffuser | 9.53.1 | 11.01.7 | 142.25.1 |
SafeDiffuser | 99.40.9 | 12.33.5 | 135.85.2 |
LTLDoG-S | 99.00.8 | 73.03.0 | 97.32.8 |
LTLDoG-R | 98.80.8 | 92.01.4 | 127.15.1 |
-
1
Mean and standard deviation calculated from groups of tests. Each test contains trials, where a trajectory is labeled as either satisfied or not satisfied in each trial. Best result during rollout is highlighted. Same for other tables of Maze2d.
-
2
Unconstrained rewards do not take unsafe penalties into account.
Method\Perf. | Satisfaction rate() | Overlap Score1 |
---|---|---|
Diffusion Policy | 34.818.0 | 0.9410.0584 |
LTLDoG-S | 85.613.1 | 0.8900.0647 |
LTLDoG-R | 85.612.5 | 0.8420.0985 |
-
1
measures the final overlap of the T block and target area (min: , max: ).
Our experiments focus on testing LTLDoG’s ability to handle static and temporal safety constraints. We first briefly describe the simulated and real environments, and baseline data-driven methods. Then we report empirical results on benchmark environments and demonstrate LTLDoG’s applicability in real world tasks through a case study on a quadruped robot (Fig. 2). We conclude with a brief ablation study and analysis. Due to space restrictions, we focus on conveying our main results. More details of the environments, implementations and analysis can be found in the appendix.
V-A Experimental Setup
Environments. We evaluate methods in two simulation benchmark environments (Maze2d [1] and PushT [3]) and demonstrate in two real indoor rooms. Maze2d (Fig. 3 and 5) presents challenging long-horizon navigation tasks, where state-of-the-art offline algorithms fail to learn a reliable goal-reaching policy. The atomic propositions are determined by the occurrence of events when the agent is inside key regions in the maze (for avoidance and visitation). The PushT task (Fig. 4) requires manipulation of a T block through interaction with a controllable mover. In our experiments, the mover is constrained to visit specific regions and avoid others. Our real-world experiments involve two indoor environments: a lab designed to mimic a studio apartment, and an office environment (Fig. 2).
Compared methods. Our work involves trajectory generation by learning from an offline dataset and as such, we compare against data-driven planning methods. Diffuser and Diffusion Policy are state-of-the-art methods for sampling viable plans but without any guarantees over external constraints. To evaluate how well LTLDoG enforces safety specifications, we compare with SafeDiffuser, a safe planning diffusion model using CBFs. However, note that SafeDiffuser cannot handle temporal constraints or instructions; to our knowledge, our work is the first data-driven method to handle both static and temporal constraints. As such, there is no direct comparison baseline. We analyze the differences between the two variants of our method, LTLDoG-S and LTLDoG-R.
V-B Comparative Analysis of Methods
Can LTLDoG achieve safe planning for static constraints? Our results indicate that yes, LTLDoG is better able to generate trajectories that satisfy given region-avoidance constraints relative to existing methods. In both the Maze2D and PushT benchmarks, LTLDoG achieves high success rates (in both planning and rollout), without severely compromising reward accumulation (Tables I and II). Lower total rewards are expected since safe paths are typically longer and rollout scores are generally lower as the low-level controller may not exactly follow the diffused trajectory. Qualitatively, Fig. 3 shows that LTLDoG is able to generate safe trajectories in Maze2D, whilst SafeDiffuser is limited to “local” deviations and fails to find paths that detour around unsafe regions. Results in PushT are consistent with Maze2D; Fig. 4 shows that LTLDoG performs the orientation task without entering unsafe regions.
Environment | Method\Performance | Training s | Testing s | |||||
---|---|---|---|---|---|---|---|---|
Satisfaction rate () | Reward (UnCon) | Satisfaction rate () | Reward (UnCon) | |||||
Planning | Rollout | Planning | Rollout | |||||
U-Maze (Horizon 256) | Diffuser | 31.10.5 | 31.00.5 | 33.52.7 | 33.90.5 | 34.10.6 | 35.60.3 | |
LTLDoG-S | 83.80.2 | 57.61.3 | 31.31.2 | 82.70.3 | 56.60.9 | 32.80.5 | ||
LTLDoG-R | 56.30.4 | 51.30.9 | 31.50.2 | 57.70.4 | 52.10.3 | 32.30.7 | ||
Medium (Horizon 384) | Diffuser | 15.00.7 | 13.40.6 | 84.80.3 | 11.61.4 | 10.11.2 | 84.80.5 | |
LTLDoG-S | 77.95.7 | 31.82.6 | 53.15.2 | 68.46.7 | 28.73.5 | 50.54.7 | ||
LTLDoG-R | 51.81.8 | 39.51.6 | 57.30.2 | 43.34.4 | 30.61.9 | 57.70.1 | ||
Large (Horizon 512) | Diffuser | 13.50.4 | 12.80.1 | 76.30.1 | 11.62.3 | 11.51.7 | 77.83.9 | |
LTLDoG-S | 73.82.4 | 32.61.4 | 42.35.0 | 66.62.7 | 24.91.7 | 40.94.4 | ||
LTLDoG-R | 66.90.6 | 47.40.8 | 54.61.3 | 57.52.3 | 39.02.9 | 54.53.9 |
Method\Performance | Set | Satisf. rate () | Score |
---|---|---|---|
Diffusion Policy | Training | 22.98.0 | 0.3540.153 |
Test | 30.713.9 | 0.3710.177 | |
LTLDoG-S | Training | 28.28.33 | 0.2900.115 |
Test | 43.017.0 | 0.2990.145 | |
LTLDoG-R | Training | 69.39.90 | 0.2920.121 |
Test | 66.020.8 | 0.3400.168 |
Can LTLDoG satisfy static and temporal constraints, and generalize to novel formulae? Tables III and IV show the performance of the compared methods on both training and test LTLs in Maze2d and PushT environments. To elaborate, we follow the specifications in [22] and adopt the Until sampler to generate random s ( for Maze2d and for PushT) that contain different visitation sequences and unsafe regions. The training set has of all s and the rests are used as test set. Atomic propositions consist of 6 regions in Maze2d (Fig. 5(a), i.e., ) and 4 regions in PushT (Fig. 4(a)).
Results reveal that both LTLDoG-S and LTLDoG-R achieve significantly higher success rates than Diffuser/Diffusion Policy. The baselines have a non-zero performance as some generated formulae are trivial to satisfy at some start locations. Also note that some specifications may be impossible to satisfy given the physical locations of the agent, walls, and propositional regions in the maze. In these cases, planned trajectories will violate constraints during rollout, causing a drop in the satisfaction rate. Fig. 5 shows examples of trajectories generated in Maze2d.
Environment | Method\Performance | Satisfaction rate () | ||
---|---|---|---|---|
Goal | ||||
Lab | Diffuser | 100.00 | 0.00 | |
LTLDoG-S | 91.67 | 91.67 | ||
Office | Diffuser | 100.00 | 0.00 | |
LTLDoG-S | 95.83 | 95.83 |
V-C Real Robot Case Study.
We show that LTLDoG can plan for a robot dog (Unitree Go2 Edu) given instructions in two real-world navigation environments — a lab that mimics a studio apartment and an office room. Training of diffusion models was performed in simulation using Gazebo and ROS1 using LIDAR scanned maps (Fig. 6(a) and 6(d)). Note that the training trajectories do not require running an oracle policy to satisfy many different formulae; we simply made the robot navigate to randomly sampled goals using the global planner and TEB Local Planner [39] from the ROS1 Navigation stack (Fig. 6(a) and 6(d)).
To test on potential constraints, we queried GPT-4 for formulae representing meaningful robotic tasks including obstacle avoidance and sequential navigation. We used 4 kinds of s from the generated results, i.e., , , and . The first corresponds to an obstacle avoidance task where the robot should never visit a specific region. For example, means the robot should never enter the Seat region (Fig 6(d)). The remaining three formulae represent: 1) visiting all regions at least once; 2) visiting regions in a specific sequence; 3) avoiding a specific region until another has been visited. Some example regions designed in our real environments and start/goal locations are shown in Fig 6.
In total, trials were executed on the real robot, involving trajectories ( for baseline Diffuser and for our method LTLDoG-S) with varying regions and different formulae in each room. The overall satisfaction rate of all raw generated trajectories in simulation is (c.f. baseline ). For each specific formula, a sample was selected for real-world execution based on their feasibility. The results in TABLE V show that LTLDoG has a high satisfaction rate compared to Diffuser; Fig. 6(c) and 6(f) illustrate trajectories for qualitative comparison.
V-D Ablation Study and Analysis
Method\Performance | LTL Set | Satisfaction rate () | Reward (UnCon) | |
---|---|---|---|---|
Planning | Rollout | |||
Diffuser | Training | 31.10.5 | 31.00.5 | 32.52.7 |
Test | 33.90.5 | 34.10.6 | 35.60.3 | |
Classifier guidance1 | Training | 41.10.6 | 40.60.7 | 33.90.9 |
Test | 40.00.8 | 41.81.1 | 35.30.3 | |
LTLDoG-R | Training | 56.30.4 | 51.30.9 | 31.50.2 |
Test | 57.70.4 | 52.10.3 | 32.30.7 |
-
1
The classifier guidance method only leverages binary labels for LTL satisfaction checking in Maze2d U-Maze.
Ablation study. Unlike classifier guidance, where each trajectory is labelled as satisfy or not satisfy, we leverage the continuous values from our formula evaluator (as described in Section III-C) and train a regressor guidance network. Although classifier guidance improves over the vanilla Diffuser, it achieves lower performance than LTLDoG-R ( Table VI). This comparison supports the notion that soft labels improves the guidance neural network; we posit using real values provides richer information in terms of how well the trajectory satisfies a given formula.
Analysis on horizon. Fig. 7 demonstrates that planning with a longer horizon leads to improved performance in terms of satisfaction. This improvement is attributed to the fact that instructions often require a longer sequence of steps in a path compared to mere goal navigation, e.g., visiting a specific region before reaching the goal.
VI Conclusion, Discussion and Future Work
In this work, we presented LTLDoG, an approach towards generating safe trajectories that comply with specifications at test time. Within our overall scheme, we presented two methods: LTLDoG-S guides the sampling process under any formula while LTLDoG-R uses a trained model that we show generalizes to new formulae with similar structure. To our knowledge, this work is the first that successfully incorporates model checking using a formal language with diffusion models for safe planning. Notably, LTLDoG does not require data collection for each potential instruction; rather, we control the sampling process during diffusion using “soft” model checking to generate new trajectories using existing information provided by the training dataset.
Limitations and Future Work. LTLDoG is a step towards trustworthy trajectory planning using generative models. There are several areas where LTLDoG can be improved. Similar to other diffusion models, LTLDoG is generally unable to generate realistic trajectories when the context (environment/goal) is far from the training distribution. As such, the dataset should preferably contain trajectories with a variety of complex behaviors that can potentially satisfy different test-time formulae. It would be interesting to develop methods to ensure the sampling process of LTLDoG-S adheres to the data manifold. Additionally, LTLDoG plans in an open-loop fashion, where the entire trajectory is generated conditioned on the constraints. This is mainly because the evaluation of an formula depends on the entire trajectory. We plan to further explore planning using receding horizon control using partial evaluations on an incomplete trajectory. Finally, diffusion models usually require significant amounts of training data and many diffusion steps during inference. Recent work on interpolant diffusion methods [40] leverages source policies to reduce data and computation costs. We aim to explore how integrating this approach with conditional sampling using can enhance the generation of safe trajectories.
Acknowledgements
This research is supported by A*STAR under its National Robotics Programme (NRP) (Award M23NBK0053). The authors would also like to acknowledge partial support from a Google South Asia & Southeast Asia Award and from the National Research Foundation, Singapore under its Medium Sized Center for Advanced Robotics Technology Innovation.
References
- [1] M. Janner, Y. Du, J. B. Tenenbaum, and S. Levine, “Planning with diffusion for flexible behavior synthesis,” in Int. Conf. Mach. Learn., vol. 162, 2022, pp. 9902–9915.
- [2] A. Ajay, Y. Du, A. Gupta, J. B. Tenenbaum, T. S. Jaakkola, and P. Agrawal, “Is conditional generative modeling all you need for decision making?” in Int. Conf. Learn. Representations, 2023.
- [3] C. Chi, S. Feng, Y. Du, Z. Xu, E. Cousineau, B. Burchfiel, and S. Song, “Diffusion policy: Visuomotor policy learning via action diffusion,” in Proc. Robot.: Sci. and Syst. (RSS), 2023.
- [4] J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli, “Deep unsupervised learning using nonequilibrium thermodynamics,” in Proc. 32nd Int. Conf. Mach. Learn., 2015, pp. 2256–2265.
- [5] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” in Advances in Neural Inf. Process. Syst., 2020, pp. 6840–6851.
- [6] Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, “Score-based generative modeling through stochastic differential equations,” in Int. Conf. Learn. Representations, 2021.
- [7] W. Xiao, T.-H. Wang, C. Gan, and D. Rus, “SafeDiffuser: Safe planning with diffusion probabilistic models,” arXiv preprint arXiv:2306.00148, 2023.
- [8] N. Botteghi, F. Califano, M. Poel, and C. Brune, “Trajectory generation, control, and safety with denoising diffusion probabilistic models,” arXiv preprint arXiv:2306.15512, 2023.
- [9] A. Pnueli, “The temporal logic of programs,” in 18th Annu. Symp. Found. Comput. Sci., 1977, pp. 46–57.
- [10] P. Vincent, “A connection between score matching and denoising autoencoders,” Neural Comput., vol. 23, no. 7, pp. 1661–1674, 2011.
- [11] C. Baier and J. Katoen, Principles of Model Checking. MIT Press, 2008.
- [12] C. Belta, B. Yordanov, and E. A. Gol, Formal Methods for Discrete-Time Dynamical Systems. Springer, 2017, vol. 89.
- [13] H. Chung, J. Kim, M. T. Mccann, M. L. Klasky, and J. C. Ye, “Diffusion posterior sampling for general noisy inverse problems,” in Int. Conf. Learn. Representations, 2023.
- [14] B. Efron, “Tweedie’s formula and selection bias,” J. Amer. Statistical Assoc., vol. 106, no. 496, pp. 1602–1614, 2011.
- [15] Z. Xu, Y. S. Rawat, Y. Wong, M. Kankanhalli, and M. Shah, “Don’t pour cereal into coffee: Differentiable temporal logic for temporal action segmentation,” in Advances in Neural Inf. Process. Syst., 2022.
- [16] K. Leung, N. Aréchiga, and M. Pavone, “Backpropagation through signal temporal logic specifications: Infusing logical structure into gradient-based methods,” Int. J. Robot. Res., vol. 42, no. 6, pp. 356–370, 2023.
- [17] M. Gori, G. Monfardini, and F. Scarselli, “A new model for learning in graph domains,” in Proc. IEEE Int. Joint Conf. Neural Netw., vol. 2, 2005, pp. 729–734.
- [18] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini, “The graph neural network model,” IEEE Trans. Neural Networks, vol. 20, no. 1, pp. 61–80, 2009.
- [19] S. Zhu, L. M. Tabajara, J. Li, G. Pu, and M. Y. Vardi, “Symbolic LTLf synthesis,” in Int. Joint Conf. Artif. Intell., 2017, pp. 1362–1369.
- [20] A. Camacho, J. Baier, C. Muise, and S. McIlraith, “Finite LTL synthesis as planning,” in Proc. Int. Conf. Automated Planning and Scheduling, vol. 28, 2018, pp. 29–38.
- [21] Y. Xie, F. Zhou, and H. Soh, “Embedding symbolic temporal knowledge into deep sequential models,” in IEEE Int. Conf. Robot. Automat., 2021, pp. 4267–4273.
- [22] P. Vaezipoor, A. C. Li, R. A. T. Icarte, and S. A. Mcilraith, “LTL2Action: Generalizing LTL instructions for multi-task RL,” in Int. Conf. Mach. Learn., 2021, pp. 10 497–10 508.
- [23] M. Schlichtkrull, T. N. Kipf, P. Bloem, R. Van Den Berg, I. Titov, and M. Welling, “Modeling relational data with graph convolutional networks,” in The Semantic Web, 2018, pp. 593–607.
- [24] F. Bacchus and F. Kabanza, “Using temporal logics to express search control knowledge for planning,” Artif. Intell., vol. 116, no. 1, pp. 123–191, 2000.
- [25] J. A. Baier and S. A. McIlraith, “Planning with temporally extended goals using heuristic search,” in Proc. Int. Conf. Automated Planning and Scheduling, 2006, p. 342–345.
- [26] A. Camacho, E. Triantafillou, C. Muise, J. Baier, and S. McIlraith, “Non-deterministic planning with temporally extended goals: LTL over finite and infinite traces,” in Proc. AAAI Conf. Artif. Intell., vol. 31, no. 1, 2017.
- [27] G. Fainekos, H. Kress-Gazit, and G. Pappas, “Temporal logic motion planning for mobile robots,” in IEEE Int. Conf. Robot. Automat., 2005, pp. 2020–2025.
- [28] V. Kurtz and H. Lin, “Temporal logic motion planning with convex optimization via graphs of convex sets,” IEEE Trans. Robot., vol. 39, no. 5, pp. 3791–3804, 2023.
- [29] C. Yang, M. L. Littman, and M. Carbin, “On the (in)tractability of reinforcement learning for LTL objectives,” in Int. Joint Conf. Artif. Intell., 2022, pp. 3650–3658.
- [30] R. Toro Icarte, T. Q. Klassen, R. Valenzano, and S. A. McIlraith, “Teaching multiple tasks to an RL agent using LTL,” in Proc. Int. Conf. Autonomous Agents Multiagent Syst., 2018, pp. 452–461.
- [31] C. Voloshin, H. M. Le, S. Chaudhuri, and Y. Yue, “Policy optimization with linear temporal logic constraints,” in Advances in Neural Inf. Process. Syst., 2022, pp. 17 690–17 702.
- [32] J. Song, Q. Zhang, H. Yin, M. Mardani, M.-Y. Liu, J. Kautz, Y. Chen, and A. Vahdat, “Loss-guided diffusion models for plug-and-play controllable generation,” in Int. Conf. Mach. Learn., vol. 202, 2023, pp. 32 483–32 498.
- [33] P. Dhariwal and A. Q. Nichol, “Diffusion models beat GANs on image synthesis,” in Advances in Neural Inf. Process. Syst., vol. 34, 2021, pp. 8780–8794.
- [34] J. Ho and T. Salimans, “Classifier-free diffusion guidance,” in NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021.
- [35] A. Jalal, M. Arvinte, G. Daras, E. Price, A. G. Dimakis, and J. Tamir, “Robust compressed sensing MRI with deep generative priors,” in Advances in Neural Inf. Process. Syst., vol. 34, 2021, pp. 14 938–14 954.
- [36] H. Chung and J. C. Ye, “Score-based diffusion models for accelerated MRI,” Med. Image Anal., p. 102479, 2022.
- [37] H. Chung, B. Sim, and J. C. Ye, “Come-closer-diffuse-faster: Accelerating conditional diffusion models for inverse problems through stochastic contraction,” in IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022, pp. 12 413–12 422.
- [38] H. Chung, B. Sim, D. Ryu, and J. C. Ye, “Improving diffusion models for inverse problems using manifold constraints,” in Advances in Neural Inf. Process. Syst., vol. 35, 2022, pp. 25 683–25 696.
- [39] C. Rösmann, F. Hoffmann, and T. Bertram, “Integrated online trajectory planning and optimization in distinctive topologies,” Robot. Auton. Syst., vol. 88, pp. 142–153, 2017.
- [40] K. Chen, E. Lim, K. Lin, Y. Chen, and H. Soh, “Behavioral refinement via interpolant-based policy diffusion,” in Proc. Robot.: Sci. and Syst. (RSS), 2024.
- [41] J. Fu, A. Kumar, O. Nachum, G. Tucker, and S. Levine, “D4RL: Datasets for deep data-driven reinforcement learning,” arXiv preprint arXiv:2004.07219, 2020.
- [42] A. Camacho, R. Toro Icarte, T. Q. Klassen, R. Valenzano, and S. A. McIlraith, “LTL and beyond: Formal languages for reward function specification in reinforcement learning,” in Int. Joint Conf. Artif. Intell., 2019, pp. 6065–6073.
- [43] C. Menghi, C. Tsigkanos, M. Askarpour, P. Pelliccione, G. Vázquez, R. Calinescu, and S. García, “Mission specification patterns for mobile robots: Providing support for quantitative properties,” IEEE Trans. Software Eng., vol. 49, no. 4, pp. 2741–2760, 2023.
-A Data
-A1 Maze2d
The Maze2d environment is part of a well-known open-source benchmark for offline reinforcement learning — D4RL [41]. Datasets of trajectories are publicly available in the D4RL library. Each of the trajectories within is one continuous trajectory of an agent navigating to random goals without interruption. See Fig. 8 for an overview of the environments. We utilize the original dataset in D4RL as is to train the baseline Diffuser, which also serves as the diffusion model of LTLDoG.
-A2 PushT
PushT is a robot manipulation benchmark adopted in Diffusion Policy [3] to showcase diffusion models’ planning capabilities. A dataset of expert policies for pushing a T block to the same goal position starting from random poses are publicly available at [3]. Though Diffusion Policy achieves good performance on the original manipulation task, trained with around only 200 expert demonstrations (i.e., 206 episodes), we find that these trajectories lack the modalities or abundant behaviors that potentially or partially satisfy different constraints. As such, expanding the original dataset is necessary for our method to work.
Data Augmentation Given that obtaining extra expert demonstrations is inefficient, we performed common data augmentation techniques to expand the original dataset. Concretely, we first employed simple geometric augmentation, only rotation and flipping, for each trajectory in the original dataset, assuming basic physical symmetry in this simulation environment. Then, we used a pretrained Diffusion Policy to push the T block to the goal pose, aligning with the original dataset in which the goal of the manipulation task is implicitly encoded in trajectories. In this way, we augmented the original set from 206 trajectories to 6686 trajectories, and elongated the average episode length from 125 to 289 steps. The augmented dataset is publicly available online111https://github.com/clear-nus/ltldog.
In this task, the baseline (Diffusion Policy) and our methods share the same diffusion model backbone, which is trained over the augmented dataset.
-A3 Real Robot Scenarios
Data for both of the real-world scenarios are collected in simulated environments. For the office we collected 4 long continuous trajectories in the simulation of step size of 11,000, 107,000, 60,000 and 130,000. Similarly for the lab map we collected 4 trajectories of steps 11,000, 19,000, 110,000 and 120,000. First, we built by SLAM the maps of the two environments, a studio-like lab and an office room, and the maps are used to create two simulation scenarios. For creating the maps, we teleoperated a quadruped robot (Unitree Go2) mounted with LiDAR in each of these environments. We used the slam_toolbox along with Nav2 for generating the maps in both environments. We created simple 3D world for each of the maps for simulation in gazebo. After that, we enabled a Go2 robot simulated in Gazebo and driven by ROS Navigation Stack to navigate to random goal positions sampled in the maps for data collection, creating long trajectories in a similar fashion to Maze2d.
-A4 LTL Satisfaction Values
The soft labels (values indicating how well the s are satisfied) for training the regressor guide (LTLDoG-R) are calculated with an adapted version of the DTL toolbox [15] and saved before each training session, and can be reused if possible, since each training session may sample and cut different segments of the trajectories in datasets based on different training configurations. Values calculation for benchmarking test results are performed online simultaneously with all tests.
-B Generation
To generate practical s meaningful for real robotic tasks, we query GPT-4 via prompting with some constraints on s. The prompt and the response from ChatGPT is listed in TABLE VII. In the real experiments, we tested the first four returned s as these are the most representative s among them and are also commonly used in robot learning literature [30, 42, 22]. Some of the other generated s shown in this table are either not logically correct or not meaningful for robotic tasks. For instance, the th has a similar meaning of sequential visiting as the nd . The imply operator does not enforce any constraints if the condition is not met (e.g., the condition in the th can never be satisfied for disjoint regions). The th, th and th s are not correct with their corresponding descriptions. The last can be considered as a combination of the nd and rd s. More complex s and combinations of them can be derived based on different visitation and avoidance patterns [43]. We plan to further explore these temporal specifications in environments like Robomimic, and adapting our method to complex combinations of s in future work.
Prompt | In a robot navigation task, the robot is required to visit four key regions (referred to as p_0, p_1, p_2 and p_3) in specific temporal orders. The constraints on the robot’s trajectory concerning the visitation (or non-visitation) of these regions can be specified using finite Linear Temporal Logic (LTLf). Please generate several LTLf formulas (e.g., 10) that are realistic in real-world environments and tasks involving robot navigation. The details of the task and additional requirements are as follows: 1. The LTLf operators that can be used are: (Negate), (And), (Or), F (Eventually), G (Always), U (Until). Avoid using other LTL operators. Also, avoid using X (Next) since the number of timesteps needed to move from one region to another is unknown. 2. The atomic propositions are determined by whether the robot is inside p_0, p_1, p_2 and p_3. Therefore, the labelling function returns assignments with a 4-dimensional 0-1 vector at each timestep of the robot’s trajectory. 3. Only one proposition can be true at each timestep because these regions are disjoint. The robot cannot be inside any two regions simultaneously. An assignment at a timestep can be a zero vector (0, 0, 0, 0), indicating that the robot is not currently in any of these regions. This occurs when the robot is initially not in these regions and all atomic propositions are False. 4. The robot cannot move directly from one region to another as the regions might not be connected. Therefore, LTLf formulas such as p_0 U p_1 are not valid since both propositions will be False when the robot is not inside these two regions. 5. The robot’s trajectory is finite, so the robot cannot repeatedly visit different regions. 6. The LTLf formulas are supposed to apply from the first timestep of the assignment. 7. When generating an LTLf formula, please analyze the generated formula first to ensure that it is valid and that there exist assignments that can satisfy it. | |
ChatGPT | To create realistic LTLf formulas for a robot navigation task considering the specified constraints and operators, we need to structure these formulas in a way that reflects possible navigation scenarios. Here are ten LTLf formulas that align with the given requirements: 1. Visit all regions at least once: • This formula states that eventually, the robot will visit each of the four regions at least once during its trajectory. 2. Visit regions in a specific sequence: • This formula requires the robot to visit , then , followed by , and finally , in that order. 3. Avoid a specific region until another has been visited: • Here, the robot must avoid until has been visited. 4. Never visit a specific region: • This formula specifies that the robot should never enter region . 5. Visit one region only after visiting two others: • The robot can visit only after it has visited and then . 6. Ensure two regions are visited before a third: • This formula ensures that and are visited (in no particular order) before visiting : 7. Visit at least three regions: • This ensures the robot visits at least three out of the four regions. It is a less strict requirement, allowing for flexibility in navigation. 8. Visit a specific region last: • The robot must visit last, after visiting , , and in any order. 9. Avoid a region until all other regions have been visited: • The robot should avoid until it has visited , , and in any order. 10. Visit a pair of regions, avoiding others until then: • The robot must avoid and until it has visited and then : Each of these formulas considers the finite nature of the robot’s trajectory, the impossibility of being in multiple regions simultaneously, and the requirement that the robot may need to travel through unspecified areas (not within , , , or ) to move from one region to another. The formulas are structured to reflect realistic navigation tasks that could be encountered in real-world robot navigation scenarios, such as search and rescue, surveillance, or delivery tasks within a specified area. |
-C Additional Illustrations
More illustrations and comparisons against the baseline for the obstacle avoidance task in Maze2d environment are in Fig. 9.
Task | Method\Performance | Success rate () | |
---|---|---|---|
Lab | Office | ||
Obstacle avoidance | Diffuser | 11.73.7 | 0.00.0 |
LTLDoG-S | 100.00.0 | 80.08.1 | |
Temporal behaviors | Diffuser | 0.00.0 | 0.00.0 |
LTLDoG-S | 96.51.5 | 66.71.8 |
Method\Perf. | Success rate () | Reward (UnCon) | |||
---|---|---|---|---|---|
Planning | Rollout | C1 | C2 | ||
C1 | C2 | ||||
Diffuser | 9.53.1 | 11.01.7 | 15.02.6 | 142.25.1 | 92.13.0 |
SafeDiffuser | 99.40.9 | 12.33.5 | 14.22.8 | 135.85.2 | 131.95.4 |
LTLDoG-S | 99.00.8 | 73.03.0 | 90.60.6 | 97.32.8 | 36.211.7 |
LTLDoG-R | 98.80.8 | 92.01.4 | 99.60.6 | 127.15.1 | 76.413.4 |
-
1
Planned trajectories are rolled out using two different low level control strategies, denoted as C1 and C2.
Fig. 11–Fig. 13 delineate two more examples of the comparisons between our method LTLDoG against the baseline Diffusion Policy on the obstacle avoidance and temporal behavior tasks. These results demonstrate our method’s capabilities in guiding diffusion models to follow not only purely spatial but also spatial-temporal constraints.
We also show in Fig. 14 more snapshots of a quadruped robot running in the simulated Lab environment with different policies (baseline against ours). The overall satisfaction rate of all raw generated trajectories in different tasks and environments are summarized in Table VIII. The overall performance is (c.f. baseline ) as reported in the main paper.
-D Failure case study
Fig. 10(a) shows an infeasible trajectory generated from LTLDoG-S. Our hypothesis is that posterior sampling in LTLDoG-S calculate gradients based on Euclidean distance on the trajectory space due to the LTL evaluation, which may push the trajectory away from the data manifold. On the other hand, the neural network in regressor guidance is trained on noised intermediate samples, which are linked to the data manifold. Gradients ascent on noise space will likely push the noised example to a place where both the satisfaction value and likelihood are high. Visualizations of the difference between two controllers as shown in Fig. 10(b) Fig. 10(c) suggest that selecting next locations sequentially without distance checking is a less safe option.
Low-Level Control Policy The success rate during rollout also depends on the specific controller used to visit each location of the generated trajectory. TABLE IX shows the comparative results between the two low-level control policies. A more conservative controller (denoted as C2) that visits every location exactly with distance checking can enjoy higher success rate than greedily move (C1, originally used in [1]).