Re-Mix: Optimizing Data Mixture for Large Scale Imitation Learning

Paper Notes, Originally by Hejna et al.

April 2025

Abstract

( Paper Notes )

This paper addresses a methodology to optimize the data mixture for pre-training foundation models for robotics, specifically distributionally robust optimization (DRO) to weight different subsets or domains of robotics datasets. They address challenges, such as variability of models in different action spaces / dynamics across different datasets. They experiment on the Open-X Embodiment Dataset, demonstrating that curation of data can heavily impact the performance of the final model.

Introduction

The action quality and visual diveristy of data is extremely important to improve performance of a robotic foundation model, but many foundation models, such as recently released Octo and OpenVLA are trained on a subset of OpenX, of which the subset data was chosen subjectively based on "interestingness" rather than a rigorous framework.

This curation requires extensive domain knowledge and are unlikely to scale to rapidly growing datasets.

Re-weighing Data Mixtures with Minimax Optimization

Consider the imitation learning problem, with dataset $\mathcal{D} = \{\tau_1, \dots \tau_n\}$ , where $\tau$ is a state-action trajectory, $\tau = (s_1, a_1, \dots, s_T, a_T)$ .

The goal is to learn the policy, $\pi_{\theta}$ which learns the proper mappings from states to actions: $\pi_\theta:\mathcal{S} \rightarrow \mathcal{A}$

This is typically done through imitaion learning algorithms, such as behavior cloning which aims to minimize the negative log-likelihood of the actions derived under a policy.

Behavioral Cloningi s simply a supesrvised learnign problem, where you collect a state and an action, and train the neural network to predict the state given the action.

For a given $\mathcal{D}$ , it is assumed that it can be split into $k$ domains, $\mathcal{D}_1, \dots, \mathcal{D}_k$ , where $k$ represents different domains with respect to different state space ( $\mathcal{S}$ ), different action spaces ( $\mathcal{A}$ ), different transition dynamics (what movement is possible), or distributions, or distributions.

This diversity is useful for training a generalizable model to different environments.

The goal is to learn a weighting vector, $\alpha \in \Delta^k$ which specifies a probability distribution over all domains, such that any model attains maximum performance across all domains, without overfitting on a single one too much, when trained on a weighted mixture of data based on $\alpha$ .

Distributionally Robust Optimization has the mini-max objective of:

\min_\theta \max_{\alpha \in \Delta^k} \sum_{i = 1}^k \alpha_i \mathcal{L}_{\text{BC}}(\pi_\theta, \mathcal{D}_i)

where given the inner $\max$ , $\alpha$ upweighs the domains that have a higher loss value but the outer $\min$ minimize the loss function, for the domains that have a higher loss value.

However, in practice the interest might not just be in fitting domains that have higher loss values.

We can instead do:

\min_\theta \max_{\alpha \in \Delta^k} \sum_{i = 1}^k \alpha_i \left[\mathcal{L}_{\text{BC}}(\pi_\theta, \mathcal{D}_i) - \mathcal{L}_{\text{BC}}(\pi_{\text{ref}}, \mathcal{D}_i)\right]

where $\pi_{\text{ref}}$ is a reference policy thatis trained to convergence on iinitial guess of domain weights, which is random sampling from all domains $k$ , then equivalent to uniform weights -- using $\min_\theta \sum_{i = 1}^k \alpha_i \mathcal{L}_{\text{BC}}(\pi_\text{ref}, \mathcal{D}_i)$

This downweights domains which are difficult to fit, with respect to the reference policy, such that only domains where that have a high excess loss (or the policy can improve too meet the reference) are up-weighted.

Re-Mix

Action Pre-processing is done by Gaussian normalization to every domain $D_k$ , with different action spaces and dnamics, and then discretize actions by via binning, into discrete intervals of $N$ bins.
Train $\pi_{\text{ref}}$ on a uniform mixture of domains $\mathcal{D}_1, \dots, \mathcal{D}_k$ , where each domain is weightd in propotion to it's size, for uniform (random) sampling. The reference model is chosen by the lowest validation loss.
Perform Group Distributionally Robust Optimization, learning a set of domain weights $\alpha$ through the robust optimiaztion with a discrete policy $\pi_{\theta}$ ,
- Again, we're maximizing the $\alpha$ , such that the greatest excess (divergence) loss w.r.t the reference policy is minimized.
- To update $\alpha$ , it's done via exponentiated gradient descent (where the gradient $\times$ learning rate is exponeniated by usually $\exp$ )

\min_{\theta} \max_{\alpha \in \Delta^k} \sum_{i = 1}^k \alpha_i \left[ \frac{1}{|\mathcal{D}_i|} \sum_{(s, a) \in \mathcal{D}} (-\log \pi_{\theta}(a | s) + \log \pi_{\text{ref}}(a | s))\right]

After training the policy the average value of $\alpha$ is taken over the course of training, $\rightarrow \tilde{\alpha}$ to reweight different domains or subsets for policy training.

Then, we can use $\alpha$ to probabilistically sample from datasets.