- Updated: March 22, 2026
- 8 min read
Implementing Deep Q‑Learning DQN from Scratch with RLax, JAX, Haiku, and Optax
Answer: By combining the research‑grade RLax library with JAX’s high‑performance
autodiff, Haiku’s modular neural‑network API, and Optax’s flexible optimizers, you can build a
fully‑customizable Deep Q‑Learning (DQN) agent that learns to balance the CartPole environment
in just a few thousand training steps.
Why Deep Q‑Learning Matters for Modern AI Projects
Deep Q‑Learning (DQN) remains a cornerstone algorithm for discrete‑action reinforcement learning.
Its ability to approximate the optimal action‑value function with a neural network makes it ideal for
problems ranging from game playing to robotic control. The classic CartPole benchmark is
perfect for demonstrating core RL concepts—state representation, experience replay, and target‑network
updates—while keeping compute requirements modest enough for rapid experimentation on a laptop.
For AI researchers, machine‑learning engineers, and data scientists who want to stay on the cutting
edge, mastering DQN with the latest JAX‑based ecosystem unlocks:
- GPU‑accelerated training without boilerplate.
- Composable primitives that can be swapped for Double‑DQN, Dueling DQN, or distributional RL.
- Seamless integration with other UBOS AI services such as reinforcement‑learning pipelines.
Explore the broader capabilities of the UBOS homepage to see how the platform supports end‑to‑end AI development.
The Four Pillars: RLax, JAX, Haiku & Optax
RLax – Research‑Grade RL Primitives
Developed by DeepMind, RLax offers a collection of battle‑tested reinforcement‑learning building blocks
(e.g., q_learning, policy_gradient, and discounted_return) that
operate directly on JAX tensors. By using RLax, you avoid reinventing the math behind temporal‑difference
errors and can focus on architecture.
JAX – High‑Performance Autodiff
JAX brings NumPy‑compatible syntax together with just‑in‑time (JIT) compilation, automatic
differentiation, and vectorized vmap operations. This means the same DQN code runs
efficiently on CPUs, GPUs, or TPUs with a single line change.
Haiku – Simple, Functional Neural Nets
Haiku (aka dm‑haiku) provides a clean, functional API for defining neural networks while
preserving JAX’s pure‑function philosophy. Its hk.transform and hk.without_apply_rng
utilities make it trivial to separate parameter initialization from forward passes—exactly what DQN needs.
Optax – Optimizers for the Modern Era
Optax supplies a composable optimizer library (Adam, RMSProp, gradient clipping, learning‑rate schedules)
that integrates seamlessly with JAX’s functional style. In the DQN tutorial we chain clip_by_global_norm
with adam to keep updates stable.
Together, these libraries form a lightweight yet powerful stack that the UBOS platform overview leverages for its AI services.
Step‑by‑Step: Building the CartPole DQN Agent
Below is a concise walkthrough of the core components. Full notebooks are available in the
UBOS templates for quick start.
1️⃣ Define the Q‑Network with Haiku
def q_network(x):
mlp = hk.Sequential([
hk.Linear(128), jax.nn.relu,
hk.Linear(128), jax.nn.relu,
hk.Linear(num_actions),
])
return mlp(x)
q_net = hk.without_apply_rng(hk.transform(q_network))
The network maps the 4‑dimensional CartPole observation to Q‑values for the two possible actions.
Using hk.Sequential keeps the code readable and fully JIT‑compatible.
2️⃣ Experience Replay Buffer
A deque stores Transition tuples (state, action, reward, next_state, done).
Sampling uniformly breaks correlation between consecutive steps, stabilizing learning.
@dataclass
class Transition:
obs: np.ndarray
action: int
reward: float
discount: float
next_obs: np.ndarray
done: float
class ReplayBuffer:
def __init__(self, capacity):
self.buffer = deque(maxlen=capacity)
def add(self, *args):
self.buffer.append(Transition(*args))
def sample(self, batch_size):
batch = random.sample(self.buffer, batch_size)
# Convert to JAX arrays …
3️⃣ Epsilon‑Greedy Exploration
Early in training the agent explores randomly; later it exploits the learned Q‑function.
The schedule is linear decay from 1.0 to 0.05 over 20 k steps.
def epsilon_by_frame(frame_idx, eps_start=1.0, eps_end=0.05, decay_frames=20000):
mix = min(frame_idx / decay_frames, 1.0)
return eps_start + mix * (eps_end - eps_start)
4️⃣ Training Loop & Target Network Updates
The loop interleaves environment interaction, buffer filling, and gradient updates.
A soft update (τ = 0.01) slowly copies online parameters to the target network, reducing
variance in TD‑error estimates.
@jax.jit
def soft_update(target_params, online_params, tau):
return jax.tree_util.tree_map(
lambda t, s: (1.0 - tau) * t + tau * s,
target_params, online_params)
@jax.jit
def train_step(params, target_params, opt_state, batch):
def loss_fn(p):
td_errors = batch_td_errors(p, target_params, batch)
loss = jnp.mean(rlax.huber_loss(td_errors, delta=1.0))
return loss, {"loss": loss}
(loss, metrics), grads = jax.value_and_grad(loss_fn, has_aux=True)(params)
updates, opt_state = optimizer.update(grads, opt_state, params)
params = optax.apply_updates(params, updates)
return params, opt_state, metrics
5️⃣ Periodic Evaluation
Every 2 000 environment steps we run five deterministic episodes (ε = 0) and log the average return.
This metric directly reflects how well the agent balances the pole.
def evaluate_agent(params, episodes=5):
returns = []
for _ in range(episodes):
obs, _ = eval_env.reset()
done = False
total = 0.0
while not done:
q_vals = q_net.apply(params, obs[None, :])
action = int(jnp.argmax(q_vals[0]))
obs, reward, done, _, _ = eval_env.step(action)
total += reward
returns.append(total)
return float(np.mean(returns))
The full script, including logging and Matplotlib visualizations, is part of the
UBOS AI Reinforcement Learning showcase.
Key Technical Insights & Performance Results
After 40 000 training frames, the DQN agent consistently achieves an average return of **≈ 195**,
which is near the maximum score of 200 for CartPole‑v1. The following observations are worth noting:
- JIT compilation reduces wall‑time: Training time drops from ~45 s (pure NumPy) to ~7 s on a single‑core CPU.
- Replay buffer size matters: A capacity of 50 k experiences provides enough diversity without excessive memory overhead.
- Gradient clipping stabilizes learning: Using
optax.clip_by_global_norm(10.0)prevents exploding updates during early exploration. - Soft target updates (τ = 0.01) improve convergence speed: Hard updates every 1 000 steps caused oscillations in the Q‑values.
- Batch size trade‑off: 128‑sample batches gave the best balance between variance reduction and GPU utilization.
Visualizations (training returns, evaluation curves, loss trajectories) are automatically generated
by the Workflow automation studio, enabling rapid iteration.
Practical Applications & Next Steps
While CartPole is a toy problem, the same pipeline scales to real‑world domains:
- Robotic arm manipulation – replace the environment with a physics simulator.
- Dynamic pricing – treat price tiers as discrete actions and reward based on revenue.
- Game AI – integrate with Unity or Unreal via the Telegram integration on UBOS for live telemetry.
For developers who want to embed conversational capabilities, the ChatGPT and Telegram integration lets you query the trained policy from any chat client.
If you need a language model backbone for state encoding, the OpenAI ChatGPT integration can be combined with the DQN’s Q‑network for hybrid RL‑NLP solutions.
Data persistence and vector search are handled by the Chroma DB integration, which stores experience embeddings for offline analysis.
To add voice feedback to your agent (e.g., “Pole is falling!”), plug in the ElevenLabs AI voice integration.
Companies can accelerate deployment using the Enterprise AI platform by UBOS, which offers managed GPU clusters, monitoring dashboards, and CI/CD pipelines for RL workloads.
Startups looking for a lightweight stack can leverage UBOS for startups, while SMBs benefit from UBOS solutions for SMBs.
For rapid prototyping, the Web app editor on UBOS lets you wrap the trained policy in a REST API with a few clicks.
Interested in monetizing your RL models? The UBOS partner program provides revenue‑share options and co‑marketing.
Boost Your Workflow with UBOS Template Marketplace
UBOS offers ready‑made templates that can be combined with the DQN pipeline:
- AI Article Copywriter – generate documentation for your RL experiments automatically.
- AI SEO Analyzer – ensure your project pages rank well.
- AI Video Generator – create demo reels of the CartPole agent in action.
- AI Image Generator – produce custom visual assets for reports.
- AI Chatbot template – add a conversational interface to query training metrics.
- AI YouTube Comment Analysis tool – monitor community feedback on your RL demos.
Conclusion
Implementing Deep Q‑Learning from scratch with RLax, JAX, Haiku, and Optax gives you full control over every
component of the RL pipeline while delivering performance comparable to heavyweight frameworks.
The modular nature of these libraries aligns perfectly with UBOS’s AI reinforcement‑learning services,
allowing you to scale from a single CartPole experiment to production‑grade agents serving millions of requests.
Whether you are a startup, an SMB, or an enterprise, the combination of open‑source research tools and UBOS’s
managed platform accelerates time‑to‑value. Check the UBOS pricing plans for a plan that fits your budget, explore the
UBOS portfolio examples for inspiration, and start building your next AI‑powered product today.
The technical details in this article are adapted from the original tutorial published by
MarkTechPost.