Inside PyTorch 2.7 The JAX Bridge and 5x Faster Pallas Attention

For the past few years, the machine learning community has been split across two dominant ecosystems. PyTorch has reigned supreme in research, eagerly adopted for its intuitive Pythonic design and dynamic computation graphs. Meanwhile, JAX has quietly powered the world's most massive models inside Google and Anthropic, leveraging its functional purity, effortless sharding across TPU pods, and highly optimized XLA compilations. Developers often had to choose between the flexibility of PyTorch and the raw, scalable power of JAX.

With the release of PyTorch 2.7, that binary choice is officially obsolete. The latest iteration of the framework introduces a groundbreaking experimental feature known as the JAX Bridge. This allows developers to seamlessly call JAX functions directly from within PyTorch or PyTorch/XLA computational graphs. But the interoperability does not stop there. PyTorch 2.7 also leverages this newfound bridge to ship a Pallas-based ragged paged attention kernel, effectively obliterating previous bottlenecks in variable-length sequence attention and delivering staggering performance gains of up to 5x.

In this deep dive, we will explore exactly how the JAX Bridge works under the hood, why compiling down to StableHLO makes this possible, and how the new Pallas attention kernel drastically changes the economics of Large Language Model serving.

Exploring the PyTorch to JAX Bridge

Before PyTorch 2.7, combining PyTorch and JAX in the same pipeline usually meant dealing with cumbersome workarounds. You would have to export a PyTorch model to ONNX, run it via a runtime, take the numpy arrays, and feed them into a JAX environment. Moving data across the CPU/GPU boundary in this manner destroyed performance, making it completely unviable for training loops or high-throughput inference.

The new JAX Bridge fundamentally changes this by operating at the compiler level. Instead of passing data between two separate runtime environments, PyTorch 2.7 allows you to embed JAX operations inside PyTorch's XLA graph. When PyTorch compiles the model, it absorbs the JAX function, resulting in a single, unified execution graph that runs natively on the underlying hardware.

If you are heavily invested in the PyTorch ecosystem but want to experiment with advanced JAX-based optimizers like Optax, the JAX Bridge allows you to do exactly that without rewriting your entire model architecture.

The Mechanics of StableHLO

To understand how this magic trick is performed, we need to look at StableHLO. High-Level Optimizer operations are the intermediate representation used by the XLA compiler. Historically, PyTorch (via the torch_xla module) and JAX both compiled their Python-level code down to HLO before sending it to the XLA backend for hardware-specific optimization on GPUs or TPUs.

StableHLO is a backward-compatible version of this intermediate representation. Because both PyTorch 2.7 and JAX now target StableHLO, they suddenly speak the exact same intermediate language. When you invoke a JAX function inside a PyTorch script, the JAX Bridge intercepts the JAX function, lowers it to StableHLO, and stitches that StableHLO graph directly into the PyTorch XLA graph.

This zero-overhead integration means the XLA compiler sees one continuous graph. It can perform operator fusion, memory pre-allocation, and dead code elimination across the PyTorch-JAX boundary just as it would for a pure PyTorch model.

Implementing the JAX Bridge in Practice

Let us look at what this looks like in code. In this scenario, we have a PyTorch tensor, but we want to apply a highly optimized custom JAX function to it before continuing with PyTorch operations.

code

import torch
import torch_xla.core.xla_model as xm
import torch_xla.experimental.jax_bridge as jb
import jax
import jax.numpy as jnp

# Define a pure JAX function
@jax.jit
def complex_jax_activation(x):
    # A custom activation function utilizing JAX primitives
    return jnp.sin(x) * jnp.exp(-0.1 * jnp.square(x))

# Initialize a PyTorch tensor on an XLA device (TPU or GPU)
device = xm.xla_device()
pt_tensor = torch.randn(1024, 1024, device=device, requires_grad=True)

# Convert the PyTorch tensor to a JAX-compatible representation via the bridge
# This does NOT copy data; it creates a zero-copy shared view
jax_array = jb.pt_to_jax(pt_tensor)

# Apply the JAX function
jax_result = complex_jax_activation(jax_array)

# Convert the result back to a PyTorch tensor
pt_result = jb.jax_to_pt(jax_result)

# Continue with standard PyTorch operations
final_output = torch.nn.functional.linear(pt_result, torch.randn(1024, 1024, device=device))
final_output.sum().backward()

# Execute the XLA graph
xm.mark_step()

The beauty of this code is what does not happen. There are no expensive CPU round-trips. There is no host-device synchronization halting your training loop. The `pt_to_jax` and `jax_to_pt` functions merely wrap the underlying buffer pointers, allowing the XLA compiler to merge the graphs seamlessly.

Keep in mind that the JAX Bridge is currently marked as experimental. While zero-copy data passing works remarkably well on TPUs and is improving on NVIDIA GPUs, edge cases in distributed sharding environments may still require manual intervention. Always profile your graph compilation times when mixing frameworks.

The LLM Inference Memory Wall

While the JAX Bridge is an incredible architectural feat, PyTorch 2.7 immediately puts it to practical use by solving one of the most painful problems in modern AI serving. To understand the significance of the new Pallas attention kernel, we must first examine the memory wall inherent in Large Language Model inference.

When serving LLMs in production, requests arrive continuously with vastly different sequence lengths. One user might submit a prompt that is 15 tokens long, while another submits a massive document containing 8000 tokens. Traditional batching mechanisms require all sequences in a batch to be the same length, forcing inference engines to append useless padding tokens to the shorter sequences.

This padding is a catastrophic waste of compute and memory. Because standard self-attention has quadratic time and space complexity with respect to sequence length, processing thousands of padding tokens rapidly exhausts the GPU or TPU memory. Furthermore, the Key-Value cache must allocate contiguous blocks of memory for the maximum possible sequence length. As sequences grow dynamically during autoregressive generation, this contiguous memory requirement leads to severe fragmentation.

Demystifying Ragged Paged Attention

The community has previously tackled these issues with two distinct innovations. Ragged batching eliminates padding by flattening sequences of different lengths into a single continuous sequence, keeping track of the boundaries. Paged attention solves memory fragmentation by dividing the Key-Value cache into fixed-size blocks (pages), much like a computer operating system manages virtual memory. This allows the KV cache to be allocated non-contiguously, virtually eliminating fragmentation.

Combining these two techniques into Ragged Paged Attention is the holy grail of LLM serving. However, writing a kernel that efficiently handles both ragged boundaries and non-contiguous memory lookups is incredibly complex. Standard PyTorch primitives cannot express this efficiently, and writing custom C++ or CUDA kernels locks you into specific hardware vendors.

Why Pallas Trumps Custom Kernels

This is where Pallas enters the picture. Pallas is an open-source kernel language embedded in JAX. It serves a similar purpose to OpenAI's Triton but is designed from the ground up to compile flawlessly to both Google TPUs and standard GPUs via the XLA compiler. Pallas allows developers to write low-level hardware instructions, manage SRAM explicitly, and define block-level matrix multiplications using familiar Python syntax.

Because PyTorch 2.7 now features the JAX Bridge, PyTorch developers gain native access to Pallas kernels. The PyTorch team collaborated with Google engineers to write a state-of-the-art Ragged Paged Attention kernel entirely in Pallas. By exposing this via the JAX Bridge, PyTorch 2.7 users can now leverage TPU-optimized, heavily custom attention mechanisms without writing a single line of C++.

This Pallas-based kernel handles the complex pointer arithmetic required to fetch non-contiguous KV cache pages and correctly mask the ragged sequence boundaries in a single, fused operation. By doing all of this within the fast SRAM of the accelerator, the kernel avoids multiple slow reads from High Bandwidth Memory.

Unpacking the 5x Speedup

The headline feature of this release is the claim of up to 5x speedups. This number is not a marketing gimmick; it is grounded in the elimination of specific computational bottlenecks. The acceleration comes from three distinct optimizations working in tandem.

The complete elimination of padding tokens ensures that the accelerator's matrix multiplication units are only crunching mathematically meaningful data.
Paged memory management allows batch sizes to be scaled up massively since memory is no longer hoarded by fragmented, pre-allocated contiguous blocks.
The fused nature of the Pallas kernel prevents the intermediate attention matrices from ever being written back to main memory, keeping the entire operation bandwidth-bound rather than compute-bound.

When you combine these three factors on workloads with highly variable sequence lengths (such as chat applications or agentic workflows), the end-to-end throughput of the inference server can easily quintuple compared to naive, padded attention implementations.

Utilizing the Pallas Attention Kernel

Integrating the new kernel into a custom LLM serving script requires setting up the paged KV cache tensors and passing the sequence metadata. Here is a simplified look at how the API is structured in PyTorch 2.7.

code

import torch
import torch_xla.core.xla_model as xm
from torch_xla.experimental.custom_kernels import ragged_paged_attention

# Assume we have pre-computed queries, and a pre-allocated paged KV cache
device = xm.xla_device()

# Q: [total_tokens_in_batch, num_heads, head_dim] (Ragged, no padding!)
queries = torch.randn(4096, 32, 128, device=device)

# KV Cache: [num_pages, 2, num_kv_heads, page_size, head_dim]
# Pages are allocated dynamically and non-contiguously
kv_cache = torch.randn(1000, 2, 8, 16, 128, device=device)

# Metadata arrays tracking sequence lengths and block tables mapping sequences to pages
context_lens = torch.tensor([15, 800, 3281], dtype=torch.int32, device=device)
block_tables = torch.randint(0, 1000, (3, 256), dtype=torch.int32, device=device)

# Execute the Pallas-based kernel via the XLA bridge
# This single call replaces dozens of lines of complex masking and slicing
output = ragged_paged_attention(
    query=queries,
    key_value_cache=kv_cache,
    context_lengths=context_lens,
    block_tables=block_tables,
    max_context_len=4096
)

# Output is perfectly sized [total_tokens_in_batch, num_heads, head_dim]
xm.mark_step()

This unified API hides immense complexity. Behind the scenes, `torch_xla` uses the JAX Bridge to invoke the Pallas kernel, lowers it to StableHLO, and executes it at near-hardware limits. For teams building custom inference infrastructure, this drastically reduces the barrier to entry for achieving state-of-the-art serving performance.

While the ragged paged attention kernel is heavily optimized for TPU v4 and v5e hardware, the underlying StableHLO compilation ensures that the exact same code will run on standard GPU clusters, making your inference stack truly hardware agnostic.

Real World Implications for AI Teams

The release of PyTorch 2.7 is not just another incremental update. It is a fundamental shift in how ML engineering teams will structure their technology stacks over the next few years. The implications ripple across the entire lifecycle of model development.

Teams no longer have to maintain separate codebases for PyTorch-based research and JAX-based production scaling.
Infrastructure costs for LLM inference will plummet as the paged attention kernel allows for significantly higher concurrency on existing hardware.
Developers can seamlessly integrate specialized libraries from the JAX ecosystem, such as complex RL environments or advanced sampling algorithms, directly into their PyTorch training loops.
Hardware lock-in is severely mitigated since Pallas kernels natively target both TPUs and GPUs, giving organizations the freedom to shop around for the most cost-effective compute clusters.

We are likely to see an explosion of hybrid architectures in the coming months. Researchers will train massive foundational models using PyTorch's user-friendly distributed frameworks while leaning heavily on JAX for custom hardware-level optimizations that were previously too difficult to implement in pure C++.

Looking Ahead to the Framework Agnostic Future

PyTorch 2.7 effectively signals the end of the framework wars. By embracing StableHLO and building a robust bridge to JAX, the PyTorch team has acknowledged that the future of machine learning is collaborative, not competitive. The integration of the Pallas-based ragged paged attention kernel is just the first proof of concept for this new paradigm.

As the JAX Bridge matures out of its experimental phase, we can expect the boundary between these two ecosystems to blur entirely. For developers and AI architects, the message is clear. The days of choosing between PyTorch's flexibility and JAX's performance are over. You can now have both, wrapped in a 5x speedup, ready for the next generation of scalable AI applications.