This document outlines the HLO optimizations and transformations passes in the XLA compiler.
Introduction
A single HLO Pass can be comprised of one or many compiler optimizations and transformations, and XLA provides several hundred such passes. HLO focuses only on the shape (e.g. a 3x4 matrix) and the operation semantics of the arrays to make the optimization or transformation easier.
For example:
AlgebraicSimplifier: A pass that performs a number of mostly arithmetic simplifications and optimizations. Including:- When dividing by a constant, an optimization is performed to transform the operation to multiplication by the inversion of the constant.
HloRematerialization: A pass that recomputes selected expressions in the computation to reduce memory pressure caused by long live ranges of array-shaped values.
Developer details
The base class for HLO passes can be found in
xla/hlo/pass/hlo_pass_interface.h.
HLO pass should not extend this class directly but instead should extend
HloModulePass.
See also XLA HLO Pass Framework.
Tooling and Testing
XLA comes with multiple command line tools, including the hlo-opt tool. This tool allows execution of an individual pass independent of the given platform compilation stages. For more information see Tooling.
For information on writing unit tests for HLO Passes see Testing HLO Passes.
Hardware-independent HLO Pass Examples
This section describes a few examples of passes shared across XLA backends. Some passes may be specialized for specific backends, but the high-level functionality is similar.
Shared passes or hardware-independent passes can be found in
xla/hlo/transforms.
Rematerialization
See also
HloRematerialization.
Selectively recomputes expressions within the HLO graph to reduce memory usage. Trades off higher compute for lower memory usage. Can reduce memory usage by tens of percent and is required to run many large models.
Algebraic Simplifier
See also
AlgebraicSimplifier.
A grab bag of simplifications, optimizations, and canonicalizations. Analogous
to
LLVM’s instcombine pass.
Constant Folding
See also
HloConstantFolding.
Replaces expressions which can be evaluated at compile time with their constant equivalent.
Dead Code Elimination
See also
HloDCE
.
Removes operations with unused results (fast implementation).
Call Graph Flattening
See also
FlattenCallGraph.
A legalization pass which converts the HLO call graph into a tree by cloning computations. Required because memory is statically assigned to HLO operations and not based on dynamic call context.
Reshape Mover
See also
ReshapeMover.
Reshapes and transposes can be expensive, especially on TPU. This pass moves and reshapes and transposes across elementwise operations enabling the operations to be merged or eliminated.
Zero-sized HLO Elimination
See also
ZeroSizedHloElimination.
HLO supports arrays of zero size (one or more dimensions has a bound of zero). This pass simplifies the graph by replacing zero-sized operations with zero-sized constants.
TPU-specific HLO Pass Examples
Passes specific to the TPU backend.
Model parallelism
The partitioning of an XLA program across multiple cores is performed at the HLO level and the TPU HLO pipeline includes a number of passes for supporting multi-core execution.
Spatial partitioning
See also
ShardingPropagation.
Pass to support dividing operations across devices along non-batch dimensions.
Handling of bfloat16
See also
BFloat16ConversionFolding,
BFloat16MixedPrecisionRemoval,
and
BFloat16Propagation.
TPUs support bfloat16 as a lower-precision, more compact floating-point representation than 32-bit floats. Using bfloat16 reduces memory footprint and memory bandwidth. The TPU HLO pipeline includes various passes for replacing floats with bfloat16 into the program and propagating the precision through the graph.
Legalization passes
See also
GatherExpander,
and
BatchNormExpander.
Passes which transform unsupported HLO into a form which the backend can emit or for which the backend produces a more efficient lowering.
GPU-specific HLO Pass Example
Passes specific to the GPU backend are found in
xla/service/gpu.
These passes can be identified as classes defined in namespace gpu.
cuDNN Rewriter
See also
CudnnFusedConvRewriter
and
CudnnNormRewriter.
Rewrites fused convolution and norm operations into their respective library calls in cuDNN.
CPU-specific HLO Pass Examples
Passes specific to the CPU backend are found in
xla/service/cpu.
These passes can be identified as classes defined in namespace cpu.
Convolution Canonicalization
See also
ConvCanonicalization.
Canonicalizes convolutions so that they can be lowered to a fast implementation in Eigen.
Operation Parallelization
See also
ParallelTaskAssigner.
Partitions HLOs into tasks to run on separate threads.
Analysis passes
Analysis passes are not considered "HLO passes" since they do not transform HLO
and may not extend HloModulePass. Shared analyses are found in
xla/hlo/analysis.
Analysis Pass Examples
Dataflow Analysis
See also
HloDataflowAnalysis.
Identifies all HLO values in the graph and their uses.
Alias Analysis
See also
HloAliasAnalysis.
Identifies must-alias relationships between values in the program.
Computation Cost Analysis
See also
HloCostAnalysis.
Computes FLOP count and memory usage for all operations in the program.
HLO Verification
See also
HloVerifier.
Verifies various invariants of the HLO graph.