The XLA development workflow is usually centered around
HLO IR, which represents isolated functional
computation given to the compiler. XLA comes with multiple command line tools
(described below) which consume HLO and either run it, or provide an
intermediate compilation stage. Using such tools is invaluable for a fast
compile->modify->run
iteration cycle, as HLO is both visualizable and
hackable, and iteratively changing and running it is often the fastest way to
understand and to fix an XLA performance or behavior.
The easiest way to obtain the HLO for a program being compiled with XLA is
usually to use the XLA_FLAGS
environment variable:
$ XLA_FLAGS=--xla_dump_to=/tmp/myfolder ./myprogram-entry-point
which stores all before-optimization HLO files in the folder specified, along with many other useful artifacts.
[run_hlo_module
] Run HLO Modules
bazel run //xla/tools:run_hlo_module -- [flags] <filename>
The tool run_hlo_module
operates on pre-optimization HLO, and by default
bundles compilation, running and comparison with the reference interpreter
implementation. For example, the usual invocation to run an input file
computation.hlo
on an NVIDIA GPU and to check it for correctness is:
run_hlo_module --platform=CUDA --reference_platform=Interpreter computation.hlo
Run Multiple HLO Modules
Invocation with multiple HLO modules is supported for run_hlo_module
. To run
all hlo modules from a directory:
bazel run //xla/tools:run_hlo_module -- [flags] /dump/*before_optimizations*
[multihost_hlo_runner
] Run HLO Modules With SPMD Support
# Note: Binary name is `hlo_runner_main`.
bazel run //xla/tools/multihost_hlo_runner:hlo_runner_main -- [flags] <filename>
Multihost HLO runner is a very similar tool, with the caveat that it supports SPMD, including cross host communication. See Multi-Host HLO Runner for details.
Run Multiple HLO Modules With SPMD Support
Similar to run_hlo_module
, multihost_hlo_runner
also supports invocation
with multiple modules.
bazel run //xla/tools/multihost_hlo_runner:hlo_runner_main -- [flags] /dump/*before_optimizations*
[hlo-opt
] Compile HLO Module
bazel run //xla/tools:hlo-opt -- --platform=[gpu|cpu|...] [more flags] <filename>
When debugging or understanding the workings of the compiler, it is often useful to get the expansion for a particular hardware at a particular point in the pipeline (be it HLO, optimized HLO, TritonIR or LLVM), for a given HLO or StableHLO input.
hlo-opt
supports multiple output stages: be it PTX, HLO after optimizations,
LLVM IR before optimizations, or TritonIR. The exact set of stages supported
depends on the platform (as e.g. PTX is NVIDIA-specific), and can be seen using
the --list-stages command:
hlo-opt --platform=CUDA --list-stages
buffer-assignment
hlo
hlo-backend
html
llvm
llvm-after-optimizations
llvm-before-optimizations
ptx
After selecting a stage, the user can write the result of the conversion for a given platform to a given stream:
hlo-opt --platform=cpu --stage=hlo input.hlo
which would print the dump to stdout (or to a given file if -o
was specified).
Deviceless Compilation for GPU
Deviceless compilation do not need access to a GPU. The Deviceless Compilation
provides a way to specify GPU spec on the command line
(--xla_gpu_target_config_filename
) for stages where access to GPU is required,
eliminating a need for GPU device.
Example: PTX output without access to a gpu device:
hlo-opt --platform=CUDA --stage=llvm --xla_gpu_target_config_filename=/xla/tools/hlo_opt/gpu_specs/a100_pcie_80.txtpb input.hlo
Specs for popular GPUs are shipped with the compiler, and the provided file is
string serialization of device_description.proto
:
gpu_device_info {
cuda_compute_capability {
major: 8
minor: 0
}
threads_per_block_limit: 1024
threads_per_warp: 32
shared_memory_per_block: 127152
shared_memory_per_core: 65536
threads_per_core_limit: 2048
core_count: 6192
fpus_per_core: 64
block_dim_limit_x: 2147483647
block_dim_limit_y: 65535
block_dim_limit_z: 65535
memory_bandwidth: 2039000000000
l2_cache_size: 4194304
clock_rate_ghz: 1.1105
device_memory_size: 79050250240
}
platform_name: "CUDA"
More GPU specs are located at /xla/tools/hlo_opt/gpu_specs
Autotuning
Sometimes compilation may involve autotuning based on a compilation --stage
.
For the deviceless compilation to work, the user either need to
disable autotuning with --xla_gpu_autotune_level=0
or
load a pre-existing autotuning results with
--xla_gpu_load_autotune_results_from=<filename>
(obtained with
--xla_gpu_dump_autotune_results_to=<filename>
).
hlo-opt --platform=CUDA --stage=llvm --xla_gpu_target_config_filename=gpu_specs/a100_pcie_80.txtpb --xla_gpu_load_autotune_results_from=results.textpb input.hlo
The autotune file is text serialization of autotune_results.proto
, with
example looking like:
version: 3
results {
device: "CUDA: 8.0, Cores: 108, GPU clock: 1.41 GHz, Memory bandwidth: 1555 GB/s, L2 cache: 40 MB"
hlo: "{\n tmp_0 = f16[1,16,17,3]{3,2,1,0} parameter(0)\n tmp_1 = f16[16,51]{1,0} bitcast(f16[1,16,17,3]{3,2,1,0} tmp_0)\n tmp_2 = s8[16,17,3]{2,1,0} parameter(1)\n tmp_3 = s8[51,16]{0,1} bitcast(s8[16,17,3]{2,1,0} tmp_2)\n tmp_4 = f16[51,16]{0,1} convert(s8[51,16]{0,1} tmp_3)\n tmp_5 = f16[16,16]{1,0} dot(f16[16,51]{1,0} tmp_1, f16[51,16]{0,1} tmp_4), lhs_contracting_dims={1}, rhs_contracting_dims={0}\n ROOT tmp_6 = f16[1,16,16]{2,1,0} bitcast(f16[16,16]{1,0} tmp_5)\n}"
result {
run_time {
nanos: 31744
}
triton {
block_m: 32
block_n: 32
block_k: 32
split_k: 1
num_stages: 1
num_warps: 4
}
}
}
The autotuning database can be serialized using
XLA_FLAGS=--xla_gpu_dump_autotune_results_to=<myfile.pbtxt>
[hlo-opt
] HLO Pass Development And Debugging
# If you are working with hardware independent passes from the
# `xla/hlo/transforms/` directory, prefer light-weight version
# of the `hlo-opt` tool with fewer dependencies:
bazel run //xla/hlo/tools:hlo-opt -- [flags] <filename>
# Otherwise, for hardware independent and CPU, GPU passes use
# the same binary from "Compile HLO Modules" section above:
bazel run //xla/tools:hlo-opt -- [flags] <filename>
The hlo-opt
tool allows execution of an individual passes
independent of the given platform compilation stages. This isolation helps to
quickly run passes on input hlo module and pinpoint the root cause of failures.
hlo-opt --passes=schedule-aware-collective-cse input.hlo
hlo-opt
tool also supports DebugOptions XLA_FLAGS
.
hlo-opt --passes=schedule-aware-collective-cse
--xla_gpu_experimental_collective_cse_distance_threshold=20 input.hlo
Use--list-passes
option to get the pass name string.
hlo-opt --list-passes
Users can create their own custom pipeline by specifying more than one passes
to --passes
option.
hlo-opt --passes=pass1,pass2,pass3 input.hlo
Assist New HLO Pass Development
- First, write your pass.
Register the new pass to the
hlo-opt
tool pass registry.RegisterPass<FooPass>(FooPassInputOptions)
Based on the pass type, choose one of the following locations for registration:
opt_lib.cc
Hardware-independent passes.
cpu_opt.cc
CPU specific passes.
gpu_opt.cc
GPU specific passes.
compiled_opt.cc
Passes common to CPU, GPU, XPU.
Don't forget to add build dependency.Include pass registration as part of your PR(example) so that the pass will be available to use for all
hlo-opt
users.Rebuild the
hlo-opt
tool, validate successful pass registration using--list-passes
option and then use--passes
option to run the pass.$ hlo-opt --passes=foo-pass input.hlo
Writing unit tests for the pass? refer https://openxla.org/xla/test_hlo_passes for more details.
Pass Runtime Measurement
For large models, full compilation runs can take upto few minutes, making it
challenging to detect subtle performance regressions. In contrast, individual
pass runs using hlo-opt
allow for precise
performance measurement and the easy detection of even small increases in
execution time caused by new code changes.
time hlo-opt --passes=reduce-window-rewriter,scatter_simplifier
--xla_reduce_window_rewrite_base_length=128 input.hlo
[hlo-opt
] Convert HLO Module Formats
# Use the light weight version of the `hlo-opt` tool.
bazel run //xla/hlo/tools:hlo-opt -- [flags] <filename>
Convert HLO Text
-> HLO Proto
hlo-opt --emit-proto input.hlo
Convert HLO Proto
or HLO Proto Binary
-> HLO Text
hlo-opt input.pbtxt or input.pb