We use OpenAI Triton for generating some of the GPU kernels. Triton allows generating fast GPU kernels for certain fusions, but we have to tune some parameters for each such fusion.
This can take a long time if there are many fusions, so we provide a way to load those autotuning results, while still running the other compilation steps normally. Autotuning caches are still useful if we make a few changes: the fusions that are present in the cache will use the cache, and the other ones will be autotuned normally.
The autotuning results can be dumped/loaded using these parameters:
--xla_gpu_dump_autotune_results_to=
--xla_gpu_load_autotune_results_from=
If we specify a .txt or .textproto file, then the cache will be dumped in textproto format, otherwise in binary protobuf format.
In tests
Persisted autotuning can also be used in tests. It is recommended to use it if the tests are very big, especially if the performance of the test environment is limited.
It only works well if the autotune cache contains results generated on the same type of GPU where the tests are being run.
Making a test use persisted autotuning
For now let's assume that the test in question always uses the same GPU type.
We have to export the autotune results from the test, for example by specifying these parameters to the test command:
--test_env=XLA_FLAGS=--xla_gpu_dump_autotune_results_to=TEST_UNDECLARED_OUTPUTS_DIR/autotune_cache.textproto --test_sharding_strategy=disabled
Sharding must be disabled to correctly get a single autotune cache for all tests.
Then we have to upload that cache to our code repository.
Then we have to add the cache to the data dependencies of our test target, and load it using an environment variable.
data = ["test_autotune_cache.textproto"], env = {"XLA_FLAGS": "--xla_gpu_load_autotune_results_from=" + "$(execpath test_autotune_cache.textproto)"},
(It is OK to use sharding in tests that load autotune results.)
Please also see the example tests in xla/service/gpu/tests/BUILD:
- load_autotune_results_using_execpath_test
- load_autotune_results_from_test_workspace_test
- dump_autotune_results_to_test_outputs_test
Cache obsolescence
If many changes are made to a model, it is possible that the cache will no longer contain all fusions, so the test will become slower. In this case we would have to regenerate the autotuning cache.
If we start using a new type of GPU for running the tests, the same applies.
The cache may also become obsolete if the XLA compiler evolves and generates different fusions.