Optimize TensorFlow performance using XProf

This guide demonstrates how to use the tools available with XProf to track the performance of your TensorFlow models on the host (CPU), the device (GPU), or on a combination of both the host and device(s).

Profiling helps understand the hardware resource consumption (time and memory) of the various TensorFlow operations (ops) in your model and resolve performance bottlenecks and, ultimately, make the model execute faster.

This guide will walk you through how to use the various tools available and the different modes of how the Profiler collects performance data.

If you want to profile your model performance on Cloud TPUs, refer to the Cloud TPU guide.

Collect performance data

XProf collects host activities and GPU traces of your TensorFlow model. You can configure XProf to collect performance data through either the programmatic mode or the sampling mode.

Profiling APIs

You can use the following APIs to perform profiling.

Programmatic mode using the TensorBoard Keras Callback (tf.keras.callbacks.TensorBoard)

# Profile from batches 10 to 15
tb_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir,
                                             profile_batch='10, 15')

# Train the model and use the TensorBoard Keras callback to collect
# performance profiling data
model.fit(train_data,
          steps_per_epoch=20,
          epochs=5,
          callbacks=[tb_callback])

Programmatic mode using the tf.profiler Function API

tf.profiler.experimental.start('logdir')
# Train the model here
tf.profiler.experimental.stop()

Programmatic mode using the context manager

with tf.profiler.experimental.Profile('logdir'):
    # Train the model here
    pass

Sampling mode: Perform on-demand profiling by using tf.profiler.experimental.server.start to start a gRPC server with your TensorFlow model run. After starting the gRPC server and running your model, you can capture a profile through the Capture Profile button in XProf. Use the script in the Install profiler section above to launch a TensorBoard instance if it is not already running.

As an example,

# Start a profiler server before your model runs.
tf.profiler.experimental.server.start(6009)
# (Model code goes here).
#  Send a request to the profiler server to collect a trace of your model.
tf.profiler.experimental.client.trace('grpc://localhost:6009',
                                      'gs://your_tb_logdir', 2000)

An example for profiling multiple workers:

# E.g., your worker IP addresses are 10.0.0.2, 10.0.0.3, 10.0.0.4, and you
# would like to profile for a duration of 2 seconds.
tf.profiler.experimental.client.trace(
    'grpc://10.0.0.2:8466,grpc://10.0.0.3:8466,grpc://10.0.0.4:8466',
    'gs://your_tb_logdir',
    2000)

Capture Profile Dialog

Use the Capture Profile dialog to specify:

A comma-delimited list of profile service URLs or TPU names.
A profiling duration.
The level of device, host, and Python function call tracing.
How many times you want the Profiler to retry capturing profiles if unsuccessful at first.

Profiling custom training loops

To profile custom training loops in your TensorFlow code, instrument the training loop with the tf.profiler.experimental.Trace API to mark the step boundaries for XProf.

The name argument is used as a prefix for the step names, the step_num keyword argument is appended in the step names, and the _r keyword argument makes this trace event get processed as a step event by XProf.

As an example,

for step in range(NUM_STEPS):
    with tf.profiler.experimental.Trace('train', step_num=step, _r=1):
        train_data = next(dataset)
        train_step(train_data)

This will enable XProf's step-based performance analysis and cause the step events to show up in the trace viewer.

Make sure that you include the dataset iterator within the tf.profiler.experimental.Trace context for accurate analysis of the input pipeline.

The code snippet below is an anti-pattern:

for step, train_data in enumerate(dataset):
    with tf.profiler.experimental.Trace('train', step_num=step, _r=1):
        train_step(train_data)

Profiling use cases

The profiler covers a number of use cases along four different axes. Some of the combinations are currently supported and others will be added in the future. Some of the use cases are:

Local vs. remote profiling: These are two common ways of setting up your profiling environment. In local profiling, the profiling API is called on the same machine your model is executing, for example, a local workstation with GPUs. In remote profiling, the profiling API is called on a different machine from where your model is executing, for example, on a Cloud TPU.
Profiling multiple workers: You can profile multiple machines when using the distributed training capabilities of TensorFlow.
Hardware platform: Profile CPUs, GPUs, and TPUs.

The table below provides a quick overview of the TensorFlow-supported use cases mentioned above:

Profiling API	Local	Remote	Multiple workers	Hardware Platforms
TensorBoard Keras Callback	Supported	Not Supported	Not Supported	CPU, GPU
`tf.profiler.experimental` start/stop API	Supported	Not Supported	Not Supported	CPU, GPU
`tf.profiler.experimental` client.trace API	Supported	Supported	Supported	CPU, GPU, TPU
Context manager API	Supported	Not supported	Not Supported	CPU, GPU

Additional resources

The TensorFlow Profiler: Profile model performance tutorial with Keras and TensorBoard where you can apply the advice in this guide.
The Performance profiling in TensorFlow 2 talk from the TensorFlow Dev Summit 2020.
The TensorFlow Profiler demo from the TensorFlow Dev Summit 2020.