In this section we will cover the various mechanisms Parsl provides to add resiliency and robustness to workflows.
Parsl provides support for capturing, tracking, and handling a variety of errors. It also provides functionality to appropriately respond to failures during workflow execution. If a task is unable to complete execution within a specified time limit or if it is unable to produce the specified set of outputs it is considered to have failed.
Failures might occur for various reasons:
- Task exceeded specified walltime.
- Formatting error while formatting the command-line string in Bash apps.
- Task failed during execution.
- Task completed execution but failed to produce one or more of its specified outputs.
- Task failed to launch, for example if an input dependency is not met.
Since Parsl tasks are executed asynchronously, it can be difficult to determine where to place exception handling code in the workflow. In Parsl all exceptions are associated with the task futures. These exceptions are raised only when a result is called on the future of a failed task. For example:
@python_app def bad_divide(x): return 6 / x # Call bad divide with 0, to cause a divide by zero exception doubled_x = bad_divide(0) # Catch and handle the exception. try: doubled_x.result() except ZeroDivisionError as e: print('Oops! You tried to divide by 0.') except Exception as e: print('Oops! Something really bad happened.')
Often errors in distributed/parallel environments are transient. Retrying
a task is a common method for adding resiliency to a workflow.
By retrying failed apps, transient failures (e.g., machine failure,
network failure) and intermittent failures within applications can be overcome.
retries are enabled (and set to an integer > 0), Parsl will automatically
re-launch applications that have failed, until the retry limit is reached.
retries = 0.
The following example shows how the number of retries can be set to 2:
from parsl import load from parsl.tests.configs.local_threads import config config.retries = 2 load(config)
Due to a known bug (issue#282),
disabling lazy_errors with
lazy_errors=False is not supported in Parsl 0.6.0.
Parsl implements a lazy failure model through which a workload will continue to execute in the case that some tasks fail. That is, the workflow does not halt as soon as it encounters a failure, but continues execution of every app that is unaffected.
Here's a workflow graph, where (X) is runnable, [X] is completed, (X*) is failed. (!X) is dependency failed (A) [A] (A) / \ / \ / \ (B) (C) [B] (C*) [B] (C*) | | => | | => | | (D) (E) (D) (E) [D] (!E) \ / \ / \ / (F) (F) (!F) time ----->