Parsl is designed to enable implementation of dataflow patterns in which data passed between apps manages the flow of execution. Dataflow programming models are popular as they can cleanly express, via implicit parallelism, opportunities for concurrent execution.
Parsl aims to abstract not only parallel execution but also execution location, which in turn requires data location abstraction. This is crucial as it allows scripts to execute in different locations without regard for data location. Parsl implements a flexible file abstraction that can be used to reference data irrespective of its location. At present this model supports local files as well as files accessible via FTP, HTTP, HTTPS, and Globus.
File class abstracts the file access layer. Irrespective of where the script or its apps are executed, Parsl uses this abstraction to access that file. When referencing a Parsl file in an app, Parsl maps the object to the appropriate access path according to the selected URL scheme: Local, FTP, HTTP, HTTPS and Globus.
file scheme is used to reference local files. A file using the local file scheme must specify the absolute file path, for example:
The file may then be passed as input or output to an app. The following example executes the
cat command on a local file:
@bash_app def cat(inputs=, stdout='stdout.txt'): return 'cat %s' % (inputs) # create a test file open('/tmp/test.txt', 'w').write('Hello\n') # create the Parsl file parsl_file = File('file:///tmp/test.txt') # call the cat app with the Parsl file cat(inputs=[parsl_file])
FTP, HTTP, HTTPS¶
File objects with FTP, HTTP, and HTTPS schemes represent remote files on FTP, HTTP and HTTPS servers, respectively.The following example defines a file accessible on a remote FTP server.
When such a file object is passed as an input to an app, Parsl will download the file to the executor where the app is scheduled for execution. The following example illustrates how the remote file is implicitly downloaded from an FTP server and then converted. Note: the app does not need to know the local location of the downloaded file as Parsl abstracts this translation.
@python_app def convert(inputs=, outputs=): with open(inputs.filepath, 'r') as inp: content = inp.read() with open(outputs.filepath, 'w') as out: out.write(content.upper()) # create an remote Parsl file inp = File('ftp://www.iana.org/pub/mirror/rirstats/arin/ARIN-STATS-FORMAT-CHANGE.txt') # create a local Parsl file out = File('file:///tmp/ARIN-STATS-FORMAT-CHANGE.txt') # call the convert app with the Parsl file f = convert(inputs=[inp], outputs=[out]) f.result()
Globus scheme is used to reference files that can be accessed using Globus (a guide to using Globus is available here). A file using the Globus scheme must specify the UUID of the Globus
endpoint and a path to the file on the endpoint, for example:
Note: the Globus endpoint UUID can be found in the Globus Manage Endpoints page.
When Globus files are passed as inputs or outputs to/from an app, Parsl stage the files to/from the remote executor using Globus. The staging occurs implicitly. That is, Parsl is responsible for transferring the input file from the Globus endpoint to the executor, or transferring the output file from the executor to the Globus endpoint. Parsl scripts may combine staging of files in and out of apps. For example, the following script stages a file from a remote Globus endpoint, it then sorts the strings in that file, and stages the sorted output file to another remote endpoint.
@python_app def sort_strings(inputs=, outputs=): with open(inputs.filepath, 'r') as u: strs = u.readlines() strs.sort() with open(outputs.filepath, 'w') as s: for e in strs: s.write(e) unsorted_file = File('globus://037f054a-15cf-11e8-b611-0ac6873fc732/unsorted.txt') sorted_file = File ('globus://ddb59aef-6d04-11e5-ba46-22000b92c6ec/sorted.txt') f = sort_strings(inputs=[unsorted_file], outputs=[sorted_file]) f.result()
In order to manage where data is staged users may configure the default
working_dir on a remote executor. This is passed to the
ParslExecutor via the
working_dir parameter in the
Config instance. For example:
from parsl.config import Config from parsl.executors.ipp import IPyParallelExecutor config = Config( executors=[ IPyParallelExecutor( working_dir="/home/user/parsl_script" ) ] )
When using the Globus scheme Parsl requires knowledge of the Globus endpoint that is associated with an executor. This is done by specifying the
endpoint_name (the UUID of the Globus endpoint that is associated with the system) in the configuration.
In some cases, for example when using a Globus shared endpoint or when a Globus endpoint is mounted on a supercomputer, the path seen by Globus is not the same as the local path seen by Parsl. In this case the configuration may optionally specify a mapping between the
endpoint_path (the common root path seen in Globus), and the
local_path (the common root path on the local file system). In most cases
local_path are the same.
from parsl.config import Config from parsl.executors.ipp import IPyParallelExecutor from parsl.data_provider.globus import GlobusStaging from parsl.data_provider.data_manager import default_staging config = Config( executors=[ IPyParallelExecutor( working_dir="/home/user/parsl_script", storage_access=default_staging + [GlobusStaging( endpoint_uuid="7d2dc622-2edb-11e8-b8be-0ac6873fc732", endpoint_path="/", local_path="/home/user" )] ) ] )