sphinxcontrib-data-pipeline#

This extension allows you to document data processing pipelines that connect tools (“drivers”) and corresponding data products in sphinx.

Installation#

You can install the extension via pip:

pip install sphinxcontrib-data-pipeline

Getting Started#

To use this extension, add the following extensions to the extensions list in your conf.py file:

extensions = [
   ...
   "sphinxcontrib_data_pipeline",
   "sphinxcontrib.mermaid",
   "sphinx_tabs.tabs",
   "sphinx_toolbox.collapse"
]

Data Products#

A data product can be added with the data_product directive. The directive has the following options:

.. data_product:: Data
   :description: Description of `Data A`.
   :format: CSV
   :file_extension: .csv

Data

Description:

Description of Data A.

Data Format:

CSV

File Extension:

.csv

flowchart LR Data([Data]) classDef driverClass fill:#f96,color:black,stroke:#854c30;

Data producers and consumers#

Drivers#

A driver can be added with the driver directive. The directive has the following options:

.. driver:: tool
   :description: Description of `tool`.
   :executable: tool
   :repository: https:<path_to_repository>.com
   :documentation: https:<path_to_documentation>.com
   :contact: Max Mustermann
   :inputs:
   :outputs:

   # Command Line Arguments
   tool [options] input_file output_file

For :inputs: and :outputs: you can specify a list of data products that are expected as input or produced as output. Make sure to use the same name as in the data product directive.

tool

Description:

Description of tool.

Executable:

tool

Repository:

https:<path_to_repository>.com

Documentation:

https:<path_to_documentation>.com

Contact:

Max Mustermann

tool command line arguments#
# Command Line Arguments
tool_A [options]
flowchart LR tool{{tool}}:::driverClass classDef driverClass fill:#f96,color:black,stroke:#854c30;

tool inputs/outputs#

Pipeline:#

A pipeline is specified by a set of drivers and data products that are linked together. The following example specifies 4 data products and 2 drivers.

.. data_product:: Data_A
   :description: Description of `Data A`.
   :format: text
   :file_extension: .txt

.. data_product:: Data_B
   :description: Description of `Data B`.
   :format: text
   :file_extension: .txt

.. data_product:: Data_C
   :description: Description of `Data B`.
   :format: text
   :file_extension: .txt

.. data_product:: Data_D
   :description: Description of `Data B`.
   :format: text
   :file_extension: .txt

.. driver:: Tool_A
   :description: Description of `Tool A`.
   :inputs: Data_A
   :outputs: Data_B

   # Command Line Arguments
   Tool_A [options] input_file output_file

.. driver:: Tool_B
   :description: Description of `Tool B`.
   :inputs: Data_B
   :outputs: [Data_C, Data_D]

   # Command Line Arguments
   Tool_B [options] input_file output_file

Data_A

Description:

Description of Data A.

Data Format:

text

File Extension:

.txt

flowchart LR Data_A([Data_A]) Tool_A{{Tool_A}}:::driverClass click Tool_A href "#driver-Tool_A" "go to driver" Data_A --> Tool_A classDef driverClass fill:#f96,color:black,stroke:#854c30;

Data_A producers and consumers#

Data_B

Description:

Description of Data B.

Data Format:

text

File Extension:

.txt

flowchart LR Data_B([Data_B]) Tool_A{{Tool_A}}:::driverClass click Tool_A href "#driver-Tool_A" "go to driver" Tool_B{{Tool_B}}:::driverClass click Tool_B href "#driver-Tool_B" "go to driver" Data_B --> Tool_B Tool_A --> Data_B classDef driverClass fill:#f96,color:black,stroke:#854c30;

Data_B producers and consumers#

Data_C

Description:

Description of Data B.

Data Format:

text

File Extension:

.txt

flowchart LR Data_C([Data_C]) Tool_B{{Tool_B}}:::driverClass click Tool_B href "#driver-Tool_B" "go to driver" Tool_B --> Data_C classDef driverClass fill:#f96,color:black,stroke:#854c30;

Data_C producers and consumers#

Data_D

Description:

Description of Data B.

Data Format:

text

File Extension:

.txt

flowchart LR Data_D([Data_D]) Tool_B{{Tool_B}}:::driverClass click Tool_B href "#driver-Tool_B" "go to driver" Tool_B --> Data_D classDef driverClass fill:#f96,color:black,stroke:#854c30;

Data_D producers and consumers#

Tool_A

Description:

Description of Tool A.

Tool_A command line arguments#
# Command Line Arguments
Tool_A [options] input_file output_file
flowchart LR Tool_A{{Tool_A}}:::driverClass Data_A([Data_A]) click Data_A href "#data-product-Data_A" "go to data product" Data_B([Data_B]) click Data_B href "#data-product-Data_B" "go to data product" Data_A --> Tool_A Tool_A --> Data_B classDef driverClass fill:#f96,color:black,stroke:#854c30;

Tool_A inputs/outputs#

Tool_B

Description:

Description of Tool B.

Tool_B command line arguments#
# Command Line Arguments
Tool_B [options] input_file output_file
flowchart LR Tool_B{{Tool_B}}:::driverClass Data_B([Data_B]) click Data_B href "#data-product-Data_B" "go to data product" Data_C([Data_C]) click Data_C href "#data-product-Data_C" "go to data product" Data_D([Data_D]) click Data_D href "#data-product-Data_D" "go to data product" Data_B --> Tool_B Tool_B --> Data_C Tool_B --> Data_D classDef driverClass fill:#f96,color:black,stroke:#854c30;

Tool_B inputs/outputs#

The entire pipeline can be shown in a single diagram by using the pipeline directive:

.. pipeline::
flowchart TD Data([Data]) click Data href "#data-product-Data" "go to data product" Data_A([Data_A]) click Data_A href "#data-product-Data_A" "go to data product" Data_B([Data_B]) click Data_B href "#data-product-Data_B" "go to data product" Data_C([Data_C]) click Data_C href "#data-product-Data_C" "go to data product" Data_D([Data_D]) click Data_D href "#data-product-Data_D" "go to data product" INPUT_1([INPUT_1]) click INPUT_1 href "#data-product-INPUT_1" "go to data product" OUTPUT_1([OUTPUT_1]) click OUTPUT_1 href "#data-product-OUTPUT_1" "go to data product" OUTPUT_2([OUTPUT_2]) click OUTPUT_2 href "#data-product-OUTPUT_2" "go to data product" tool{{tool}}:::driverClass click tool href "#driver-tool" "go to driver" Tool_A{{Tool_A}}:::driverClass click Tool_A href "#driver-Tool_A" "go to driver" Tool_B{{Tool_B}}:::driverClass click Tool_B href "#driver-Tool_B" "go to driver" Tool_X{{Tool_X}}:::driverClass click Tool_X href "#driver-Tool_X" "go to driver" Data_A --> Tool_A Tool_A --> Data_B Data_B --> Tool_B Tool_B --> Data_C Tool_B --> Data_D INPUT_1 --> Tool_X Tool_X --> OUTPUT_1 Tool_X --> OUTPUT_2 classDef driverClass fill:#f96,color:black,stroke:#854c30;

exaworkflows pipeline#

Externally Specifying Data Products and Drivers#

You can specify Data Products and Drivers in a separate file. You will need to provide a parser for your file type and register it in the Sphinx conf.py file as:

# workflows parsers
driver_parser = "package.module:parse_driver_function"
data_product_parser = "package.module:parse_dataproducts_function"

See the example_yaml_parser.py <https://github.com/michaelbuehlmann/sphinxcontrib-data-pipeline/blob/master/sphinxcontrib_data_pipeline/parsers/example_yaml_parser.py> file for an example of how to implement a parser for a yaml file.

You can then use the .. external_data_products:: and .. external_drivers:: directives to include the data products and drivers from the external file. For example:

.. external_data_products:: path/to/specification_file.yaml
   :type: path

.. external_drivers:: path/to/specification_file.yaml
   :type: path

The files can also be hosted in an external repository:

.. external_data_products:: path/in/repository.yaml
 :type: git
 :git-branch: master
 :git-url: https://url_to_git_repo.com/repository.git

Example#

The following example shows how to use the external data products and drivers directives. We specify a pipeline in example_specs.yaml. and include it here with the following code:

.. external_data_products:: example_specs.yaml
   :type: path

.. external_drivers:: example_specs.yaml
   :type: path

INPUT_1

Description:

A text file that is used as input for Tool_X

Data Format:

text

File Extension:

.txt

flowchart LR INPUT_1([INPUT_1]) Tool_X{{Tool_X}}:::driverClass click Tool_X href "#driver-Tool_X" "go to driver" INPUT_1 --> Tool_X classDef driverClass fill:#f96,color:black,stroke:#854c30;

INPUT_1 producers and consumers#

OUTPUT_1

Description:

A text file produced by Tool_X

Data Format:

CSV

File Extension:

.csv

flowchart LR OUTPUT_1([OUTPUT_1]) Tool_X{{Tool_X}}:::driverClass click Tool_X href "#driver-Tool_X" "go to driver" Tool_X --> OUTPUT_1 classDef driverClass fill:#f96,color:black,stroke:#854c30;

OUTPUT_1 producers and consumers#

Data Fields

Field

Type

Units

Description

id

int64

None

particle id

x

float32

meters

x coordinate

y

float32

meters

y coordinate

vx

float32

meters/second

x component of velocity

vy

float32

meters/second

y component of velocity

OUTPUT_2

Description:

A text file produced by Tool_X

Data Format:

text

File Extension:

.txt

flowchart LR OUTPUT_2([OUTPUT_2]) Tool_X{{Tool_X}}:::driverClass click Tool_X href "#driver-Tool_X" "go to driver" Tool_X --> OUTPUT_2 classDef driverClass fill:#f96,color:black,stroke:#854c30;

OUTPUT_2 producers and consumers#

Tool_X#

Tool_X

Executable:

XXX

Repository:

michaelbuehlmann/sphinxcontrib-data-pipeline

Contact:

Michael Buehlmann

Tool_X command line arguments#
XXX -test
flowchart LR Tool_X{{Tool_X}}:::driverClass INPUT_1([INPUT_1]) click INPUT_1 href "#data-product-INPUT_1" "go to data product" OUTPUT_1([OUTPUT_1]) click OUTPUT_1 href "#data-product-OUTPUT_1" "go to data product" OUTPUT_2([OUTPUT_2]) click OUTPUT_2 href "#data-product-OUTPUT_2" "go to data product" INPUT_1 --> Tool_X Tool_X --> OUTPUT_1 Tool_X --> OUTPUT_2 classDef driverClass fill:#f96,color:black,stroke:#854c30;

Tool_X inputs/outputs#