sphinxcontrib-data-pipeline#
This extension allows you to document data processing pipelines that connect tools (“drivers”) and corresponding data products in sphinx.
Installation#
You can install the extension via pip:
pip install sphinxcontrib-data-pipeline
Getting Started#
To use this extension, add the following extensions to the extensions list in your conf.py file:
extensions = [
...
"sphinxcontrib_data_pipeline",
"sphinxcontrib.mermaid",
"sphinx_tabs.tabs",
"sphinx_toolbox.collapse"
]
Data Products#
A data product can be added with the data_product directive. The directive has the following options:
.. data_product:: Data
:description: Description of `Data A`.
:format: CSV
:file_extension: .csv
Data
- Description:
Description of Data A.
- Data Format:
CSV
- File Extension:
.csv
Drivers#
A driver can be added with the driver directive. The directive has the following options:
.. driver:: tool
:description: Description of `tool`.
:executable: tool
:repository: https:<path_to_repository>.com
:documentation: https:<path_to_documentation>.com
:contact: Max Mustermann
:inputs:
:outputs:
# Command Line Arguments
tool [options] input_file output_file
For :inputs: and :outputs: you can specify a list of data products that are expected as input or produced as output. Make sure to use the same name as in the data product directive.
tool
- Description:
Description of tool.
- Executable:
tool
- Repository:
https:<path_to_repository>.com
- Documentation:
https:<path_to_documentation>.com
- Contact:
Max Mustermann
# Command Line Arguments
tool_A [options]
Pipeline:#
A pipeline is specified by a set of drivers and data products that are linked together. The following example specifies 4 data products and 2 drivers.
.. data_product:: Data_A
:description: Description of `Data A`.
:format: text
:file_extension: .txt
.. data_product:: Data_B
:description: Description of `Data B`.
:format: text
:file_extension: .txt
.. data_product:: Data_C
:description: Description of `Data B`.
:format: text
:file_extension: .txt
.. data_product:: Data_D
:description: Description of `Data B`.
:format: text
:file_extension: .txt
.. driver:: Tool_A
:description: Description of `Tool A`.
:inputs: Data_A
:outputs: Data_B
# Command Line Arguments
Tool_A [options] input_file output_file
.. driver:: Tool_B
:description: Description of `Tool B`.
:inputs: Data_B
:outputs: [Data_C, Data_D]
# Command Line Arguments
Tool_B [options] input_file output_file
Data_A
- Description:
Description of Data A.
- Data Format:
text
- File Extension:
.txt
Data_B
- Description:
Description of Data B.
- Data Format:
text
- File Extension:
.txt
Data_C
- Description:
Description of Data B.
- Data Format:
text
- File Extension:
.txt
Data_D
- Description:
Description of Data B.
- Data Format:
text
- File Extension:
.txt
Tool_A
- Description:
Description of Tool A.
# Command Line Arguments
Tool_A [options] input_file output_file
Tool_B
- Description:
Description of Tool B.
# Command Line Arguments
Tool_B [options] input_file output_file
The entire pipeline can be shown in a single diagram by using the pipeline directive:
.. pipeline::
Externally Specifying Data Products and Drivers#
You can specify Data Products and Drivers in a separate file. You will need to
provide a parser for your file type and register it in the Sphinx conf.py
file
as:
# workflows parsers
driver_parser = "package.module:parse_driver_function"
data_product_parser = "package.module:parse_dataproducts_function"
See the example_yaml_parser.py <https://github.com/michaelbuehlmann/sphinxcontrib-data-pipeline/blob/master/sphinxcontrib_data_pipeline/parsers/example_yaml_parser.py> file for an example of how to implement a parser for a yaml file.
You can then use the .. external_data_products::
and .. external_drivers::
directives to include the data products and drivers from the external file.
For example:
.. external_data_products:: path/to/specification_file.yaml
:type: path
.. external_drivers:: path/to/specification_file.yaml
:type: path
The files can also be hosted in an external repository:
.. external_data_products:: path/in/repository.yaml
:type: git
:git-branch: master
:git-url: https://url_to_git_repo.com/repository.git
Example#
The following example shows how to use the external data products and drivers directives. We specify a pipeline in example_specs.yaml. and include it here with the following code:
.. external_data_products:: example_specs.yaml
:type: path
.. external_drivers:: example_specs.yaml
:type: path
INPUT_1
- Description:
A text file that is used as input for Tool_X
- Data Format:
text
- File Extension:
.txt
OUTPUT_1
- Description:
A text file produced by Tool_X
- Data Format:
CSV
- File Extension:
.csv
Data Fields
Field |
Type |
Units |
Description |
---|---|---|---|
id |
int64 |
None |
particle id |
x |
float32 |
meters |
x coordinate |
y |
float32 |
meters |
y coordinate |
vx |
float32 |
meters/second |
x component of velocity |
vy |
float32 |
meters/second |
y component of velocity |
OUTPUT_2
- Description:
A text file produced by Tool_X
- Data Format:
text
- File Extension:
.txt
Tool_X#
Tool_X
- Executable:
XXX
- Repository:
- Contact:
Michael Buehlmann
XXX -test