Data pipeline#

Data flow#

This section describes the data flow of the pipeline.

(RAW) → (PREPROCESSED) → (DATASETS) → (APPDATA)

Overview:

Step	Directory	Description	Rule(s) for target	Cfg section
0	`store/raw/`	Raw data as downloaded	TBD	TBD
1	`store/preprocessed/`	Preprocessed data, 1:1 from (0)	TBD	TBD
2	`store/datasets/`	Datasets, n:1 from (0) and (1)	TBD	TBD
3	`store/appdata/`	Data ready to be used in the app from (2)	TBD	TBD

In the following each step is shortly described along a common example use case.

Example data flow:

Snakefile

As all rules will be searched for and included in the main Snakefile, they must have unique names. It's a good idea to use the dataset name as prefix, e.g. rule osm_forest_<RULE_NAME>.

(0) Raw#

Contains immutable raw data as downloaded. In directory TEMPLATE there are two additional files: dataset.md (see that file for further instructions) and metadata.json).

Note: Assumptions are to be defined in the scenarios, not the raw data. See the scenario readme in SCENARIOS.md.

Example

Dataset A: ERA5 weather dataset for Germany
Dataset B: MaStR dataset on renewable generators
Dataset C: Shapefiles of region: federal districts and municipalities

(1) Preprocessed#

Data from (0) Raw that has undergone some preprocessing such as:

Archive extracted
CRS transformed (see below for CRS conventions)
Fields filtered
But NO merging/combining/clipping of multiple (raw) datasets! This should be done in (2)

Rules, config and info

Preprocessing rule(s) for the dataset can be defined in the dataset's Snakefile: preprocessed/TEMPLATE/create.smk.
Subsequently, these rules must be included in the module file preprocessed/module.smk to take effect (see template in the file).
Custom, dataset-specific configuration can be put into the dataset's config file preprocessed/TEMPLATE/config.yml.
The title and description of each dataset are to be gathered in the file preprocessed/TEMPLATE/dataset.md.

Example

Dataset D: Extracted ERA5 weather dataset for Germany (from dataset A)
Dataset E: Wind energy turbines extracted from MaStR dataset, filter for columns power and geometry (from dataset B)
Dataset F: Federal districts converted to Geopackage file, CRS transformed (from dataset C)
Dataset G: Municipalities converted to Geopackage file, CRS transformed (from dataset C)

(2) Datasets#

Datasets, created from arbitrary combinations of datasets from store/preprocessed/ and/or store/datasets/.

Rules, config and info

Creation rule(s) for the dataset can be defined in the dataset's Snakefile: datasets/TEMPLATE/create.smk.
Subsequently, these rules must be included in the module file datasets/module.smk to take effect (see template in the file).
Custom, dataset-specific configuration can be put into the dataset's config file datasets/TEMPLATE/config.yml.
The title and description of each dataset are to be gathered in the file datasets/TEMPLATE/dataset.md.
Custom, dataset-specific scripts are located in scripts.

Example

Using datasets from store/preprocessed/ and store/datasets/:

Dataset H: Wind energy turbines in the region of interest (from datasets E+F)
Dataset I: Normalized wind energy feedin timeseries for the region (from datasets D+G)
Dataset J: Federal districts (from dataset F)
Dataset K: Municipalities (from dataset G)

(3) App data#

Data ready to be used in the app / as expected by the app.

While the data from (0), (1) and (2) contain datasets from intermediate steps, this directory holds app-ready datasets. See appdata/APPDATA for further information.

To which processing step do the different stages of my data belong to?#

Often, there are multiple options to assign a dataset. If you're still unsure where the different stages of your dataset should be located after having read the examples above, the following example may help you:

Let's say you want to integrate some subsets of an OpenStreetMap dataset, namely extracting (a) forests and (b) residential structures for a specific region. You want both be stored in separate files. This involves (e.g.):

Convert the pbf file to a more handy type such as Geopackage
Extract data using OSM tags
Clip with region

First, you would create a new raw dataset in store/raw/. Then, you could either put steps 1 and 2 in the preprocessing, resulting in two datasets in store/preprocessed/. For each, you could then perform step 3 in store/datasets/. However, this would imply a redundant execution of step 1. While this is basically fine in the terms of the pipeline flow, it might be a better idea to apply only step 1 and create one dataset in store/preprocessed/. Using this dataset, you would create the two extracts in store/datasets/. Finally, the datasets after performing step 3 would be created in store/datasets/ as well resulting in a total of four datasets in store/datasets/.

Temporary files#

TODO: REVISE

Temporary files are stored in store/temp/ by default and, depending on your configuration, can get quite large. You can change the directory in config.yml → path → temp.

Further notes#

No data files in the repository! But keep the `.gitkeep` files#

Make sure not to commit any data files located in store/ to the repository (except for the descriptive readme and metadata files). Rules have been defined in the .gitignore file which should make git omit those files but you better don't count on it. Instead, save them in the designated directory on the RLI Wolke.

Each data directory in the provided templates contain an empty .gitkeep file. When creating a new dataset, please commit this file too to make sure the (empty) data directory is retained.

Coordinate reference system#

Please use LAEA Europe (EPSG:3035) as default CRS when writing geodata.

TODO: REVISE

The files in store/raw/ can have an arbitrary CRS.
In the preprocessing (step 1) it is converted to the CRS specified in the global config.yml → preprocessing → crs. It is important to use a equal-area CRS to make sure operations such as buffering work properly. By default, it is set to LAEA Europe (EPSG:3035).
The final output is written in CRS specified in the global config.yml → output → crs. By default, it is set to WGS84 (EPSG:4326) used by the app.

Data pipeline#

Data flow#

(0) Raw#

(1) Preprocessed#

(2) Datasets#

(3) App data#

To which processing step do the different stages of my data belong to?#

Temporary files#

Further notes#

No data files in the repository! But keep the `.gitkeep` files#

Coordinate reference system#

HowTos#

Add a new dataset#

Datasets#

Data pipeline#

Data flow#

(0) Raw#

(1) Preprocessed#

(2) Datasets#

(3) App data#

To which processing step do the different stages of my data belong to?#

Temporary files#

Further notes#

No data files in the repository! But keep the .gitkeep files#

Coordinate reference system#

HowTos#

Add a new dataset#

Datasets#

No data files in the repository! But keep the `.gitkeep` files#