search
Search
Login
Unlock 100+ guides
menu
menu
web
search toc
close
Comments
Log in or sign up
Cancel
Post
account_circle
Profile
exit_to_app
Sign out
What does this mean?
Why is this true?
Give me some examples!
search
keyboard_voice
close
Searching Tips
Search for a recipe:
"Creating a table in MySQL"
Search for an API documentation: "@append"
Search for code: "!dataframe"
Apply a tag filter: "#python"
Useful Shortcuts
/ to open search panel
Esc to close search panel
to navigate between search results
d to clear all current filters
Enter to expand content preview
icon_star
Doc Search
icon_star
Code Search Beta
SORRY NOTHING FOUND!
mic
Start speaking...
Voice search is only supported in Safari and Chrome.
Navigate to

Guide on DAGSTER_HOME and dagster.yaml

schedule Aug 12, 2023
Last updated
local_offer
Dagster
Tags
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!

Types of data in Dagster

Broadly speaking, there are two types of data in Dagster:

  • run data: describes how the pipeline executes (e.g. execution time, the involved assets, event logs). The Dagster UI is the visual interface to the run data.

  • materialized assets: the output data of our pipeline (e.g. a CSV file).

How run-related data is stored in Dagster

When we launch the Dagster UI, Dagster will look for the environment variable DAGSTER_HOME, which is a (absolute) path that points to the directory containing our dagster.yaml file. Using the dagster.yaml file, we can configure the behavior of Dagster such as specifying where to store run data. As for the storage location of materialized assets, if DAGSTER_HOME is defined, then they will be stored under DAGSTER_HOME by default.

NOTE

We typically use Dagster's IO managers to define the storage location of the materialized assets because they're more flexible. For instance, using IO managers, we can store our materialized assets in a local directory or blob storage (e.g. AWS S3) in any format (e.g. csv, pickle, parquet).

On the other hand, the storage location of run data is defined using dagster.yaml.

Dagster instance

The dagster.yaml is used to load up what's known as the Dagster instance. The naming here is slightly misleading because the Dagster instance is not a process or a long-running daemon like the dagster daemon or dagster-webserver. Instead, we can think of the Dagster instance as the root configuration file that contains all the parameters that the Dagster system (including the dagster daemon and dagster-webserver) will use.

Here's a diagram showing what the Dagster instance is:

Here, we're setting the environment variable DAGSTER_HOME, which is an absolute path pointing to a directory (my_dagster_home) containing the dagster.yaml file. Within dagster.yaml, we can specify basic configurations such as where we wish to store run data (via the storage property).

Case when DAGSTER_HOME is not defined

If DASGTER_HOME is not defined, then Dagster will output all our materialized assets and run data (e.g. run ID, timestamps) in a temporary folder that is deleted when the Dagster server terminates. Let's demonstrate this - suppose we have a main.py file like so:

from dagster import Definitions, asset
import pandas as pd

@asset(name="iris_data")
def get_iris_data():
return pd.read_csv("https://raw.githubusercontent.com/SkyTowner/sample_data/main/iris_data.csv")

defs = Definitions(assets=[get_iris_data])

Launch the Dagster UI using the following command:

dagster dev -f main.py
2023-07-29 23:47:46 +0800 - dagster - INFO - Using temporary directory /Users/isshininada/Desktop/dagster_demo/tmpckgig9g2 for storage. This will be removed when dagster dev exits.
2023-07-29 23:47:46 +0800 - dagster - INFO - To persist information across sessions, set the environment variable DAGSTER_HOME to a directory to use.
2023-07-29 23:47:48 +0800 - dagster-webserver - INFO - Serving dagster-webserver on http://127.0.0.1:3000 in process 82177
...

The output tells us that Dagster is using a temporary directory (/tmpckgig9g2) to store our run data as well as the materialized assets.

On the Dagster UI, click on the Materialize button:

Once our asset is materialized, our temporary tmpckgig92 directory should look something like so:

tmpckgig92
├── schedules/
├── history
├── runs
├── 00848753-18b4-a9vn-abnfe.db
└── ...
└── runs.db
└── storage
├── 00848753-18b4-a9vn-abnfe
└── compute_logs
├── jecaitjb.complete
├── jecaitjb.err
└── jecaitjb.out
├── ...
└── compute_logs
├── ....complete
├── ....err
└── ....out
└── iris_data

Note the following:

  • the schedules/ directory contains information about any scheduling logic of our pipeline. We can ignore this folder for now since we don't use any scheduling.

  • the history/ directory contains information about our runs.

  • the history/runs directory contains SQLite files, each of which holds the run data for a single particular run. The name of the SQLite files is their run IDs (e.g. 00848753-18b4-a9vn-abnfe).

  • the history/runs.db is an SQLite file storing basic information about all the runs.

  • the storage/ directory contains the computed logs (stdout and stderr) of each run as well as the materialized assets (iris_data in pickle format). This means that if we call print(-) in our data pipeline, the printed value will be stored here. Note that Dagster's own event logs (e.g. PIPELINE_STARTING) are stored under the history/runs directory.

Again, this temporary folder will be deleted once we stop the Dagster server. This can be useful for quick testing but is not recommended for most cases since we usually want to persist all our assets and logs.

Case when DAGSTER_HOME is defined

Case when dagster.yaml is missing

If DAGSTER_HOME is defined, then Dagster will proceed to look for the configuration file at $DAGSTER_HOME/dagster.yaml. If this .yaml file does not exist, then Dagster will persist our run data and materialized assets under DAGSTER_HOME (instead of a temporary folder) like so:

$DAGSTER_HOME
├── schedules/
├── history/
└── storage/

Note that unlike the case when DAGSTER_HOME is not defined, the run data and our assets will be persisted even after terminating the Dagster server.

Let's now demonstrate this. Suppose our project structure is like so:

.env
my_dagster_home/
main.py

Where my_dagster_home is an empty folder and main.py is the same as before:

from dagster import Definitions, asset
import pandas as pd

@asset(name="iris_data")
def get_iris_data():
return pd.read_csv("https://raw.githubusercontent.com/SkyTowner/sample_data/main/iris_data.csv")

defs = Definitions(assets=[get_iris_data])

Whenever we launch a Dagster process, it will look into our .env file to check if the environment variable DAGSTER_HOME is defined. Let's update our .env file and define DAGSTER_HOME like so:

DAGSTER_HOME=/Users/isshininada/Desktop/dagster_demo/my_dagster_home

Here, the DAGSTER_HOME must be an absolute path rather than a relative path.

Let's now launch our Dagster UI:

dagster dev -f main.py
2023-07-30 12:10:05 +0800 - dagster - INFO - Loaded environment variables from .env file: DAGSTER_HOME
...
2023-07-30 12:10:07 +0800 - dagster-webserver - INFO - Serving dagster-webserver on http://127.0.0.1:3000 in process 84499

The output tells us that Dagster has found DAGSTER_HOME in our .env file!

Now, materialize the iris_data asset in the Dagster UI. We should then see our run data and materialized assets stored in our my_dagster_home folder:

my_dagster_home
├── schedules
└── ...
├── history
└── ...
└── storage
├── ...
└── iris_data

Case when DAGSTER_HOME and dagster.yaml are present

Finally, let's consider the case when DAGSTER_HOME is defined and the dagster.yaml is present. Suppose our current file structure is:

.env
main.py
my_dagster_home
└── dagster.yaml

By default, if we have an empty dagster.yaml file, our run data (history/ and schedules/), stdout/stderr logs and materialized assets (storage/) will be written inside the $DAGSTER_HOME/ folder - just as in the case when the dagster.yaml is not present:

my_dagster_home
├── schedules/
├── history/
└── storage/

As discussed in the beginning, we can modify the dagster.yaml file to configure the behavior of Dagster. We will now demonstrate this.

Changing the location of run-data storage

Let's update our dagster.yaml file such that we write history/ and schedules/ inside a directory called my_runs_info instead:

storage:
sqlite:
base_dir: my_runs_info

Let's now launch our Dagster server:

dagster dev -f main.py
2023-07-30 12:25:41 +0800 - dagster-webserver - INFO - Loaded environment variables from .env file: DAGSTER_HOME
...
2023-07-30 12:25:42 +0800 - dagster-webserver - INFO - Serving dagster-webserver on http://127.0.0.1:3000 in process 84654

In our local repository, we should now see that Dagster has created a new directory called my_runs_info containing history/ and schedules/ like so:

.env
main.py
my_dagster_home
└── dagster.yaml
my_runs_info
├── history
├── runs
└── index.db
└── runs.db
└── schedules
└── schedules.db

The storage/ directory, which contains our stdout and stderr logs, is still stored within the dagster_home directory.

NOTE

Dagster's naming convention is quite misleading here. Even though we specified storage in dagster.yaml, the storage folder, which contains stdout and stderr logs and our materialized assets will still be written under the $DAGSTER/ folder. Instead, specifying storage in dagster.yaml will change where the history/ and schedules/ directories will be written to.

To change the location of stdout and stderr logs, replace storage with compute_logs in our dagster.yaml. To change the location of the materialized assets, use IO managers.

Since we have not yet materialized any assets, the .db files do not contain any run information. Let's head over to the Dagster UI and materialize our assets. We should then see the following:

...
my_runs_info
├── history
├── runs
├── 4c6ca96c-c34d-4e6c-ad74-e32be41178c1.db # new run!
└── index.db
└── runs.db
└── schedules/

Again, the materialized assets and the stdout/stderr logs are still stored in the my_dagster_home/storage folder.

Using environment variables

Instead of hard-coding the base path (my_runs_info in this case) inside the dagster.yaml file, we can specify this as an environment variable. To demonstrate, let's update the .env file like so:

DAGSTER_HOME=/Users/isshininada/Desktop/dagster_demo/my_dagster_home
SQLITE_STORAGE_BASE_DIR=my_runs_info

In our dagster.yaml file, we can access the environment variable like so:

storage:
sqlite:
base_dir:
env: SQLITE_STORAGE_BASE_DIR

Notice how we have to create a new field under base_dir called env to be able to access the environment variable.

Storing logs in PostgresDB in dagster.yaml

By default, run data is stored as an SQLite file locally. Dagster currently supports storing the run data in remote PostgreSQL and MySQL databases. Let's demonstrate this - go ahead and set up a PostgreSQL database using your favorite cloud service and fetch its credentials.

Suppose our dagster.yaml file is as follows:

storage:
postgres:
postgres_db:
username: my_username
password: my_password
hostname: my_hostname
db_name: my_db_name
port: 5432

Since we have specified storage, all our run data and the event logs (e.g. PIPELINE_STARTING) will be written to the PostgresDB. By specifying storage and compute_logs, we have the flexibility to store run data and stdout/stderr logs in different locations - for instance, we can store run data in PostgresDB but stream stdout/stderr logs in an AWS S3 bucket!

NOTE

To be able to store logs in PostgresDB, we must first install the package dagster-postgres.

Starting our Dagster server for the first time with this new configuration file may take a while because Dagster has to create many tables in our PostgresDB. Once the Dagster server is finished launching, we should see many new tables in our database such as:

...
jobs
runs
run_tags

Since we haven't materialized our assets yet, almost all of these tables will be empty.

Let's now head over to the Dagster UI and materialize our assets. It should take longer than usual to materialize the assets since all the run data has to be written to the remote PostgresDB.

If we inspect our runs table, we should see a new row like so:

run_id

status

...

create_timestamp

update_timestamp

7d6329bb-2149-4318-80f0-da46fcde9807

SUCCESS

...

2023-07-02 09:46:05.057909

2023-07-02 09:48:21.954296

Note that we can see our deployment config in the Dagster UI as well:

robocat
Published by Isshin Inada
Edited by 0 others
Did you find this page useful?
thumb_up
thumb_down
Comment
Citation
Ask a question or leave a feedback...