Dagster is an open-source data orchestrator tool that helps you build, test, and deploy data pipelines. With Dagster, you can define your pipelines using Python, and it provides a visual interface to monitor the pipeline's execution. Here's a sneak peek:

Such a graph is useful because:

we can see how data flows and changes through the pipeline.
instead of executing the pipeline from the very beginning every time, we can cache our results at each stage and execute the pipeline at any intermediate step.

Getting started with Dagster

Setting up our environment

Start by creating a virtual environment and activating it:


        
        
            
                
                
                    python -m venv .venv
.venv\Scripts\activate           # For Windows
source .venv/bin/activate        # For Linux and MacOS

We then install the dependencies of our project:


        
        
            
                
                
                    pip install dagster
pip install dagster-webserver
pip install pandas

Note the following:

dagster is the main package that contains all the Dagster's core logic.
dagster-webserver is for the Dagster UI dashboard.
pandas is for doing some simple data processing.

Next, create a folder called my_dagster_code_location and within it, create two empty Python files:


        
        
            
                
                
                    .venv
my_dagster_code_location
├── __init__.py
└── my_assets.py

We will cover the concept of code location in a separate guide - don't worry about it for now!

Defining our assets

In the my_assets.py, copy and paste the following code:


        
        
            
                
                
                    from dagster import asset
import pandas as pd

@asset(name="iris_data")
def get_iris_data():
    df = pd.read_csv("https://raw.githubusercontent.com/SkyTowner/sample_data/main/iris_data.csv")
    return df

@asset(name="setosa")
def get_setosa(iris_data):
    return iris_data.query("species == 0")

Here, note the following:

we added Dagster's @asset decorator to indicate that the function should be treated as a Dagster asset, which is essentially a piece of data that flows through a pipeline. In this case, the get_iris_data returns an asset (in the form of a Pandas DataFrame), which we name iris_data.
we also defined another asset called setosa, which takes in as input the iris_data asset.

Creating an initialization script

Next, in the __init__.py, copy and paste the following content:


        
        
            
                
                
                    from dagster import Definitions, load_assets_from_modules
from . import my_assets

all_assets = load_assets_from_modules([my_assets])
defs = Definitions(assets=all_assets)

Note the following:

we import our my_assets module and use Dagster's load_assets_from_modules(-) method to get a reference to all the assets we've defined in the module.
we then instantiate a Definitions object that contains all the entities such as assets, jobs, resources and schedules used for our Dagster project. Since we only make use of assets for now, we will simply pass in our assets and nothing else.

Don't worry about the other types of entities for now - we will go over them in detail in the latter guides 🙂.

Launching the Dagster UI

To launch the Dagster UI (or dagster-webserver), type the following in the terminal:


        
        
            
                
                
                    dagster-webserver -m my_dagster_code_location
                
            
            ...
2023-07-25 22:50:55 +0800 - dagster-webserver - INFO - Serving dagster-webserver on http://127.0.0.1:3000 in process 35413

Here, the -m flag is to specify the location of the directory that contains the __init__.py file.

We now have a local Dagster server hosted at localhost:3000. We should see the following data pipeline:

We can see the assets (iris_data and setosa) that we've defined in our code earlier! There is also an arrow pointing from iris_data to setosa, which means that setosa is dependent on iris_data - exactly like how we defined it in our code!

However, Dagster is telling us that the assets have never been materialized. In Dagster, to materialize means to create or update an asset. Even though we have defined an asset, we have never actually created it - this is what Dagster is telling us.

Materializing our assets

To materialize our asset, click on the black Materialize button at the top-right corner:

We see that our asset has been materialized:

Note that since our setosa asset depends on iris_data, we say that:

setosa is a downstream asset.
iris_data is an upstream asset.

Checking meta-information about our assets

Clicking on our iris_data asset will open up the following panel:

Note the following:

whenever we materialize an asset, we create what's called a run. We can think of a run as a particular execution of our data pipeline. We see that the ID of our run is b02e0a69.
we also see the path of the asset (../tmp_nr4u770/storage/iris_data). By default, Dagster will create a temporary folder in our local repository and store the asset in there. This is only a temporary folder because it is deleted once we stop our Dagster server. This means that if we restart our Dagster server, we will see the label "Never Materialized" for all of our assets. We will later explore how to persistently store our materialized assets.

NOTE

By default, the materialized assets are stored in pickle format. We can modify the file format (e.g. csv) by using IO managers.

Exploring the run tab

Click on the Runs tab in the header:

Note the following:

we can see meta information such as Run ID, created date, run duration and so on.
the assets that we've materialized are shown under Target.

Visualizing the run timeline

Now, click on the View run button on the right. This will open up a timeline of the events that happen in the run:

Note the following:

we see that creating the iris_data asset takes much longer compared to creating the setosa asset.
this timeline updates in real-time as our run progresses. Since our run completes almost immediately, the timeline that we see is already complete and thus no live updates are going on.
we will be able to see live logs at the bottom! Later, we will go into more detail on how we can log information at different levels (e.g. info, error, critical).

As you can imagine, this timeline is extremely useful for profiling - we can identify slow operations in our pipeline and optimize them!

Published by Isshin Inada

Edited by 0 others

Did you find this page useful?

thumb_up

thumb_down

Comment

Citation

Ask a question or leave a feedback...

thumb_up

thumb_down

chat_bubble_outline

settings

Enjoy our search

Hit / to insta-search docs and recipes!