search
Search
Login
Unlock 100+ guides
menu
menu
web
search toc
close
Comments
Log in or sign up
Cancel
Post
account_circle
Profile
exit_to_app
Sign out
What does this mean?
Why is this true?
Give me some examples!
search
keyboard_voice
close
Searching Tips
Search for a recipe:
"Creating a table in MySQL"
Search for an API documentation: "@append"
Search for code: "!dataframe"
Apply a tag filter: "#python"
Useful Shortcuts
/ to open search panel
Esc to close search panel
to navigate between search results
d to clear all current filters
Enter to expand content preview
icon_star
Doc Search
icon_star
Code Search Beta
SORRY NOTHING FOUND!
mic
Start speaking...
Voice search is only supported in Safari and Chrome.
Navigate to

Guide on Code Locations in Dagster

schedule Aug 12, 2023
Last updated
local_offer
Dagster
Tags
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!

What is a code location?

Code location is a file or directory that contains an instance of Dagster's Definitions, which specifies all the Dagster entities (e.g. assets, resources, schedules) required by our data pipeline. In a single deployment, we can have multiple code locations like so:

Note the following:

  • only a single Definitions object is allowed for every code location - having multiple Definitions in one code location will throw an error.

  • the Definitions object must be available as a top-level variable. We'll explore more about this below.

Creating a code location using a module

Let's start by creating a minimal code location in our local machine. Create a folder called my_dagster_code_location and two Python files within it:

my_dagster_code_location
├── __init__.py
└── my_assets.py

The content of my_assets.py is as follows:

from dagster import asset
import pandas as pd

@asset(name="iris_data")
def get_iris_data():
return pd.read_csv("https://raw.githubusercontent.com/SkyTowner/sample_data/main/iris_data.csv")

The content of __init__.py is as follows:

from dagster import Definitions, load_assets_from_modules
from . import my_assets

all_assets = load_assets_from_modules([my_assets])
defs = Definitions(assets=all_assets)

Note the following:

  • we have a Definitions object defined as a top-level variable - otherwise, Dagster will not be able to find our definitions and throw an error.

  • the name assigned to the Definitions object (defs in this case) is irrelevant, that is, we can assign any variable name we wish.

Let's now launch the Dagster UI and load our my_dagster_code_location as a module (-m) like so:

dagster dev -m my_dagster_code_location

Dagster will then look into the root __init__.py file within the my_dagster_code_location directory and load the Definitions object.

In the Dagster UI, click on Deployment in the header and we will be able to see our code location:

NOTE

In Python, if-else statements have a more "global" scope compared to functions and classes. For instance, the following Definitions object is still considered to be in the top-level scope:

if True:
defs = Definitions(resources={
"assets": my_assets
})

However, we should not place the Definitions object inside functions or classes because it will no longer be at the top-level scope. The dagster dev command will still run without any error but it will not be able to know about all our Dagster entities.

Creating a code location using a file

Typically, a code location will be a module, but we could also use a file instead. To demonstrate, suppose we have a main.py file with the following content:

from dagster import Definitions, asset

@asset(name="empty_data")
def get_empty_data():
return

defs = Definitions(assets=[get_empty_data])

We can then load this file as our code location like so:

dagster dev -f main.py

In Dagster UI, we see that the main.py is our code location:

Whether we go with the module or file code location approach, what's important is for Dagster to find the Definitions object.

Creating multiple code locations

Each Dagster deployment can have multiple code locations. For instance, consider the following two module code locations:

.venv
my_dagster_code_location
└── __init__.py
my_other_dagster_code_location
└── __init__.py

We can load both code locations by specifying multiple -m flags:

dagster dev -m my_dagster_code_location -m my_other_code_location

In the Dagster UI, we should now see two code locations:

We could also load multiple file code locations by using multiple -f flags.

How does Dagster interact with code locations?

By default, Dagster will launch a new process for each code location. This means that if we have 3 code locations, we would have 3 processes running like so:

The Dagster system processes will communicate with these processes via RPC. This setup has two advantages:

  • each of the code location processes is independent, which is useful because this allows each code location to have its own environment. For instance, one code location can run using Python 3.8 with packages A and B, while another code location can run using Python 3.9 with packages A and C.

  • since the Dagster system processes run independently of code location processes, errors in user code will not crash the Dagster system processes.

Workspace files

A workspace file is a yaml file that indicates where the code locations can be found. In our previous minimal setup, we had to specify the relative path to the code locations using -m and -f flags when launching Dagster like so:

dagster dev -m my_dagster_code_location
dagster dev -f main.py

We can avoid specifying the flags by creating a workspace.yaml file that holds information about the path of the code locations. Command-line Dagster tools such as dagster dev, dagster-webserver and dagster-daemon run will use this workspace.yaml file during launch.

For instance, suppose we have the following file structure:

workspace.yaml
my_dagster_code_location
└── __init__.py

Where our workspace.yaml file is:

load_from:
- python_module: my_dagster_code_location

We can now launch our Dagster processes without the -m flag:

dagster dev
...
2023-07-16 19:30:24 +0800 - dagit - INFO - Serving dagit on http://127.0.0.1:3000 in process 22682

Here, the workspace.yaml must be in the current directory for this to work. In the case when the workspace.yaml file is located elsewhere, we can use the -w flag to specify its path:

dagster dev -w path/to/workspace.yaml

As expected, we have the file code location version as well:

load_from:
- python_file: my_file.py

We can also load multiple code locations by adding more entries like so:

load_from:
- python_module: my_dagster_code_location
- python_module: my_other_code_location

Hosting our own GRPC server

Code locations do not necessarily have to reside in the same node as the Dagster processes. Let's spin up another server that contains a code location:

dagster api grpc --module-name my_dagster_code_location --host 0.0.0.0 --port 4500
2023-07-16 19:48:25 +0800 - dagster.code_server - INFO - Started Dagster code server for module my_dagster_code_location on port 4500 in process 23246

Now, we have a so-called code location server running at localhost:4500. In our workspace.yaml file, we can indicate the location of this code location server like so:

load_from:
- grpc_server:
host: localhost
port: 4500
location_name: "my_grpc_server"

Here, the location_name is the code location name you wish to assign - we will later see this name in the Dagster UI.

Open up another terminal and let's launch the Dagster server:

dagster dev

This command will read our workspace.yaml file and connect to the grpc server that we just defined.

Navigate to Deployment and observe how Dagster has identified our my_grpc_server as a code location server:

When we materialize the assets within our my_grpc_server code location, that code location server will execute our user code. We can confirm this by observing that the Dagster event logs appear in the grpc server when materializing an asset:

dagster api grpc --module-name my_dagster_code_location --host 0.0.0.0 --port 4500
...
2023-07-19 22:38:30 +0800 - dagster - DEBUG - __ASSET_JOB - 0208971d-de32-4d43-9913-387f329d1080 - 71718 - RUN_START - Started execution of run for "__ASSET_JOB".
2023-07-19 22:38:30 +0800 - dagster - DEBUG - __ASSET_JOB - 0208971d-de32-4d43-9913-387f329d1080 - 71718 - ENGINE_EVENT - Executing steps using multiprocess executor: parent process (pid: 71718)
2023-07-19 22:38:30 +0800 - dagster - DEBUG - __ASSET_JOB - 0208971d-de32-4d43-9913-387f329d1080 - 71718 - iris_data - STEP_WORKER_STARTING - Launching subprocess for "iris_data".

This example is quite arbitrary because the code location server is hosted on the same machine. In practice, we would typically set up the grpc server in a remote machine (e.g. virtual machines) equipped with the suitable specs to handle code execution.

robocat
Published by Isshin Inada
Edited by 0 others
Did you find this page useful?
thumb_up
thumb_down
Comment
Citation
Ask a question or leave a feedback...