Table of contents [Show]
1. Introduction to Apache Airflow DAGs
What is a DAG?
A Directed Acyclic Graph (DAG) in Apache Airflow represents a workflow or pipeline where tasks are executed in a defined order based on dependencies. Each DAG consists of multiple tasks that can run in parallel or sequentially.
- Directed: Tasks are executed in a specific order.
- Acyclic: The graph cannot contain cycles (no task can depend on itself).
- Graph: Represents a network of interconnected tasks.
2. Apache Airflow DAG Architecture
Components of a DAG
- DAG Definition: Written in Python and defines the workflow.
- Operators: Represent different types of tasks (e.g., BashOperator, PythonOperator, DummyOperator).
- Tasks: Individual units of work within a DAG.
- Task Dependencies: Define execution order and relationships between tasks.
Airflow Architecture Overview
Apache Airflow consists of the following components:
- Scheduler: Determines when tasks should run.
- Executor: Executes tasks (LocalExecutor, CeleryExecutor, KubernetesExecutor, etc.).
- Worker Nodes: Execute tasks in a distributed system (for Celery/Kubernetes executors).
- Metadata Database: Stores DAGs, task statuses, logs, and execution metadata.
- Web UI: Provides a graphical interface for monitoring DAGs and tasks.
3. Structure of an Apache Airflow DAG
Basic Components of a DAG File
A DAG file is a Python script defining workflow structure. It includes:
- Imports: Required Airflow modules and libraries.
- DAG Object: Defines the DAG’s properties (e.g., start date, schedule interval).
- Tasks: Defined using operators (e.g., PythonOperator, BashOperator).
- Task Dependencies: Defines execution order using
>>(sequential) or[t1, t2] >> t3(parallel execution).
4. Example Apache Airflow DAG
Creating a Simple DAG in Apache Airflow
Create a new DAG file inside the dags/ directory in your Airflow project:
from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.bash import BashOperator
# Define default arguments
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2024, 1, 1),
'retries': 1,
'retry_delay': timedelta(minutes=5),
}
# Define the DAG
dag = DAG(
'example_dag',
default_args=default_args,
description='A simple example DAG',
schedule_interval=timedelta(days=1),
)
# Define tasks
task1 = BashOperator(
task_id='print_hello',
bash_command='echo "Hello, Airflow!"',
dag=dag,
)
task2 = BashOperator(
task_id='print_goodbye',
bash_command='echo "Goodbye, Airflow!"',
dag=dag,
)
# Define dependencies
task1 >> task2 # task1 runs before task2Explanation of Code:
- The DAG starts on January 1, 2024.
- The schedule_interval runs the DAG daily.
task1prints "Hello, Airflow!".task2prints "Goodbye, Airflow!".- Dependency:
task1 >> task2ensurestask1runs beforetask2.
5. How to Deploy and Run a DAG
- Save the DAG file: Place it in the
dags/directory inside Airflow. Start Apache Airflow:
airflow standalone- Check the Web UI: Open
http://localhost:8080and navigate to the DAGs page. - Trigger the DAG: Click the "Trigger DAG" button.
Monitor Execution: Check logs in the UI or run:
airflow tasks logs example_dag print_hello
6. Conclusion
Apache Airflow DAGs are powerful for orchestrating workflows efficiently. This guide covered:
- What a DAG is
- DAG structure and architecture
- A step-by-step example DAG
- How to deploy and run a DAG
With this foundation, you can start building complex workflows in Apache Airflow! 🚀