Git Hub:- Link
Introduction
In today’s real-time environment, automating data pipelines is essential for efficiency and scalability. Apache Airflow is a powerful workflow orchestration tool that helps streamline data transformation and storage processes. In this blog, we will explore how to use Airflow to clean and store wine rating data in an SQLite database using Pandas and SQLAlchemy.
Key Technologies Overview
To build our Airflow-based data pipeline, we will use:
- Apache Airflow — Workflow automation and orchestration
- Pandas — For data manipulation and preprocessing
- SQLAlchemy — For handling Database Interactions
- PythonVirtualenvOperator — To execute tasks in isolated Python Environment.
Understanding the Workflow
The goal of this pipeline is to:
- Extract and Clean Data -Load raw wine rating data, clean it, and remove unnecessary columns.
- Store Data in SQLite – Save the cleaned data in an SQLite database for further analysis.
Apache Airflow is a powerful workflow automation tool that allows users to schedule and manage data pipelines efficiently. In this article, we explore how to use Airflow to clean and store wine rating data in an SQLite database using This process is orchestrated using an Apache Airflow DAG, which ensures that the data cleaning step is completed before storing it in the database.

Explanation:
- Airflow Scheduler :Triggers and Manages the DAG execution, ensuring tasks run in Sequence. It automates workflows and retries failed tasks when needed.
- Airflow DAG (pandas_to_sqlite) : Defines the workflow, specifying task dependencies and exucution order. It Schedules automation at fixed intervals (e.g., daily). s.
- Task 1: Data Cleaning (Pandas Virtual Environment) :Reads raw wine-ratings.csv, removes unnecessary data, and structures it. Runs inside a Python Virtual Environment to prevent conflicts.
- Task 2: Store Data (SQLite Database) :Takes the cleaned dataset and saves it in an SQLite database. Uses SQLAIchemy: for efficient storage and easy access.
- SQLite Database (wine_database.db) : Acts as structured storage system for processed data. Enables easy querying and retrieval for reports and analytics.
- Airflow UI & Logs : Providers real-time tracking of DAG execution, logging errors and successes. Ensures smooth pipeline execution with debugging options.
Key Concepts
The DAG named pandas_to_sqlite automates the process of cleaning a dataset using Pandas and storing the processed data into an SQLite database.
Key Components:
- DAG Configuration: Includes parameters like the owner, start date, email notifications, and retry policies.
- Task Flow: Data is cleaned first before being stored in SQLite.
- Virtual Environments: Each task runs in an isolated environment to prevent dependency conflicts.
Hands-On Session:
DAG Configuration
Step 1: Setting Up the DAG
First, we define the DAG(Directed Acyclic Graph) configuration:
from airflow.models import DAG
from datetime import timedelta, datetime
from airflow.operators.python import PythonVirtualenvOperator, is_venv_installed
from airflow.utils.dates import days_ago
import logging
import sys
from airflow.decorators import task
log = logging.getLogger(__name__)
default_args = {
'owner':'venkat',
'start_date' : days_ago(0),
'email':['[email protected]'],
'retries':1,
'retry_delay': timedelta(minutes=5),
}
with DAG(
'pandas_to_sqlite',
default_args = default_args,
description='this is pandas analysis ',
schedule_interval = timedelta(days=1),
catchup=False
) as dag:
This Configuration Defines:
- DAG name: “Pandas_to_sqlite”
- Start Date: Runs from the current day
- Retry Policy: retries once if a failure occurs
if not is_venv_installed():
log.warning("the virtualenv_python this is required virtualenv package, please install it")
else:
@task.virtualenv(
task_id="virutalenv_python",requirements=["pandas"],system_site_packages=False)
def pandas_read():
import pandas as pd
df = pd.read_csv("/opt/airflow/data/wine-ratings.csv", index_col = 0)
df = df.replace({"\r": ""}, regex=True)
df = df.replace({"\n": " "}, regex=True)
df.drop(['grape'], axis=1, inplace=True)
df.to_csv("/opt/airflow/cleaned_data.csv")
What this function does:
- Reads the raw wine-ratings.csv file
- Removes unwanted newline characters
- Drops the grape column
- Saves the cleaned data to new CSV file
Once the data is cleaned, it is stored in an SQLite database using SQLAlchemy. After cleaning, we persist the data into an SQLite database.
@task.virtualenv(
task_id="sqlite_persist_wine_data",requirements=["pandas", "sqlalchemy"],system_site_packages=False)
def pandas_to_sqllite():
import pandas as pd
from sqlalchemy import create_engine
engine = create_engine('sqlite:////opt/airflow/wine_database.db', echo=True)
df = pd.read_csv("/opt/airflow/cleaned_data.csv", index_col = 0)
#df.to_sql('wine_database', engine)
df.notes.to_sql('wine_notes', engine)
What this function does:
- Reads the cleaned CSV file
- Connects to an SQLite database
- Saves the notes column into table
Step 4: Executing the DAG
The final step is defining the task execution order. The Cleaning task must complete before storing data in SQLite.
pandas_read() >> pandas_to_sqllite()
To ensure efficiency and reliability, follow these best practices: This ensure a sequential execution where Airflow first cleans the data and then persists it in SQLite
Best Practices for Airflow Pipelines
To optimize performance and reliability, follow these best practices:
- Use Virtual Environments → Prevent Dependency conflicts by using PythonVirtualenvOperator
- Optimize Data Processing →Drop Unnecessary columns early to reduce memory usage
- Enable Logging → Monotor task execution in the Airfloww UI for Debugging
- Retry Mechanism → Use Airflow’s retry policy to handle intermittent failures
- Database Optimization → Use indexing in SQLite for faster queries
Conclusion
In this blog, we automated a data pipeline using APache Airflow, Pandas, and SQLite. We are Successfully:
- Cleaned raw data using pandas
- Stored Processed data in an SQLite database
- Scheduled automation using an Airflow DAG
By leveraging Airflow’s automation capabilities, you can simplify complex workflows and ensure data constistency at scale.Should you encounter any issues, our team is readily available to provide guidance and support with a prompt response. Please do not hesitate to reach out to us at any time [email protected]
AccentFuture provides expert-led Airflow Training as a solution for professionals who need workflow automation mastery. Our Airflow Online Training gives you flexible learning options from anywhere which includes hands-on building and management practice for data pipelines. The Airflow Course delivers a complete program which teaches essential topics regarding DAGs alongside task scheduling and Python connection capabilities. Become a part of our current program to acquire Airflow skills and improve your abilities as a data engineer.