https://github.com/AccentFuture-dev/Airflow/blob/Main/pandas_to_sqlite.py

Git Hub:- Link

Introduction

In today’s real-time environment, automating data pipelines is essential for efficiency and scalability. Apache Airflow is a powerful workflow orchestration tool that helps streamline data transformation and storage processes. In this blog, we will explore how to use Airflow to clean and store wine rating data in an SQLite database using Pandas and SQLAlchemy.

Key Technologies Overview

To build our Airflow-based data pipeline, we will use:

Apache Airflow — Workflow automation and orchestration
Pandas — For data manipulation and preprocessing
SQLAlchemy — For handling Database Interactions
PythonVirtualenvOperator — To execute tasks in isolated Python Environment.

Understanding the Workflow

The goal of this pipeline is to:

Extract and Clean Data -Load raw wine rating data, clean it, and remove unnecessary columns.
Store Data in SQLite – Save the cleaned data in an SQLite database for further analysis.

Apache Airflow is a powerful workflow automation tool that allows users to schedule and manage data pipelines efficiently. In this article, we explore how to use Airflow to clean and store wine rating data in an SQLite database using This process is orchestrated using an Apache Airflow DAG, which ensures that the data cleaning step is completed before storing it in the database.

Explanation:

Airflow Scheduler :Triggers and Manages the DAG execution, ensuring tasks run in Sequence. It automates workflows and retries failed tasks when needed.
Airflow DAG (pandas_to_sqlite) : Defines the workflow, specifying task dependencies and exucution order. It Schedules automation at fixed intervals (e.g., daily). s.
Task 1: Data Cleaning (Pandas Virtual Environment) :Reads raw wine-ratings.csv, removes unnecessary data, and structures it. Runs inside a Python Virtual Environment to prevent conflicts.
Task 2: Store Data (SQLite Database) :Takes the cleaned dataset and saves it in an SQLite database. Uses SQLAIchemy: for efficient storage and easy access.
SQLite Database (wine_database.db) : Acts as structured storage system for processed data. Enables easy querying and retrieval for reports and analytics.
Airflow UI & Logs : Providers real-time tracking of DAG execution, logging errors and successes. Ensures smooth pipeline execution with debugging options.

Key Concepts

The DAG named pandas_to_sqlite automates the process of cleaning a dataset using Pandas and storing the processed data into an SQLite database.

Key Components:

DAG Configuration: Includes parameters like the owner, start date, email notifications, and retry policies.
Task Flow: Data is cleaned first before being stored in SQLite.
Virtual Environments: Each task runs in an isolated environment to prevent dependency conflicts.

Hands-On Session:

DAG Configuration

Step 1: Setting Up the DAG

First, we define the DAG(Directed Acyclic Graph) configuration:

from airflow.models import DAG
from datetime import timedelta, datetime
from airflow.operators.python import PythonVirtualenvOperator, is_venv_installed
from airflow.utils.dates import days_ago
import logging
import sys
from airflow.decorators import task

log = logging.getLogger(__name__)

default_args = {
    'owner':'venkat',
    'start_date' : days_ago(0),
    'email':['test123@gmail.com'],
    'retries':1,
    'retry_delay': timedelta(minutes=5),
}
with DAG(
    'pandas_to_sqlite',
    default_args = default_args,
    description='this is pandas analysis ',
    schedule_interval = timedelta(days=1),
    catchup=False 
) as dag:

This Configuration Defines:

DAG name: “Pandas_to_sqlite”
Start Date: Runs from the current day
Retry Policy: retries once if a failure occurs

Step 2: Cleaning the Data Using Pandas

The next step is to load and clean the dataset. We define a Python function inside a Virtual Environment to avoid dependency conflicts.

 if not is_venv_installed():
        log.warning("the virtualenv_python this is required virtualenv package, please install it")
 else:
        @task.virtualenv(
            task_id="virutalenv_python",requirements=["pandas"],system_site_packages=False)
        def pandas_read():
            import pandas as pd 
            df = pd.read_csv("/opt/airflow/data/wine-ratings.csv", index_col = 0)
            df = df.replace({"\r": ""}, regex=True)
            df = df.replace({"\n": " "}, regex=True)
            df.drop(['grape'], axis=1, inplace=True)
            df.to_csv("/opt/airflow/cleaned_data.csv")

What this function does:

- Reads the raw wine-ratings.csv file
- Removes unwanted newline characters
- Drops the grape column
- Saves the cleaned data to new CSV file

Step 3: Storing the Data in SQLite

Once the data is cleaned, it is stored in an SQLite database using SQLAlchemy. After cleaning, we persist the data into an SQLite database.

       @task.virtualenv(
            task_id="sqlite_persist_wine_data",requirements=["pandas", "sqlalchemy"],system_site_packages=False)
        def pandas_to_sqllite():
            import pandas as pd 
            from sqlalchemy import create_engine
            engine = create_engine('sqlite:////opt/airflow/wine_database.db', echo=True)
            df = pd.read_csv("/opt/airflow/cleaned_data.csv", index_col = 0)
            #df.to_sql('wine_database', engine)
            df.notes.to_sql('wine_notes', engine)

What this function does:

- Reads the cleaned CSV file
- Connects to an SQLite database
- Saves the notes column into table

Step 4: Executing the DAG

The final step is defining the task execution order. The Cleaning task must complete before storing data in SQLite.

pandas_read() >> pandas_to_sqllite()

To ensure efficiency and reliability, follow these best practices: This ensure a sequential execution where Airflow first cleans the data and then persists it in SQLite

Best Practices for Airflow Pipelines

To optimize performance and reliability, follow these best practices:

Use Virtual Environments → Prevent Dependency conflicts by using PythonVirtualenvOperator
Optimize Data Processing →Drop Unnecessary columns early to reduce memory usage
Enable Logging → Monotor task execution in the Airfloww UI for Debugging
Retry Mechanism → Use Airflow’s retry policy to handle intermittent failures
Database Optimization → Use indexing in SQLite for faster queries

Conclusion

In this blog, we automated a data pipeline using APache Airflow, Pandas, and SQLite. We are Successfully:

Cleaned raw data using pandas
Stored Processed data in an SQLite database
Scheduled automation using an Airflow DAG

By leveraging Airflow’s automation capabilities, you can simplify complex workflows and ensure data constistency at scale.Should you encounter any issues, our team is readily available to provide guidance and support with a prompt response. Please do not hesitate to reach out to us at any time contact@accentfuture.com

AccentFuture provides expert-led Airflow Training as a solution for professionals who need workflow automation mastery. Our Airflow Online Training gives you flexible learning options from anywhere which includes hands-on building and management practice for data pipelines. The Airflow Course delivers a complete program which teaches essential topics regarding DAGs alongside task scheduling and Python connection capabilities. Become a part of our current program to acquire Airflow skills and improve your abilities as a data engineer.

Git Hub:- Link

Introduction

Key Technologies Overview

Understanding the Workflow

Explanation:

Key Concepts

Key Components:

Hands-On Session:

DAG Configuration

This Configuration Defines:

Step 4: Executing the DAG

To ensure efficiency and reliability, follow these best practices: This ensure a sequential execution where Airflow first cleans the data and then persists it in SQLite

Best Practices for Airflow Pipelines

To optimize performance and reliability, follow these best practices:

Conclusion

Tags :

Hands-On Guide to Setting Up SSH Connectivity in Apache Airflow

Kafka Tutorial for Beginners – AccentFuture

Automating Pandas Data Processing and SQLite Storage with Apache Airflow

Git Hub:- Link

Introduction

Key Technologies Overview

Understanding the Workflow

Explanation:

Key Concepts

Key Components:

Hands-On Session:

DAG Configuration

This Configuration Defines:

Step 4: Executing the DAG

To ensure efficiency and reliability, follow these best practices: This ensure a sequential execution where Airflow first cleans the data and then persists it in SQLite

Best Practices for Airflow Pipelines

To optimize performance and reliability, follow these best practices:

Conclusion

Related posts:

Tags :

Social Share :

Hands-On Guide to Setting Up SSH Connectivity in Apache Airflow

Kafka Tutorial for Beginners – AccentFuture