A Coding Guide to Build a Scalable End-to-End Machine Learning Data Pipeline Using Daft for High-Performance Structured and Image Data Processing

A Coding Guide to Build a Scalable End-to-End Machine Learning Data Pipeline Using Daft

By Amr Abdeldaym, Founder of Thiqa Flow

In the realm of AI automation and business efficiency, the ability to seamlessly process and transform data is fundamental. In this tutorial, we explore how Daft, a high-performance, Python-native data engine, empowers developers to build scalable end-to-end machine learning data pipelines that handle both structured and image data with ease.

Introduction to Daft for High-Performance Data Processing

Daft offers an intuitive yet powerful framework to streamline analytical workflows, combining:

Lazy execution for optimized computation
Vectorized operations and user-defined functions (UDFs) for complex transformations
Efficient support for structured and nested data
Scalable joins and aggregations suitable for large datasets

These features are pivotal for building AI automation pipelines that improve business efficiency by reducing processing time and simplifying code maintenance.

Step 1: Loading and Inspecting the MNIST Dataset

The journey begins by loading a real-world MNIST dataset stored in JSON format from a remote source. Daft’s native read_json method allows us to:

Quickly ingest compressed JSON data
Automatically infer schema for structured analysis
Preview sample data to verify content

Operation	Description
Data Loading	Read JSON data directly from a URL into a Daft DataFrame
Schema Inspection	Discover data column types, including labels and image pixel arrays
Data Preview	Display sample rows for validation

Step 2: Transforming Raw Pixels into Structured Features

Raw pixel arrays are reshaped into 28×28 matrices representing image data. We introduce feature engineering by calculating pixel-wise statistical metrics — mean and standard deviation — using Daft’s Python UDFs. This process enriches the dataset with valuable insights that help machine learning models distinguish handwritten digits.

UDF-Based Transformations: Apply reshape and statistical functions on image data.
Return Types: Explicitly define output data types for optimized execution.
Scalable Execution: Daft efficiently batches these operations across rows.

Python Code Snippet for Feature Engineering

df2 = (
   df
   .with_column("img_28x28", col("image").apply(to_28x28, return_dtype=daft.DataType.python()))
   .with_column("pixel_mean", col("img_28x28").apply(lambda x: float(np.mean(x)) if x else None, return_dtype=daft.DataType.float32()))
   .with_column("pixel_std", col("img_28x28").apply(lambda x: float(np.std(x)) if x else None, return_dtype=daft.DataType.float32()))
)

Step 3: Advanced Feature Extraction with Batch UDFs

Moving beyond basic statistics, we utilize batch UDFs to generate composite feature vectors by summarizing pixel intensity distributions along image axes and calculating normalized centroids. This creates enriched representations that enhance predictive power.

Summarizes pixel intensities row-wise and column-wise
Calculates centroid coordinates for pixel brightness
Combines multiple metrics into one feature vector

Step 4: Aggregation and Enrichment Through Joins

To add context, we perform group-by aggregations on the label column, computing metrics such as total counts and average pixel statistics per digit. We then join these statistics back to the original DataFrame, allowing each data point to carry both individual and class-level insights.

Aggregation	Description
Count per Label	Number of instances per digit
Mean Pixel Mean	Average brightness mean by digit
Mean Pixel Std	Average brightness variability by digit

Step 5: Model Training and Validation

After thorough preprocessing, we extract feature vectors and labels into NumPy arrays and train a baseline Logistic Regression model. This step validates the effectiveness of our feature engineering and sets a benchmark for predictive performance.

Split data into train and test sets for reliable validation
Evaluate model using accuracy and classification report metrics
Interpret results to guide further feature refinement

Sample Performance Metrics

Metric	Value
Accuracy	~0.85 (example baseline)
Precision, Recall, F1-Score	Provided per digit in classification report

Step 6: Persisting Processed Data for Production

We demonstrate the pipeline’s completeness by writing the processed DataFrame, with engineered features and metadata, to Parquet format. This step ensures data readiness for future model retraining, deployment, or further analysis.

Optimized storage with efficient compression
Compatibility with various data platforms supporting Parquet
Supports uninterrupted, scalable data workflows

Conclusion: Unlocking AI Automation and Business Efficiency with Daft

This tutorial presented a comprehensive, scalable approach to building an end-to-end machine learning data pipeline using Daft. By integrating structured processing, image feature engineering, aggregation, and model training in a unified Pythonic framework, Daft bridges the gap between data engineering and machine learning seamlessly.

Leveraging Daft in your AI automation workflows can significantly improve business efficiency by:

Accelerating data transformations through optimized execution
Reducing development complexity with native Python code and UDFs
Enabling reproducible, maintainable pipelines from raw ingestion to production

To explore the full code and follow future updates, consider joining the active community through popular ML SubReddits, newsletters, and social media.

Looking for custom AI automation for your business? Connect with me at https://amr-abdeldaym.netlify.app/