Blog

Roles in the Context of the Analytics Workflow

2025-03-27T10:47:00Z

An analytics workflow documents the journey from raw data to production-ready data models, encompassing development and testing phases. A critical component is governance, implemented through a Pull Request approval process that facilitates regular code reviews and prevents technical debt accumulation. This structured approach ensures quality and maintainability while supporting collaborative development.

To manage this governance effectively, several roles are typically involved. It’s becoming increasingly rare to see a single analyst handle the entire end-to-end process of creating a report or data model. Instead, the trend is moving toward a growing number of specialized roles, each with distinct responsibilities.

Key Roles

The analytics workflow involves several key roles, with data analysts playing a particularly key position. Data analysts combine technical skills with deep business domain knowledge, giving them unique insight into business models and challenges.

In contrast, data engineers and data platform engineers typically focus on technical implementation rather than direct business interaction. While aligning the data platform roadmap and investments with business value remains strategically important for leadership, this alignment happens at a higher level and doesn’t directly impact the day-to-day analytics workflow operations.

Role	Tasks
Data Analyst	- Understand business requirements - Analyze data in intermediate and mart layers - Develop SQL queries and transformations - Create and maintain metadata documentation
Data Platform Engineer	- Monitor and support infrastructure resources - Maintain CI/CD pipelines - Manage network infrastructure - Implement cybersecurity measures
Data Engineer	- Design and develop data pipelines - Maintain and optimize data flows - Schedule and orchestrate data processing - Implement data ingestion processes

Measuring Impact of each Role

Because roles are usually tackling different problems, it is a good idea to measure performance and impact differently. Measuring how a role is doing is important also to create rules such as notification rules, issue and incident prioritization rules, and other operational matters to increase the reliability of production.

Role	Measure
Data Analyst	- Understanding of business domain - Business Satisfaction - ROI from data initiatives
Data Platform Engineer	- Speed of new data platform features - Reliability of the data platform - Data analysts support and satisfaction
Data Engineer	- Speed of new data ingestions - Reliability of data pipelines - Data analysts support and satisfaction

Understanding the need of roles

Analytics teams often tend to combine roles or leave them loosely defined. This approach is understandable, and sometimes even beneficial, in the early stages of an analytics initiative. After all, when starting out, the priority is delivering business value quickly, and formal roles and approval processes can slow things down.

However, this lack of clearly defined roles and boundaries typically creates challenges as the analytics function matures. Common issues include:

Blurred lines between exploratory analytics work and production pipeline operations, making it difficult to maintain service levels
Insufficient knowledge transfer mechanisms, including limited documentation, unclear onboarding processes, and lack of backup coverage for key roles
Team friction arising from ambiguous responsibilities and overlapping ownership
An overemphasis on technical tools and implementation details, rather than addressing the more fundamental needs of role clarity and process alignment

Working with an external partner

In this workflow, there are already quite a few roles and tasks — and in reality, even more can be involved. For example, handling data privacy in Europe requires a solid understanding of GDPR. Given the range of responsibilities and the breadth of expertise needed to run a data department effectively, many data and analytics teams choose to rely on external partners. However, this isn’t always straightforward. External teams often create strong and rigid boundaries between themselves (the extended team) and the customer’s in-house team (the internal team), which can hinder collaboration.

One practical way to navigate the tension between collaboration and rigid boundaries is to establish a clear RACI matrix from the very beginning. This matrix serves as a shared reference to define roles and responsibilities, helping both internal and extended teams understand who is Responsible, Accountable, Consulted, and Informed for each task. It provides structure without creating silos, enabling smoother handovers and aligned expectations.

Conclusions

A well-structured analytics workflow is essential for turning raw data into reliable insights. As analytics initiatives mature, the need for governance, role clarity, and collaboration becomes increasingly important. Specialized roles such as data analysts, data engineers, and platform engineers each contribute unique expertise, and their impact should be measured differently to reflect their responsibilities.

Clearly defined roles and processes—such as code reviews, CI/CD practices, and RACI matrices—not only support governance and maintainability but also foster collaboration across internal and external teams. While early-stage flexibility is useful, long-term success in data and analytics depends on thoughtful structure, cross-functional alignment, and a shared understanding of who does what, and why.

Resources

Example of RACI Matrix

How to Setup Data Platform Infrastructure on Google Cloud Platform with Terraform

2025-03-05T13:31:00Z

Setting up a solid, scalable data platform is crucial for organizations looking to get the most out of their data. Building upon our previous discussion on architectural considerations for deploying Prefect on various cloud platforms, this article will walk you through building your data platform infrastructure on Google Cloud Platform (GCP) using Terraform.

Our focus is on creating a server-based approach utilizing a single Virtual Machine (VM)—a simple yet powerful starting point for organizations that don’t need to dive into complex source systems or full-blown data warehouses just yet. This approach offers an easy entry point with plenty of room to grow as your data needs evolve.

As this article will use a substantial amount of code, you can find all the relevant code samples in our dedicated repository.

Why Choose a Server-Based Approach with a Single VM?

Choosing a server-based approach with a single VM comes with several advantages:

Cost-effectiveness: A single VM setup is often a more budget-friendly option for initial deployments or smaller-scale projects.
Simplified management: Fewer components mean easier maintenance and troubleshooting.
Flexibility: This approach offers the ability to easily expand or modify your infrastructure as requirements change.
Learning curve: For teams new to cloud infrastructure, starting with a single VM can be less overwhelming and serve as a stepping stone to more complex architectures.

This guide will walk you through the process of setting up key components of our data platform infrastructure on GCP. You’ll learn how to configure the VPC and subnets, set up Compute Engine instances, configure firewall rules, secure SSH access with Identity-Aware Proxy (IAP), establish internet connectivity with Cloud Router and NAT, and store the state files in Cloud Storage. We’ll also dive into the specifics of GCP’s Identity-Aware Proxy, exploring its crucial role in enhancing the security of our data platform.

By using Terraform to manage infrastructure as code, we ensure that our setup is reproducible, version-controlled, and easy to manage. This not only streamlines the initial deployment but also makes scaling and future updates much more efficient.

Let’s get started on building a solid, scalable data platform infrastructure on GCP—one that will grow with your organization’s data needs.

Infrastructure overview

Before we dive into the step-by-step process of setting up your data platform on GCP, let’s take a first look at the key components that make up the environment we’ll be building:

Virtual Private Cloud (VPC): A private network that will serve as the foundation of your environment, providing isolation and security.
Subnet: A private subnet where the virtual machine will reside.
Compute Engine Virtual Machine: The instance where both the GitHub Runner and Prefect Worker will be set up.
Firewall: Configured with rules to allow inbound access exclusively through Google Cloud Identity-Aware Proxy (IAP), blocking all other traffic.
IAP SSH Permissions: Enables secure access to the virtual machine.
Cloud Router: Provides internet connectivity for the virtual machine.
Cloud NAT: Configures a NAT gateway that directs the virtual machine to the Cloud Router for outbound internet access. It also ensures that the public IP is fixed as long as the Cloud NAT object is not destroyed and configured for the same zone.
Cloud Storage: Sets up a Google Cloud Storage bucket to store ingested data as Parquet files before transforming and loading it into the database as tables.

Understanding Google Cloud Identity-Aware Proxy (IAP)

To ensure a secure environment, all public access should be completely blocked. With this configuration, resources within the environment can be accessed using two main options:

VPN Connection: In this setup, at least one resource within the VPC must be exposed to the internet to host a VPN endpoint. Alternatively, a separate VPC can be configured solely for VPN purposes, with VPC Network Peering into the main environment. This way, only the VPN-hosting VPC is exposed to the internet, while the main environment remains accessible only internally. Although effective, this configuration is more complex and falls outside the scope of this documentation.
Google Cloud Identity-Aware Proxy (IAP): This option offers a similar secure access model to a VPN but with simplified management through Google Cloud. As outlined in the official documentation:

“When an application or resource is protected by IAP, it can only be accessed through the proxy by principals, also known as users, who have the correct Identity and Access Management (IAM) role. When you grant a user access to an application or resource by IAP, they’re subject to the fine-grained access controls implemented by the product in use without requiring a VPN. When a user tries to access an IAP-secured resource, IAP performs authentication and authorization checks.”

This diagram from Google further illustrates the components required to implement this configuration:

With a solid understanding of Google Cloud Identity and its role in managing users and access, let’s now dive into the practical steps for setting it up and implementing it effectively.

Phase 1: Securing the Essentials

Before starting the Terraform configuration, make sure you have the following tools and setups in place:

Terraform: You’ll need Terraform installed on your local machine. This will be your main tool for provisioning infrastructure.
gcloud CLI: The gcloud CLI tool should be installed and configured to interact with your Google Cloud account.
GCP Service Account: A Google Cloud Platform service account needs to be created.
Enabled APIs: Make sure the Compute Engine API and Cloud Resource Manager API are enabled on your GCP account.

Let’s take a detailed look at these prerequisites:

Step 1: Installing Terraform

Terraform can be installed in various ways, which are outlined by Hashicorp here. For Ubuntu, installation can be done with the following commands:

wget -O - https://apt.releases.hashicorp.com/gpg | sudo gpg --dearmor -o
/usr/share/keyrings/hashicorp-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/hashicorp-archive-keyring.gpg]
https://apt.releases.hashicorp.com $(lsb_release -cs) main" | sudo tee
/etc/apt/sources.list.d/hashicorp.list
sudo apt update && sudo apt install terraform

Step 2: gcloud CLI installation

Similarly to Terraform, the gcloud CLI can be installed as per the official instructions. For Ubuntu, run:

 
sudo apt-get update
sudo apt-get install apt-transport-https ca-certificates gnupg curl
curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo gpg --dearmor
-o /usr/share/keyrings/cloud.google.gpg
echo "deb [signed-by=/usr/share/keyrings/cloud.google.gpg]
https://packages.cloud.google.com/apt cloud-sdk main" | sudo tee -a
/etc/apt/sources.list.d/google-cloud-sdk.list
sudo apt-get update && sudo apt-get install google-cloud-cli

After installation, initialize the gcloud CLI by providing the “gcloud init” command and setting up a new account by opening the provided URL.

gcloud init


# gcloud init
Welcome! This command will take you through the configuration of gcloud.

Your current configuration has been set to: [default]

You can skip diagnostics next time by using the following flag:
  gcloud init --skip-diagnostics

Network diagnostic detects and fixes local network connection issues.
Checking network connection...done.
Reachability Check passed.
Network diagnostic passed (1/1 checks passed).

You must sign in to continue. Would you like to sign in (Y/n)?  Y

Go to the following link in your browser, and complete the sign-in prompts:

    https://accounts.google.com/o/oauth2/auth<URL_TO_OPEN_IN_BROWSER>

Once finished, enter the verification code provided in your browser:
<PROVIDE_VERIFICATION_CODE>

Subsequently, configure the desired Cloud project, default Compute Region, and Zone for your environment.

Step 3: Setting up GCP Service Account

To create a GCP Service Account, navigate to the Google Cloud Console, select the correct project, and go to Navigation Menu (3 lines) > IAM & Admin > Service Accounts.

Click Create Service Account and provide the required information.

When prompted to Grant this service account access to a project, select the appropriate role. In this guide, we use the Owner role for simplicity, but it’s advisable to limit permissions to only what’s necessary.

Finally, in the Grant users access to this service account step, assign access to the users who will need to interact with the Kubernetes cluster (not part of this article) or VM. Once done, verify that the service account is correctly set up. Its email should follow this pattern:

{service_account_name}@{project}.iam.gserviceaccount.com

Step 4: Downloading the JSON Key for the Service Account

In the service account interface, go to Actions (3 dots) > Manage keys.

Select Add Key > Create new key > JSON to download the JSON key file. Keep this file secure, as it will be required for Terraform configuration.

Step 5: Activating service account

After downloading the JSON key, activate the service account locally with the following command:

gcloud auth activate-service-account
{service_account_name}@{project}.iam.gserviceaccount.com
--key-file={json_file}.json

This step enables the service account for use in the local environment, ensuring access to necessary GCP resources with IAP tunnel functionality.

Step 6: Generating HMAC key to buckets

To enable file uploads to Cloud Storage, some libraries require an HMAC token in addition to a JSON key. To generate an HMAC token:

Go to Cloud Storage > Settings > Interoperability.
Under Service account HMAC, click Create a key for service account.
Once created, the token will be marked as ‘Active’.

For basic configuration, this step is not required. However, for ingestion, the key must be added to Google Secret Manager to ensure it’s accessible for flow runs.

Step 7: Enabling Compute Engine API and Cloud Resource Manager API

Before Terraform can interact with GCP, make sure these APIs are enabled for your project:

If they’re not enabled, head to the Google Cloud Console and enable them.

Step 8: Setting Up a Remote State for Terraform

By default, Terraform stores its state locally in .tfstate files. While this works for development, for any persistent environment, even if you’re the only one working on the project, it’s crucial to store this state in a centralized and reliable location. A common best practice is to use a Google Cloud Storage (GCS) bucket to keep the state safe and accessible, avoiding potential issues with local file loss or conflicts.

However, Terraform itself cannot create the bucket required for storing its state, leading to what’s called a “chicken-and-egg” problem. The bucket must be created manually before running any Terraform code. Tools like Terragrunt can solve this by simplifying environment management and reducing code duplication (example setup). However, for the sake of simplicity, we are not introducing such tools in this context.

To create a new bucket using the gcloud CLI, export the necessary credentials and then proceed with the bucket creation process.

Step 9: Exporting Credentials and Setting up New Bucket

Once we have our service account JSON prepared, export of credentials is necessary to provide Application Default Credentials (ADC):

export GOOGLE_APPLICATION_CREDENTIALS=test-project-32206692d146.json

Then, with the usage of gcloud CLI, a new bucket with the applied policy should be created:

gcloud storage buckets create gs://test-project-tfstate --location=us-central1 
--uniform-bucket-level-access

gcloud storage buckets add-iam-policy-binding gs://test-project-tfstate \
--member="serviceAccount:test-service-account@test-project.iam.gserviceaccount.com" \
  --role="roles/storage.objectAdmin"

Once completed, it should be available in the GCP Console. To check it, go to Navigation Menu > Cloud Storage > Buckets:

Phase 2: Installing & Deploying Infrastructure with Terraform

To set up the environment, Terraform will handle provisioning all the required GCP resources. By the end of this process, your directory structure will look like this:

$ tree

|-- backend.tf
|-- main.tf
|-- provider.tf
|-- test-project-32206692d146.json
|-- variable.tf

For managing both DEV and PROD environments, you can duplicate the files as shown:

$ tree
├── dev
    ├── backend.tf
│   ├── main.tf
│   ├── test-project-32206692d146.json
│   ├── provider.tf
│   └── variable.tf
└── prod
    ├── backend.tf
    ├── main.tf
    ├── test-project-32206692d146.json
    ├── provider.tf
    └── variable.tf

Terraform Files

The main difference between environments lies in the backend.tf and variables.tf files. In larger projects, using Terraform modules or tools like Terragrunt is recommended for reusable configurations. However, for simplicity, this example uses code duplication, which is also a valid approach.

The content of provider.tf should look like this:

provider "google" {
  region      = var.region
  project     = var.project_name
  credentials = file(var.credentials_file)
  zone        = var.zone
}

backend.tf should point to a bucket with a shared tfstate file created in step 9 of the first phase. It needs to be manually configured because it is the first block loaded when running terraform init, and variables from variables.tf cannot be referenced here:

terraform {
  backend "gcs" {
    bucket  = "test-project-tfstate"
    prefix  = "terraform/state/prod"
  }
}

All variables used in provider.tf and main.tf are defined in variable.tf, as shown below:

variable "credentials_file" {
  default = "test-project-32206692d146.json"
}

variable "environment" {
  default = "prod"
}

variable "filesystem" {
  default = "ext4"
}

variable "image" {
  default = 
"projects/ubuntu-os-cloud/global/images/ubuntu-2404-noble-amd64-v20241115"
}

variable "ip_cidr_range" {
  default = "10.202.0.0/24"
}

variable "machine_type" {
  default = "c3d-standard-8-lssd"
}

variable "project_name" {
  default = "test-project"
}

variable "region" {
  default = "us-central1"
}

variable "service_account" {
  default = 
"serviceAccount:test-service-account@test-project.iam.gserviceaccount.com"
}

variable "zone" {
  default = "us-central1-c"
}

main.tf defines and initializes all infrastructure components outlined in the Infrastructure overview section.

resource "google_compute_network" "vpc_edp" {
 name                    = "vpc-${var.project_name}-${var.environment}"
 auto_create_subnetworks = "false"

}

resource "google_compute_subnetwork" "subnet_edp" {
 name          = "subnet-${var.project_name}-${var.environment}"
 ip_cidr_range = var.ip_cidr_range
 network       = google_compute_network.vpc_edp.name
 region        = var.region
 depends_on    = [google_compute_network.vpc_edp]
}

resource "google_compute_instance" "vm_edp" {
 project      = var.project_name
 zone         = var.zone
 name         = "${var.project_name}-${var.environment}-01"
 machine_type = var.machine_type
 boot_disk {
   auto_delete = true
   initialize_params {
     image = var.image
     size  = 50
     type  = "pd-ssd"
   }
   mode = "READ_WRITE"
 }
 scratch_disk {
   interface = "NVME"
 }
 network_interface {
   network    = "vpc-${var.project_name}-${var.environment}"
   subnetwork = google_compute_subnetwork.subnet_edp.name
 }
 metadata_startup_script = <<-EOT
   #!/bin/bash
   set -e
   sudo mkfs.ext4 -F /dev/disk/by-id/google-local-nvme-ssd-0
   sudo mkdir -p /mnt/disks/local-nvme-ssd
   sudo mount /dev/disk/by-id/google-local-nvme-ssd-0 /mnt/disks/local-nvme-ssd
   sudo chmod a+w /mnt/disks/local-nvme-ssd

   echo UUID=`sudo blkid -s UUID -o value /dev/disk/by-id/google-local-nvme-ssd-0` /mnt/disks/local-nvme-ssd ext4 discard,defaults,nofail 0 2 | sudo tee -a /etc/fstab
 EOT
 depends_on              = [google_compute_network.vpc_edp]
}

resource "google_compute_firewall" "rules" {
 project = var.project_name
 name    = "allow-ssh-${var.environment}"
 network = "vpc-${var.project_name}-${var.environment}"

 allow {
   protocol = "tcp"
   ports    = ["22", "6443"]
 }
 source_ranges = ["35.235.240.0/20"]
 depends_on    = [google_compute_network.vpc_edp]
}

resource "google_project_iam_member" "project" {
 project = var.project_name
 role    = "roles/iap.tunnelResourceAccessor"
 member  = var.service_account
}

resource "google_compute_router" "router" {
 project    = var.project_name
 name       = "nat-router-${var.environment}"
 network    = "vpc-${var.project_name}-${var.environment}"
 region     = var.region
 depends_on = [google_compute_network.vpc_edp]
}

resource "google_compute_router_nat" "nat" {
 name                               = "router-nat-${var.project_name}-${var.environment}"
 router                             = google_compute_router.router.name
 region                             = var.region
 nat_ip_allocate_option             = "AUTO_ONLY"
 source_subnetwork_ip_ranges_to_nat = "ALL_SUBNETWORKS_ALL_IP_RANGES"

 log_config {
   enable = true
   filter = "ERRORS_ONLY"
 }
}

resource "google_storage_bucket" "private_bucket" {
 name          = "${var.project_name}-${var.environment}"
 location      = var.region
 storage_class = "STANDARD"

 uniform_bucket_level_access = true
}

resource "google_storage_bucket_iam_binding" "bucket_writer" {
 bucket = google_storage_bucket.private_bucket.name

 role = "roles/storage.objectCreator"
 members = [
   "${var.service_account}"
 ]
}

resource "google_storage_bucket_iam_binding" "bucket_admin" {
 bucket = google_storage_bucket.private_bucket.name

 role = "roles/storage.admin"
 members = [
   "${var.service_account}"
 ]
}

With the environment outlined, let’s provision the infrastructure by validating, formatting, and applying the configuration.

Step 1: Infrastructure provisioning with Terraform

Once the necessary files are prepared, validate and format the configuration:

$ terraform validate
Success! The configuration is valid.
$ terraform fmt
main.tf
provider.tf

Before applying changes, inspect them with the plan command:

terraform plan
# Check if it's all good
terraform apply
# Enter a value: yes

After a few minutes, the environment will be ready. To list the created resources, run:

$ terraform state list
google_compute_firewall.rules
google_compute_instance.vm_edp
google_compute_network.vpc_edp
google_compute_router.router
google_compute_router_nat.nat
google_compute_subnetwork.subnet_edp
google_project_iam_member.project

Step 2: Setup Verification

Verify the resources by logging into the GCP Console. Confirm the creation of VPC and Subnet, Virtual Machine, Firewall Rule, IAP SSH Permission, Cloud Router, and NAT Gateway. Navigate to the following sections:

VPC and Subnet

Go to Navigation Menu > VPC Network > VPC Networks:

Click on the VPC and check the Subnets tab:

Virtual Machine

Navigate to Navigation Menu > Compute Engine > VM instances:

Firewall Rule

Go to Navigation Menu > VPC Network > Firewall:

IAM Role

Navigate to Navigation Menu > IAM & Admin > IAM, and View by roles:

Cloud NAT gateway

Go to Navigation Menu > Network Connectivity > Cloud Routers > Open Cloud Router:

Cloud NAT

Navigate to Navigation Menu > Network Services > Cloud NAT:

SSH Access to Virtual Machine using GCP Console

To test SSH connectivity to the Virtual Machine, go to Navigation Menu > Compute Engine > VM instances and click on the SSH option for the created VM. Approve the connection when prompted, and you should be logged in.

This verification ensures all components are correctly configured and accessible. The next steps in setting up the data platform should be setting up a self-hosted GitHub runner and then a Prefect worker.

SSH Access to Virtual Machine Using the gcloud CLI

You can access a Virtual Vachine securely using only a service account token and gcloud CLI. Follow these steps to set up and establish SSH access:

Ensure you have the service account JSON key stored locally.
Log in to your Google Cloud account using the following command:

gcloud auth activate-service-account 
test-service-account@test-project.iam.gserviceaccount.com --key-file
 ~/.config/gcloud.json
gcloud config set project test-project

Once authenticated, execute the following command to initiate the SSH connection:

gcloud compute ssh ${project_name}-${environment}-01

On the first execution, the gcloud CLI will prompt you to generate a private and public SSH key pair. Follow the instructions to create the key pair.

Once the keys are created, access to the virtual machine will be automatically established. Subsequent logins will reuse the existing key pair, simplifying future access.

Conclusion

Setting up a data platform infrastructure on Google Cloud Platform using Terraform provides a solid foundation for organizations looking to leverage the power of their data. This approach offers several key benefits:

Scalability and Flexibility: The server-based approach with a single VM provides an excellent starting point that can easily be expanded as your data needs grow.
Security: By leveraging Google Cloud’s Identity-Aware roxy (IAP), we’ve ensured that access to our resources is tightly controlled and secure.
Infrastructure as Code: Using Terraform allows for version-controlled, reproducible infrastructure deployments, making it easier to manage and update your environment over time.
Cost-Effectiveness: Starting with a single VM setup is often more budget-friendly for initial deployments or smaller-scale projects.
Simplified Management: With fewer components to manage initially, maintenance and troubleshooting become more straightforward.

By following the steps outlined in this guide, you’ve created a robust infrastructure that includes a VPC, subnet, Compute Engine instance, firewall rules, IAP SSH permissions, Cloud Router, Cloud NAT, and Cloud Storage. This setup provides a solid base for running a data platform, including components like a GitHub Runner and Prefect Worker. The process of setting up these additional components will be covered in the next article of this series, building upon the foundation we’ve established here.

Our dedicated repository contains all code examples and implementations discussed in this article, which can be accessed for reference and further exploration. We encourage you to review the repository for a comprehensive understanding of the concepts presented.

Remember, while this guide provides a strong starting point, it’s crucial to continually assess and adjust your infrastructure to meet your organization’s changing needs and to stay aligned with best practices in cloud computing and data management.

Organizing Networking for Data Platforms: Key Connectivity Options

2025-03-05T09:50:00Z

A poorly designed network can cripple even the most advanced data platform. Slow queries, failed data transfers, and security vulnerabilities often stem from overlooked networking decisions. Yet, networking remains one of the least understood aspects of data architecture.

The Extract, Load, and Transform (ELT) process has become the standard for data integration. It enables organizations to move raw data from source systems to destinations like data warehouses, where it can be analyzed using Business Intelligence (BI) tools. While many aspects of this process deserve attention, networking is a critical yet often underestimated component.

Building a data platform that supports ELT processes requires a clear understanding of how all components communicate. Whether implementing an on-premise solution with open-source tools, leveraging cloud providers, or utilizing SaaS or PaaS solutions, the common thread is the need for seamless connectivity between all elements.

In this article, we’ll explore the options of organizing networking in data platforms, covering key connectivity options, security considerations, and best practices. To lay the groundwork for our discussion, let’s first examine the optimal organization of a data platform.

Data Platform Architecture and Networking

The below diagram presents the reference architecture for the ELT process as a whole, outlining the key components and workflows involved. Each stage has its own phases, with Ingest being part of data extraction, Land being the process of loading data, and Prepare with Model being transformation.

To better understand how networking ties into a data platform, let’s examine a second diagram, which shifts focus to the networking aspects of the data platform architecture.

The diagram illustrates various components of a data platform, each requiring network configuration to ensure smooth and secure data movement. At the core of our setup is workflow orchestration, which manages the data integration process. Tools like Prefect, Airflow, or Azure Data Factory can handle this, running data flows across various stages.

Extract

The initial phase of the ELT (Extract, Load, Transform) process is data extraction. Every data platform needs to gather data from external systems, represented as “data sources" in the diagram. To access resources from a private environment, we need to use a gateway that allows us to reach external resources. This could be:

Internet Gateway - For accessing public resources.
NAT Gateway - Allow resources in private subnets to connect to services outside the private network.
VPN Gateway - Establishes a secure tunnel with private resources within a different network of our organization or a partner.

Load

Once extracted, data needs to be loaded into a central repository—typically a Data Warehouse. This can be hosted within the same network as the workflow orchestration tool or exist as an external resource.

Same Network: Configuration is simpler as the same team is likely responsible for setting up both components along with networking.
External Resource: Requires additional networking considerations, but the same principles apply—ensuring secure, reliable connectivity.

Transform

The Transform phase follows a similar working pattern, as the workflow orchestration tool needs access to the Data Warehouse. The same resources need to communicate with each other, regardless of whether it’s the Load or Transform phase.

Data Consumption

The final stage is data consumption, where users and tools query the Data Warehouse. Given that sensitive information such as Client Identifying Data (CID) may be stored, secure connections are essential. BI tools, accessible by data platform consumers, need controlled access to the Data Warehouse. Such tools are often managed by a different team from those responsible for data gathering, loading, and transformation.

Application Layer Security

When discussing the networking aspects of data platforms, it’s essential to understand the context within the OSI (Open Systems Interconnection) model. This article primarily focuses on the Network Layer (Layer 3) and Transport Layer (Layer 4)—the backbone of data connectivity. These layers handle IP addressing, routing, and basic connection establishment, forming the foundation for gateways and other networking components.

However, security doesn’t stop at the network level. The Application Layer (Layer 7) plays a critical role in securing data and applications. While this article centers on network infrastructure, robust Layer 7 security is just as important. Common security measures include:

OAuth for secure authorization
Mutual TLS (mTLS) for encrypted, authenticated communication
Basic authentication for simple access control
API gateways for managing and securing API access
Web Application Firewalls (WAF) for protecting against application-level attacks

Regardless of network configuration, Application Layer security should always be implemented. A key principle to remember is that the weaker the Layer 7 security measures, the stronger the network-level controls must be to compensate. This inverse relationship between application-level and network-level security is key to maintaining overall system integrity.

That’s why, while our focus remains on network infrastructure, a holistic approach to data platform security should consider all relevant OSI layers, especially when dealing with sensitive data and critical business intelligence tools.

With this foundation in place, let’s dive into the specific networking options available for securing your data platform.

Connectivity Options

While there are many possible approaches to networking configurations, we’ll focus on the most common scenarios applicable to the majority of data platform use cases:

Public access
Public access with Access Control List (ACL)
VPC peering (within a single project and multiple ones)
Site-to-site VPN

We’ll explore each in detail, followed by a brief discussion of additional networking possibilities.

Public Access

Public Access is the least secure networking option, as it does not restrict access at the network level. Resources residing in a private network are configured to access the internet, while the target resource has no network security applied. This doesn’t necessarily mean the resource is available to anyone, as Application Layer security may still be in place. However, from a networking perspective, access is unrestricted.

This configuration exposes resources to potential attacks, as malicious actors can easily reach them and attempt to bypass application security. Whenever possible, such unrestricted access should be avoided.

That said, Public Access remains the best option for specific, low-risk data sources, such as:

Exchange rates for currencies
Stock market prices
Other publicly accessible data needed in ELT pipelines

To mitigate risks associated with Public Access, organizations can implement additional security measures within their private networks. For instance, they can configure a firewall to block access to all public resources except those explicitly whitelisted. This approach adheres to the principle of least privilege, ensuring only necessary connections are allowed.

By implementing such measures, organizations can balance the need for access to public data sources with maintaining a secure network environment.

Public Access with Access Control List (ACL)

When a system is publicly available, in addition to securing access through Application Layer security controls, we can implement networking mechanisms to expose the system only to a limited group of servers or users. An Access Control List, often referred to as a whitelist, is a security mechanism implemented on the publicly available target system. While the simplest scenario involves allowing access for specific IP addresses, ACLs offer more sophisticated options, including:

Source and destination IP addresses
Port numbers
Network protocols (e.g., TCP, UDP, ICMP)
Time ranges for when the ACL is active

For ACLs to be effective, the system must have a fixed IP address. If this cannot be guaranteed, more secure alternatives like Site-to-Site VPN should be considered.

ACLs can be implemented at multiple levels, including routers, firewalls, or other network devices, providing a layered approach to security. Additionally, ACLs can be used for both inbound and outbound traffic, allowing for fine-grained control over data flow in both directions.

However, while they enhance security, they should not be relied upon as the sole protection mechanism—they work best alongside other security measures, such as authentication, encryption, and regular security audits.

VPC Peering

For cloud environments, VPC (Virtual Private Cloud) Peering allows direct, private network connections between different cloud resources without exposing traffic to the public internet

Since cloud providers use different naming conventions (AWS: accounts, Azure: subscriptions, GCP: projects), we’ll use the term “project” to refer to these organizational units.

VPC peering should be considered the default network configuration for resources within the same cloud provider, regardless of the specific implementation option.

The implementation process varies depending on whether the VPCs are located within the same project or separate ones. Therefore, we will discuss these scenarios separately to highlight their unique characteristics and requirements.

VPC Peering within a single project

Connecting two networks within the same project is a streamlined process. It requires no additional permissions and can be configured entirely from a single account. This peering effectively extends the network, making all resources in the peered network accessible from the first network.

As with other networking options, additional firewall rules or ACLs can be implemented to restrict access directionality or limit connectivity to specific services.

VPC Peering across separate projects

When peering VPCs between different projects, additional security and administrative steps are required:

Cross-project peering permissions must be explicitly granted.
Approval is needed from administrators in both projects.
Firewall rules must be configured in each project to enable cross-project traffic.

Despite these additional requirements, VPC Peering remains the most secure and efficient method for connecting resources within the same cloud provider, offering greater control and reduced exposure compared to internet-based connections.

Site-to-site (S2S) VPN

Site-to-Site VPN is a secure networking solution that connects two or more separate networks, typically in different physical locations, enabling them to communicate as if they were directly connected. This technology creates an encrypted tunnel over the public internet, allowing organizations to securely link their geographically dispersed offices, data centers, or cloud resources.

Key aspects of Site-to-Site VPN:

VPN Gateways: Specialized devices or software applications are deployed at each network endpoint to act as tunnel terminators.
Encryption: Data is encrypted before entering the VPN tunnel and decrypted upon reaching its destination, ensuring confidentiality during transit.
Tunneling Protocols: Protocols like IPsec establish the secure tunnel and manage encryption/decryption processes.
Routing Configuration: Network administrators configure routing to direct traffic through the VPN tunnel instead of the public internet.

A Site-to-Site VPN is one of the most secure ways to connect resources across different locations. While the tunnel relies on public internet infrastructure, all traffic is encrypted, ensuring that data cannot be decrypted without the secret key used to establish the connection. Because of this, securely sharing the secret key is crucial and should never be transmitted through unencrypted channels. With strong encryption and secure key management, Site-to-Site VPN provides an excellent solution for organizations requiring high levels of data protection and privacy across geographically dispersed networks.

Other Networking Possibilities

There are more advanced options available for securing network traffic and isolating it from the public internet. Two notable solutions worth mentioning are:

MPLS (Multiprotocol Label Switching)

MPLS is a packet forwarding technology that operates between Layer 2 and Layer 3 of the OSI model. It typically utilizes a dedicated network infrastructure, ensuring no public connection is involved. Implementing MPLS requires finding a vendor capable of leasing physical cables for exclusive use. While more expensive and complex to implement than previously mentioned options, MPLS offers enhanced security and guaranteed connection speeds.

Dedicated Link

Cloud providers offer solutions like Google Cloud’s Dedicated Interconnect or AWS Direct Connect, which are faster to implement than MPLS, as the cloud provider handles much of the infrastructure. These options are ideal for establishing physical, private connections between on-premises networks and cloud provider networks. However, they may be excessive for connecting to a single data source on a data platform.

While these options provide additional layers of security and performance, they should be carefully considered based on specific organizational needs and resources.

Conclusion

Selecting the right networking strategy for your data platform is critical to ensuring security, performance, and scalability. From public access to VPC peering and site-to-site VPNs, the choice of networking strategy significantly impacts your data platform’s security, performance, and flexibility. Each option comes with trade-offs that need to be considered.

Security should be a primary concern. Public access is the least secure option, while site-to-site VPN offers robust protection.
VPC peering provides an excellent balance of performance and security for resources within the same cloud provider.
Access Control Lists (ACLs) offer an additional layer of security for public access scenarios, allowing for fine-grained control.
Application layer security remains crucial regardless of the chosen networking option, complementing network-level protections.

When designing your data platform’s networking architecture, consider your specific use case, security needs, and scalability requirements. Remember that a comprehensive approach, combining appropriate networking strategies with robust application-level security measures, will provide the most effective protection for your valuable data assets.

As technology evolves, staying updated on best practices and emerging solutions will help ensure your platform remains secure and efficient in the long run.

What Is a Modular Data Platform?

2025-02-24T10:58:00Z

Modern data analytics can get complicated. With an abundance of tools, conflicting methodologies, and ever-evolving technologies, mistakes can be costly. However, at its core, data analytics remains grounded in a few fundamental principles. Understanding these fundamentals while leveraging modular and well-designed data platforms can significantly improve operational efficiency and decision-making.

History of Data Platforms: From OLAP Cubes to Hadoop to Lakehouses

The evolution of data platforms has been driven by two primary goals:

Doing Analytics Better: improving analytics work with more efficient storage and retrieval of business and machine data; moving insights generation closer to the domain experts by improving self-service tools and processes
Doing Better Analytics: increasing the value of analytics by having more and deeper insights; leverage statistical modeling, machine learning, and AI to improve the quality of business decisions

And so, while SQL, which was invented in 1975, remains at the core of analytics, there have been significant advances in technology.

OLAP cubes emerged in 1993, introducing multi-dimensional analysis. In the early 2000s, Hadoop revolutionized big data processing, allowing distributed storage and computing. More recently, the Lakehouse paradigm has sought to unify the best aspects of data warehouses and data lakes, improving performance, governance, and flexibility.

Reference Architecture of a Modern and Modular Data Platform

A modern data platform consists of several key components, each playing a crucial role in the data lifecycle. These components enable efficient data movement, transformation, and consumption while ensuring modularity and scalability.

Data Sources

Data sources are the origin of information within an organization. These range from structured databases, APIs, and SaaS applications to unstructured sources such as logs, IoT streams, and social media feeds. Some sources offer modern APIs for easy integration, while others, particularly legacy systems, require extensive workarounds.

Ingestion

Ingestion refers to the process of transferring data from sources into the platform reliably. This is typically done via scheduled batch jobs, though some architectures incorporate real-time ingestion using event brokers like Kafka.

The ingestion landscape is fragmented, with numerous tools available, such as Azure Data Factory and Fivetran. However, no tool provides connectors for every possible source. Consequently, organizations often need to develop custom connectors, leading to maintenance challenges and dependencies on vendors.

Landing

Landing zones serve as the initial storage layer where raw, unprocessed data is deposited after ingestion. This stage ensures that data is captured in its original form, preserving fidelity and enabling downstream transformation.

Storage for this type of data (including lakehouse data) has been standardized around the AWS S3 object storage API. Consequently, most cloud providers now offer their own variations of object storage with APIs closely mirroring AWS S3.

Preparation

Since raw data can be messy and inconsistent, preparation is necessary to clean, standardize, and format it for further processing. This stage includes:

Data masking
Data anonymization
Structuring into standardized formats such as Delta Tables or Apache Iceberg Parquet files

Note that both data masking and anonymization could be done also during landing on data “in-transit” to avoid storing sensitive information on the platform.

Data engineers typically handle this step using workflow tools like Alteryx, Azure Data Factory, or programming languages such as Python.

Modeling

Modeling transforms prepared data into well-structured datasets optimized for analytical use. Historically, this was the “T” in ETL (Extract, Transform, Load). Today, tools like dbt have popularized the concept of modular and scalable transformation workflows.

Consumption

Once modeled, data is consumed in various ways, including:

Traditional dashboards and Excel reports
Embedded analytics within applications
AI-powered data exploration (e.g., generative AI and natural language querying)

Data Cataloging

A data catalog is a comprehensive inventory of an organization’s data assets, documenting their structure, relationships, and usage. It extends beyond datasets to include analytical assets such as dashboards, reports, and Jupyter notebooks, ensuring a unified and well-organized view of available information.

Despite its critical role in data management, data cataloging is often overlooked or deprioritized in analytics projects. However, a well-maintained data catalog is fundamental to effective data governance and security. By systematically identifying all data assets, their ownership, and their respective domains, organizations can enhance discoverability, streamline compliance efforts, and facilitate data democratization.

Data Orchestration

Data orchestration refers to the automated coordination of ETL processes, from data ingestion and preparation to final modeling for consumption. It ensures that data flows seamlessly across different stages, reducing manual intervention and improving efficiency.

This industry is highly fragmented, with traditional IT approaches relying on UI-based tools such as Talend and Azure Data Factory. More modern methodologies, however, focus on code-driven orchestration using tools like Apache Airflow and Prefect. These newer solutions provide greater flexibility, scalability, and integration capabilities, making them preferred choices for organizations aiming to build robust and automated data pipelines.

Understanding Modularity in the Context of Data Platforms

Unlike ERP systems, which are often monolithic, data platforms are inherently modular. The diversity of data workflows and use cases makes it impractical to consolidate everything into a single tool.

Some vendors, such as Databricks and Microsoft Fabric, attempt to provide an all-in-one solution. However, even these platforms require integration with external components to cover all aspects of data management.

Components and Interfaces Before Tools

The success of a data platform hinges on well-defined interfaces between its components. A common pitfall is over-reliance on a single vendor, leading to inflexible architectures that struggle to adapt to evolving business needs. Organizations should prioritize:

Well-defined interfaces between tools
Single point for managing accesses (i.e., using A/D groups)
Loose coupling between components to enable flexibility

Workflows and Developer Experience

A streamlined developer experience is crucial for maintaining data platform efficiency. Poorly designed workflows can introduce bottlenecks, reduce productivity, and increase technical debt. Best practices include:

Automating repetitive tasks (e.g., CI/CD for data pipelines)
Enforcing coding standards and documentation
Providing self-service capabilities for data consumers

Data Governance and Security

With numerous tools and evolving datasets, data governance and security must be proactive rather than reactive. Traditional IT governance models, which assume static datasets, are insufficient for modern data platforms. Without a structured approach to creating and managing new data assets, governance becomes impossible.

Effective data governance requires:

Clear company policies on data access, privacy, and security.
A solid understanding of analytics workflows to incorporate governance steps and audit reviews seamlessly.
Automated data cataloging to maintain visibility into data assets and their ownership.
A comprehensive inventory of all analytics tools, ensuring each one is correctly configured and continuously monitored for compliance.

While data governance is straightforward in principle, it requires a structured, realistic approach with well-defined steps to ensure its successful implementation and long-term effectiveness.

Conclusion

Modern data platforms are modular ecosystems that require careful design and governance to be effective. By understanding the historical evolution of data architectures, organizations can make informed decisions about structuring their platforms. Prioritizing interoperability, developer experience, and security ensures a scalable and efficient data operations strategy.

Organizations that embrace modularity and best practices in data management will not only improve operational efficiency but also gain a competitive advantage in an increasingly data-driven world.

Breaking Down Prefect Deployments To Improve The Data Ops Efficiency

2025-01-28T09:45:00Z

When building data platforms, it’s tempting to focus entirely on the technology stack—choosing shiny tools, debating between bulk loads or streaming, and designing storage and infrastructure to meet current needs. Yet, the rush to get data flowing often overshadows a crucial question: How will we monitor and operate all of this effectively?

In the early stages, data projects typically start small: an MVP, one or two data sources, and a couple of flow runs per day. At this scale, operations often feel secondary— issues can be solved on the spot, and data engineering teams are under pressure to deliver data to the end users. But as the platform scales, this oversight catches up. Within months, many teams find themselves struggling to manage DataOps, with operational gaps threatening their progress.

Observability and day-to-day functionality are the bedrock of robust, scalable, and maintainable data pipelines. Modern orchestration tools like Prefect excel at breaking down pipelines into smaller, more manageable pieces, making it easier to monitor, troubleshoot, and deploy smoothly. By designing pipelines with intention and visibility in mind, teams can ensure their data platform remains reliable—even as it evolves.

Why Observability Matters in ETL Processes

Observability is a cornerstone of modern data engineering and operations. As ETL pipelines become critical for decision-making, data teams need deep visibility into pipeline performance and meaningful, actionable logs. The stakes are high—when something goes wrong, time is lost (and as we all know, time is money, or at least that is what they say), and teams are left scrambling to identify issues. At best, this means tedious log analysis and guesswork; at worst—handling complaints from frustrated end-users.

To avoid these pitfalls, observability is a must. It not only ensures transparency with stakeholders but also equips teams to diagnose and address problems efficiently. Effective observability hinges on four dimensions:

Transparency: Understand what each step in the pipeline does, including inputs and outputs.
Traceability: Track data as it flows through the pipeline, making it possible to pinpoint where issues arise.
Granularity: Drill down to isolate performance bottlenecks, failed tasks, or long-running tasks.
Scalability: Expand monitoring and alerting systems to keep pace as the ETL process grows in complexity.

The Pitfalls of a Single Monolithic Flow

When starting an ELT project, it’s common to build one or two monolithic flows. These flows often contain dozens of tasks, which can inevitably grow as the solution scales.

The code usually looks then more or less like this:

1. Task to fetch a list of tables from MS SQL

@task
def get_table_names(conn_str: str) -> List[str]:
    """
    Connect to an MS SQL database and return a list of tables.
    """
    query = """
    SELECT TABLE_NAME
    FROM INFORMATION_SCHEMA.TABLES
    WHERE TABLE_TYPE = 'BASE TABLE'
      AND TABLE_CATALOG = DB_NAME()
    """
    with pyodbc.connect(conn_str) as conn:
        cursor = conn.cursor()
        cursor.execute(query)
        results = cursor.fetchall()

    table_names = [row[0] for row in results]
    return table_names

2. Task to extract data from a specific table into a DataFrame

@task
def extract_table_to_df(conn_str: str, table_name: str) -> pd.DataFrame:
    """
    Run SELECT * on the given table and return a Pandas DataFrame.
    """
    query = f"SELECT * FROM {table_name}"
    with pyodbc.connect(conn_str) as conn:
        df = pd.read_sql(query, conn)
    return df

3. Task to write a DataFrame to S3 as a Parquet file

@task
def write_parquet_to_s3(df: pd.DataFrame, bucket: str, table_name: str):
    """
    Write the given DataFrame as a Parquet file to the specified S3 bucket.
    """

    s3_path = f"s3://{bucket}/{table_name}.parquet"

    df.to_parquet(
        path=s3_path,
        engine="pyarrow",
        index=False,
        storage_options={
            "key": get_secret_from_gcsm("AWS_ACCESS_KEY_ID"),     
            "secret": get_secret_from_gcsm("AWS_SECRET_ACCESS_KEY")},
    )

    return s3_path

4. Main Flow orchestrating the above tasks

@flow
def ms_sql_to_s3_flow(
    conn_str: str,
    bucket: str,
):
    """
    A Prefect flow that loads all tables from MS SQL into S3 as Parquet files.
    """
    # Fetch all table names
    tables = get_table_names(conn_str)

    # For each table, extract and load
    for table_name in tables:
        df = extract_table_to_df(conn_str, table_name)
        write_parquet_to_s3(df, bucket, table_name)

At first, this approach might seem efficient. A single flow can ingest all objects from a database in one run—straightforward and convenient, right?

Initially, with just 10 objects in the database, it works well enough. But as the source database grows to 100 or more items, the cracks begin to show. Usually, this approach introduces several significant challenges:

Difficult Monitoring: A single failure makes the entire flow as failed, forcing data engineers to dig through logs to identify the problematic element.

Limited Reusability: It’s hard to run deployments for one table or only failed objects without re-running the entire flow.
Reduced Scheduling Flexibility: Monoflow might require running all tasks together, even when only a subset of tasks needs frequent execution.
SLA Reporting: Measuring success rates becomes much harder. Reporting on flow run states is unreliable since the failure on one table out of 1,000 causes the whole flow to be marked as failed. Again, this requires digging into logs to measure performance accurately.
Execution Time: Monolith flows are time-consuming and don’t allow parallel execution, hindering efficiency.

In essence, a monolithic approach limits observability, reduces performance, and complicates operations.

The Case for Granulated, Focused Flows

When it comes to sizing your ELT flows, trust me—you’d rather fight 100 duck-sized horses than one horse-sized duck. In other words, breaking down monolithic flows into smaller, focused units is the key to scaling effectively.

The first step is modularizing the monolithic flow. Ideally, each deployment flow should represent a single data object. For example, if you’re ingesting data from an SQL database, think about organizing your process to allow for per-table scalability—it might require more time investment but will divide the complexity.

With the right tools, this approach is not as complex as it sounds. Prefect allows defining deployments with YAML, leveraging project-level default configurations stored under the definitions: key in the prefect.yml file. There are two main ways of using them:

using the entire value as-is,
using part of the pre-defined values (eg. overriding only a single parameter).

This way, you can stick with the pre-defined daily schedule as it is, which makes the deployment creation way easier than it initially seemed.

Here’s why granular flow deployments are worth the effort:

Parallelism: Each table flow can run independently in parallel with others. If one table experiences performance degradation, it does not immediately affect the rest. And yes, it can be included in the monoflow, but why spend time reinventing the wheel? Orchestrator can take care of that.
Monitoring and Error Handling: If a single table fails, its flow run alone fails. This allows one to quickly identify the failed table, debug it, and restart only that deployment. Also, it helps with monitoring the execution time of a particular table or with tracking data quality issues.
Improved Data Quality Testing: It’s much easier to enable data quality tests per data object instead of having universal rules. Is it better to have customized tests per column in the data set or check if the set is not null only?
Incremental Maintenance and Scalability: Modular flows create clear boundaries. Adding or updating flows for new or modified tables doesn’t necessarily affect existing deployments. Each table’s logic is easier to maintain and evolve in isolation.
Version Control: Each deployment can be versioned independently. This makes testing changes for one table more straightforward and also makes the CI/CD implementation easier.
Team Collaboration: Different engineers can own specific deployments, making it easier to distribute responsibility and keep changes localized. It’s good to use tags to identify project-related deployments—e.g., it’s possible to have a sales tag in Prefect for sales data-related processes.
Granular Scheduling: Some tables need to be refreshed three times daily, but some should be reloaded monthly only. The granular approach allows for more playing with the schedule.
SLA Reporting: It’s simpler, as the real situation is shown on the run level, and failure means real failure.

Conclusion

In conclusion, a granular approach to orchestrated deployments is more than just a technical choice—it’s a strategic advantage. By breaking large, monolithic pipelines into focused, modular flows, data teams gain clearer observability, easier troubleshooting, and the flexibility to handle diverse scheduling needs

Focusing on key concerns—performance, reliability, and maintainability—can help you build a better data solution using a granular approach. Over time, this approach will lead to more predictable, scalable, and maintainable ETL processes.

dlt and Prefect, a Great Combo for Streamlined Data Ingestion Pipelines

2025-01-27T12:54:00Z

Doing data ingestion right is hard…

Despite advances in data engineering, data ingestion—the Extract and Load (EL) steps of the ELT process—remains a persistent challenge for many data teams.

This complexity is often due to the real-world limitations of open-source tools, leading teams to opt for UI-based solutions. While these tools are great for getting started quickly, they often lack the flexibility and scalability required for production-grade data platforms.
In the era of AI, UI-based tools face one more limitation: they miss out on most of the benefits of the advanced code generation capacity of modern LLMs (Large Language Models)[1].

Even if teams do decide to use open-source solutions, they often end up creating volumes of low-quality glue code. This in-house software, typically written in a rush by non-professional engineers, often fails to meet essential requirements for modern data platforms, such as EaC (Everything as Code), security, monitoring & alerting, reliability, or extensibility. Moreover, since it’s written by non-professional engineers, such code is far more brittle and much harder to maintain and modify. Consequently, all modifications to the code (such as adding new features or fixing bugs) take much more time and are far riskier than they should be.

…but there is light at the end of the tunnel

Luckily, in recent years, with the growing adoption of software engineering practices, we’ve seen a professionalization of the data engineering field. This has resulted in the creation of a number of high-quality, open-source tools that simplify and improve the quality of data engineering work, such as dlt and Prefect.

In this article, we explore how dlt and Prefect can be seamlessly integrated to implement a best-practice data ingestion component of a modern data platform. Our insights are grounded in real-world experience designing and implementing scalable, code-based data platforms with these open-source tools.

A short introduction to dlt and Prefect

dlt

dlt is a Python data ingestion framework enabling data engineers to define connectors and pipelines as code. It offers a rich set of features for building best-practice pipelines and supports both built-in and custom connectors built with regular Python code.

dlt ingests data in three stages: extract, normalize, and load. The extract stage downloads source data to disk. The normalize stage applies light transformations to the data, such as column renaming or datetime parsing. The load stage loads the data into the destination system.

Here’s a compact guide to key dlt concepts:

dlt config: dlt can be configured in three ways: through files (config.toml and secrets.toml), environment variables, and Python code.
Using config.toml for default settings is recommended, as it’s easy to store the file together with pipeline code on git. While it can contain some pipeline-level settings as well, its main purpose is to configure global behavior such as logging, parallelization, execution settings, and source or destination configuration common to all pipelines.
Resource and Source:
A resource is a representation of a single item in a dataset. It can be a file, a database table, a REST API endpoint, etc.
A source is a collection of resources, such as a filesystem (eg. s3), a database, or a REST API.
By applying hints to the resource with resource.apply_hints(), we can configure extraction settings specific to the resource, a pipeline, or a pipeline run, such as primary key, cursor column, column typing, partitioning, etc. We can also apply some light transformations to the data (eg. data masking) before it’s loaded to the destination with the resource.add_map() method.
dlt is flexible when it comes to working with sources and resources, and it’s easy to use either, depending on the need.
Pipeline: In dlt, pipeline describes the flow of data from a source (or resource) to a destination. Each pipeline handles a single source<->destination pair and takes a source or resource as input.
Pipelines can be reused to ingest different resources each run. For example, we can have one “Postgres to S3” pipeline, but ingest each Postgres table separately due to different scheduling or configuration needs.
A pipeline definition contains pipeline- or pipeline run-specific destination configuration, as well as settings for the load phase of the ingestion. Under the hood, a pipeline run (pipeline.run()) executes each pipeline step: extract (pipeline.extract()), normalize (pipeline.normalize()), and load (pipeline.load()).

Prefect

Prefect is a Python data orchestration library that allows data and machine learning engineers to define data workflows (data ingestion, transformation, model training, etc.) as code. It provides a rich set of features to help engineers implement best-practice data orchestration workflows.

Its cloud offering eliminates the historically stressful and labor-intensive maintenance of data orchestration systems.

Let’s unpack the core concepts of Prefect:

Task: A task is a single unit of work in a Prefect flow. It describes a single step to be executed in the workflow.
While it’s possible to implement the logic of the step directly in the task, in most cases, we recommend keeping tasks as thin wrappers around regular Python functions.
Flow: A flow is a collection of tasks that define a data workflow. You can think of it as a graph of tasks, describing their relationship (eg. this task should always run after this one, and this other task should run after that one, but only if it fails).
Similar to a dlt pipeline, the same flow can be reused with different sets of parameters. An instance of a flow with specific parameter values is called a deployment.
In this article, we utilize this fact by utilizing a single extract_and_load() flow capable of executing any dlt pipeline, depending on the parameters passed to it. As a result, each ingestion becomes a new Prefect deployment rather than a new flow, which has a major consequence: deployments can be defined with YAML, which means that they don’t require any Python code to be written, which means users don’t need to set up a local Python development environment just to eg. ingest a new table with an existing pipeline. Instead, we can, for example, expose a simple application that allows non-technical users to create new deployments with a few clicks.
Deployment: A deployment is a way to run a flow with a specific set of parameters and environment configuration. While most environment configurations in Prefect would typically be defined at the workspace level, deployments allow for overriding some of these settings, including on a per-run basis, which simplifies testing and debugging.

Creating data connectors and pipelines with dlt

Now that we’ve covered the theoretical underpinnings of dlt and Prefect, it’s time to see these concepts in action. We’ll explore how to implement best-practice dlt pipelines and bring these tools to life.

Data pipeline features

Alright, before we dive into the technical part, let’s start with the basics. A production-grade data pipeline needs to have several key features:

Modularity: The pipeline should be designed to allow the reuse of components across multiple pipelines.
Extensibility: The pipeline must be upgradeable without disrupting ongoing production jobs.
Reliability: The ability to inspect pipeline execution and quickly identify and resolve issues is crucial.
Security: Proper mechanisms must be in place to securely store and access secrets.
Privacy: Data storage should adhere to privacy regulations, ensuring compliance.
Efficiency: Pipelines must be optimized for cost-effective execution.

Data pipelines aren’t one-size-fits-all, and achieving a production-grade pipeline involves ensuring those key features. But how to get there?

Modularity

To achieve modularity, it’s best to split the dlt pipeline code into the following structure:

├── pipelines
│   ├── a_to_c.py
│   ├── b_to_c.py
│   └── utils.py

In this structure, a_to_c.py and b_to_c.py represent two example pipelines, each handling data from a source system (a and b) to a destination system ©.

The utils.py file contains common utilities such as data masking implementation, default configuration for source and destination systems, or default pipeline configuration (except configuration specified in dlt’s config.toml; for more information, see the dlt config paragraph in the dlt section).

Extensibility

Implementing extensibility goes beyond modularity. The code should also be testable, and ideally, automated testing should be integrated into the CI/CD process.

Since dlt pipelines are implemented using Python, they can be tested with common tools like pytest. Unit tests should focus on custom utility functions, while integration tests verify the entire pipeline’s behavior.

For integration testing, use a local database or disk drive instead of the target database. DuckDB is a great choice for this purpose, as it’s a lightweight, in-memory database that can be used to inspect the loaded data quickly.

Reliability

To maintain trust with data platform users, make sure that when production pipelines fail, you are informed immediately and can recover quickly. While we recommend implementing alerting in the orchestration layer, pipeline recoverability depends on having access to detailed logs.

Luckily, dlt provides rich built-in logging and error-handling mechanisms. It’s a good idea to also enable progress monitoring for additional useful information, such as CPU and memory usage.

Security

dlt supports various ways of storing credentials. For local use, secrets can be stored in a .dlt/secrets.toml file, while production environments may benefit from an external credential store, such as Google Cloud Secret Manager. To accomplish this, you can store the secret retrieval utility function in utils.py and reuse it within your pipelines.

However, since we’re using Prefect for orchestration, it’s also possible to follow a different path and use Prefect Secrets to store the credentials.

Privacy

Data anonymization and/or pseudonymization are crucial to ensure compliance with privacy regulations. Data can be erased/anonymized either:

During the ingestion phase (in which case the original data never enters the destination system)

During the transformation phase (in which case private data is stored in one or more layers in the destination system but hidden from the eyes of end users)

While dlt doesn’t provide built-in anonymization features, it provides the necessary tools to implement the first option effectively.

For more information, see the example in the official documentation.

Efficiency

To ensure pipelines are both cost-effective and high-performing, several optimization techniques can be applied:

Incremental extraction
Loading data incrementally allows for reducing the amount of data that needs to be extracted. Currently, dlt supports incremental extraction for its core sources: REST API, SQL database, and filesystem.
Incremental extraction allows us to download only new or modified data.
Write dispositions
Write dispositions work in tandem with the two extraction methods to reduce the amount of data that needs to be loaded. For example, if you only extracted new and modified data, you don’t want to overwrite existing data, as that would result in data loss. In such a case, insert the new records and update the existing ones instead.
Parallelization
dlt allows parallelizing each stage of the pipeline utilizing multithreading and multiprocessing (depending on the stage).
In cases where further parallelization is needed (i.e., the workload exceeds the capacity of a single machine), utilizing orchestrator-layer parallelization may be required. However, this scenario is now rare, as large virtual machines capable of processing petabytes of data are widely available, and dlt can leverage the machine’s resources more efficiently than older tools or typical in-house Python code.
Various other optimizations

As the topic of incremental loading can be complex even for seasoned data engineers, we’ve prepared a diagram of all the viable ELT patterns:

NOTE: dlt also provides sub-types of the “merge” disposition, including SCD type 2; however, for clarity, we did not include these in the diagram. For more information on these subtypes, see relevant documentation.

The choice of a specific implementation depends on what is supported by the source and destination systems as well as on how the source data is generated. Ideally, incremental extract should be used whenever possible. Then, whether you choose the “append” or “merge” write disposition depends on how the data is generated: if you can guarantee that only new records are produced and no existing data is ever modified, you can safely use the “append” disposition. Next, you need to check if the destination system handles the disposition you intend to use (eg. some systems don’t support the “merge” disposition).

The following diagram from dlt’s official documentation also provides a good overview of when to choose which write disposition:

Orchestrating data pipelines with Prefect

Orchestrating data pipelines with Prefect can streamline your workflow and significantly improve efficiency. Let’s dive into the best practices for implementing Prefect flows and how they integrate smoothly with your data pipelines.

Orchestration job features

Ideally, the orchestration layer is a thin wrapper over the underlying data pipeline logic. Whenever a feature can be implemented at the pipeline level, it should be implemented there in order to prevent excessive coupling with the orchestration layer and minimize complexity, which simplifies self-service data ingestion.

Here are a few key features that are best handled at the orchestration layer:

alerting
additional reliability measures
security (specifically, secret management)
distributed processing

Alerting

With Prefect, you can set up alerts, ensuring you’re notified via Slack, Teams, or email whenever jobs or infrastructure components enter an unexpected state.

Reliability

While we can (and, where possible, should) implement retries and timeouts at the pipeline level, Prefect provides these features at the task and flow level. Think of this as a last-resort, catch-all mechanism that allows data engineers to ensure timeouts and retries are enforced regardless of how well the dlt pipeline or helper code is written, again lowering the bar for self-service data ingestion.

Secret management

Security is always a top concern, and Prefect’s secret management integrations make it easier than ever to store and handle secrets. Whether it’s Google Cloud Secret Manager or AWS Secret Manager, Prefect allows you to securely retrieve credentials and pass them to the dlt pipeline. This approach ensures that no credentials are stored locally, and administrators have fine-grained control over access by utilizing Prefect’s Role-Based Access Control (RBAC).

Distributed processing

While any code-based orchestration tool allows for distributed processing, this feature is rarely required at the pipeline level in recent times. Firstly, data ingestion tools such as dlt are capable of efficiently utilizing machine resources, including parallelization and efficient and safe use of memory. Secondly, virtual machines have grown bigger—we can now easily rent VMs with hundreds of cores and hundreds of gigabytes of RAM. Therefore, typically, distributed processing is only required in case we need to run multiple resource-hungry pipelines in parallel.

Production workflow

Now that we’ve outlined the essential features of a production-grade dlt pipeline and Prefect flow, let’s break down the steps of creating and orchestrating data ingestion pipelines in production.

Overview

The diagram below illustrates the key steps in this production workflow.

Create a dlt pipeline: We start by creating a dlt pipeline (if the one we need doesn’t exist yet). Once the pipeline is finished and tests pass, we can move on to the next step.
Create Prefect deployment: We create a Prefect deployment for the pipeline. Notice we utilize Prefect’s prefect.yaml file together with a single extract_and_load() flow capable of executing any dlt pipeline to drastically simplify this process.
Create a Pull Request: We create a pull request with the new deployment. This triggers the CI/CD process.
DEV environment: The deployment is created in the DEV Prefect workspace, and a DEV Docker image is built. We can now manually run the deployment in Prefect UI, which will execute our pipeline in the DEV environment.
PROD environment: Once we’re happy with the results, we merge the pull request. This triggers a CI/CD job, which creates the deployment in the PROD Prefect workspace and builds a PROD Docker image. The deployment schedule is also only enabled at this stage.

If the pipeline already exists and only a new table is being ingested, the user needs only add a few lines of YAML toprefect.yaml and create a PR.

Configuring dlt

While dlt is highly configurable and allows for a lot of customization and optimization, we recommend starting with three highly useful configurations:

runtime.log_level to enable more logging
normalize.parquet_normalizer.add_dlt_load_id to add a dlt load ID to the loaded data
normalize.parquet_normalizer.add_dlt_id to add a unique id to each row.

The ID settings will make our data easier to work with for downstream users, as well as make our loads (especially incremental ones) easier to debug.

Creating a dlt pipeline

Pipeline design

We start by creating a dlt pipeline, following the best practices detailed in the Creating data connectors and pipelines with dlt section above.

For testability and modularity, we recommend splitting the pipeline into a resource (source data) and pipeline (journey and destination) parts. This way, you can easily test each part separately.

Inspecting the data manually

At any stage of pipeline development, you can manually inspect the loaded data, e.g., by printing it to the console or by checking the database directly.

Testing the pipeline

For integration testing, you can use DuckDB as a destination system. It’s lightweight and allows you to quickly check ingested data, so you can iterate faster.

Creating a Prefect flow and deployment

Flow design

After the dlt pipeline is working, it’s time to wrap it in a Prefect task and flow. Keep the orchestration layer simple—use a single extract_and_load() flow for all data ingestion tasks. With Prefect deployments handling the pipeline name and arguments, you can set everything up with just a few lines of YAML.

Handling pipeline secrets

Secrets should be passed through a special dictionary parameter, such as secrets. These secrets should then extracted from Prefect blocks and forwarded to the dlt pipeline, ensuring they are securely handled.

Deploying to production

A pull request with the new deployment should automatically trigger the CI/CD process in our project repository’s CI/CD pipelines. We will soon dive deeper into how to implement this process using GitHub Actions in a separate article, so stay tuned!

Summary

Building a modern, scalable data platform starts with mastering data ingestion, which requires tools that are as powerful as they are flexible. By combining dlt for efficient, open-source data pipelines with Prefect for orchestration, you can create workflows that are not only production-ready but also streamlined for both developers and data teams.

This approach ensures flexibility, scalability, and cost-effectiveness, making it ideal for modern data platforms while also strategically positioning your platform to excel in the upcoming AI age.

Next steps

Data transformation

dlt and Prefect (with the help of dbt) are just as good at data transformation as they are at data ingestion. Stay tuned as we explore how to integrate these tools for data transformation in a future article!

Ready to dive deeper?

If you’re ready to build a cutting-edge data platform with dlt and Prefect, get in touch. We offer expert guidance to help you set up every component and provide a fully equipped template Git repository with production-grade code. No fluff—just practical, scalable solutions designed to handle real-world challenges and set your data workflows up for long-term success.

Footnotes

[1] While more and more UI-based tools add copilot capabilities, they face several fundamental limitations:

Copilots, while text-based, are limited by the UI tools they are built upon.

Imagine instructing someone to build a complex LEGO castle with only a basic set of blocks. No matter how clearly you explain, the result will always be limited, forcing you to find workarounds.

These UI tools often use a custom language to define data pipelines, which adds another layer of complexity.

As the quality of LLMs is highly reliant on the size and quality of the dataset they’re learning from, it means these assistants cannot reach the same level of fluency as LLMs trained on much more popular languages, such as Python.

Imagine the person you’re instructing to build your LEGO castle has very little experience with LEGO or construction in general. They would struggle to understand basic jargon and construction trade practices, and they would often make mistakes requiring your intervention.

Deploying Prefect on any Cloud Using a Single Virtual Machine

2025-01-15T09:29:00Z

Choosing the right data platform architecture is quite a challenge for any organization. It’s a balancing act: you need something that delivers immediate value while staying flexible enough for future growth—all without sacrificing scalability, simplicity, or efficiency.

This article offers a thoughtful guide to the decision-making process behind choosing Prefect with lightweight Kubernetes (K3S) on a single Virtual Machine (VM) with any cloud provider. You’ll explore:

Why simplicity and flexibility are essential for modern data platforms.
Key considerations for selecting the right data orchestration tool.
Insights into serverless vs server-based execution of Prefect flows.
Approaches to run a server-based Prefect worker

Rather than a step-by-step tutorial, this guide is designed to help you make solid platform architecture decisions and design a solution tailored to your organization’s unique needs. Let’s dive in.

Challenges With Picking Data Platform Architecture

The options for building a data platform are endless—but many fall short. With the rise of affordable cloud storage, expectations have changed, leaving many once-revolutionary legacy systems struggling to keep up. At the same time, new solutions making big claims often fail, either missing critical features or bogging organizations down with unnecessary complexity. For smaller companies, the challenge is even greater—a data platform should drive business value, not require a dedicated team just to maintain it.

Starting small may seem practical, but early shortcuts can turn into major obstacles as the platform grows. Undoing poor architectural choices later is often costly and disruptive. That’s why choosing a solution that is both simple and scalable from the outset is essential.

For decision-makers, this journey begins by stepping back and evaluating both the current state of their team and the platform they rely on. The Data Platform Maturity Curve is a helpful framework for this:

Depending on the organization’s data technology maturity level, your platform must adapt. This article focuses on those in the middle of the curve—where simple scripts and ad-hoc solutions are no longer enough, but advanced features like autoscaling aren’t yet necessary. At this stage, the platform delivers tangible business value and is steadily becoming integral to operations. Downtime—whether it lasts hours, a day, or even a week—is growing increasingly expensive.

The goal? A platform that’s lightweight, scalable, and future-ready without overcomplicating things.

Data Platform Orchestration: the Key to Seamless Integration

Even the best-designed data platform is useless if it’s not integrated. No matter how carefully you choose your architecture, your platform’s success hinges on how well its core components—ingestion, transformation, and serving—work together. These phases can only operate efficiently when they are tightly aligned.

Adapted from “Fundamentals of Data Engineering: Plan and Build Robust Data Systems” by Joe Reis & Matt Housley

Early-stage platforms often rely on manual orchestration, which works at first but quickly becomes a bottleneck as data grows and workflows become more complex. Managing, ensuring accuracy, and reducing downtime requires a more structured approach.

A few basic improvements can help push the boundaries further. For instance:

Instead of running all scripts locally, they can be executed on a virtual machine.
Setting up a database helps centralize data
Basic automation of workflows can be managed with cron jobs in Linux.

While these incremental improvements help in the short term, significant challenges remain:

Manual code execution becomes increasingly error-prone as the scale of operations grows.
Cron jobs become difficult to manage as workflows become more complex and interdependent. Debugging failures can quickly turn into a nightmare, especially with cascading issues across multiple flows.

This is where automated data orchestration becomes the key to streamlining workflows across the entire lifecycle. It allows teams to automate, monitor, and scale operations by transforming disconnected processes into a cohesive system, minimizing manual intervention and reducing errors.

Let’s review the most popular options available in the market.

Data Orchestration Tools

The three leading orchestration tools in the market are:

Apache Airflow: An open-source and community-driven tool with robust features but a steep learning curve. Managed versions like Google Cloud Composer and Amazon MWAA simplify deployment but tie users to specific cloud providers.
Prefect: A modern, cloud-agnostic, and easy-to-configure solution emphasizing scalability, portability, and developer-friendly features that allow for flexible orchestration. Prefect’s architecture also supports running workflows in hybrid environments, seamlessly bridging on-premises and cloud solutions.
Dagster: Designed for data-aware orchestration, Dagster prioritizes validation, lineage, and developer productivity, making it ideal for teams handling complex pipelines.

At The Scalable Way we have worked with both Airflow and Prefect in a few projects, we advise Prefect for lightweight setup with less deployment things to worry about.

What is Prefect Cloud?

Prefect Cloud is a fully managed orchestration platform that simplifies running and monitoring Python-based workflows without the overhead of managing infrastructure. It’s well-suited for teams looking to automate data workflows, from ingestion and transformation to serving.

Its strengths include:

Scalability: Handles thousands of workflows with ease.
Monitoring and alerting: Built-in features simplify issue detection and resolution.
Cloud-agnostic architecture: Runs seamlessly across environments, avoiding vendor lock-in.

By automating the orchestration layer, Prefect Cloud allows teams to focus on building robust pipelines without the overhead of managing infrastructure.

Common Struggle for Prefect Users: Deployment

Adopting Prefect as an orchestrator unlocks many possibilities but like any powerful tool, it comes with a learning curve. Prefect flexibility and a developer-first approach can initially feel daunting for teams unfamiliar with building solid deployment solutions.

Prefect’s philosophy emphasizes providing tools rather than prescribing solutions, allowing users to adapt its features to their specific needs. While this approach offers flexibility and scalability, it can leave data engineers uncertain about where to start with scalable deployment practices like CI/CD pipelines and autoscaling.

Eternal Dilemma: Server-based or Serverless

Another consideration is choosing the right setup for running Prefect flows. There are two primary approaches, each designed to cater to different needs:

Server-based: This requires setting up infrastructure such as virtual machines, lightweight Kubernetes (e.g., K3S), or managed Kubernetes clusters. While these setups provide maximum control, scalability, and adaptability, they demand a higher level of expertise and upfront effort.
Serverless: Managed solutions like Prefect Cloud’s service or serverless compute options from cloud providers (AWS Fargate, Google Cloud Run, Azure Container Instances) eliminate the need for infrastructure management, making them appealing for simpler workflows.

Serverless solutions, though convenient, are best suited for simpler workflows, as they come with five notable challenges:

Startup Overhead: Prefect Worker images often have heavy dependencies, increasing flow initialization time. This leads to latency, as serverless platforms can introduce delays between task executions due to event-driven triggers. A long-running server with a persistent Prefect Worker is usually much quicker.
Vendor Lock-In: Serverless solutions are often tightly integrated with specific cloud providers, making it difficult to migrate workflows across platforms. Even Prefect Work Pools, though useful, have limited functionality at the Pro tier.
Cost Management: Serverless can be cost-effective for intermittent workloads but can become expensive with unpredictable usage patterns. Managing costs is trickier compared to traditional server-based setups.
Limited Control and Security Concerns: Serverless architectures limit control over the execution environment, as all logic runs on cloud provider-managed machines. This raises security risks, especially for companies dealing with sensitive data or operating in highly regulated industries, due to reduced visibility and potential vulnerabilities in shared infrastructure.
Token Management and Data Access Risks: Serverless setups require Prefect to hold a token for accessing cloud resources, creating security risks if mismanaged. Server-based setups mitigate this by reversing the data flow, allowing the server to pull from Prefect, and reducing the risk of data breaches or unintended data exposure.

Ultimately, the choice between server-based and serverless depends on the teams’ needs and stage of data maturity. However, for most organizations aiming to scale, a Prefect Work Pool running on a long-running server is a more optimal and reliable solution.

Deployment Options for a Server-based Data Platform

Local Prefect Worker Process

Connects directly to Prefect Cloud and serves as an introductory setup to understand Prefect Cloud’s functionality. However, this is not suitable for production scenarios due to limited scalability and resilience.

Systemd Process on Single or Multiple VMs

Runs Prefect flows in Docker containers, providing a lightweight setup that is relatively easy to configure. This approach is well-suited to small projects and teams, as Docker limits unnecessary complexity.

Single VM with Lightweight Kubernetes (K3S)

It’s not as simple as a Systemd setup because of the introduction of Kubernetes and Helm. Thanks to these tools, it’s more scalable and adaptable for future growth. This setup offers flexibility for migration to more robust configurations as project demands increase.

Managed Kubernetes Cluster

The most feature-rich solution-managed Kubernetes supports autoscaling, spot instances, and integrations with tools like Active Directory. It is ideal for comprehensive data platforms. However, this approach adds operational complexity and may be excessive for smaller projects.

Recommended Setup for getting started: Lightweight Kubernetes on a Single Virtual Machine

The lightweight Kubernetes on a single Virtual Machine (VM) setup strikes an ideal balance between cost efficiency and operational flexibility. By leveraging lightweight Kubernetes (K3S), you gain the core benefits of Kubernetes with significantly reduced overhead, making it perfect for smaller environments or projects with constrained resources. Its streamlined architecture ensures smooth operations without the complexity of managing a full Kubernetes cluster. The diagram illustrates a basic architecture that effectively meets most requirements for running Prefect flows in a scalable manner.

Using Helm charts to deploy the Prefect Worker simplifies orchestration, ensuring seamless integration with existing systems while minimizing manual configurations. Helm also makes updates easier, promotes standardization, and reduces deployment errors.

Running everything on a single virtual machine keeps the infrastructure simple yet scalable. If project demands grow, you can easily upgrade the VM or expand to a multi-node cluster without major changes to your architecture. Additionally, this setup simplifies maintenance, provides clear monitoring and debugging paths, and avoids vendor lock-in, preserving flexibility for future enhancements.

Conclusion

Building a modern data platform is no easy task. Success lies in keeping it simple while ensuring flexibility and scalability. With the right tools and setup, like Prefect and lightweight Kubernetes on a single virtual machine, you can create a platform that delivers immediate value and adapts as your needs grow.

By focusing on scalable, modular solutions, you’re not just solving today’s problems—you’re building a platform ready for whatever comes next.