<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en" xml:base="https://thescalableway.com/">
  <title>Blog</title>
  <subtitle></subtitle>
  <link href="https://thescalableway.com/feed.xml" rel="self" />
  <link href="https://thescalableway.com/" />
  <updated>2026-04-22T12:24:36Z</updated>
  <id>https://thescalableway.com/</id>
  <author>
    <name>The Scalable Way</name>
  </author>
	<entry>
      <title>Running dbt Rescue Rebuild in Production: Operational Playbooks, Failure Models, and Recovery Patterns</title>
      <link href="https://thescalableway.com/blog/running-dbt-rescue-rebuild-in-production-operational-playbooks-failure-models-and-recovery-patterns/" />
      <updated>2026-03-27T16:55:00Z</updated>
      <id>https://thescalableway.com/blog/running-dbt-rescue-rebuild-in-production-operational-playbooks-failure-models-and-recovery-patterns/</id>
      <content type="html">
				&lt;nav id=&quot;toc&quot; class=&quot;table-of-contents prose&quot;&gt;&lt;ol&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/running-dbt-rescue-rebuild-in-production-operational-playbooks-failure-models-and-recovery-patterns/#outage-recovery&quot;&gt;Outage Recovery&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/running-dbt-rescue-rebuild-in-production-operational-playbooks-failure-models-and-recovery-patterns/#resources-degradation&quot;&gt;Resources Degradation&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/running-dbt-rescue-rebuild-in-production-operational-playbooks-failure-models-and-recovery-patterns/#downstream-and-upstream-reloads&quot;&gt;Downstream and Upstream Reloads&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/running-dbt-rescue-rebuild-in-production-operational-playbooks-failure-models-and-recovery-patterns/#using-exclude-for-partial-progress-recovery&quot;&gt;Using --exclude for partial progress recovery&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/running-dbt-rescue-rebuild-in-production-operational-playbooks-failure-models-and-recovery-patterns/#incremental-models-and-full-refresh-strategy&quot;&gt;Incremental Models and Full-Refresh Strategy&lt;/a&gt;&lt;ol&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/running-dbt-rescue-rebuild-in-production-operational-playbooks-failure-models-and-recovery-patterns/#write-disposition-and-dbt-materialization-alignment&quot;&gt;Write disposition and dbt materialization alignment&lt;/a&gt;&lt;/li&gt;&lt;/ol&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/running-dbt-rescue-rebuild-in-production-operational-playbooks-failure-models-and-recovery-patterns/#cross-model-dependencies-the-heavy-lineage-trap&quot;&gt;Cross-Model Dependencies: the Heavy Lineage Trap&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/running-dbt-rescue-rebuild-in-production-operational-playbooks-failure-models-and-recovery-patterns/#scheduling-heavy-reloads&quot;&gt;Scheduling Heavy Reloads&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/running-dbt-rescue-rebuild-in-production-operational-playbooks-failure-models-and-recovery-patterns/#recovery-practices&quot;&gt;Recovery Practices&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/running-dbt-rescue-rebuild-in-production-operational-playbooks-failure-models-and-recovery-patterns/#pre-run-checklist&quot;&gt;Pre-Run Checklist&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/running-dbt-rescue-rebuild-in-production-operational-playbooks-failure-models-and-recovery-patterns/#summary&quot;&gt;Summary&lt;/a&gt;&lt;/li&gt;&lt;/ol&gt;&lt;/nav&gt;&lt;p&gt;&lt;span id=&quot;toc-skipped&quot; class=&quot;visually-hidden&quot;&gt;&lt;/span&gt;&lt;/p&gt;&lt;div class=&quot;flow prose&quot;&gt;&lt;p&gt;&lt;/p&gt;&lt;p&gt;In a &lt;a href=&quot;https://thescalableway.com/blog/the-rescue-dbt_rerun-deployment-rebuilding-changed-and-broken-models-without-disrupting-production/&quot; rel=&quot;noopener&quot;&gt;previous blog post&lt;/a&gt;, we introduced &lt;strong&gt;&lt;code&gt;dbt_rerun&lt;/code&gt;&lt;/strong&gt;, a dedicated Prefect deployment we use as a rescue mechanism for production dbt pipelines. We discussed the problem it solves, why standard recovery options fall short, how the deployment is wired, and where it fits into our day-to-day development workflow.&lt;/p&gt;&lt;p&gt;Now, let’s focus on what happens after you click “run”. We’ll walk through how we actually use &lt;code&gt;dbt_rerun&lt;/code&gt; in production: choosing the right dependency scope, managing queue and warehouse contention, recovering from outages, handling incremental models safely, and validating that the rescue truly fixed the issue. These are the patterns and guardrails we rely on during real incidents, where speed matters but correctness matters more.&lt;/p&gt;&lt;h2 id=&quot;outage-recovery&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/running-dbt-rescue-rebuild-in-production-operational-playbooks-failure-models-and-recovery-patterns/#outage-recovery&quot; class=&quot;heading-anchor&quot;&gt;Outage Recovery&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;When an upstream failure or source outage takes down a large number of models at once, the rescue deployment becomes the primary recovery channel. Attempting to recover through the normal scheduled pipelines risks re-triggering failures before the source is stable, and provides less visibility into what has and hasn’t been rebuilt.&lt;/p&gt;&lt;p&gt;The recommended approach is to batch all affected models into a single, well-scoped rescue run:&lt;/p&gt;&lt;p&gt;&lt;code&gt;dbt run --select int_model_1+ int_model_2+ +mart_model_1+ +mart_model_2+&lt;/code&gt;&lt;/p&gt;&lt;p&gt;This pattern has several advantages:&lt;/p&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;It is isolated from the main pipeline schedules, so a failure in the rescue run does not interfere with ongoing production work.&lt;/li&gt;&lt;li&gt;It runs the full lineage for each affected model, ensuring that intermediates and their dependent marts are rebuilt in the correct order.&lt;/li&gt;&lt;li&gt;It makes recovery explicit and auditable - we know exactly which models were confirmed rebuilt, rather than assuming the scheduler will handle it.&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;For very large outages (e.g., 50 or more staging models failing overnight), we group models by lineage affinity rather than triggering a single enormous run. This keeps individual runs fast, minimizes the blast radius of any secondary failure, and makes it easier to identify which chains have been recovered.&lt;/p&gt;&lt;h2 id=&quot;resources-degradation&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/running-dbt-rescue-rebuild-in-production-operational-playbooks-failure-models-and-recovery-patterns/#resources-degradation&quot; class=&quot;heading-anchor&quot;&gt;Resources Degradation&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Warehouse resources are shared across all running pipelines. A rescue run that overlaps with a scheduled run that touches the same large models will introduce concurrency at the database level, leading to competing read and write operations on the same tables, which can cause both runs to slow significantly or time out.&lt;/p&gt;&lt;p&gt;Two situations are particularly risky:&lt;/p&gt;&lt;ol class=&quot;list&quot;&gt;&lt;li&gt;&lt;strong&gt;Concurrent writes to the same table:&lt;/strong&gt; If a rescue run and a scheduled run attempt to write to the same model at the same time, they will contend for table locks. The likely outcome is that both runs become slow or stall, requiring manual intervention.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Merging an updated model while it is actively running:&lt;/strong&gt; When a Data Analyst merges a model change to main while a scheduled pipeline is mid-run and still executing that model, the downstream runs in that same pipeline may pick up the new code before the model has been rebuilt against it. This is a common source of unexpected failures. Our practice is to either merge after the current scheduled run cycle has completed or merge immediately after a cycle ends and trigger the rescue rebuild before the next cycle begins.&lt;/li&gt;&lt;/ol&gt;&lt;p&gt;Warehouse resource contention is easiest to avoid by checking the Prefect run timeline before triggering a rescue: if a heavy scheduled deployment is running or about to run, we wait for it to clear.&lt;/p&gt;&lt;h2 id=&quot;downstream-and-upstream-reloads&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/running-dbt-rescue-rebuild-in-production-operational-playbooks-failure-models-and-recovery-patterns/#downstream-and-upstream-reloads&quot; class=&quot;heading-anchor&quot;&gt;Downstream and Upstream Reloads&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Choosing the right dependency scope is one of the most consequential decisions in a rescue run. Running too narrow leaves downstream models stale; running too wide wastes resources and increases the risk of contention.&lt;/p&gt;&lt;p&gt;Here is a compact guide to the selection patterns we use:&lt;/p&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;&lt;code&gt;+model&lt;/code&gt;&amp;nbsp; -&amp;nbsp; the model plus all upstream parents (everything it reads from)&lt;/li&gt;&lt;li&gt;&lt;code&gt;model+&lt;/code&gt;&amp;nbsp; -&amp;nbsp; the model plus all downstream children (everything that reads from it)&lt;/li&gt;&lt;li&gt;&lt;code&gt;+model+&lt;/code&gt;&amp;nbsp; -&amp;nbsp; both directions simultaneously&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;&lt;strong&gt;Intermediate models&lt;/strong&gt;&lt;/p&gt;&lt;p&gt;&lt;is-land on:idle&gt;&lt;/is-land&gt;&lt;/p&gt;&lt;dialog class=&quot;flow modal2&quot;&gt;&lt;button autofocus class=&quot;button&quot;&gt;Close&lt;/button&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/img/7of-t7uRO_-900.webp 900w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/img/7of-t7uRO_-900.jpeg&quot; alt width=&quot;900&quot; height=&quot;280&quot;&gt;&lt;/picture&gt;&lt;/dialog&gt;&lt;button data-index=&quot;2&quot;&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/img/7of-t7uRO_-900.webp 900w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/img/7of-t7uRO_-900.jpeg&quot; alt width=&quot;900&quot; height=&quot;280&quot;&gt;&lt;/picture&gt;&lt;/button&gt;&lt;p&gt;&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Mart models&lt;/strong&gt;&lt;/p&gt;&lt;p&gt;&lt;is-land on:idle&gt;&lt;/is-land&gt;&lt;/p&gt;&lt;dialog class=&quot;flow modal3&quot;&gt;&lt;button autofocus class=&quot;button&quot;&gt;Close&lt;/button&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/img/p4gJy4ZE6x-900.webp 900w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/img/p4gJy4ZE6x-900.jpeg&quot; alt width=&quot;900&quot; height=&quot;280&quot;&gt;&lt;/picture&gt;&lt;/dialog&gt;&lt;button data-index=&quot;3&quot;&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/img/p4gJy4ZE6x-900.webp 900w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/img/p4gJy4ZE6x-900.jpeg&quot; alt width=&quot;900&quot; height=&quot;280&quot;&gt;&lt;/picture&gt;&lt;/button&gt;&lt;p&gt;&lt;/p&gt;&lt;h2 id=&quot;using-exclude-for-partial-progress-recovery&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/running-dbt-rescue-rebuild-in-production-operational-playbooks-failure-models-and-recovery-patterns/#using-exclude-for-partial-progress-recovery&quot; class=&quot;heading-anchor&quot;&gt;Using --exclude for partial progress recovery&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;When a large run times out mid-way, some models in the selection will have already been built. Re-running the full selection would rebuild them unnecessarily. The &lt;code&gt;--exclude&lt;/code&gt; flag lets us resume from where we left off:&lt;/p&gt;&lt;pre class=&quot;language-plain&quot;&gt;&lt;code class=&quot;language-plain&quot;&gt;# Initial run: times out after building int_model, dep_1, dep_2 of 10 models
dbt run --select int_model+ --full-refresh

# Resume run: exclude the three already-built models
dbt run --select int_model+ --exclude int_model dep_1 dep_2 --full-refresh&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;This pattern is especially important for full-refresh runs on large incremental chains, where rebuilding from scratch would waste significant time and resources.&lt;/p&gt;&lt;h2 id=&quot;incremental-models-and-full-refresh-strategy&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/running-dbt-rescue-rebuild-in-production-operational-playbooks-failure-models-and-recovery-patterns/#incremental-models-and-full-refresh-strategy&quot; class=&quot;heading-anchor&quot;&gt;Incremental Models and Full-Refresh Strategy&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Incremental models are where most rescue mistakes happen, because the wrong refresh strategy can silently lock in bad history or overload the warehouse.&lt;/p&gt;&lt;p&gt;Incremental models represent the most complex rescue scenario, because the decision of whether to use &lt;code&gt;--full-refresh&lt;/code&gt; has lasting consequences for both data correctness and warehouse load. Our incremental models typically use a &lt;code&gt;unique_key&lt;/code&gt; for deduplication and an &lt;code&gt;is_incremental()&lt;/code&gt; filter to restrict which data is loaded on each run, often tied to a date cursor or a calendar condition, such as a monthly refresh window. Understanding both of those constraints is a prerequisite for choosing the right rescue approach.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;When to use &lt;code&gt;--full-refresh&lt;/code&gt;&lt;/strong&gt;&lt;/p&gt;&lt;p&gt;Use &lt;code&gt;--full-refresh&lt;/code&gt; when the change invalidates the historical state of the table:&lt;/p&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;The join logic, business rules, or transformations affecting existing rows have changed.&lt;/li&gt;&lt;li&gt;The &lt;code&gt;unique_key&lt;/code&gt; definition has changed.&lt;/li&gt;&lt;li&gt;A bug produced an incorrect history that must be corrected.&lt;/li&gt;&lt;li&gt;A new column is being backfilled from source data that already exists in staging.&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;&lt;strong&gt;When not to use &lt;code&gt;--full-refresh&lt;/code&gt;&lt;/strong&gt;&lt;/p&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;Only future data will differ (for example, adding a new column with NULL values for historical rows is acceptable).&lt;/li&gt;&lt;li&gt;The model runs on a monthly cadence, and a normal incremental run will naturally pick up the change on its next scheduled execution.&lt;/li&gt;&lt;li&gt;The model is very large, and a targeted historical window backfill would be more efficient than a full rebuild.&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;When &lt;code&gt;--full-refresh&lt;/code&gt; is genuinely necessary on a large incremental model, treat it as a scheduled operation: run it during a low-traffic window, ensure queues are free, and monitor warehouse resource usage throughout.&lt;/p&gt;&lt;h3 id=&quot;write-disposition-and-dbt-materialization-alignment&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/running-dbt-rescue-rebuild-in-production-operational-playbooks-failure-models-and-recovery-patterns/#write-disposition-and-dbt-materialization-alignment&quot; class=&quot;heading-anchor&quot;&gt;Write disposition and dbt materialization alignment&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;The choice of write disposition is tightly coupled to how the source generates data:&lt;/p&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;If source records are append-only and never modified, the incremental pattern with a date cursor is safe and efficient.&lt;/li&gt;&lt;li&gt;If source records can be updated, a &lt;code&gt;unique_key&lt;/code&gt; merge strategy is required to avoid duplicate or stale rows in the target.&lt;/li&gt;&lt;li&gt;If correctness requires reconstructing the full table from scratch on a change, &lt;code&gt;--full-refresh&lt;/code&gt; is the only reliable option, but it should be the last resort, not the default.&lt;/li&gt;&lt;/ul&gt;&lt;h2 id=&quot;cross-model-dependencies-the-heavy-lineage-trap&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/running-dbt-rescue-rebuild-in-production-operational-playbooks-failure-models-and-recovery-patterns/#cross-model-dependencies-the-heavy-lineage-trap&quot; class=&quot;heading-anchor&quot;&gt;Cross-Model Dependencies: the Heavy Lineage Trap&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Most rescue incidents do not involve a single model in isolation. They involve a model that sits in the middle of a deep lineage graph, with multiple upstream sources and multiple downstream consumers. When such a model is rescued carelessly, the blast radius extends well beyond its intended scope.&lt;/p&gt;&lt;p&gt;Before triggering a rescue run, we always inspect the lineage of the target model:&lt;/p&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;How many upstream models does it depend on, and how heavy are they?&lt;/li&gt;&lt;li&gt;How many downstream models reference it directly or transitively?&lt;/li&gt;&lt;li&gt;Are any of those downstream models incremental or particularly large?&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;If the lineage is deep and the models are heavy, we split the rescue into multiple runs, ordered by layer:&lt;/p&gt;&lt;ol class=&quot;list&quot;&gt;&lt;li&gt;Rebuild the upstream models first to ensure stable inputs.&lt;/li&gt;&lt;li&gt;Rebuild the target model against the refreshed upstream.&lt;/li&gt;&lt;li&gt;Rebuild the downstream models once the target is confirmed correct.&lt;/li&gt;&lt;/ol&gt;&lt;p&gt;This layered approach reduces the risk of a single timeout taking down the entire lineage recovery and makes it easier to identify failures at each stage.&lt;/p&gt;&lt;h2 id=&quot;scheduling-heavy-reloads&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/running-dbt-rescue-rebuild-in-production-operational-playbooks-failure-models-and-recovery-patterns/#scheduling-heavy-reloads&quot; class=&quot;heading-anchor&quot;&gt;Scheduling Heavy Reloads&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;For rescue operations that involve large incremental models, deep lineages, or full-refresh requirements, ad hoc triggering is rarely the right approach. Instead, we treat these as scheduled operations with the same care we apply to any production job.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Identifying a reload as “heavy”&lt;/strong&gt; is the first step. A useful heuristic: if a model typically takes more than a few minutes in its normal incremental run, its full-refresh will be significantly longer - potentially by an order of magnitude. Dependency chains multiply this further.&lt;/p&gt;&lt;p&gt;Scheduling considerations:&lt;/p&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;Run heavy rescues outside peak hours, when scheduled pipelines have cleared their queues and warehouse load is low.&lt;/li&gt;&lt;li&gt;If multiple heavy models need to be rescued, stagger them across separate windows rather than running them in parallel, unless queues and warehouse headroom are confirmed sufficient.&lt;/li&gt;&lt;li&gt;Monitor the run actively - heavy full-refresh operations are the most common source of rescue-induced timeouts.&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;When a rescue run times out partway through, the --exclude pattern described in the downstream/upstream section allows the run to resume without duplicating already-completed work.&lt;/p&gt;&lt;h2 id=&quot;recovery-practices&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/running-dbt-rescue-rebuild-in-production-operational-playbooks-failure-models-and-recovery-patterns/#recovery-practices&quot; class=&quot;heading-anchor&quot;&gt;Recovery Practices&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Consistent recovery practices reduce the operational costs of incidents and make rescue deployments easier to reason about.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Post-merge rebuild discipline.&lt;/strong&gt; Every model merge should be followed immediately by a rescue rebuild, completed before the next scheduled run cycle begins. This eliminates the most common source of “new code, old table” mismatches. If there is insufficient time before the next cycle, the merge should wait for a better window.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Batching by lineage for outage recovery.&lt;/strong&gt; When many models fail simultaneously, resist the temptation to address each one individually. Group them by shared lineage: models that share upstream sources or downstream consumers should be recovered together in a single selection string. This is faster, more reliable, and produces a cleaner audit trail.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Validating recovery before declaring success.&lt;/strong&gt; A successful dbt run is necessary but not sufficient. After each rescue run, we verify:&lt;/p&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;dbt tests triggered by the deployment have passed.&lt;/li&gt;&lt;li&gt;Row counts and max dates in the rebuilt tables match expectations.&lt;/li&gt;&lt;li&gt;Downstream consumers are producing correct outputs in their next run.&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;&lt;strong&gt;Using the sandbox for validation before prod.&lt;/strong&gt; For complex lineage changes, the sandbox schema - which mirrors the structure of all production schemas inside the production cluster - can be used to validate the rebuild logic before triggering it against production tables.&lt;/p&gt;&lt;h2 id=&quot;pre-run-checklist&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/running-dbt-rescue-rebuild-in-production-operational-playbooks-failure-models-and-recovery-patterns/#pre-run-checklist&quot; class=&quot;heading-anchor&quot;&gt;Pre-Run Checklist&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Before triggering any rescue run, we work through the following:&lt;/p&gt;&lt;ol class=&quot;list&quot;&gt;&lt;li&gt;Are SQL queues currently saturated, or is a heavy scheduled run about to start?&lt;/li&gt;&lt;li&gt;Is the target model heavy (large row volume, complex joins, or deep lineage)?&lt;/li&gt;&lt;li&gt;Have I mapped the upstream and downstream scope? Am I using the right selection (&lt;code&gt;+model&lt;/code&gt;, &lt;code&gt;model+&lt;/code&gt;, or &lt;code&gt;+model+&lt;/code&gt;)?&lt;/li&gt;&lt;li&gt;Is the model incremental? If so, will a normal incremental run load any data today, or will the monthly filter make it a no-op?&lt;/li&gt;&lt;li&gt;Does the change require &lt;code&gt;--full-refresh&lt;/code&gt;, or can a targeted backfill or normal incremental run achieve correctness?&lt;/li&gt;&lt;li&gt;Am I about to overlap with a scheduled run that touches the same lineage?&lt;/li&gt;&lt;li&gt;If the run times out, do I have a plan for resuming with &lt;code&gt;--exclude&lt;/code&gt;?&lt;/li&gt;&lt;li&gt;After the run: did tests pass, and have I sanity-checked the output tables?&lt;/li&gt;&lt;/ol&gt;&lt;h2 id=&quot;summary&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/running-dbt-rescue-rebuild-in-production-operational-playbooks-failure-models-and-recovery-patterns/#summary&quot; class=&quot;heading-anchor&quot;&gt;Summary&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;The &lt;code&gt;dbt_rerun&lt;/code&gt; rescue deployment is one of the highest-leverage tools in our daily data service operations. It provides a controlled, production-safe channel for rebuilding changed or broken models without affecting the main pipeline schedules, which is essential for maintaining data trust in a production environment where models change continuously.&lt;/p&gt;&lt;p&gt;Used well, it functions as a scalpel: we rebuild the smallest correct scope, at the right time, with the right materialization strategy. Used carelessly ‒ triggering heavy full-refreshes at peak hours, overlapping with scheduled runs, or rebuilding too wide a lineage all at once ‒ it becomes a source of incidents rather than a remedy for them.&lt;/p&gt;&lt;p&gt;The practices described in this article ‒ queue awareness, lineage inspection, incremental strategy, and post-run validation ‒ reflect lessons learned from daily operational use. The principles, however, generalize: any on-demand dbt execution in a production environment benefits from the same discipline.&lt;/p&gt;&lt;/div&gt;
 			</content>
    </entry><entry>
      <title>The Rescue dbt_rerun Deployment: Rebuilding Changed and Broken Models Without Disrupting Production</title>
      <link href="https://thescalableway.com/blog/the-rescue-dbt_rerun-deployment-rebuilding-changed-and-broken-models-without-disrupting-production/" />
      <updated>2026-03-23T13:45:00Z</updated>
      <id>https://thescalableway.com/blog/the-rescue-dbt_rerun-deployment-rebuilding-changed-and-broken-models-without-disrupting-production/</id>
      <content type="html">
				&lt;nav id=&quot;toc&quot; class=&quot;table-of-contents prose&quot;&gt;&lt;ol&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/the-rescue-dbt_rerun-deployment-rebuilding-changed-and-broken-models-without-disrupting-production/#what-we-are-solving-for&quot;&gt;What We Are Solving For&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/the-rescue-dbt_rerun-deployment-rebuilding-changed-and-broken-models-without-disrupting-production/#the-solution-a-dedicated-rescue-deployment&quot;&gt;The Solution: A Dedicated Rescue Deployment&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/the-rescue-dbt_rerun-deployment-rebuilding-changed-and-broken-models-without-disrupting-production/#how-the-deployment-is-wired&quot;&gt;How the Deployment Is Wired&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/the-rescue-dbt_rerun-deployment-rebuilding-changed-and-broken-models-without-disrupting-production/#where-rescue-fits-in-our-daily-workflow&quot;&gt;Where Rescue Fits in Our Daily Workflow&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/the-rescue-dbt_rerun-deployment-rebuilding-changed-and-broken-models-without-disrupting-production/#lateness-and-queue-contention&quot;&gt;Lateness and Queue Contention&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/the-rescue-dbt_rerun-deployment-rebuilding-changed-and-broken-models-without-disrupting-production/#whats-next&quot;&gt;What’s next&lt;/a&gt;&lt;/li&gt;&lt;/ol&gt;&lt;/nav&gt;&lt;p&gt;&lt;span id=&quot;toc-skipped&quot; class=&quot;visually-hidden&quot;&gt;&lt;/span&gt;&lt;/p&gt;&lt;div class=&quot;flow prose&quot;&gt;&lt;p&gt;&lt;/p&gt;&lt;p&gt;Keeping production data correct after a model change is harder than it looks.&lt;/p&gt;&lt;p&gt;A Data Analyst merges a new dbt model (or a hotfix) to the main branch. The code is now live. But the table in the production database still reflects the old logic. Scheduled downstream runs will start referencing the new code within minutes, yet the upstream table hasn’t been rebuilt. The result is either a silent data mismatch or a cascade of failures across the pipeline.&lt;/p&gt;&lt;p&gt;That scenario plays out more often than it should, because the standard options for recovery are blunt instruments:&lt;/p&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;Waiting for the next scheduled run assumes the failure will resolve itself, which it often won’t.&lt;/li&gt;&lt;li&gt;Rerunning the entire pipeline is expensive, risky during peak hours, and almost always more than is needed.&lt;/li&gt;&lt;li&gt;Rerunning the wrong subset is fast but leaves downstream models in an inconsistent state.&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;A targeted, production-safe rescue mechanism solves all three problems at once.&lt;/p&gt;&lt;p&gt;In this article, we describe how our team uses a dedicated Prefect deployment ‒ &lt;strong&gt;&lt;code&gt;dbt_rerun&lt;/code&gt;&lt;/strong&gt; ‒ to handle exactly this: rebuilding changed, broken, or historically stale models with the right scope and at the right time, without disrupting the main pipeline schedules.&lt;/p&gt;&lt;p&gt;Our insights are grounded in daily operational experience running this deployment as a core part of our data service.&lt;/p&gt;&lt;h2 id=&quot;what-we-are-solving-for&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/the-rescue-dbt_rerun-deployment-rebuilding-changed-and-broken-models-without-disrupting-production/#what-we-are-solving-for&quot; class=&quot;heading-anchor&quot;&gt;What We Are Solving For&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Production dbt pipelines are rarely static. Models get updated, source schemas drift, upstream data arrives late or corrupt, and historical backfills are periodically requested. Each of these situations requires rebuilding some portion of the model graph ‒ but rebuilding the wrong scope, at the wrong time, or with the wrong materialization strategy can cause as much damage as the original failure.&lt;/p&gt;&lt;p&gt;The core challenge is surgical precision: rebuild exactly what needs to be rebuilt, in the right dependency order, without blocking scheduled runs or saturating warehouse resources.&lt;/p&gt;&lt;p&gt;Three recurring failure modes motivated us to formalize the rescue approach:&lt;/p&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;&lt;strong&gt;Merge without immediate rebuild.&lt;/strong&gt; A Data Analyst merges an updated model into the main. Scheduled downstream runs start pulling from it before it’s been rebuilt, producing incorrect or failed outputs.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Outage recovery at scale.&lt;/strong&gt; A weekend failure takes out dozens of intermediate models and their marts. Recovering each one individually is impractical; recovering everything blindly through the main pipeline is too risky.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Heavy full-refresh at peak hours.&lt;/strong&gt; An incremental model is rebuilt with &lt;code&gt;--full-refresh&lt;/code&gt; at rush hour, saturating SQL queues and degrading warehouse performance across unrelated pipelines.&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;The &lt;strong&gt;&lt;code&gt;dbt_rerun&lt;/code&gt;&lt;/strong&gt; deployment addresses all three by giving us a dedicated, controlled channel for on-demand model rebuilds.&lt;/p&gt;&lt;h2 id=&quot;the-solution-a-dedicated-rescue-deployment&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/the-rescue-dbt_rerun-deployment-rebuilding-changed-and-broken-models-without-disrupting-production/#the-solution-a-dedicated-rescue-deployment&quot; class=&quot;heading-anchor&quot;&gt;The Solution: A Dedicated Rescue Deployment&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Let’s start from the beginning: What do we mean by a “Rescue Deployment”?&lt;/p&gt;&lt;p&gt;By “rescue deployment&quot;, we mean an on-demand dbt execution path that is deliberately separated from scheduled production pipelines. The &lt;strong&gt;&lt;code&gt;dbt_rerun&lt;/code&gt;&lt;/strong&gt; deployment is not tied to a specific model, environment, or schedule. It exists solely to rebuild production tables when something has changed or broken, using an explicitly chosen dbt selection and timing. Unlike scheduled runs, it is triggered manually, scoped deliberately, and monitored closely from start to finish.&lt;/p&gt;&lt;p&gt;This separation is intentional: it allows us to fix production state without modifying schedules, redeploying code, or taking unnecessary risks during peak hours.&lt;/p&gt;&lt;p&gt;We run &lt;strong&gt;&lt;code&gt;dbt_rerun&lt;/code&gt;&lt;/strong&gt; as a Prefect deployment built on a shared template:&lt;/p&gt;&lt;pre class=&quot;language-python&quot;&gt;&lt;code class=&quot;language-python&quot;&gt;&lt;span class=&quot;token keyword&quot;&gt;from&lt;/span&gt; templates &lt;span class=&quot;token keyword&quot;&gt;import&lt;/span&gt; transform_and_catalog



transform_and_catalog&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;

    dbt_selects&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&quot;run&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;  &lt;span class=&quot;token comment&quot;&gt;# Filled at runtime via Prefect custom run&lt;/span&gt;

    work_queue&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&quot;dbt&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;         &lt;span class=&quot;token comment&quot;&gt;# Routes to the shared dbt work queue&lt;/span&gt;

&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;The deployment is intentionally minimal. The &lt;code&gt;dbt_selects[&quot;run&quot;]&lt;/code&gt; parameter is left empty at definition time and filled in when a custom run is triggered in Prefect, allowing any valid dbt selection string to be passed without modifying code or creating new deployments. In most cases, this is simply the name of the model that needs to be rebuilt (e.g., &lt;code&gt;int_orders or mart_revenue&lt;/code&gt;), though more complex selection strings can be used when the rescue scope requires it.&lt;/p&gt;&lt;p&gt;This means a single deployment covers the full range of rescue scenarios: post-merge rebuilds, hotfix recoveries, historical backfills, and outage remediation.&lt;/p&gt;&lt;h2 id=&quot;how-the-deployment-is-wired&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/the-rescue-dbt_rerun-deployment-rebuilding-changed-and-broken-models-without-disrupting-production/#how-the-deployment-is-wired&quot; class=&quot;heading-anchor&quot;&gt;How the Deployment Is Wired&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;The &lt;code&gt;transform_and_catalog&lt;/code&gt; template handles the orchestration plumbing. Here is an annotated version of the relevant configuration:&lt;/p&gt;&lt;pre class=&quot;language-python&quot;&gt;&lt;code class=&quot;language-python&quot;&gt;&lt;span class=&quot;token keyword&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;token function&quot;&gt;transform_and_catalog&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;
    flow_branch&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token builtin&quot;&gt;str&lt;/span&gt; &lt;span class=&quot;token operator&quot;&gt;|&lt;/span&gt; &lt;span class=&quot;token boolean&quot;&gt;None&lt;/span&gt; &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token boolean&quot;&gt;None&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;        &lt;span class=&quot;token comment&quot;&gt;# Branch where the Prefect flow code lives&lt;/span&gt;
    dbt_repo_branch&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token builtin&quot;&gt;str&lt;/span&gt; &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;main&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;       &lt;span class=&quot;token comment&quot;&gt;# dbt repo branch to run against&lt;/span&gt;
    schedules&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token builtin&quot;&gt;list&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;token builtin&quot;&gt;str&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;             &lt;span class=&quot;token comment&quot;&gt;# Optional cron schedules (empty for rescue)&lt;/span&gt;
    work_queue&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token builtin&quot;&gt;str&lt;/span&gt; &lt;span class=&quot;token operator&quot;&gt;|&lt;/span&gt; &lt;span class=&quot;token boolean&quot;&gt;None&lt;/span&gt; &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token boolean&quot;&gt;None&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;         &lt;span class=&quot;token comment&quot;&gt;# Which Prefect queue executes this deployment&lt;/span&gt;
    version&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token builtin&quot;&gt;int&lt;/span&gt; &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token number&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
    tags&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token builtin&quot;&gt;list&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;token builtin&quot;&gt;str&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;token operator&quot;&gt;**&lt;/span&gt;kwargs&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;token operator&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;token operator&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;token boolean&quot;&gt;None&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;
    params &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;token string&quot;&gt;&quot;dbt_repo_url&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
        &lt;span class=&quot;token string&quot;&gt;&quot;dbt_repo_token_secret&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
        &lt;span class=&quot;token string&quot;&gt;&quot;dbt_repo_branch&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; dbt_repo_branch&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
        &lt;span class=&quot;token string&quot;&gt;&quot;dbt_project_path&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;dbt/lakehouse&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
        &lt;span class=&quot;token string&quot;&gt;&quot;run_results_storage_path&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;s3:///dbt/runs&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
        &lt;span class=&quot;token string&quot;&gt;&quot;run_results_storage_credentials_secret&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;
    params&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;update&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token operator&quot;&gt;**&lt;/span&gt;kwargs&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;


    deploy&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;
        name&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
        flow_name&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&quot;transform_and_catalog&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
        flow_branch&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;flow_branch&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
        params&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;params&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
        schedules&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;schedules&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
        work_queue&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;work_queue&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
        version&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;version&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
        tags&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;tags&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;Two operational details are worth highlighting:&lt;/p&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;&lt;code&gt;work_queue=&quot;dbt&quot;&lt;/code&gt; routes the rescue run through the same queue pool as regular dbt deployments, making queue saturation visible and manageable.&lt;/li&gt;&lt;li&gt;&lt;code&gt;dbt_selects&lt;/code&gt; passed at runtime means the deployment never needs to be redeployed to handle a new scenario. At trigger time, the engineer fills in the selection string in the Prefect UI custom run dialog.&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;The deployment also runs dbt’s built-in tests after each execution, so model correctness is validated automatically without a separate step.&lt;/p&gt;&lt;h2 id=&quot;where-rescue-fits-in-our-daily-workflow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/the-rescue-dbt_rerun-deployment-rebuilding-changed-and-broken-models-without-disrupting-production/#where-rescue-fits-in-our-daily-workflow&quot; class=&quot;heading-anchor&quot;&gt;Where Rescue Fits in Our Daily Workflow&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Before diving into individual scenarios, here is how &lt;strong&gt;&lt;code&gt;dbt_rerun&lt;/code&gt;&lt;/strong&gt; fits into our standard development lifecycle:&lt;/p&gt;&lt;ol class=&quot;list&quot;&gt;&lt;li&gt;&lt;strong&gt;Local development:&lt;/strong&gt; A Data Analyst builds or updates a dbt model locally against the sandbox schema ‒ a schema inside the production cluster that mirrors the structure of all production schemas.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Pull request and merge:&lt;/strong&gt; The Data Analyst opens a PR to main. Once approved and merged, the new or updated model code is live in the repository.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Rescue rebuild:&lt;/strong&gt; We trigger **&lt;code&gt;dbt_rerun&lt;/code&gt;** to rebuild the changed model along with the appropriate upstream and downstream dependencies. This is the step that makes the production table reflect the new code.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Scheduled runs proceed normally:&lt;/strong&gt; Once the rebuild is confirmed, the regular pipeline schedules pick up cleanly without encountering a code/table mismatch.&lt;/li&gt;&lt;/ol&gt;&lt;p&gt;This sequence applies equally to post-merge refreshes, hotfix recoveries, and outage remediations ‒ only the selection string and the urgency window change.&lt;/p&gt;&lt;h2 id=&quot;lateness-and-queue-contention&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/the-rescue-dbt_rerun-deployment-rebuilding-changed-and-broken-models-without-disrupting-production/#lateness-and-queue-contention&quot; class=&quot;heading-anchor&quot;&gt;Lateness and Queue Contention&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Triggering a rescue run at the wrong moment can cause more lateness than the original problem it is meant to solve.&lt;/p&gt;&lt;p&gt;Our production environment typically runs five SQL queues in parallel. When four of those are occupied by scheduled deployments, taking the fifth with a heavy rescue run means that any new scheduled run that needs a queue will be blocked until the rescue completes.&lt;/p&gt;&lt;p&gt;If the rescue model is large or triggers a long chain of dependencies, this blockage can cause data latency across multiple downstream consumers.&lt;/p&gt;&lt;p&gt;Before triggering any rescue run, we assess the following:&lt;/p&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;How many SQL queues are currently in use?&lt;/li&gt;&lt;li&gt;Is the target model light or heavy (in terms of compute and row volume)?&lt;/li&gt;&lt;li&gt;Does the lineage include other heavy models?&lt;/li&gt;&lt;li&gt;Are any scheduled runs likely to start in the next 15–30 minutes that would need a free queue?&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;If queues are near capacity and the model is heavy, we do not trigger immediately. Instead, we either wait for a quiet window or split the reload into smaller, sequential runs that each finish quickly and release the queue.&lt;/p&gt;&lt;blockquote&gt;&lt;p&gt;&lt;strong&gt;NOTE:&lt;/strong&gt; A heavy &lt;code&gt;--full-refresh&lt;/code&gt; on a large incremental model is particularly risky in this context. Not only does it occupy a queue for an extended period, but it also recreates the full table from scratch, generating significant warehouse load. We treat any full refresh of a large incremental model as a scheduled operation, not an ad hoc one.&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;The mitigation pattern for large reloads is to split by lineage: instead of one run covering ten dependent models, run three batches of three or four models each, in low-traffic windows with enough time between them for queue recovery.&lt;/p&gt;&lt;h2 id=&quot;whats-next&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/the-rescue-dbt_rerun-deployment-rebuilding-changed-and-broken-models-without-disrupting-production/#whats-next&quot; class=&quot;heading-anchor&quot;&gt;What’s next&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;The &lt;code&gt;dbt_rerun&lt;/code&gt; deployment gives us a safe, controlled way to correct production state when models change or break, without touching schedules or resorting to heavy-handed reruns. But the deployment itself is only part of the solution. How and when it is used matters just as much as how it is wired.&lt;/p&gt;&lt;p&gt;Watch out for &lt;a href=&quot;https://thescalableway.com/blog/running-dbt-rescue-rebuild-in-production-operational-playbooks-failure-models-and-recovery-patterns/&quot; rel=&quot;noopener&quot;&gt;part two&lt;/a&gt; of this post, where we’ll go deeper into the operational side: how we scope rescue runs in real incidents, how we recover from outages, how we handle incremental models and full-refresh decisions, and how we avoid turning a rescue into a second production issue. If you’re the one on call when data breaks, stay tuned.&lt;/p&gt;&lt;/div&gt;
 			</content>
    </entry><entry>
      <title>Why Data Teams Struggle Without Separate Dev and Prod Environments</title>
      <link href="https://thescalableway.com/blog/why-data-teams-struggle-without-separate-dev-and-prod-environments/" />
      <updated>2026-01-22T13:00:00Z</updated>
      <id>https://thescalableway.com/blog/why-data-teams-struggle-without-separate-dev-and-prod-environments/</id>
      <content type="html">
				&lt;nav id=&quot;toc&quot; class=&quot;table-of-contents prose&quot;&gt;&lt;ol&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/why-data-teams-struggle-without-separate-dev-and-prod-environments/#when-development-and-production-collide&quot;&gt;When Development and Production Collide&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/why-data-teams-struggle-without-separate-dev-and-prod-environments/#what-a-healthy-setup-looks-like&quot;&gt;What a Healthy Setup Looks Like&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/why-data-teams-struggle-without-separate-dev-and-prod-environments/#rolling-it-out-without-overwhelming-teams&quot;&gt;Rolling It Out Without Overwhelming Teams&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/why-data-teams-struggle-without-separate-dev-and-prod-environments/#the-payoff-stability-speed-and-trust&quot;&gt;The Payoff: Stability, Speed, and Trust&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/why-data-teams-struggle-without-separate-dev-and-prod-environments/#but-were-different&quot;&gt;“But We’re Different…”&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/why-data-teams-struggle-without-separate-dev-and-prod-environments/#taking-the-first-step&quot;&gt;Taking the First Step&lt;/a&gt;&lt;/li&gt;&lt;/ol&gt;&lt;/nav&gt;&lt;p&gt;&lt;span id=&quot;toc-skipped&quot; class=&quot;visually-hidden&quot;&gt;&lt;/span&gt;&lt;/p&gt;&lt;div class=&quot;flow prose&quot;&gt;&lt;p&gt;&lt;/p&gt;&lt;p&gt;It’s Monday morning. The CEO’s dashboard shows zeros. Sales metrics are gone. The data team is digging through logs, trying to figure out which Friday deployment broke production. Familiar scenario?&lt;/p&gt;&lt;p&gt;Unfortunately, this isn’t just an anecdote. Over the past three years, &lt;a href=&quot;https://www.coresite.com/blog/data-center-outage-trends-good-news-flags-in-the-uptime-institute-reports?hs_amp=true&quot; rel=&quot;noopener&quot;&gt;50% of data centers experienced at least one impactful outage&lt;/a&gt;. Of these incidents, nearly 40% were caused by human error, with 85% stemming from staff failing to follow procedures or from process flaws ‒ issues that could often be avoided if analytics and production workloads were properly separated. And the consequences can be severe: more than half of these outages cost organizations over $100,000, and &lt;a href=&quot;https://www.scribd.com/document/890018493/2025-Annual-Outage-Exec-Summary-UI&quot; rel=&quot;noopener&quot;&gt;one in five exceeded $1 million&lt;/a&gt;.&lt;/p&gt;&lt;p&gt;This happens when teams share the same data warehouse, the same jobs, and sometimes even the same credentials. At first, it feels efficient: no duplicate infrastructure, no setup overhead. But over time, it creates a fragile system in which even small changes can ripple into business-critical outages.&lt;/p&gt;&lt;p&gt;&lt;is-land on:idle&gt;&lt;/is-land&gt;&lt;/p&gt;&lt;dialog class=&quot;flow modal4&quot;&gt;&lt;button autofocus class=&quot;button&quot;&gt;Close&lt;/button&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/img/qHEZdOHFhB-960.webp 960w, https://thescalableway.com/img/qHEZdOHFhB-1525.webp 1525w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/img/qHEZdOHFhB-960.jpeg&quot; alt=&quot;dev and prod environments separation&quot; width=&quot;1525&quot; height=&quot;500&quot; srcset=&quot;https://thescalableway.com/img/qHEZdOHFhB-960.jpeg 960w, https://thescalableway.com/img/qHEZdOHFhB-1525.jpeg 1525w&quot; sizes=&quot;auto&quot;&gt;&lt;/picture&gt;&lt;/dialog&gt;&lt;button data-index=&quot;4&quot;&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/img/qHEZdOHFhB-960.webp 960w, https://thescalableway.com/img/qHEZdOHFhB-1525.webp 1525w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/img/qHEZdOHFhB-960.jpeg&quot; alt=&quot;dev and prod environments separation&quot; width=&quot;1525&quot; height=&quot;500&quot; srcset=&quot;https://thescalableway.com/img/qHEZdOHFhB-960.jpeg 960w, https://thescalableway.com/img/qHEZdOHFhB-1525.jpeg 1525w&quot; sizes=&quot;auto&quot;&gt;&lt;/picture&gt;&lt;/button&gt;&lt;p&gt;&lt;/p&gt;&lt;h2 id=&quot;when-development-and-production-collide&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/why-data-teams-struggle-without-separate-dev-and-prod-environments/#when-development-and-production-collide&quot; class=&quot;heading-anchor&quot;&gt;When Development and Production Collide&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;When both environments are the same, every update carries high stakes. Test directly in production, and you risk breaking a dashboard the CFO uses daily. Hold back changes, and you slow down every initiative.&lt;/p&gt;&lt;p&gt;I’ve seen this play out in many companies:&lt;/p&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;Stakeholders stop trusting reports because they fail randomly ‒ and this is not just my experience: about &lt;a href=&quot;https://www.montecarlodata.com/blog-data-quality-survey&quot; rel=&quot;noopener&quot;&gt;three-quarters of organizations&lt;/a&gt; say business stakeholders are the ones who spot issues first, most of the time.&lt;/li&gt;&lt;li&gt;Business users build their own spreadsheets and one-off tools to “fix” gaps.&lt;/li&gt;&lt;li&gt;Data teams spend their time firefighting instead of improving pipelines. On average, data engineers spend &lt;a href=&quot;https://www.montecarlodata.com/blog-2022-data-quality-survey/&quot; rel=&quot;noopener&quot;&gt;40% of their time&lt;/a&gt; (roughly 2 days per week!) addressing bad data and unplanned issues.&lt;/li&gt;&lt;li&gt;Deployment windows shrink to late nights and weekends because no one trusts changes during business hours. And it’s not just time lost: over &lt;a href=&quot;https://www.pagerduty.com/blog/devops/unplanned-work-devops/&quot; rel=&quot;noopener&quot;&gt;70% of technology staff&lt;/a&gt; report being negatively impacted by unplanned work in three or more ways, including heightened stress and anxiety, reduced work-life balance, and less time to focus on strategic projects.&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;For the people working inside these environments, the stress is constant. Engineers avoid trying new ideas because the cost of failure is too high. The pace of delivery slows, and eventually, good people leave for organizations where they can focus on building rather than patching.&lt;/p&gt;&lt;h2 id=&quot;what-a-healthy-setup-looks-like&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/why-data-teams-struggle-without-separate-dev-and-prod-environments/#what-a-healthy-setup-looks-like&quot; class=&quot;heading-anchor&quot;&gt;What a Healthy Setup Looks Like&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;The solution is not complicated in principle: give development and production their own space. That means separate cloud accounts, databases, compute resources, and access controls. Tricks like using schema prefixes in the same warehouse don’t solve the problem ‒ they only create false security.&lt;/p&gt;&lt;p&gt;A good development environment doesn’t need to be a full copy of production. It just needs to behave the same way. I live by those 3 golden practices for achieving this:&lt;/p&gt;&lt;ol class=&quot;list&quot;&gt;&lt;li&gt;Use infrastructure-as-code to keep environments consistent.&lt;/li&gt;&lt;li&gt;Create smaller datasets that are representative of production (with sensitive data masked).&lt;/li&gt;&lt;li&gt;Set up a clear path for code to move: dev → staging → production, with tests and reviews at each step.&lt;/li&gt;&lt;/ol&gt;&lt;p&gt;Version control (Git) underpins all of this. Every change should leave a trail, so you can review, roll back, and understand what’s running where.&lt;/p&gt;&lt;h2 id=&quot;rolling-it-out-without-overwhelming-teams&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/why-data-teams-struggle-without-separate-dev-and-prod-environments/#rolling-it-out-without-overwhelming-teams&quot; class=&quot;heading-anchor&quot;&gt;Rolling It Out Without Overwhelming Teams&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;A shift like this doesn’t happen overnight, and it shouldn’t. Most teams succeed by taking it step by step:&lt;/p&gt;&lt;ol class=&quot;list&quot;&gt;&lt;li&gt;&lt;strong&gt;Set up separate infrastructure&lt;/strong&gt; ‒ provision isolated development resources using infrastructure-as-code so environments stay consistent and secure. Check our &lt;a href=&quot;https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/&quot; rel=&quot;noopener&quot;&gt;blog post &lt;/a&gt;on how to do it using Terraform, a great starting point for provisioning dev infrastructure.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Get usable data into dev&lt;/strong&gt; – establish pipelines to copy and mask subsets of production data. You can see some best practices in our piece on &lt;a href=&quot;https://thescalableway.com/blog/dlt-and-prefect-a-great-combo-for-streamlined-data-ingestion-pipelines/&quot; rel=&quot;noopener&quot;&gt;building ingestion pipelines with dlt and Prefect&lt;/a&gt;.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Define workflows and train the team&lt;/strong&gt; – document how changes flow, how reviews are conducted, and the promotion criteria.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Automate deployments&lt;/strong&gt; – CI/CD pipelines handle testing, validation, and approval gates before changes reach production.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Add monitoring&lt;/strong&gt; – make sure each environment has the right level of alerting so issues are caught quickly. Head to our &lt;a href=&quot;https://thescalableway.com/blog/deploying-prefect-on-any-cloud-using-a-single-virtual-machine/&quot; rel=&quot;noopener&quot;&gt;Prefect deployment guide&lt;/a&gt; for a practical approach to monitoring and automation, which also covers the previous step on deployment.&lt;/li&gt;&lt;/ol&gt;&lt;h2 id=&quot;the-payoff-stability-speed-and-trust&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/why-data-teams-struggle-without-separate-dev-and-prod-environments/#the-payoff-stability-speed-and-trust&quot; class=&quot;heading-anchor&quot;&gt;The Payoff: Stability, Speed, and Trust&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Companies that make this switch often see significant drops in incident rates within the first six months ‒ based on our experience 50% in 3 months. Deployments that once happened monthly shift to a weekly or even daily rhythm. Teams finally get space to experiment without the fear of breaking production, and business leaders start trusting dashboards again because they consistently work.&lt;/p&gt;&lt;p&gt;Inside the team, the shift is just as important. Instead of late-night firefighting, data professionals can focus on delivering value, exploring new tools, and building solutions that last.&lt;/p&gt;&lt;h2 id=&quot;but-were-different&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/why-data-teams-struggle-without-separate-dev-and-prod-environments/#but-were-different&quot; class=&quot;heading-anchor&quot;&gt;“But We’re Different…”&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Yes, yes, I know. “It’s too expensive, too complicated, or not suitable for the amount of data we handle.” But here’s the reality:&lt;/p&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;&lt;strong&gt;&quot;We can’t afford separate environments.”&amp;nbsp;&lt;/strong&gt;&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;Cloud platforms make it manageable. You don’t need full-scale production replicas; lightweight development instances are enough. For example, on &lt;strong&gt;Azure Synapse Analytics&lt;/strong&gt;, small teams with 1TB of data and 5–10 users typically &lt;strong&gt;spend $1,500–4,000 per month&lt;/strong&gt; per environment using serverless SQL pools with pause/resume capabilities. On &lt;strong&gt;AWS Redshift Serverless&lt;/strong&gt;, similar workloads run &lt;a href=&quot;https://aws.amazon.com/redshift/pricing/&quot; rel=&quot;noopener&quot;&gt;&lt;strong&gt;$1,000–3,000 per month&lt;/strong&gt;&lt;/a&gt; since you only pay for actual query time—no idle clusters burning money overnight. For even lighter workloads, &lt;strong&gt;Amazon Athena&lt;/strong&gt; can drop costs to &lt;strong&gt;$200–800 per month&lt;/strong&gt; when you’re querying well-partitioned data in S3 (&lt;a href=&quot;https://aws.amazon.com/athena/pricing/&quot; rel=&quot;noopener&quot;&gt;at $5 per TB scanned&lt;/a&gt;). Compare that cost to your last outage.&lt;/p&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;&lt;strong&gt;“We’re too small for such complexity.”&amp;nbsp;&lt;/strong&gt;&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;Small teams need it most. If three people spend about 40% their time fixing broken production jobs, that effectively means more than one full team member’s capacity is devoted to patching problems instead of innovating or building new solutions. That’s a huge hit to both productivity and morale.&lt;/p&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;&lt;strong&gt;“Our data is too big to duplicate.”&amp;nbsp;&lt;/strong&gt;&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;You don’t need to. &lt;a href=&quot;https://www.datprof.com/solutions/data-subsetting-2/&quot; rel=&quot;noopener&quot;&gt;Subsets&lt;/a&gt;, samples, or synthetic data usually give you what you need to test logic and performance.&lt;/p&gt;&lt;h2 id=&quot;taking-the-first-step&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/why-data-teams-struggle-without-separate-dev-and-prod-environments/#taking-the-first-step&quot; class=&quot;heading-anchor&quot;&gt;Taking the First Step&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Separating dev and prod isn’t an optional best practice. It’s the foundation for stable, trustworthy analytics. You don’t need to implement everything in one go, but starting small pays off quickly.&lt;/p&gt;&lt;p&gt;Think of it less as technical overhead and more as business enablement. When your analytics are reliable, your decisions improve. When deployments are faster, your team can respond to new business needs. When incidents are fewer, you unlock time for innovation.&lt;/p&gt;&lt;p&gt;The question isn’t whether your organization needs this ‒ it’s whether you want to start proactively or wait until the next production incident forces the change.&lt;/p&gt;&lt;/div&gt;
 			</content>
    </entry><entry>
      <title>Data Platform Cost Optimization: Practical Strategies for Query Performance, Storage, and Cloud Resource Management</title>
      <link href="https://thescalableway.com/blog/data-platform-cost-optimization-practical-strategies-for-query-performance-storage-and-cloud-resource-management/" />
      <updated>2025-10-27T13:06:00Z</updated>
      <id>https://thescalableway.com/blog/data-platform-cost-optimization-practical-strategies-for-query-performance-storage-and-cloud-resource-management/</id>
      <content type="html">
				&lt;nav id=&quot;toc&quot; class=&quot;table-of-contents prose&quot;&gt;&lt;ol&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/data-platform-cost-optimization-practical-strategies-for-query-performance-storage-and-cloud-resource-management/#query-optimization-for-lower-data-warehouse-costs&quot;&gt;Query Optimization for Lower Data Warehouse Costs&lt;/a&gt;&lt;ol&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/data-platform-cost-optimization-practical-strategies-for-query-performance-storage-and-cloud-resource-management/#warehouse-specific-techniques&quot;&gt;Warehouse-Specific Techniques:&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/data-platform-cost-optimization-practical-strategies-for-query-performance-storage-and-cloud-resource-management/#execution-plan-analysis&quot;&gt;Execution Plan Analysis:&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/data-platform-cost-optimization-practical-strategies-for-query-performance-storage-and-cloud-resource-management/#precomputation-and-caching&quot;&gt;Precomputation and Caching:&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/data-platform-cost-optimization-practical-strategies-for-query-performance-storage-and-cloud-resource-management/#optimizing-joins-and-complexity&quot;&gt;Optimizing Joins and Complexity:&lt;/a&gt;&lt;/li&gt;&lt;/ol&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/data-platform-cost-optimization-practical-strategies-for-query-performance-storage-and-cloud-resource-management/#advanced-data-loading-techniques&quot;&gt;Advanced Data Loading Techniques:&lt;/a&gt;&lt;ol&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/data-platform-cost-optimization-practical-strategies-for-query-performance-storage-and-cloud-resource-management/#incremental-loading-patterns&quot;&gt;Incremental Loading Patterns:&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/data-platform-cost-optimization-practical-strategies-for-query-performance-storage-and-cloud-resource-management/#batch-vs-streaming-trade-offs&quot;&gt;Batch vs. Streaming Trade-offs:&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/data-platform-cost-optimization-practical-strategies-for-query-performance-storage-and-cloud-resource-management/#optimal-scheduling&quot;&gt;Optimal Scheduling:&lt;/a&gt;&lt;/li&gt;&lt;/ol&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/data-platform-cost-optimization-practical-strategies-for-query-performance-storage-and-cloud-resource-management/#cloud-resource-management-and-cost-control&quot;&gt;Cloud Resource Management and Cost Control&lt;/a&gt;&lt;ol&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/data-platform-cost-optimization-practical-strategies-for-query-performance-storage-and-cloud-resource-management/#reserved-capacity-strategy&quot;&gt;Reserved Capacity Strategy:&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/data-platform-cost-optimization-practical-strategies-for-query-performance-storage-and-cloud-resource-management/#auto-scaling-configuration&quot;&gt;Auto-Scaling Configuration:&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/data-platform-cost-optimization-practical-strategies-for-query-performance-storage-and-cloud-resource-management/#spot-instance-usage&quot;&gt;Spot Instance Usage:&lt;/a&gt;&lt;/li&gt;&lt;/ol&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/data-platform-cost-optimization-practical-strategies-for-query-performance-storage-and-cloud-resource-management/#data-access-control-strategy&quot;&gt;Data Access Control Strategy&lt;/a&gt;&lt;ol&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/data-platform-cost-optimization-practical-strategies-for-query-performance-storage-and-cloud-resource-management/#self-service-vs-centralized-reporting&quot;&gt;Self-Service vs. Centralized Reporting:&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/data-platform-cost-optimization-practical-strategies-for-query-performance-storage-and-cloud-resource-management/#aggregation-layers&quot;&gt;Aggregation Layers:&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/data-platform-cost-optimization-practical-strategies-for-query-performance-storage-and-cloud-resource-management/#access-control-patterns&quot;&gt;Access Control Patterns:&lt;/a&gt;&lt;/li&gt;&lt;/ol&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/data-platform-cost-optimization-practical-strategies-for-query-performance-storage-and-cloud-resource-management/#storage-optimization-and-data-lifecycle-management&quot;&gt;Storage Optimization and Data Lifecycle Management&lt;/a&gt;&lt;ol&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/data-platform-cost-optimization-practical-strategies-for-query-performance-storage-and-cloud-resource-management/#data-lifecycle-management&quot;&gt;Data Lifecycle Management:&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/data-platform-cost-optimization-practical-strategies-for-query-performance-storage-and-cloud-resource-management/#compression-and-encoding&quot;&gt;Compression and Encoding:&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/data-platform-cost-optimization-practical-strategies-for-query-performance-storage-and-cloud-resource-management/#time-travel-for-point-in-time-analysis&quot;&gt;Time Travel for Point-in-Time Analysis:&lt;/a&gt;&lt;/li&gt;&lt;/ol&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/data-platform-cost-optimization-practical-strategies-for-query-performance-storage-and-cloud-resource-management/#conclusion&quot;&gt;Conclusion&lt;/a&gt;&lt;/li&gt;&lt;/ol&gt;&lt;/nav&gt;&lt;p&gt;&lt;span id=&quot;toc-skipped&quot; class=&quot;visually-hidden&quot;&gt;&lt;/span&gt;&lt;/p&gt;&lt;div class=&quot;flow prose&quot;&gt;&lt;p&gt;&lt;/p&gt;&lt;p&gt;Data platform costs can quickly spiral out of control, often catching organizations off guard as workloads scale and data volumes increase. Whether you’re a data engineer optimizing transformation pipelines, a data analyst writing complex queries, or a DevOps engineer managing infrastructure across AWS, GCP, and Azure, you have direct influence over your platform’s total cost of ownership.&lt;/p&gt;&lt;p&gt;Cost optimization isn’t the sole responsibility of a single team or role. Every decision made throughout the data platform lifecycle affects the bottom line. A poorly written query can consume hundreds of dollars in compute resources within minutes. Unnecessary data retention policies can bloat storage costs month after month. Inadequate resource planning can lead to over-provisioned infrastructure that runs idle most of the time.&lt;/p&gt;&lt;p&gt;This guide looks at cost optimization from multiple angles, recognizing that effective cost management requires collaboration across the entire data platform team. You’ll find practical strategies tailored to different aspects of data platform operations, from warehouse query optimization and incremental loading techniques to cloud resource reservations and data access controls. Each section highlights key principles and links to implementation approaches for deeper technical coverage.&lt;/p&gt;&lt;p&gt;The goal is simple: to help you identify and act on cost-saving opportunities in your area of expertise, building data platforms that are both efficient and economical.&lt;/p&gt;&lt;h2 id=&quot;query-optimization-for-lower-data-warehouse-costs&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/data-platform-cost-optimization-practical-strategies-for-query-performance-storage-and-cloud-resource-management/#query-optimization-for-lower-data-warehouse-costs&quot; class=&quot;heading-anchor&quot;&gt;Query Optimization for Lower Data Warehouse Costs&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Query optimization is one of the most impactful ways to reduce platform expenses. A well-tuned query can run 10–100 times faster than an unoptimized one, directly lowering compute costs and improving user experience. Before writing SQL, how tables are structured, data types chosen, and storage organized determines achievable performance ceilings. Messy source tables with unnecessary columns, unoptimized types (eg. STRING instead of INT64 for IDs), and poor partitioning or clustering make queries orders of magnitude slower—even with well-written SQL. Universal architectural principles apply across all modern warehouses: partitioning, clustering/sorting, indexing, compression, and columnar storage. Start optimization here before platform-specific tuning.&lt;/p&gt;&lt;h3 id=&quot;warehouse-specific-techniques&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/data-platform-cost-optimization-practical-strategies-for-query-performance-storage-and-cloud-resource-management/#warehouse-specific-techniques&quot; class=&quot;heading-anchor&quot;&gt;&lt;strong&gt;Warehouse-Specific Techniques&lt;/strong&gt;:&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;Each data warehouse behaves differently. BigQuery charges based on bytes processed or slot usage, making partition pruning and clustering critical for cost control. In Redshift, performance depends heavily on choosing the right distribution and sort keys to reduce data movement during joins. Understanding these platform-specific optimizations allows you to reduce query costs dramatically without changing business logic.&lt;/p&gt;&lt;h3 id=&quot;execution-plan-analysis&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/data-platform-cost-optimization-practical-strategies-for-query-performance-storage-and-cloud-resource-management/#execution-plan-analysis&quot; class=&quot;heading-anchor&quot;&gt;&lt;strong&gt;Execution Plan Analysis&lt;/strong&gt;:&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;Understanding query execution plans is essential for identifying performance bottlenecks. Analyze the most expensive queries using BigQuery’s &lt;code&gt;INFORMATION_SCHEMA.JOBS&lt;/code&gt; or Redshift’s &lt;code&gt;SYS_QUERY_HISTORY&lt;/code&gt; (for serverless) and &lt;code&gt;SVL_QUERY_REPORT&lt;/code&gt; (for provisioned clusters). Focus optimization on queries consuming the most slots, scan time, or compute resources.&lt;/p&gt;&lt;h3 id=&quot;precomputation-and-caching&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/data-platform-cost-optimization-practical-strategies-for-query-performance-storage-and-cloud-resource-management/#precomputation-and-caching&quot; class=&quot;heading-anchor&quot;&gt;&lt;strong&gt;Precomputation and Caching:&lt;/strong&gt;&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;Choose the right materialization strategy for recurring queries. Physical tables offer the most flexibility for complex transformation logic and refresh schedules but require explicit pipeline management. Standard views provide real-time data access without storage overhead, ideal when freshness is critical and query performance is acceptable. Materialized views physically store precomputed results for faster reads but come with platform-specific limitations around refresh behavior, join complexity, and aggregation support—evaluate whether your use case fits before committing.​&lt;/p&gt;&lt;p&gt;Complement materialization with result caching: BigQuery caches identical query results for 24 hours at no charge, while Redshift keeps result sets in memory. Both eliminate redundant computation when users re-run the same queries.&lt;/p&gt;&lt;h3 id=&quot;optimizing-joins-and-complexity&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/data-platform-cost-optimization-practical-strategies-for-query-performance-storage-and-cloud-resource-management/#optimizing-joins-and-complexity&quot; class=&quot;heading-anchor&quot;&gt;&lt;strong&gt;Optimizing Joins and Complexity:&lt;/strong&gt;&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;Simplify query logic by breaking down complex queries into intermediate tables. Filter data before joining to minimize scanned bytes and reduce overall processing costs. Make sure to keep the larger table on the left side of the join.&lt;/p&gt;&lt;h2 id=&quot;advanced-data-loading-techniques&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/data-platform-cost-optimization-practical-strategies-for-query-performance-storage-and-cloud-resource-management/#advanced-data-loading-techniques&quot; class=&quot;heading-anchor&quot;&gt;Advanced Data Loading Techniques:&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Efficient data loading practices can lower compute and storage costs while improving pipeline performance. Instead of reprocessing entire datasets, incremental loading focuses only on what has changed, cutting ingestion costs by up to 90%.&lt;/p&gt;&lt;h3 id=&quot;incremental-loading-patterns&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/data-platform-cost-optimization-practical-strategies-for-query-performance-storage-and-cloud-resource-management/#incremental-loading-patterns&quot; class=&quot;heading-anchor&quot;&gt;&lt;strong&gt;Incremental Loading Patterns&lt;/strong&gt;:&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;The high-watermark method is a simple and reliable approach: track the maximum timestamp or sequential ID from the previous load, then query for records exceeding that watermark. For full change tracking, including deletes, implement Change Data Capture (CDC) using tools like dltHub, a Python-based library that supports CDC and incremental loading out of the box. It handles schema evolution, state tracking, and normalization automatically.&lt;/p&gt;&lt;p&gt;&lt;is-land on:idle&gt;&lt;/is-land&gt;&lt;/p&gt;&lt;dialog class=&quot;flow modal14&quot;&gt;&lt;button autofocus class=&quot;button&quot;&gt;Close&lt;/button&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/img/dSD7g6PA5c-960.webp 960w, https://thescalableway.com/img/dSD7g6PA5c-1600.webp 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/img/dSD7g6PA5c-960.jpeg&quot; alt=&quot;incremental loading&quot; width=&quot;1600&quot; height=&quot;865&quot; srcset=&quot;https://thescalableway.com/img/dSD7g6PA5c-960.jpeg 960w, https://thescalableway.com/img/dSD7g6PA5c-1600.jpeg 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;/picture&gt;&lt;/dialog&gt;&lt;button data-index=&quot;14&quot;&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/img/dSD7g6PA5c-960.webp 960w, https://thescalableway.com/img/dSD7g6PA5c-1600.webp 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/img/dSD7g6PA5c-960.jpeg&quot; alt=&quot;incremental loading&quot; width=&quot;1600&quot; height=&quot;865&quot; srcset=&quot;https://thescalableway.com/img/dSD7g6PA5c-960.jpeg 960w, https://thescalableway.com/img/dSD7g6PA5c-1600.jpeg 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;/picture&gt;&lt;/button&gt;&lt;p&gt;&lt;/p&gt;&lt;h3 id=&quot;batch-vs-streaming-trade-offs&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/data-platform-cost-optimization-practical-strategies-for-query-performance-storage-and-cloud-resource-management/#batch-vs-streaming-trade-offs&quot; class=&quot;heading-anchor&quot;&gt;&lt;strong&gt;Batch vs. Streaming Trade-offs:&lt;/strong&gt;&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;Batch loading offers significant cost advantages for most analytical workloads, with traditional batch jobs running on scheduled intervals (hourly, daily) at minimal infrastructure cost. Event-based batch processing provides a middle ground, triggering data loads when specific events occur, such as new files arriving in object storage or source system notifications, eliminating unnecessary empty runs while maintaining batch processing efficiency. Micro-batch processing (near real-time) has emerged as a viable option for many use cases, processing small batches of data every 5-10 minutes and satisfying business requirements that don’t truly need sub-minute latency. True streaming ingestion suits scenarios requiring sub-second latency but comes at premium pricing due to continuous resource consumption and the need for separate, specialized data platform components. This is often confusing for less technical staff, but if you realize that a 1 second frequency is &lt;em&gt;300 times&lt;/em&gt; more frequent than 5 minutes, it should make intuitive sense that a different approach is required. Evaluate whether business requirements genuinely need real-time data or if event-triggered or micro-batch processing satisfies use cases at a fraction of the cost—many “real-time” dashboards function perfectly well with 5-15 minute refresh intervals while reducing infrastructure expenses significantly.&lt;/p&gt;&lt;h3 id=&quot;optimal-scheduling&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/data-platform-cost-optimization-practical-strategies-for-query-performance-storage-and-cloud-resource-management/#optimal-scheduling&quot; class=&quot;heading-anchor&quot;&gt;&lt;strong&gt;Optimal Scheduling&lt;/strong&gt;:&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;When scheduling batch jobs (instead of event-based processing), run them during off-peak hours to reduce slot contention and improve query performance. &quot;Distribute pipeline start times based on data dependencies and SLAs to avoid unnecessary resource spikes.&lt;/p&gt;&lt;p&gt;&lt;is-land on:idle&gt;&lt;/is-land&gt;&lt;/p&gt;&lt;dialog class=&quot;flow modal15&quot;&gt;&lt;button autofocus class=&quot;button&quot;&gt;Close&lt;/button&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/img/iLWdkcRQW1-960.webp 960w, https://thescalableway.com/img/iLWdkcRQW1-1600.webp 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/img/iLWdkcRQW1-960.jpeg&quot; alt=&quot;optimal scheduling&quot; width=&quot;1600&quot; height=&quot;642&quot; srcset=&quot;https://thescalableway.com/img/iLWdkcRQW1-960.jpeg 960w, https://thescalableway.com/img/iLWdkcRQW1-1600.jpeg 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;/picture&gt;&lt;/dialog&gt;&lt;button data-index=&quot;15&quot;&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/img/iLWdkcRQW1-960.webp 960w, https://thescalableway.com/img/iLWdkcRQW1-1600.webp 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/img/iLWdkcRQW1-960.jpeg&quot; alt=&quot;optimal scheduling&quot; width=&quot;1600&quot; height=&quot;642&quot; srcset=&quot;https://thescalableway.com/img/iLWdkcRQW1-960.jpeg 960w, https://thescalableway.com/img/iLWdkcRQW1-1600.jpeg 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;/picture&gt;&lt;/button&gt;&lt;p&gt;&lt;/p&gt;&lt;h2 id=&quot;cloud-resource-management-and-cost-control&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/data-platform-cost-optimization-practical-strategies-for-query-performance-storage-and-cloud-resource-management/#cloud-resource-management-and-cost-control&quot; class=&quot;heading-anchor&quot;&gt;Cloud Resource Management and Cost Control&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Infrastructure typically represents one of the largest expenses in a data platform, yet many organizations run their warehouses at default configurations. Smart resource management requires understanding workload patterns and matching the right pricing model to each use case.&lt;/p&gt;&lt;h3 id=&quot;reserved-capacity-strategy&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/data-platform-cost-optimization-practical-strategies-for-query-performance-storage-and-cloud-resource-management/#reserved-capacity-strategy&quot; class=&quot;heading-anchor&quot;&gt;&lt;strong&gt;Reserved Capacity Strategy&lt;/strong&gt;:&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;Where possible, commit to reserved capacity for predictable, steady-state workloads. BigQuery with committed slot reservations can save 40-60% compared to on-demand for high-usage scenarios (&lt;a href=&quot;https://cloud.google.com/bigquery/docs/reservations-tasks&quot; rel=&quot;noopener&quot;&gt;docs&lt;/a&gt;). Redshift Reserved Instances provide up to 62.5% savings with 1-year or 3-year commitments. Analyze your past 3-6 months of usage to identify the minimum baseline capacity required during low-demand periods - this becomes your reserved capacity floor. Configure auto-scaling on top of this baseline to handle variable workloads and peak demand, ensuring you only pay for additional resources when actually needed rather than over-provisioning for worst-case scenarios. This hybrid approach balances cost predictability through reservations with elasticity through auto-scaling, optimizing both budget and performance. Below you can find an example pricing discount for Redshift Reserved Instances:&lt;/p&gt;&lt;p&gt;&lt;is-land on:idle&gt;&lt;/is-land&gt;&lt;/p&gt;&lt;dialog class=&quot;flow modal16&quot;&gt;&lt;button autofocus class=&quot;button&quot;&gt;Close&lt;/button&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/img/W8Urxoh2o3-960.webp 960w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/img/W8Urxoh2o3-960.jpeg&quot; alt=&quot;Reserved Capacity Strategy&quot; width=&quot;960&quot; height=&quot;477&quot;&gt;&lt;/picture&gt;&lt;/dialog&gt;&lt;button data-index=&quot;16&quot;&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/img/W8Urxoh2o3-960.webp 960w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/img/W8Urxoh2o3-960.jpeg&quot; alt=&quot;Reserved Capacity Strategy&quot; width=&quot;960&quot; height=&quot;477&quot;&gt;&lt;/picture&gt;&lt;/button&gt;&lt;p&gt;&lt;/p&gt;&lt;h3 id=&quot;auto-scaling-configuration&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/data-platform-cost-optimization-practical-strategies-for-query-performance-storage-and-cloud-resource-management/#auto-scaling-configuration&quot; class=&quot;heading-anchor&quot;&gt;&lt;strong&gt;Auto-Scaling Configuration&lt;/strong&gt;:&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;Configure auto-scaling for workloads outside your core data warehouse. For transformation engines and orchestration runners, implement horizontal scaling based on queue depth or CPU utilization. Set aggressive scale-down policies during off-hours when data pipelines experience minimal activity, automatically capturing savings rather than running fixed capacity 24/7.&lt;/p&gt;&lt;p&gt;&lt;is-land on:idle&gt;&lt;/is-land&gt;&lt;/p&gt;&lt;dialog class=&quot;flow modal17&quot;&gt;&lt;button autofocus class=&quot;button&quot;&gt;Close&lt;/button&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/img/cjAAOnHIED-960.webp 960w, https://thescalableway.com/img/cjAAOnHIED-1600.webp 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/img/cjAAOnHIED-960.jpeg&quot; alt=&quot;autoscaling&quot; width=&quot;1600&quot; height=&quot;902&quot; srcset=&quot;https://thescalableway.com/img/cjAAOnHIED-960.jpeg 960w, https://thescalableway.com/img/cjAAOnHIED-1600.jpeg 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;/picture&gt;&lt;/dialog&gt;&lt;button data-index=&quot;17&quot;&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/img/cjAAOnHIED-960.webp 960w, https://thescalableway.com/img/cjAAOnHIED-1600.webp 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/img/cjAAOnHIED-960.jpeg&quot; alt=&quot;autoscaling&quot; width=&quot;1600&quot; height=&quot;902&quot; srcset=&quot;https://thescalableway.com/img/cjAAOnHIED-960.jpeg 960w, https://thescalableway.com/img/cjAAOnHIED-1600.jpeg 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;/picture&gt;&lt;/button&gt;&lt;p&gt;&lt;/p&gt;&lt;p&gt;&lt;is-land on:idle&gt;&lt;/is-land&gt;&lt;/p&gt;&lt;dialog class=&quot;flow modal18&quot;&gt;&lt;button autofocus class=&quot;button&quot;&gt;Close&lt;/button&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/img/MoMOuxwKRR-960.webp 960w, https://thescalableway.com/img/MoMOuxwKRR-1600.webp 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/img/MoMOuxwKRR-960.jpeg&quot; alt=&quot;autoscaling&quot; width=&quot;1600&quot; height=&quot;901&quot; srcset=&quot;https://thescalableway.com/img/MoMOuxwKRR-960.jpeg 960w, https://thescalableway.com/img/MoMOuxwKRR-1600.jpeg 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;/picture&gt;&lt;/dialog&gt;&lt;button data-index=&quot;18&quot;&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/img/MoMOuxwKRR-960.webp 960w, https://thescalableway.com/img/MoMOuxwKRR-1600.webp 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/img/MoMOuxwKRR-960.jpeg&quot; alt=&quot;autoscaling&quot; width=&quot;1600&quot; height=&quot;901&quot; srcset=&quot;https://thescalableway.com/img/MoMOuxwKRR-960.jpeg 960w, https://thescalableway.com/img/MoMOuxwKRR-1600.jpeg 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;/picture&gt;&lt;/button&gt;&lt;p&gt;&lt;/p&gt;&lt;h3 id=&quot;spot-instance-usage&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/data-platform-cost-optimization-practical-strategies-for-query-performance-storage-and-cloud-resource-management/#spot-instance-usage&quot; class=&quot;heading-anchor&quot;&gt;&lt;strong&gt;Spot Instance Usage&lt;/strong&gt;:&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;Use spot instances (AWS) or preemptible VMs (GCP) for fault-tolerant batch workloads at 60-90% discounts. Historical data backfills, large-scale data quality checks, and experimental analytics workloads run well on spot capacity. Implement retry logic and design jobs to checkpoint progress periodically, avoiding spot instances only for real-time pipelines or time-sensitive reporting.&lt;/p&gt;&lt;h2 id=&quot;data-access-control-strategy&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/data-platform-cost-optimization-practical-strategies-for-query-performance-storage-and-cloud-resource-management/#data-access-control-strategy&quot; class=&quot;heading-anchor&quot;&gt;Data Access Control Strategy&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;The way users access data significantly impacts platform costs. Centralized architectures push all queries through a single warehouse, creating bottlenecks and driving up compute usage.&lt;/p&gt;&lt;h3 id=&quot;self-service-vs-centralized-reporting&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/data-platform-cost-optimization-practical-strategies-for-query-performance-storage-and-cloud-resource-management/#self-service-vs-centralized-reporting&quot; class=&quot;heading-anchor&quot;&gt;&lt;strong&gt;Self-Service vs. Centralized Reporting&lt;/strong&gt;:&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;Self-service analytics tools empower users but can lead to inefficient ad-hoc queries that consume excessive resources. Balance this by providing pre-aggregated datasets, semantic layers, and curated data marts for common analysis patterns and proper training. Centralized reporting through scheduled dashboards and reports consolidates repetitive queries into single executions shared across users.&lt;/p&gt;&lt;h3 id=&quot;aggregation-layers&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/data-platform-cost-optimization-practical-strategies-for-query-performance-storage-and-cloud-resource-management/#aggregation-layers&quot; class=&quot;heading-anchor&quot;&gt;&lt;strong&gt;Aggregation Layers&lt;/strong&gt;:&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;Create data summarization and aggregation layers that pre-compute common metrics at various granularities. Daily, weekly, and monthly aggregation tables reduce the need to scan massive fact tables repeatedly. Users query these optimized datasets instead of raw transaction data, dramatically reducing bytes processed and query execution times.&lt;/p&gt;&lt;h3 id=&quot;access-control-patterns&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/data-platform-cost-optimization-practical-strategies-for-query-performance-storage-and-cloud-resource-management/#access-control-patterns&quot; class=&quot;heading-anchor&quot;&gt;&lt;strong&gt;Access Control Patterns&lt;/strong&gt;:&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;Implement role-based access control in order to limit expensive query executions. Grant read-only access to pre-aggregated views for most analysts while restricting raw table access to data engineers who understand optimization techniques. Configure query timeout limits and result set size restrictions to prevent runaway queries from consuming excessive resources.&lt;/p&gt;&lt;h2 id=&quot;storage-optimization-and-data-lifecycle-management&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/data-platform-cost-optimization-practical-strategies-for-query-performance-storage-and-cloud-resource-management/#storage-optimization-and-data-lifecycle-management&quot; class=&quot;heading-anchor&quot;&gt;Storage Optimization and Data Lifecycle Management&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Storage often accounts for 30-40% of total data platform costs, yet receives less attention than compute optimization. Smart storage strategies can deliver major savings with minimal operational impact.&lt;/p&gt;&lt;h3 id=&quot;data-lifecycle-management&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/data-platform-cost-optimization-practical-strategies-for-query-performance-storage-and-cloud-resource-management/#data-lifecycle-management&quot; class=&quot;heading-anchor&quot;&gt;&lt;strong&gt;Data Lifecycle Management&lt;/strong&gt;:&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;Automate data tiering into hot/warm/cold storage based on data age and access patterns. BigQuery offers long-term storage pricing (50% discount) for tables not edited in 90 days. Redshift supports automatic table optimization that moves cold data to S3. Design lifecycle policies around actual business needs instead of keeping everything in expensive hot storage.&lt;/p&gt;&lt;p&gt;&lt;is-land on:idle&gt;&lt;/is-land&gt;&lt;/p&gt;&lt;dialog class=&quot;flow modal19&quot;&gt;&lt;button autofocus class=&quot;button&quot;&gt;Close&lt;/button&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/img/MPEWR1We6P-960.webp 960w, https://thescalableway.com/img/MPEWR1We6P-1600.webp 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/img/MPEWR1We6P-960.jpeg&quot; alt=&quot;data lifecycle management&quot; width=&quot;1600&quot; height=&quot;653&quot; srcset=&quot;https://thescalableway.com/img/MPEWR1We6P-960.jpeg 960w, https://thescalableway.com/img/MPEWR1We6P-1600.jpeg 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;/picture&gt;&lt;/dialog&gt;&lt;button data-index=&quot;19&quot;&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/img/MPEWR1We6P-960.webp 960w, https://thescalableway.com/img/MPEWR1We6P-1600.webp 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/img/MPEWR1We6P-960.jpeg&quot; alt=&quot;data lifecycle management&quot; width=&quot;1600&quot; height=&quot;653&quot; srcset=&quot;https://thescalableway.com/img/MPEWR1We6P-960.jpeg 960w, https://thescalableway.com/img/MPEWR1We6P-1600.jpeg 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;/picture&gt;&lt;/button&gt;&lt;p&gt;&lt;/p&gt;&lt;h3 id=&quot;compression-and-encoding&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/data-platform-cost-optimization-practical-strategies-for-query-performance-storage-and-cloud-resource-management/#compression-and-encoding&quot; class=&quot;heading-anchor&quot;&gt;&lt;strong&gt;Compression and Encoding&lt;/strong&gt;:&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;Apply appropriate compression and encoding techniques for your data types. Redshift offers multiple encoding options (LZO, ZSTD, Byte-dictionary) that can reduce storage by 70-90% while improving query performance through reduced I/O. BigQuery automatically compresses data, but choosing appropriate data types (INT64 vs. STRING for numeric IDs) impacts compressed size significantly.&lt;/p&gt;&lt;h3 id=&quot;time-travel-for-point-in-time-analysis&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/data-platform-cost-optimization-practical-strategies-for-query-performance-storage-and-cloud-resource-management/#time-travel-for-point-in-time-analysis&quot; class=&quot;heading-anchor&quot;&gt;&lt;strong&gt;Time Travel for Point-in-Time Analysis:&lt;/strong&gt;&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;Replace expensive SCD Type 2 dimension tables with native time travel capabilities where available. Traditional SCD Type 2 implementations duplicate dimension records for every change, creating surrogate keys, effective dates, and current flags that bloat storage and slow query performance as tables grow. BigQuery’s time travel feature allows querying data as it existed up to 7 days ago without maintaining separate historical versions, while extended time travel (up to 90 days) is available through snapshot configuration. Data lakehouse formats like Delta Lake, Apache Iceberg, and Apache Hudi provide similar capabilities through native versioning, enabling point-in-time queries without the complexity and storage costs of maintaining full SCD pipelines. For dimensions with frequent changes, time travel can reduce storage by 60-80% compared to SCD Type 2 while simplifying ETL logic and improving join performance by eliminating multi-version dimension lookups. Evaluate time travel as the default pattern for historical analysis, reserving SCD Type 2 only for regulatory requirements demanding explicit audit trails beyond platform retention windows.&lt;/p&gt;&lt;h2 id=&quot;conclusion&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/data-platform-cost-optimization-practical-strategies-for-query-performance-storage-and-cloud-resource-management/#conclusion&quot; class=&quot;heading-anchor&quot;&gt;Conclusion&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Cost optimization for data platforms requires a balanced approach across query performance, data loading, infrastructure, access, and storage. No single change solves everything, but applying several targeted strategies can reduce expenses by 40–60% while maintaining or even improving performance.&lt;/p&gt;&lt;p&gt;Start by identifying your biggest cost drivers through monitoring tools, then focus on the highest-impact areas first. Quick wins like caching, partitioning, and incremental loading deliver immediate savings, while longer-term efforts like lifecycle management and data mesh adoption provide sustainable efficiency.&lt;/p&gt;&lt;p&gt;By applying these techniques consistently, data teams can build platforms that scale intelligently, delivering both performance and value.&lt;/p&gt;&lt;/div&gt;
 			</content>
    </entry><entry>
      <title>SAP Data Ingestion with Python: A Technical Breakdown of Using the SAP RFC Protocol</title>
      <link href="https://thescalableway.com/blog/sap-data-ingestion-with-python-a-technical-breakdown-of-using-the-sap-rfc-protocol/" />
      <updated>2025-09-15T08:00:00Z</updated>
      <id>https://thescalableway.com/blog/sap-data-ingestion-with-python-a-technical-breakdown-of-using-the-sap-rfc-protocol/</id>
      <content type="html">
				&lt;nav id=&quot;toc&quot; class=&quot;table-of-contents prose&quot;&gt;&lt;ol&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/sap-data-ingestion-with-python-a-technical-breakdown-of-using-the-sap-rfc-protocol/#can-you-give-us-a-general-overview-of-how-sap-rfc-works&quot;&gt;Can you give us a general overview of how SAP RFC works?&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/sap-data-ingestion-with-python-a-technical-breakdown-of-using-the-sap-rfc-protocol/#what-are-some-of-the-challenges-of-using-sap-rfc-to-ingest-data-with-python&quot;&gt;What are some of the challenges of using SAP RFC to ingest data with Python?&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/sap-data-ingestion-with-python-a-technical-breakdown-of-using-the-sap-rfc-protocol/#can-you-explain-how-you-interface-a-c-library-with-python&quot;&gt;Can you explain how you interface a C++ library with Python?&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/sap-data-ingestion-with-python-a-technical-breakdown-of-using-the-sap-rfc-protocol/#how-can-ingestion-speed-be-optimized&quot;&gt;How can ingestion speed be optimized?&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/sap-data-ingestion-with-python-a-technical-breakdown-of-using-the-sap-rfc-protocol/#why-do-you-consider-pyrfc-slow-when-it-comes-to-data-ingestion&quot;&gt;Why do you consider pyRFC “slow” when it comes to data ingestion?&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/sap-data-ingestion-with-python-a-technical-breakdown-of-using-the-sap-rfc-protocol/#can-incremental-ingestions-be-done-using-sap-rfc-if-yes-how&quot;&gt;Can incremental ingestions be done using SAP RFC? If yes, how?&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/sap-data-ingestion-with-python-a-technical-breakdown-of-using-the-sap-rfc-protocol/#can-you-give-us-a-couple-of-code-examples-in-python-on-ingesting-the-data&quot;&gt;Can you give us a couple of code examples in Python on ingesting the data?&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/sap-data-ingestion-with-python-a-technical-breakdown-of-using-the-sap-rfc-protocol/#conclusion&quot;&gt;Conclusion&lt;/a&gt;&lt;/li&gt;&lt;/ol&gt;&lt;/nav&gt;&lt;p&gt;&lt;span id=&quot;toc-skipped&quot; class=&quot;visually-hidden&quot;&gt;&lt;/span&gt;&lt;/p&gt;&lt;div class=&quot;flow prose&quot;&gt;&lt;p&gt;&lt;/p&gt;&lt;p&gt;If you’ve ever tried to get data out of SAP, you know it’s often easier said than done. Manual exports take up a lot of time, standard tools don’t always fit real needs, and moving data into modern analytics or machine learning pipelines can feel like forcing two worlds together.&lt;/p&gt;&lt;p&gt;For years, many teams relied on pyRFC, the official Python library for SAP Remote Function Calls. But with pyRFC being decommissioned, there’s now a real gap for anyone who wants to keep integrating SAP data into Python workflows.&lt;/p&gt;&lt;p&gt;That’s why we’re working on a new Python library to replace pyRFC, focused on making SAP data ingestion easier, more reliable, and better suited for modern data workflows. A connector like this can directly call SAP functions, pull large tables in manageable chunks, and plug that data into Python pipelines. For data engineers, analysts, and developers, this isn’t about high-end tech for its own sake; it’s about saving time, reducing errors, and finally being able to use SAP data in the tools we already work with every day.&lt;/p&gt;&lt;p&gt;To dig into how this actually works, I’ve talked with Dominik ‒ a senior software engineer with more than ten years of experience, a computer science lecturer, and the tech lead of this project. In this conversation, he shares his knowledge, best practices, and lessons learned while building an SAP connector in Python using the RFC protocol.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;______&lt;/strong&gt;&lt;/p&gt;&lt;h4 id=&quot;can-you-give-us-a-general-overview-of-how-sap-rfc-works&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/sap-data-ingestion-with-python-a-technical-breakdown-of-using-the-sap-rfc-protocol/#can-you-give-us-a-general-overview-of-how-sap-rfc-works&quot; class=&quot;heading-anchor&quot;&gt;&lt;strong&gt;Can you give us a general overview of how SAP RFC works?&lt;/strong&gt;&lt;/a&gt;&lt;/h4&gt;&lt;p&gt;SAP Remote Function Call is a communication mechanism that allows one system to execute functions in another system as if they were local calls. It is primarily used to integrate different SAP modules, connect SAP with external applications, and support distributed processing. RFC works by exposing specific function modules in SAP that are marked as “remote-enabled,” meaning they can be invoked across system boundaries. When an RFC is triggered, the caller system packages the request, sends it over the network, and waits for a response (in synchronous mode) or continues processing without waiting (in asynchronous mode). This approach provides a standardized and reliable way for SAP systems and external programs to exchange data and trigger business logic, making RFC a cornerstone of SAP interoperability.&lt;/p&gt;&lt;p&gt;&lt;is-land on:idle&gt;&lt;/is-land&gt;&lt;/p&gt;&lt;dialog class=&quot;flow modal54&quot;&gt;&lt;button autofocus class=&quot;button&quot;&gt;Close&lt;/button&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/img/CCj0qkhJii-960.webp 960w, https://thescalableway.com/img/CCj0qkhJii-1600.webp 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/img/CCj0qkhJii-960.jpeg&quot; alt=&quot;sap rfc communication&quot; width=&quot;1600&quot; height=&quot;656&quot; srcset=&quot;https://thescalableway.com/img/CCj0qkhJii-960.jpeg 960w, https://thescalableway.com/img/CCj0qkhJii-1600.jpeg 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;/picture&gt;&lt;/dialog&gt;&lt;button data-index=&quot;54&quot;&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/img/CCj0qkhJii-960.webp 960w, https://thescalableway.com/img/CCj0qkhJii-1600.webp 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/img/CCj0qkhJii-960.jpeg&quot; alt=&quot;sap rfc communication&quot; width=&quot;1600&quot; height=&quot;656&quot; srcset=&quot;https://thescalableway.com/img/CCj0qkhJii-960.jpeg 960w, https://thescalableway.com/img/CCj0qkhJii-1600.jpeg 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;/picture&gt;&lt;/button&gt;&lt;p&gt;&lt;/p&gt;&lt;h4 id=&quot;what-are-some-of-the-challenges-of-using-sap-rfc-to-ingest-data-with-python&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/sap-data-ingestion-with-python-a-technical-breakdown-of-using-the-sap-rfc-protocol/#what-are-some-of-the-challenges-of-using-sap-rfc-to-ingest-data-with-python&quot; class=&quot;heading-anchor&quot;&gt;&lt;strong&gt;What are some of the challenges of using SAP RFC to ingest data with Python?&lt;/strong&gt;&lt;/a&gt;&lt;/h4&gt;&lt;p&gt;One of the main challenges of using SAP RFC for ingesting data with Python is handling the very large data volumes that SAP tables can produce. RFC itself has technical limitations, such as row size restrictions such as:&lt;/p&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;&lt;strong&gt;Row size limitations&lt;/strong&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;The 512-character per-row restriction in RFC_READ_TABLE means wide tables with many columns need to be split across multiple queries.&lt;/li&gt;&lt;li&gt;Reconstructing the full row in Python requires careful mapping of column segments.&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Volume and performance&lt;/strong&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;Extracting millions of rows can be slow and may degrade SAP system performance if not throttled.&lt;/li&gt;&lt;li&gt;Network latency and RFC protocol overhead can become bottlenecks for very large datasets.&lt;/li&gt;&lt;li&gt;Lack of streaming support means all results are buffered in memory, risking &lt;strong&gt;high RAM usage&lt;/strong&gt; in Python.&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Pagination and batching&lt;/strong&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;Since RFC has no built-in pagination, developers need to implement logic to fetch data in smaller chunks.&lt;/li&gt;&lt;li&gt;Requires careful handling of row offsets and consistency to avoid duplicates or missing records.&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Data type handling&lt;/strong&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;SAP tables contain many proprietary data types (e.g., RAW, DEC, DATS, TIMS) that need explicit conversion to Python types.&lt;/li&gt;&lt;li&gt;Inconsistent formatting (e.g., leading zeros, fixed-length fields) can require custom parsing logic.&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Error handling and robustness&lt;/strong&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;Large queries can lead to timeouts or aborted RFC sessions.&lt;/li&gt;&lt;li&gt;Error messages from SAP may be cryptic and require domain knowledge to interpret.&lt;/li&gt;&lt;li&gt;Retry logic and fault tolerance are not built in and must be handled in the Python layer.&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Security and access restrictions&lt;/strong&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;Not all tables are directly accessible due to SAP authorization profiles.&lt;/li&gt;&lt;li&gt;RFC users often have limited permissions, which may block some required data.&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Alternative interfaces&lt;/strong&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;RFC_READ_TABLE is convenient but not officially intended for large-scale data extraction.&lt;/li&gt;&lt;li&gt;In some cases, more efficient solutions (e.g., SAP OData services, CDS views, or custom ABAP reports) may be required instead of RFC.&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Connection/session handling&lt;/strong&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;With Python’s pyrfc, each call often opens a new RFC session, and connections can drop unexpectedly if not managed carefully. This leads to overhead in session initialization and can cause instability in long-running jobs.&lt;/li&gt;&lt;li&gt;A custom C++ connector can maintain a &lt;strong&gt;persistent session&lt;/strong&gt; without frequent disconnects, providing more stability and efficiency for large-scale or continuous data ingestion.&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;This means developers must implement batching, efficient memory handling, and sometimes parallelization to make ingestion practical. Without these techniques, performance can degrade quickly, and data may be truncated or lost during transfer, making large-scale SAP-to-Python integration non-trivial.&lt;/p&gt;&lt;h4 id=&quot;can-you-explain-how-you-interface-a-c-library-with-python&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/sap-data-ingestion-with-python-a-technical-breakdown-of-using-the-sap-rfc-protocol/#can-you-explain-how-you-interface-a-c-library-with-python&quot; class=&quot;heading-anchor&quot;&gt;&lt;strong&gt;Can you explain how you interface a C++ library with Python?&lt;/strong&gt;&lt;/a&gt;&lt;/h4&gt;&lt;p&gt;Interfacing a C++ library with Python involves creating a thin wrapper layer that makes the C++ functions and classes look like native Python objects. To achieve that, we use **pybind11, **which handles this translation by generating a Python extension module that directly calls into the compiled C++ code. Once built, the module can be imported into Python just like any other package, allowing Python code to invoke high-performance C++ logic seamlessly. This approach avoids costly inter-process communication and provides a clean way to combine Python’s flexibility with the speed and efficiency of C++.&lt;/p&gt;&lt;h4 id=&quot;how-can-ingestion-speed-be-optimized&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/sap-data-ingestion-with-python-a-technical-breakdown-of-using-the-sap-rfc-protocol/#how-can-ingestion-speed-be-optimized&quot; class=&quot;heading-anchor&quot;&gt;&lt;strong&gt;How can ingestion speed be optimized?&lt;/strong&gt;&lt;/a&gt;&lt;/h4&gt;&lt;p&gt;Ingestion speed from SAP via RFC can be optimized by designing smart calls that minimize the number of round-trip calls to the SAP system. Instead of pulling entire tables blindly, it’s often more efficient to filter data at the source, select only the required columns, and batch large queries into manageable but sizable chunks. This reduces overhead and avoids overwhelming the Python client with too many small calls. Another common strategy is to push down as much logic as possible into SAP—using where clauses, ranges, or custom remote-enabled function modules—so that Python only receives the data that is truly needed. By shrinking the number of SAP calls in this way, the integration pipeline becomes more efficient, faster, and less resource-intensive on both ends.&lt;/p&gt;&lt;h4 id=&quot;why-do-you-consider-pyrfc-slow-when-it-comes-to-data-ingestion&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/sap-data-ingestion-with-python-a-technical-breakdown-of-using-the-sap-rfc-protocol/#why-do-you-consider-pyrfc-slow-when-it-comes-to-data-ingestion&quot; class=&quot;heading-anchor&quot;&gt;&lt;strong&gt;Why do you consider pyRFC “slow” when it comes to data ingestion?&lt;/strong&gt;&lt;/a&gt;&lt;/h4&gt;&lt;p&gt;pyRFC is not inherently “slow” for small to medium RFC calls, but it becomes inefficient when used for large-scale data ingestion from SAP tables. Several factors contribute to this:&lt;/p&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;&lt;strong&gt;Session management overhead&lt;/strong&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;pyRFC often opens and tears down RFC sessions per call, rather than maintaining a long-lived persistent session. This adds noticeable latency when thousands of calls are required for wide or paginated tables.&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Row size and query splitting&lt;/strong&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;Because of the 512-character row size limitation in RFC_READ_TABLE, wide tables must be split into multiple queries. In pyRFC, reconstructing results requires extra calls and significant Python-side processing, slowing ingestion.&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Serialization and conversion costs&lt;/strong&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;SAP data types (e.g., packed decimals, dates, times, raw fields) must be converted into Python objects. This conversion layer, implemented in Python, adds overhead compared to a native C++ implementation.&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Global Interpreter Lock (GIL)&lt;/strong&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;Python’s GIL prevents true multithreaded parallel RFC calls within the same process. This limits scalability for high-throughput extraction workloads unless you resort to multiprocessing (which adds its own overhead). We are trying to jump over this problem using C++ library.&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Error recovery and retries&lt;/strong&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;pyRFC connections may drop unexpectedly under load, requiring reconnections and retries. This increases latency compared to a C++ connector that maintains stable sessions.&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;In contrast, a native C++ RFC client avoids much of this overhead by:&lt;/p&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;Keeping a persistent connection/session.&lt;/li&gt;&lt;li&gt;Handling large data more efficiently with lower-level memory management.&lt;/li&gt;&lt;li&gt;Offering faster type conversions without Python’s object allocation overhead.&lt;/li&gt;&lt;/ul&gt;&lt;h4 id=&quot;can-incremental-ingestions-be-done-using-sap-rfc-if-yes-how&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/sap-data-ingestion-with-python-a-technical-breakdown-of-using-the-sap-rfc-protocol/#can-incremental-ingestions-be-done-using-sap-rfc-if-yes-how&quot; class=&quot;heading-anchor&quot;&gt;&lt;strong&gt;Can incremental ingestions be done using SAP RFC? If yes, how?&lt;/strong&gt;&lt;/a&gt;&lt;/h4&gt;&lt;p&gt;Incremental ingestions with SAP RFC are not straightforward, because RFC itself is just a transport mechanism and doesn’t provide built-in change tracking. In most cases, you cannot simply ask RFC for “only new or updated records” unless the underlying SAP function module or table has fields that can support this, such as timestamps or change indicators. If those exist, you can design your RFC queries in Python to fetch only rows newer than the last ingestion run, effectively simulating incremental loading. Otherwise, true incremental ingestion is better handled through SAP’s dedicated frameworks like ODP (Operational Data Provisioning) or CDS views, which are specifically designed for delta handling. So while basic RFC on its own doesn’t guarantee incremental ingestions, with careful design and the right SAP data sources, it can sometimes be approximated.&lt;/p&gt;&lt;h4 id=&quot;can-you-give-us-a-couple-of-code-examples-in-python-on-ingesting-the-data&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/sap-data-ingestion-with-python-a-technical-breakdown-of-using-the-sap-rfc-protocol/#can-you-give-us-a-couple-of-code-examples-in-python-on-ingesting-the-data&quot; class=&quot;heading-anchor&quot;&gt;&lt;strong&gt;Can you give us a couple of code examples in Python on ingesting the data?&lt;/strong&gt;&lt;/a&gt;&lt;/h4&gt;&lt;p&gt;&lt;strong&gt;Connection:&lt;/strong&gt;&lt;/p&gt;&lt;pre class=&quot;language-python&quot;&gt;&lt;code class=&quot;language-python&quot;&gt;&lt;span class=&quot;token keyword&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;token function&quot;&gt;con&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;self&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;token operator&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;token operator&quot;&gt;&amp;gt;&lt;/span&gt; sap_rfc_connector&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;SapRfcConnector&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;token triple-quoted-string string&quot;&gt;&quot;&quot;&quot;The C+++ connection to SAP.&quot;&quot;&quot;&lt;/span&gt;
    &lt;span class=&quot;token keyword&quot;&gt;if&lt;/span&gt; self&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;_con &lt;span class=&quot;token keyword&quot;&gt;is&lt;/span&gt; &lt;span class=&quot;token keyword&quot;&gt;not&lt;/span&gt; &lt;span class=&quot;token boolean&quot;&gt;None&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;
        &lt;span class=&quot;token keyword&quot;&gt;return&lt;/span&gt; self&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;_con
    con &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; sap_rfc_connector&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;SapRfcConnector&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token operator&quot;&gt;**&lt;/span&gt;self&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;credentials&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
    self&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;_con &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; con
    &lt;span class=&quot;token keyword&quot;&gt;return&lt;/span&gt; con&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;&lt;strong&gt;Call:&lt;/strong&gt;&lt;/p&gt;&lt;pre class=&quot;language-python&quot;&gt;&lt;code class=&quot;language-python&quot;&gt;&lt;span class=&quot;token keyword&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;token function&quot;&gt;call&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;self&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; func&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token builtin&quot;&gt;str&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;token operator&quot;&gt;*&lt;/span&gt;args&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;token operator&quot;&gt;**&lt;/span&gt;kwargs&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;token operator&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;token operator&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;token builtin&quot;&gt;dict&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;token builtin&quot;&gt;str&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; Any&lt;span class=&quot;token punctuation&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;token triple-quoted-string string&quot;&gt;&quot;&quot;&quot;Call a SAP RFC function.&quot;&quot;&quot;&lt;/span&gt;
    func_caller &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; sap_rfc_connector&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;SapFunctionCaller&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;self&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;con&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;   
    result &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; func_caller&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;smart_call&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;func&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;token operator&quot;&gt;*&lt;/span&gt;args&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;token operator&quot;&gt;**&lt;/span&gt;kwargs&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;

    &lt;span class=&quot;token keyword&quot;&gt;return&lt;/span&gt; result&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;&lt;strong&gt;One of the approaches to avoid pandas for the data ingestion (POC):&lt;/strong&gt;&lt;/p&gt;&lt;pre class=&quot;language-python&quot;&gt;&lt;code class=&quot;language-python&quot;&gt;&lt;span class=&quot;token comment&quot;&gt;# Check and skip if there is no data returned.&lt;/span&gt;
&lt;span class=&quot;token keyword&quot;&gt;try&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;token keyword&quot;&gt;if&lt;/span&gt; response&lt;span class=&quot;token punctuation&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&quot;DATA&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;
        logger&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;debug&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&quot;checking data&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;token keyword&quot;&gt;if&lt;/span&gt; print_regular&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;
            print_regular&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&quot;checking data&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
        record_key &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;WA&quot;&lt;/span&gt;
        data_raw &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; np&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;array&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;response&lt;span class=&quot;token punctuation&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&quot;DATA&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
        
        &lt;span class=&quot;token comment&quot;&gt;# Save raw data to CSV file immediately&lt;/span&gt;
        logger&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;debug&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token string-interpolation&quot;&gt;&lt;span class=&quot;token string&quot;&gt;f&quot;Saving &lt;/span&gt;&lt;span class=&quot;token interpolation&quot;&gt;&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;token builtin&quot;&gt;len&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;data_raw&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt; rows to CSV file...&quot;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;token keyword&quot;&gt;if&lt;/span&gt; print_regular&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;
            print_regular&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token string-interpolation&quot;&gt;&lt;span class=&quot;token string&quot;&gt;f&quot;Saving &lt;/span&gt;&lt;span class=&quot;token interpolation&quot;&gt;&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;token builtin&quot;&gt;len&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;data_raw&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt; rows to CSV file...&quot;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;token keyword&quot;&gt;with&lt;/span&gt; &lt;span class=&quot;token builtin&quot;&gt;open&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;temp_csv_path&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&#39;w&#39;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; newline&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&#39;&#39;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; encoding&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&#39;utf-8&#39;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;token keyword&quot;&gt;as&lt;/span&gt; csvfile&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;
            writer &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; csv&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;writer&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;csvfile&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
            &lt;span class=&quot;token comment&quot;&gt;# Write header&lt;/span&gt;
            writer&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;writerow&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;fields&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
            &lt;span class=&quot;token comment&quot;&gt;# Write data rows&lt;/span&gt;
            &lt;span class=&quot;token keyword&quot;&gt;for&lt;/span&gt; row_data &lt;span class=&quot;token keyword&quot;&gt;in&lt;/span&gt; data_raw&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;
                &lt;span class=&quot;token keyword&quot;&gt;try&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;
                    split_data &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; row_data&lt;span class=&quot;token punctuation&quot;&gt;[&lt;/span&gt;record_key&lt;span class=&quot;token punctuation&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;split&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;sep&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
                    &lt;span class=&quot;token keyword&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;token builtin&quot;&gt;len&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;split_data&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;token operator&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;token builtin&quot;&gt;len&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;fields&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;
                        writer&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;writerow&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;split_data&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
                    &lt;span class=&quot;token keyword&quot;&gt;else&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;
                        logger&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;warning&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token string-interpolation&quot;&gt;&lt;span class=&quot;token string&quot;&gt;f&quot;Row data length mismatch: expected &lt;/span&gt;&lt;span class=&quot;token interpolation&quot;&gt;&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;token builtin&quot;&gt;len&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;fields&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;, got &lt;/span&gt;&lt;span class=&quot;token interpolation&quot;&gt;&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;token builtin&quot;&gt;len&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;split_data&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&quot;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
                &lt;span class=&quot;token keyword&quot;&gt;except&lt;/span&gt; Exception &lt;span class=&quot;token keyword&quot;&gt;as&lt;/span&gt; e&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;
                    logger&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;error&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token string-interpolation&quot;&gt;&lt;span class=&quot;token string&quot;&gt;f&quot;Error processing row: &lt;/span&gt;&lt;span class=&quot;token interpolation&quot;&gt;&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;e&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&quot;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
                    &lt;span class=&quot;token keyword&quot;&gt;continue&lt;/span&gt;
        
        logger&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;debug&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token string-interpolation&quot;&gt;&lt;span class=&quot;token string&quot;&gt;f&quot;Saved data to &lt;/span&gt;&lt;span class=&quot;token interpolation&quot;&gt;&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;temp_csv_path&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&quot;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;token keyword&quot;&gt;if&lt;/span&gt; print_regular&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;
            print_regular&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token string-interpolation&quot;&gt;&lt;span class=&quot;token string&quot;&gt;f&quot;Saved data to &lt;/span&gt;&lt;span class=&quot;token interpolation&quot;&gt;&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;temp_csv_path&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&quot;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;token keyword&quot;&gt;del&lt;/span&gt; response
        &lt;span class=&quot;token keyword&quot;&gt;del&lt;/span&gt; data_raw
&lt;span class=&quot;token keyword&quot;&gt;except&lt;/span&gt; Exception &lt;span class=&quot;token keyword&quot;&gt;as&lt;/span&gt; e&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;
    logger&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;error&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token string-interpolation&quot;&gt;&lt;span class=&quot;token string&quot;&gt;f&quot;Error: &lt;/span&gt;&lt;span class=&quot;token interpolation&quot;&gt;&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;e&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&quot;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;token keyword&quot;&gt;if&lt;/span&gt; print_regular&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;
        print_regular&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token string-interpolation&quot;&gt;&lt;span class=&quot;token string&quot;&gt;f&quot;Error: &lt;/span&gt;&lt;span class=&quot;token interpolation&quot;&gt;&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;e&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&quot;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;token keyword&quot;&gt;break&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;&lt;strong&gt;_________&lt;/strong&gt;&lt;/p&gt;&lt;h4 id=&quot;conclusion&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/sap-data-ingestion-with-python-a-technical-breakdown-of-using-the-sap-rfc-protocol/#conclusion&quot; class=&quot;heading-anchor&quot;&gt;&lt;strong&gt;Conclusion&lt;/strong&gt;&lt;/a&gt;&lt;/h4&gt;&lt;p&gt;Dominik’s perspective makes it clear that working with SAP data through RFC is full of opportunities, but far from straightforward. From row size limits and type conversions to performance tuning and incremental loading, every step has its own challenges. His approach ‒ combining Python with a C++ connector, careful batching, and smart query design ‒ shows what it takes to make ingestion both reliable and efficient.&lt;/p&gt;&lt;p&gt;With pyRFC being decommissioned, these insights feel especially timely. They point to what’s needed in the next generation of connectors: tools that handle scale gracefully, integrate naturally into Python workflows, and make SAP data easier to work with daily.&lt;/p&gt;&lt;/div&gt;
 			</content>
    </entry><entry>
      <title>CI/CD for Data Workflows: Automating Prefect Deployments with GitHub Actions</title>
      <link href="https://thescalableway.com/blog/ci/cd-for-data-workflows-automating-prefect-deployments-with-github-actions/" />
      <updated>2025-07-03T11:00:00Z</updated>
      <id>https://thescalableway.com/blog/ci/cd-for-data-workflows-automating-prefect-deployments-with-github-actions/</id>
      <content type="html">
				&lt;nav id=&quot;toc&quot; class=&quot;table-of-contents prose&quot;&gt;&lt;ol&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/ci/cd-for-data-workflows-automating-prefect-deployments-with-github-actions/#lets-recap&quot;&gt;Let’s recap…&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/ci/cd-for-data-workflows-automating-prefect-deployments-with-github-actions/#ci/cd-foundations-for-data-platforms&quot;&gt;CI/CD Foundations for Data Platforms&lt;/a&gt;&lt;ol&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/ci/cd-for-data-workflows-automating-prefect-deployments-with-github-actions/#why-is-ci/cd-so-important&quot;&gt;Why is CI/CD so important?&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/ci/cd-for-data-workflows-automating-prefect-deployments-with-github-actions/#what-makes-data-pipeline-ci/cd-different-from-traditional-software&quot;&gt;What makes data pipeline CI/CD different from traditional software?&lt;/a&gt;&lt;/li&gt;&lt;/ol&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/ci/cd-for-data-workflows-automating-prefect-deployments-with-github-actions/#github-ci/cd-workflows-for-a-data-platform&quot;&gt;GitHub CI/CD Workflows for a Data Platform&lt;/a&gt;&lt;ol&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/ci/cd-for-data-workflows-automating-prefect-deployments-with-github-actions/#repository-structure&quot;&gt;Repository structure&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/ci/cd-for-data-workflows-automating-prefect-deployments-with-github-actions/#ci/cd-workflows-overview&quot;&gt;CI/CD Workflows overview&lt;/a&gt;&lt;ol&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/ci/cd-for-data-workflows-automating-prefect-deployments-with-github-actions/#workflow-1-flows-image-builder&quot;&gt;Workflow 1: Flows Image Builder&lt;/a&gt;&lt;ol&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/ci/cd-for-data-workflows-automating-prefect-deployments-with-github-actions/#pull-request-steps&quot;&gt;Pull request steps&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/ci/cd-for-data-workflows-automating-prefect-deployments-with-github-actions/#after-pr-marge&quot;&gt;After PR Marge&lt;/a&gt;&lt;/li&gt;&lt;/ol&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/ci/cd-for-data-workflows-automating-prefect-deployments-with-github-actions/#workflow-2-prefect-worker-updates&quot;&gt;Workflow 2: Prefect Worker Updates&lt;/a&gt;&lt;ol&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/ci/cd-for-data-workflows-automating-prefect-deployments-with-github-actions/#pull-request-steps-1&quot;&gt;Pull Request Steps&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/ci/cd-for-data-workflows-automating-prefect-deployments-with-github-actions/#after-pr-merge&quot;&gt;After PR Merge&lt;/a&gt;&lt;/li&gt;&lt;/ol&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/ci/cd-for-data-workflows-automating-prefect-deployments-with-github-actions/#workflow-3-prefect-deployment-orchestration&quot;&gt;Workflow 3: Prefect Deployment Orchestration&lt;/a&gt;&lt;ol&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/ci/cd-for-data-workflows-automating-prefect-deployments-with-github-actions/#pull-request-steps-2&quot;&gt;Pull Request Steps&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/ci/cd-for-data-workflows-automating-prefect-deployments-with-github-actions/#after-pr-merge-1&quot;&gt;After PR Merge&lt;/a&gt;&lt;/li&gt;&lt;/ol&gt;&lt;/li&gt;&lt;/ol&gt;&lt;/li&gt;&lt;/ol&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/ci/cd-for-data-workflows-automating-prefect-deployments-with-github-actions/#conclusion-and-series-summary&quot;&gt;Conclusion &amp;amp; series summary&lt;/a&gt;&lt;/li&gt;&lt;/ol&gt;&lt;/nav&gt;&lt;p&gt;&lt;span id=&quot;toc-skipped&quot; class=&quot;visually-hidden&quot;&gt;&lt;/span&gt;&lt;/p&gt;&lt;div class=&quot;flow prose&quot;&gt;&lt;p&gt;&lt;/p&gt;&lt;p&gt;You’ve built a Prefect flow that runs, but wondering what’s next?&lt;/p&gt;&lt;p&gt;If deploying it means copying files, running CLI commands, or manually registering deployments, you’re doing too much. In this guide, we’ll walk through how to automate Prefect deployments using GitHub Actions and Docker, so your flows move from dev to prod with zero manual steps. Cleaner workflows, fewer errors, and no more “did I forget to deploy that?&quot; moments.&lt;/p&gt;&lt;p&gt;Welcome to Part 4 of our data platform series, where we bring automation and resilience to the forefront by introducing CI/CD for your data workflows. If you’ve followed along, you’ve seen how each layer builds on the last: from architecture through infrastructure to operational readiness. But just to be sure we’re on the same page…&lt;/p&gt;&lt;h2 id=&quot;lets-recap&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/ci/cd-for-data-workflows-automating-prefect-deployments-with-github-actions/#lets-recap&quot; class=&quot;heading-anchor&quot;&gt;Let’s recap…&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;In &lt;strong&gt;Part 1&lt;/strong&gt; (&lt;a href=&quot;https://thescalableway.com/blog/deploying-prefect-on-any-cloud-using-a-single-virtual-machine/&quot; rel=&quot;noopener&quot;&gt;Deploying Prefect on any Cloud Using a Single Virtual Machine&lt;/a&gt;), we explored the architectural foundations of a modern data platform. We discussed why simplicity, flexibility, and scalability matter, and how a lightweight Kubernetes setup on a single VM can deliver immediate value while laying the groundwork for future growth.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Part 2&lt;/strong&gt; (&lt;a href=&quot;https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/&quot; rel=&quot;noopener&quot;&gt;How to Setup Data Platform Infrastructure on Google Cloud Platform with Terraform&lt;/a&gt;) moved us from theory to practice, automating cloud infrastructure using Terraform. It emphasized cloud-agnostic design while preserving the architectural principles from the first article.&lt;/p&gt;&lt;p&gt;We operationalized the platform in &lt;strong&gt;Part 3&lt;/strong&gt; (&lt;a href=&quot;https://thescalableway.com/blog/getting-to-your-first-flow-run-prefect-worker-and-deployment-setup/&quot; rel=&quot;noopener&quot;&gt;Getting to Your First Flow Run: Prefect Worker and Deployment Setup&lt;/a&gt;). You learned how to build a containerized execution environment, configure Prefect workers, and organize deployment code, culminating in your first successful data ingestion flow.&lt;/p&gt;&lt;p&gt;Now, in &lt;strong&gt;Part 4&lt;/strong&gt;, we turn our attention to automation. You’ll learn how to implement a robust CI/CD pipeline using GitHub Actions, tailored for data platforms. We’ll break down three essential types of workflows (covering container management, infrastructure updates, and workflow orchestration) that form a scalable, resilient, and low-maintenance deployment system together.&lt;/p&gt;&lt;p&gt;Whether your goal is to reduce manual intervention, boost reliability, or accelerate delivery, this final part will equip you with patterns and real-world guidance to make your data platform production-ready. Let’s delve in!&lt;/p&gt;&lt;h2 id=&quot;ci/cd-foundations-for-data-platforms&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/ci/cd-for-data-workflows-automating-prefect-deployments-with-github-actions/#ci/cd-foundations-for-data-platforms&quot; class=&quot;heading-anchor&quot;&gt;CI/CD Foundations for Data Platforms&lt;/a&gt;&lt;/h2&gt;&lt;h3 id=&quot;why-is-ci/cd-so-important&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/ci/cd-for-data-workflows-automating-prefect-deployments-with-github-actions/#why-is-ci/cd-so-important&quot; class=&quot;heading-anchor&quot;&gt;Why is CI/CD so important?&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;CI/CD (Continuous Integration and Continuous Deployment) is key to building reliable, scalable, and collaborative data platforms. Manual processes, such as registering deployments by hand or managing configurations outside of version control, can easily slow things down and lead to mistakes. Here’s how automation makes a big difference in modern data workflows:&lt;/p&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;&lt;strong&gt;Manual deployment steps are easy to mess up.&lt;/strong&gt;&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;When people manually update deployments or configurations, there’s a higher chance of errors, inconsistencies, or missed steps. Automation ensures each deployment follows the same steps every time, cutting down on mistakes and keeping things running smoothly.&lt;/p&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;&lt;strong&gt;Without a shared codebase, it’s hard to stay aligned.&lt;/strong&gt;&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;If local development isn’t tied closely to a shared, version-controlled code repository, keeping track of changes, rolling back mistakes, or working well as a team is tough. CI/CD pipelines enforce the use of a central repository, making the entire platform transparent and auditable.&lt;/p&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;&lt;strong&gt;Testing only locally hides problems.&lt;/strong&gt;&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;Without automated workflows to provision and test in staging or development environments, teams often default to running tests locally. This limits visibility into how changes will behave in real-world conditions and increases the risk of production issues.&lt;/p&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;&lt;strong&gt;Manual processes don’t scale well.&lt;/strong&gt;&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;As your data platform and team grow, manual processes quickly become bottlenecks. Automated CI/CD pipelines make it easier to bring on new team members, keep deployments consistent, and move faster, without losing quality or stability.&lt;/p&gt;&lt;p&gt;Setting up a strong CI/CD foundation creates a more stable, transparent, and scalable environment. It lets your data engineers spend more time delivering value and less time fixing avoidable problems.&lt;/p&gt;&lt;h3 id=&quot;what-makes-data-pipeline-ci/cd-different-from-traditional-software&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/ci/cd-for-data-workflows-automating-prefect-deployments-with-github-actions/#what-makes-data-pipeline-ci/cd-different-from-traditional-software&quot; class=&quot;heading-anchor&quot;&gt;What makes data pipeline CI/CD different from traditional software?&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;CI/CD for data platforms comes with its own set of challenges, especially compared to traditional software development. This is mainly because tools like Prefect act as orchestrators, running workflows inside separate, containerized environments. This setup introduces a few extra layers to manage:&lt;/p&gt;&lt;ol class=&quot;list&quot;&gt;&lt;li&gt;&lt;strong&gt;Different Workflows for Different Parts&lt;/strong&gt;&lt;/li&gt;&lt;/ol&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;&lt;strong&gt;Docker Image Pipeline:&lt;/strong&gt; Dedicated workflow for building, testing, and deploying container images, including all the flow dependencies.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Prefect Worker Management:&lt;/strong&gt; The only long-living process requiring separate CI/CD for updates of the application itself and the base job template used in flow runs.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Deployment Configuration:&lt;/strong&gt; Independent workflow for managing Prefect deployment definitions and versioning.&lt;/li&gt;&lt;/ul&gt;&lt;ol start=&quot;2&quot; class=&quot;list&quot;&gt;&lt;li&gt;&lt;strong&gt;More Moving Parts to Coordinate&lt;/strong&gt;&lt;/li&gt;&lt;/ol&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;Each part of the system might have its own release schedule, so updates must be carefully timed to avoid breaking things.&lt;/li&gt;&lt;li&gt;The container environments used for development, testing, and production must match, or you risk inconsistencies and surprises.&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;Unlike in traditional software, where a single build moves through environments, data platforms rely on multiple components that each need to be managed separately. Because of this, automation has to be thoughtfully designed to keep everything in sync. Done right, it helps teams move faster without sacrificing reliability.&lt;/p&gt;&lt;h2 id=&quot;github-ci/cd-workflows-for-a-data-platform&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/ci/cd-for-data-workflows-automating-prefect-deployments-with-github-actions/#github-ci/cd-workflows-for-a-data-platform&quot; class=&quot;heading-anchor&quot;&gt;GitHub CI/CD Workflows for a Data Platform&lt;/a&gt;&lt;/h2&gt;&lt;h3 id=&quot;repository-structure&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/ci/cd-for-data-workflows-automating-prefect-deployments-with-github-actions/#repository-structure&quot; class=&quot;heading-anchor&quot;&gt;Repository structure&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;Before diving into the workflows, here’s a quick look at the repository structure established so far. While this isn’t the complete repository, the following directories and files are the main ones that trigger CI/CD processes:&lt;/p&gt;&lt;pre class=&quot;language-bash&quot;&gt;&lt;code class=&quot;language-bash&quot;&gt;📦 repository
 ┣ 📂 .github
 ┃ ┣ 📂 workflows
 ┃ ┃ ┣ 📜 workflow-x.yml
 ┃ ┃ ┣ 📜 template-x.yml
 ┣ 📂 etc
 ┃ ┣ 📂 &lt;span class=&quot;token function&quot;&gt;docker&lt;/span&gt;
 ┃ ┃ ┣ 📜 Dockerfile
 ┃ ┣ 📂 helm_values
 ┃ ┣ ┣ 📂 prefect-worker
 ┃ ┃ ┃ ┣ 📜 values-dev.yaml
 ┃ ┃ ┃ ┣ 📜 values-prod.yaml
 ┣ 📂 src
 ┃ ┣ 📂 edp_flows
 ┃ ┣ ┣ 📂 flows
 ┃ ┃ ┃ ┣ 📜 flow-x.yml
 ┣ 📜 pyproject.toml
 ┗ 📜 prefect.yml&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&quot;ci/cd-workflows-overview&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/ci/cd-for-data-workflows-automating-prefect-deployments-with-github-actions/#ci/cd-workflows-overview&quot; class=&quot;heading-anchor&quot;&gt;CI/CD Workflows overview&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;With the structure in place, let’s walk through the &lt;strong&gt;three main workflows&lt;/strong&gt; that drive CI/CD for this data platform. This setup is designed to be portable and not tied specifically to GitHub Actions, and its goal is to automate flow deployment with Prefect, avoid unnecessary Docker builds, and use Prefect’s GitHub Repository Block to manage flows cleanly. The three key workflows are:&lt;/p&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;&lt;strong&gt;Flows Image Workflow:&lt;/strong&gt; Triggered when updates to flow image dependencies are needed.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Prefect Worker Workflow:&lt;/strong&gt; Runs when changes are made to the base job template or worker configuration.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Prefect Deployments CI/CD:&lt;/strong&gt; Used for developing and managing new deployments and flows; this is the main workflow during development.&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;Let’s take a closer look at all of them.&lt;/p&gt;&lt;h4 id=&quot;workflow-1-flows-image-builder&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/ci/cd-for-data-workflows-automating-prefect-deployments-with-github-actions/#workflow-1-flows-image-builder&quot; class=&quot;heading-anchor&quot;&gt;Workflow 1: Flows Image Builder&lt;/a&gt;&lt;/h4&gt;&lt;p&gt;&lt;is-land on:idle&gt;&lt;/is-land&gt;&lt;/p&gt;&lt;dialog class=&quot;flow modal10&quot;&gt;&lt;button autofocus class=&quot;button&quot;&gt;Close&lt;/button&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/img/2za6kUqSCH-960.webp 960w, https://thescalableway.com/img/2za6kUqSCH-1600.webp 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/img/2za6kUqSCH-960.jpeg&quot; alt=&quot;flows image builder&quot; width=&quot;1600&quot; height=&quot;748&quot; srcset=&quot;https://thescalableway.com/img/2za6kUqSCH-960.jpeg 960w, https://thescalableway.com/img/2za6kUqSCH-1600.jpeg 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;/picture&gt;&lt;/dialog&gt;&lt;button data-index=&quot;10&quot;&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/img/2za6kUqSCH-960.webp 960w, https://thescalableway.com/img/2za6kUqSCH-1600.webp 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/img/2za6kUqSCH-960.jpeg&quot; alt=&quot;flows image builder&quot; width=&quot;1600&quot; height=&quot;748&quot; srcset=&quot;https://thescalableway.com/img/2za6kUqSCH-960.jpeg 960w, https://thescalableway.com/img/2za6kUqSCH-1600.jpeg 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;/picture&gt;&lt;/button&gt;&lt;p&gt;&lt;/p&gt;&lt;p&gt;This workflow is triggered by changes to any of the following files:&lt;/p&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;&lt;code&gt;etc/docker/Dockerfile&lt;/code&gt;&lt;/li&gt;&lt;li&gt;&lt;code&gt;pyproject.toml&lt;/code&gt;&lt;/li&gt;&lt;/ul&gt;&lt;h5 id=&quot;pull-request-steps&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/ci/cd-for-data-workflows-automating-prefect-deployments-with-github-actions/#pull-request-steps&quot; class=&quot;heading-anchor&quot;&gt;Pull request steps&lt;/a&gt;&lt;/h5&gt;&lt;ol class=&quot;list&quot;&gt;&lt;li&gt;&lt;strong&gt;Check version increase:&lt;/strong&gt; Validates that the version number was bumped correctly (e.g., 1.2.3 → 1.2.4, 1.3.0, or 2.0.0). To calculate acceptable versions after an increase, there is the &lt;code&gt;christian-draeger/increment-semantic-version@1.2.3&lt;/code&gt; action for GitHub that can calculate the next patch, minor, and major version, which can later be compared with the actual version that was manually increased by the developer.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Build DEV Image:&lt;/strong&gt; Builds a unique, versioned DEV image for the edp-flows. It can be tagged using the pattern:&lt;/li&gt;&lt;/ol&gt;&lt;p&gt;&lt;code&gt;${VERSION}-pr-${{ github.event.number }}-run-${{ github.run_number }}&lt;/code&gt;&lt;/p&gt;&lt;p&gt;The image is then pushed to the GitHub Container Registry. In the &lt;a href=&quot;https://thescalableway.com/blog/getting-to-your-first-flow-run-prefect-worker-and-deployment-setup/&quot; rel=&quot;noopener&quot;&gt;previous blog post&lt;/a&gt;, we prepared an example Dockerfile. Here’s what a GitHub workflow to handle it might look like:&lt;/p&gt;&lt;pre class=&quot;language-yaml&quot;&gt;&lt;code class=&quot;language-yaml&quot;&gt;&lt;span class=&quot;token key atrule&quot;&gt;jobs&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;token key atrule&quot;&gt;build-and-push&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;token key atrule&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; Build and push docker image
    &lt;span class=&quot;token key atrule&quot;&gt;runs-on&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; ubuntu&lt;span class=&quot;token punctuation&quot;&gt;-&lt;/span&gt;latest
    &lt;span class=&quot;token key atrule&quot;&gt;steps&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;
      &lt;span class=&quot;token punctuation&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;token key atrule&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; Checkout Repository
        &lt;span class=&quot;token key atrule&quot;&gt;uses&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; actions/checkout@v4

      &lt;span class=&quot;token punctuation&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;token key atrule&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; Set up Docker Buildx
        &lt;span class=&quot;token key atrule&quot;&gt;uses&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; docker/setup&lt;span class=&quot;token punctuation&quot;&gt;-&lt;/span&gt;buildx&lt;span class=&quot;token punctuation&quot;&gt;-&lt;/span&gt;action@v3

      &lt;span class=&quot;token punctuation&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;token key atrule&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; Create a multi&lt;span class=&quot;token punctuation&quot;&gt;-&lt;/span&gt;platform builder
        &lt;span class=&quot;token key atrule&quot;&gt;run&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;|&lt;/span&gt;&lt;span class=&quot;token scalar string&quot;&gt;
          docker buildx create --name builder --driver docker-container --use
          docker buildx inspect --bootstrap&lt;/span&gt;

      &lt;span class=&quot;token punctuation&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;token key atrule&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; Login to GitHub Container Registry
        &lt;span class=&quot;token key atrule&quot;&gt;uses&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; docker/login&lt;span class=&quot;token punctuation&quot;&gt;-&lt;/span&gt;action@v3
        &lt;span class=&quot;token key atrule&quot;&gt;with&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;
          &lt;span class=&quot;token key atrule&quot;&gt;registry&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; $&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt; inputs.registry &lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;
          &lt;span class=&quot;token key atrule&quot;&gt;username&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; $&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt; github.actor &lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;
          &lt;span class=&quot;token key atrule&quot;&gt;password&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; $&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt; secrets.GITHUB_TOKEN &lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;

      &lt;span class=&quot;token punctuation&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;token key atrule&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;Build and Push Docker Image&quot;&lt;/span&gt;
        &lt;span class=&quot;token key atrule&quot;&gt;uses&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; docker/build&lt;span class=&quot;token punctuation&quot;&gt;-&lt;/span&gt;push&lt;span class=&quot;token punctuation&quot;&gt;-&lt;/span&gt;action@v6
        &lt;span class=&quot;token key atrule&quot;&gt;with&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;
          &lt;span class=&quot;token key atrule&quot;&gt;context&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; ./
          &lt;span class=&quot;token key atrule&quot;&gt;file&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; $&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt; inputs.dockerfile &lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;
          &lt;span class=&quot;token key atrule&quot;&gt;platforms&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; $&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt; inputs.platforms &lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;
          &lt;span class=&quot;token key atrule&quot;&gt;push&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; $&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt; inputs.push &lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;
          &lt;span class=&quot;token key atrule&quot;&gt;tags&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; $&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt; inputs.registry &lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;/$&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt; inputs.organization&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;/$&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt; inputs.image_name &lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;$&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt; inputs.image_tag &lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;

      &lt;span class=&quot;token punctuation&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;token key atrule&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; Cleanup the builder
        &lt;span class=&quot;token key atrule&quot;&gt;run&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; docker buildx rm builder&lt;/code&gt;&lt;/pre&gt;&lt;ol start=&quot;3&quot; class=&quot;list&quot;&gt;&lt;li&gt;&lt;strong&gt;Update DEV Prefect work pool:&lt;/strong&gt; Updates the base job template on the DEV environment to reference the newly built Docker image. During this step, it is essential to update the image inside the &lt;code&gt;baseJobTemplate&lt;/code&gt; definition to the newly created one, which can be handled even with a simple replacement:&lt;/li&gt;&lt;/ol&gt;&lt;pre class=&quot;language-yaml&quot;&gt;&lt;code class=&quot;language-yaml&quot;&gt;&lt;span class=&quot;token key atrule&quot;&gt;jobs&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;token key atrule&quot;&gt;prefect-worker-helm&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;token key atrule&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; Prepare prefect worker
    &lt;span class=&quot;token key atrule&quot;&gt;runs-on&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; $&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt; inputs.runs_on &lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;
    &lt;span class=&quot;token key atrule&quot;&gt;steps&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;
      &lt;span class=&quot;token punctuation&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;token key atrule&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; Checkout Repository
        &lt;span class=&quot;token key atrule&quot;&gt;uses&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; actions/checkout@v4

      &lt;span class=&quot;token punctuation&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;token key atrule&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; Add necessary dependencies
        &lt;span class=&quot;token key atrule&quot;&gt;run&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;|&lt;/span&gt;&lt;span class=&quot;token scalar string&quot;&gt;
          helm repo add prefect https://prefecthq.github.io/prefect-helm
          helm repo update prefect&lt;/span&gt;

      &lt;span class=&quot;token punctuation&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;token key atrule&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; Create prefect namespace
        &lt;span class=&quot;token key atrule&quot;&gt;run&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;|&lt;/span&gt;&lt;span class=&quot;token scalar string&quot;&gt;
          cat &amp;lt;&amp;lt;EOF | kubectl apply -f -
          apiVersion: v1
          kind: Namespace
          metadata:
            name: ${{ inputs.namespace }}
          EOF&lt;/span&gt;

      &lt;span class=&quot;token punctuation&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;token key atrule&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; Replace default flow image
        &lt;span class=&quot;token key atrule&quot;&gt;run&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;|&lt;/span&gt;&lt;span class=&quot;token scalar string&quot;&gt;
          sed -i &quot;s|DEFAULT_FLOW_IMAGE|${{ inputs.default_flow_image }}|g&quot; ${{ inputs.helm_values_path }}&lt;/span&gt;

      &lt;span class=&quot;token punctuation&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;token key atrule&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; Run Helm upgrade commands
        &lt;span class=&quot;token key atrule&quot;&gt;run&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;|&lt;/span&gt;&lt;span class=&quot;token scalar string&quot;&gt;
          helm upgrade prefect-worker --install prefect/prefect-worker &#92;
            -n ${{ inputs.namespace }} &#92;
            --version ${{ inputs.prefect_chart_version }} &#92;
            -f ${{ inputs.helm_values_path }}&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;Just like with creating a namespace, any missing resources can also be created. Usually, this includes a secret with the &lt;code&gt;PREFECT_API_KEY&lt;/code&gt; and &lt;code&gt;registry credentials&lt;/code&gt; secret for downloading private flow images, but additional secrets or configurations may also be needed.&lt;/p&gt;&lt;h5 id=&quot;after-pr-marge&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/ci/cd-for-data-workflows-automating-prefect-deployments-with-github-actions/#after-pr-marge&quot; class=&quot;heading-anchor&quot;&gt;After PR Marge&lt;/a&gt;&lt;/h5&gt;&lt;ol class=&quot;list&quot;&gt;&lt;li&gt;&lt;strong&gt;Tag and Release:&lt;/strong&gt; A new GitHub tag and release are created for the updated version. Many ready-made actions are available online. For example, we can define a job like this:&lt;/li&gt;&lt;/ol&gt;&lt;pre class=&quot;language-yaml&quot;&gt;&lt;code class=&quot;language-yaml&quot;&gt;&lt;span class=&quot;token key atrule&quot;&gt;jobs&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;token key atrule&quot;&gt;prepare&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;token key atrule&quot;&gt;if&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; $&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt; github.event.pull_request.merged &lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;
    &lt;span class=&quot;token key atrule&quot;&gt;runs-on&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; ubuntu&lt;span class=&quot;token punctuation&quot;&gt;-&lt;/span&gt;latest
    &lt;span class=&quot;token key atrule&quot;&gt;steps&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;
      &lt;span class=&quot;token punctuation&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;token key atrule&quot;&gt;uses&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; actions/checkout@v4

      &lt;span class=&quot;token punctuation&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;token key atrule&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; Get version from pyproject.toml
        &lt;span class=&quot;token key atrule&quot;&gt;id&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; get&lt;span class=&quot;token punctuation&quot;&gt;-&lt;/span&gt;version
        &lt;span class=&quot;token key atrule&quot;&gt;run&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; echo &quot;version=$(grep &lt;span class=&quot;token punctuation&quot;&gt;-&lt;/span&gt;Po &#39;(&lt;span class=&quot;token punctuation&quot;&gt;?&lt;/span&gt;&amp;lt;=^version = &quot;)&lt;span class=&quot;token punctuation&quot;&gt;[&lt;/span&gt;^&quot;&lt;span class=&quot;token punctuation&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;token important&quot;&gt;*&#39;&lt;/span&gt; pyproject.toml)&quot; &lt;span class=&quot;token punctuation&quot;&gt;|&lt;/span&gt; tee &lt;span class=&quot;token punctuation&quot;&gt;-&lt;/span&gt;a $GITHUB_OUTPUT

    &lt;span class=&quot;token key atrule&quot;&gt;outputs&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;
      &lt;span class=&quot;token key atrule&quot;&gt;VERSION&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; $&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt; steps.get&lt;span class=&quot;token punctuation&quot;&gt;-&lt;/span&gt;version.outputs.version &lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;

  &lt;span class=&quot;token key atrule&quot;&gt;tag-and-release&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;token key atrule&quot;&gt;if&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; $&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt; github.event.pull_request.merged &lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;
    &lt;span class=&quot;token key atrule&quot;&gt;needs&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;[&lt;/span&gt;prepare&lt;span class=&quot;token punctuation&quot;&gt;]&lt;/span&gt;
    &lt;span class=&quot;token key atrule&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; Tag and release new version (if applicable)
    &lt;span class=&quot;token key atrule&quot;&gt;runs-on&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; ubuntu&lt;span class=&quot;token punctuation&quot;&gt;-&lt;/span&gt;latest
    &lt;span class=&quot;token key atrule&quot;&gt;permissions&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;
      &lt;span class=&quot;token key atrule&quot;&gt;contents&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; write
    &lt;span class=&quot;token key atrule&quot;&gt;outputs&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;
      &lt;span class=&quot;token key atrule&quot;&gt;TAG_CREATED&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; $&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt; steps.check&lt;span class=&quot;token punctuation&quot;&gt;-&lt;/span&gt;tag.outputs.exists &lt;span class=&quot;token tag&quot;&gt;!=&lt;/span&gt; &#39;true&#39; &lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;

    &lt;span class=&quot;token key atrule&quot;&gt;steps&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;
      &lt;span class=&quot;token punctuation&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;token key atrule&quot;&gt;uses&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; actions/checkout@v4
        &lt;span class=&quot;token key atrule&quot;&gt;with&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;
          &lt;span class=&quot;token key atrule&quot;&gt;fetch-depth&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token number&quot;&gt;0&lt;/span&gt;

      &lt;span class=&quot;token punctuation&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;token key atrule&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; Check if a tag for this version already exists in the repo
        &lt;span class=&quot;token key atrule&quot;&gt;uses&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; mukunku/tag&lt;span class=&quot;token punctuation&quot;&gt;-&lt;/span&gt;exists&lt;span class=&quot;token punctuation&quot;&gt;-&lt;/span&gt;action@v1.6.0
        &lt;span class=&quot;token key atrule&quot;&gt;id&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; check&lt;span class=&quot;token punctuation&quot;&gt;-&lt;/span&gt;tag
        &lt;span class=&quot;token key atrule&quot;&gt;with&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;
          &lt;span class=&quot;token key atrule&quot;&gt;tag&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; v$&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt; needs.prepare.outputs.VERSION &lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;

      &lt;span class=&quot;token punctuation&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;token key atrule&quot;&gt;uses&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; fregante/setup&lt;span class=&quot;token punctuation&quot;&gt;-&lt;/span&gt;git&lt;span class=&quot;token punctuation&quot;&gt;-&lt;/span&gt;user@v2
        &lt;span class=&quot;token key atrule&quot;&gt;if&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; steps.check&lt;span class=&quot;token punctuation&quot;&gt;-&lt;/span&gt;tag.outputs.exists &lt;span class=&quot;token tag&quot;&gt;!=&lt;/span&gt; &#39;true&#39;

      &lt;span class=&quot;token punctuation&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;token key atrule&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; Publish the new tag
        &lt;span class=&quot;token key atrule&quot;&gt;if&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; steps.check&lt;span class=&quot;token punctuation&quot;&gt;-&lt;/span&gt;tag.outputs.exists &lt;span class=&quot;token tag&quot;&gt;!=&lt;/span&gt; &#39;true&#39;
        &lt;span class=&quot;token key atrule&quot;&gt;run&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;|&lt;/span&gt;&lt;span class=&quot;token scalar string&quot;&gt;
          git tag -a v${{ needs.prepare.outputs.VERSION }} -m &quot;Release v${{ needs.prepare.outputs.VERSION }}&quot;
          git push origin v${{ needs.prepare.outputs.VERSION }}&lt;/span&gt;

      &lt;span class=&quot;token punctuation&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;token key atrule&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; Create a release
        &lt;span class=&quot;token key atrule&quot;&gt;if&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; steps.check&lt;span class=&quot;token punctuation&quot;&gt;-&lt;/span&gt;tag.outputs.exists &lt;span class=&quot;token tag&quot;&gt;!=&lt;/span&gt; &#39;true&#39;
        &lt;span class=&quot;token key atrule&quot;&gt;uses&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; ncipollo/release&lt;span class=&quot;token punctuation&quot;&gt;-&lt;/span&gt;action@v1
        &lt;span class=&quot;token key atrule&quot;&gt;with&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;
          &lt;span class=&quot;token key atrule&quot;&gt;generateReleaseNotes&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token boolean important&quot;&gt;true&lt;/span&gt;
          &lt;span class=&quot;token key atrule&quot;&gt;tag&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; v$&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt; needs.prepare.outputs.VERSION &lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;We can utilize an additional &lt;code&gt;prepare&lt;/code&gt; job that will pre-define values used later in our workflow, simplifying its logic.&lt;/p&gt;&lt;ol class=&quot;list&quot;&gt;&lt;li&gt;&lt;strong&gt;Build PROD Image:&lt;/strong&gt; Builds PROD image for the edp-flows, tagging it with &lt;code&gt;${VERSION}&lt;/code&gt; and pushing it to the GitHub container registry. The same template used for building the DEV image can be reused, but only the image tag can be changed.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Update DEV Prefect work pool:&lt;/strong&gt; Updates the base job template in the DEV environment to use the new Docker image.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Update PROD Prefect work pool:&lt;/strong&gt; Updates the base job template in the PROD environment to use the new Docker image.&lt;/li&gt;&lt;/ol&gt;&lt;h4 id=&quot;workflow-2-prefect-worker-updates&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/ci/cd-for-data-workflows-automating-prefect-deployments-with-github-actions/#workflow-2-prefect-worker-updates&quot; class=&quot;heading-anchor&quot;&gt;Workflow 2: Prefect Worker Updates&lt;/a&gt;&lt;/h4&gt;&lt;p&gt;&lt;is-land on:idle&gt;&lt;/is-land&gt;&lt;/p&gt;&lt;dialog class=&quot;flow modal11&quot;&gt;&lt;button autofocus class=&quot;button&quot;&gt;Close&lt;/button&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/img/6XE6y67Dgw-960.webp 960w, https://thescalableway.com/img/6XE6y67Dgw-1600.webp 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/img/6XE6y67Dgw-960.jpeg&quot; alt=&quot;prefect worker updates&quot; width=&quot;1600&quot; height=&quot;741&quot; srcset=&quot;https://thescalableway.com/img/6XE6y67Dgw-960.jpeg 960w, https://thescalableway.com/img/6XE6y67Dgw-1600.jpeg 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;/picture&gt;&lt;/dialog&gt;&lt;button data-index=&quot;11&quot;&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/img/6XE6y67Dgw-960.webp 960w, https://thescalableway.com/img/6XE6y67Dgw-1600.webp 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/img/6XE6y67Dgw-960.jpeg&quot; alt=&quot;prefect worker updates&quot; width=&quot;1600&quot; height=&quot;741&quot; srcset=&quot;https://thescalableway.com/img/6XE6y67Dgw-960.jpeg 960w, https://thescalableway.com/img/6XE6y67Dgw-1600.jpeg 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;/picture&gt;&lt;/button&gt;&lt;p&gt;&lt;/p&gt;&lt;p&gt;This workflow is triggered by changes to any of the following files:&lt;/p&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;&lt;code&gt;etc/helm_values/prefect-worker/values-dev.yaml&lt;/code&gt;&lt;/li&gt;&lt;li&gt;&lt;code&gt;etc/helm_values/prefect-worker/values-prod.yaml&lt;/code&gt;&lt;/li&gt;&lt;/ul&gt;&lt;h5 id=&quot;pull-request-steps-1&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/ci/cd-for-data-workflows-automating-prefect-deployments-with-github-actions/#pull-request-steps-1&quot; class=&quot;heading-anchor&quot;&gt;Pull Request Steps&lt;/a&gt;&lt;/h5&gt;&lt;ol class=&quot;list&quot;&gt;&lt;li&gt;&lt;strong&gt;Binary packages check/install&lt;/strong&gt;: Installs any required binaries (like K3s and Helm) on the DEV virtual machine.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Update DEV Prefect work pool:&lt;/strong&gt; Updates the Prefect worker’s base job template on DEV environment to apply any required configuration changes.&lt;/li&gt;&lt;/ol&gt;&lt;h5 id=&quot;after-pr-merge&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/ci/cd-for-data-workflows-automating-prefect-deployments-with-github-actions/#after-pr-merge&quot; class=&quot;heading-anchor&quot;&gt;After PR Merge&lt;/a&gt;&lt;/h5&gt;&lt;ol class=&quot;list&quot;&gt;&lt;li&gt;&lt;strong&gt;Binary packages check/install&lt;/strong&gt;: During the first execution, the necessary binaries (K3s and Helm) are installed on the PROD virtual machine.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Update DEV Prefect work pool:&lt;/strong&gt; Updates the Prefect worker’s base job template on DEV environment to apply any required configuration changes.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Update PROD Prefect work pool:&lt;/strong&gt; Updates the Prefect worker’s base job template on PROD environment to apply any required configuration changes.&lt;/li&gt;&lt;/ol&gt;&lt;p&gt;&lt;strong&gt;&lt;em&gt;Note:&lt;/em&gt;&lt;/strong&gt;&lt;em&gt; These first two workflows affect infrastructure only; they don’t touch actual Prefect deployments. The next workflow handles that.&lt;/em&gt;&lt;/p&gt;&lt;h4 id=&quot;workflow-3-prefect-deployment-orchestration&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/ci/cd-for-data-workflows-automating-prefect-deployments-with-github-actions/#workflow-3-prefect-deployment-orchestration&quot; class=&quot;heading-anchor&quot;&gt;Workflow 3: Prefect Deployment Orchestration&lt;/a&gt;&lt;/h4&gt;&lt;p&gt;&lt;is-land on:idle&gt;&lt;/is-land&gt;&lt;/p&gt;&lt;dialog class=&quot;flow modal12&quot;&gt;&lt;button autofocus class=&quot;button&quot;&gt;Close&lt;/button&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/img/jOHcunnyPS-960.webp 960w, https://thescalableway.com/img/jOHcunnyPS-1600.webp 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/img/jOHcunnyPS-960.jpeg&quot; alt=&quot;prefect deployment orchestration&quot; width=&quot;1600&quot; height=&quot;598&quot; srcset=&quot;https://thescalableway.com/img/jOHcunnyPS-960.jpeg 960w, https://thescalableway.com/img/jOHcunnyPS-1600.jpeg 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;/picture&gt;&lt;/dialog&gt;&lt;button data-index=&quot;12&quot;&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/img/jOHcunnyPS-960.webp 960w, https://thescalableway.com/img/jOHcunnyPS-1600.webp 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/img/jOHcunnyPS-960.jpeg&quot; alt=&quot;prefect deployment orchestration&quot; width=&quot;1600&quot; height=&quot;598&quot; srcset=&quot;https://thescalableway.com/img/jOHcunnyPS-960.jpeg 960w, https://thescalableway.com/img/jOHcunnyPS-1600.jpeg 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;/picture&gt;&lt;/button&gt;&lt;p&gt;&lt;/p&gt;&lt;p&gt;Triggered by changes to &lt;code&gt;prefect.yaml&lt;/code&gt; or any source code within src directory.&lt;/p&gt;&lt;h5 id=&quot;pull-request-steps-2&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/ci/cd-for-data-workflows-automating-prefect-deployments-with-github-actions/#pull-request-steps-2&quot; class=&quot;heading-anchor&quot;&gt;Pull Request Steps&lt;/a&gt;&lt;/h5&gt;&lt;p&gt;&lt;strong&gt;1. Identify Modified Deployments:&lt;/strong&gt; In a simplified version, we can register all existing deployments. It can be handled with a helper script with such logic:&lt;/p&gt;&lt;pre class=&quot;language-yaml&quot;&gt;&lt;code class=&quot;language-yaml&quot;&gt;&lt;span class=&quot;token key atrule&quot;&gt;jobs&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;token key atrule&quot;&gt;apply-deployments&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;token key atrule&quot;&gt;runs-on&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; ubuntu&lt;span class=&quot;token punctuation&quot;&gt;-&lt;/span&gt;latest
    &lt;span class=&quot;token key atrule&quot;&gt;steps&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;
      &lt;span class=&quot;token punctuation&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;token key atrule&quot;&gt;uses&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; actions/checkout@v4
        &lt;span class=&quot;token key atrule&quot;&gt;with&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;
          &lt;span class=&quot;token key atrule&quot;&gt;fetch-depth&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token number&quot;&gt;0&lt;/span&gt;
      
      &lt;span class=&quot;token punctuation&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;token key atrule&quot;&gt;uses&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; actions/setup&lt;span class=&quot;token punctuation&quot;&gt;-&lt;/span&gt;python@v5
        &lt;span class=&quot;token key atrule&quot;&gt;with&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;
          &lt;span class=&quot;token key atrule&quot;&gt;python-version-file&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;.python-version&quot;&lt;/span&gt;

      &lt;span class=&quot;token punctuation&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;token key atrule&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; Install dependencies
        &lt;span class=&quot;token key atrule&quot;&gt;run&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; pip install &lt;span class=&quot;token punctuation&quot;&gt;-&lt;/span&gt;q PyYAML prefect

      &lt;span class=&quot;token punctuation&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;token key atrule&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; Get all deployments from prefect.yaml
        &lt;span class=&quot;token key atrule&quot;&gt;if&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; $&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt; inputs.deployments == &#39;all&#39; &lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;
        &lt;span class=&quot;token key atrule&quot;&gt;run&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; echo &quot;DEPLOYMENT_NAMES=$(cat prefect.yaml &lt;span class=&quot;token punctuation&quot;&gt;|&lt;/span&gt; yq &#39;.deployments&lt;span class=&quot;token punctuation&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;]&lt;/span&gt;.name&#39; &lt;span class=&quot;token punctuation&quot;&gt;|&lt;/span&gt; paste &lt;span class=&quot;token punctuation&quot;&gt;-&lt;/span&gt;sd &quot;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;&quot;)&quot; &lt;span class=&quot;token punctuation&quot;&gt;|&lt;/span&gt; tee &lt;span class=&quot;token punctuation&quot;&gt;-&lt;/span&gt;a $GITHUB_ENV&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;As a target solution, we want a script that detects changes to deployments in the &lt;code&gt;prefect.yaml&lt;/code&gt; by comparing the main branch with the pull request branch. This way, &lt;code&gt;DEPLOYMENT_NAMES&lt;/code&gt; will include only the modified deployments. The script can also detect removed deployments, helping with housekeeping in Prefect Cloud.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;2. Apply to DEV:&lt;/strong&gt; Applies the modified deployments to the DEV Prefect workspace, referencing the pull request branch. Assuming that we have only modified deployments provided in &lt;code&gt;DEPLOYMENT_NAMES&lt;/code&gt;, the registration script can look like this:&lt;/p&gt;&lt;pre class=&quot;language-yaml&quot;&gt;&lt;code class=&quot;language-yaml&quot;&gt;      &lt;span class=&quot;token punctuation&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;token key atrule&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; Set branch &#39;$&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;GITHUB_HEAD_REF&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;&#39; in prefect.yaml
        &lt;span class=&quot;token key atrule&quot;&gt;run&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;|&lt;/span&gt;&lt;span class=&quot;token scalar string&quot;&gt;
          yq -i &#39;.pull[] |= (select(has(&quot;prefect.deployments.steps.git_clone&quot;)) 
            | .[&quot;prefect.deployments.steps.git_clone&quot;].branch = &quot;${GITHUB_HEAD_REF}&quot; | .) // .&#39; prefect.yaml&lt;/span&gt;

      &lt;span class=&quot;token punctuation&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;token key atrule&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; Register all deployments
        &lt;span class=&quot;token key atrule&quot;&gt;run&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;|&lt;/span&gt;&lt;span class=&quot;token scalar string&quot;&gt;
              for deployment_name in $(echo ${{ env.DEPLOYMENT_NAMES }} | tr &#39;,&#39; &#39;&#92;n&#39;); do&lt;/span&gt;

        prefect &lt;span class=&quot;token punctuation&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;-&lt;/span&gt;no&lt;span class=&quot;token punctuation&quot;&gt;-&lt;/span&gt;prompt deploy &#92;
            &lt;span class=&quot;token punctuation&quot;&gt;-&lt;/span&gt;n &quot;$deployment_name&quot;
            &lt;span class=&quot;token punctuation&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;-&lt;/span&gt;tag $&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;GITHUB_HEAD_REF&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;
    done&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;Deployments are created with scheduling disabled to allow manual testing in the DEV environment. The preparation step sets &lt;code&gt;GITHUB_HEAD_REF&lt;/code&gt; as a branch reference for the &lt;code&gt;git_clone&lt;/code&gt; step at runtime.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;3. Testing &amp;amp; Merge:&lt;/strong&gt; Modified deployments and flows are tested. As there is no schedule enabled on the deployment on DEV workspace, it needs to be triggered and tested manually. Upon approval, changes are merged into the main branch.&lt;/p&gt;&lt;h5 id=&quot;after-pr-merge-1&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/ci/cd-for-data-workflows-automating-prefect-deployments-with-github-actions/#after-pr-merge-1&quot; class=&quot;heading-anchor&quot;&gt;After PR Merge&lt;/a&gt;&lt;/h5&gt;&lt;p&gt;&lt;strong&gt;1. Identify Modified Deployments:&lt;/strong&gt; Detects all updated deployments in the prefect.yaml, just like in step 1.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;2. Apply to PROD:&lt;/strong&gt; Apply the deployments to the PROD Prefect workspace, using the main branch. Scheduling is enabled with this command:&lt;/p&gt;&lt;pre class=&quot;language-bash&quot;&gt;&lt;code class=&quot;language-bash&quot;&gt;&lt;span class=&quot;token function-name function&quot;&gt;get_flow_name_for_deployment_from_prefect_yaml&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;token comment&quot;&gt;# Fetch flow name for a deployment from prefect.yaml.&lt;/span&gt;

    &lt;span class=&quot;token assign-left variable&quot;&gt;deployment_name&lt;/span&gt;&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;token variable&quot;&gt;$1&lt;/span&gt;

    &lt;span class=&quot;token builtin class-name&quot;&gt;echo&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;Retrieving flow name for deployment &#39;&lt;span class=&quot;token variable&quot;&gt;$deployment_name&lt;/span&gt;&#39;...&quot;&lt;/span&gt; &lt;span class=&quot;token operator&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;token file-descriptor important&quot;&gt;&amp;amp;2&lt;/span&gt;

    &lt;span class=&quot;token assign-left variable&quot;&gt;deployment_entrypoint&lt;/span&gt;&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;token variable&quot;&gt;&lt;span class=&quot;token variable&quot;&gt;$(&lt;/span&gt;&lt;span class=&quot;token function&quot;&gt;cat&lt;/span&gt; prefect.yaml &lt;span class=&quot;token operator&quot;&gt;|&lt;/span&gt; yq &lt;span class=&quot;token string&quot;&gt;&#39;.deployments[] | select(.name == &quot;&#39;&lt;/span&gt;$deployment_name&lt;span class=&quot;token string&quot;&gt;&#39;&quot;) | .entrypoint&#39;&lt;/span&gt;&lt;span class=&quot;token variable&quot;&gt;)&lt;/span&gt;&lt;/span&gt;
    &lt;span class=&quot;token assign-left variable&quot;&gt;flow_name&lt;/span&gt;&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;token variable&quot;&gt;&lt;span class=&quot;token variable&quot;&gt;$(&lt;/span&gt;&lt;span class=&quot;token builtin class-name&quot;&gt;echo&lt;/span&gt; $deployment_entrypoint &lt;span class=&quot;token operator&quot;&gt;|&lt;/span&gt; &lt;span class=&quot;token function&quot;&gt;cut&lt;/span&gt; -d&lt;span class=&quot;token string&quot;&gt;&#39;:&#39;&lt;/span&gt; &lt;span class=&quot;token parameter variable&quot;&gt;-f2&lt;/span&gt;&lt;span class=&quot;token variable&quot;&gt;)&lt;/span&gt;&lt;/span&gt;
    &lt;span class=&quot;token assign-left variable&quot;&gt;flow_name_kebab_case&lt;/span&gt;&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;token variable&quot;&gt;&lt;span class=&quot;token variable&quot;&gt;$(&lt;/span&gt;&lt;span class=&quot;token builtin class-name&quot;&gt;echo&lt;/span&gt; $flow_name &lt;span class=&quot;token operator&quot;&gt;|&lt;/span&gt; &lt;span class=&quot;token function&quot;&gt;sed&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&#39;s/_/-/g&#39;&lt;/span&gt;&lt;span class=&quot;token variable&quot;&gt;)&lt;/span&gt;&lt;/span&gt;

    &lt;span class=&quot;token comment&quot;&gt;# Return the flow name converted to kebab case, as this is what prefect CLI commands expect.&lt;/span&gt;
    &lt;span class=&quot;token builtin class-name&quot;&gt;echo&lt;/span&gt; &lt;span class=&quot;token variable&quot;&gt;$flow_name_kebab_case&lt;/span&gt;
&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;token assign-left variable&quot;&gt;flow_name&lt;/span&gt;&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;token variable&quot;&gt;&lt;span class=&quot;token variable&quot;&gt;$(&lt;/span&gt;get_flow_name_for_deployment_from_prefect_yaml &lt;span class=&quot;token string&quot;&gt;&quot;&lt;span class=&quot;token variable&quot;&gt;$deployment_name&lt;/span&gt;&quot;&lt;/span&gt;&lt;span class=&quot;token variable&quot;&gt;)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;token assign-left variable&quot;&gt;schedule_id&lt;/span&gt;&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;token variable&quot;&gt;&lt;span class=&quot;token variable&quot;&gt;$(&lt;/span&gt;prefect deployment schedule &lt;span class=&quot;token function&quot;&gt;ls&lt;/span&gt; $flow_name/$deployment_name &lt;span class=&quot;token operator&quot;&gt;|&lt;/span&gt; &lt;span class=&quot;token function&quot;&gt;grep&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&#39;│&#39;&lt;/span&gt; &lt;span class=&quot;token operator&quot;&gt;|&lt;/span&gt; &lt;span class=&quot;token function&quot;&gt;awk&lt;/span&gt; &lt;span class=&quot;token parameter variable&quot;&gt;-F&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&#39;│&#39;&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&#39;{print $2}&#39;&lt;/span&gt; &lt;span class=&quot;token operator&quot;&gt;|&lt;/span&gt; &lt;span class=&quot;token function&quot;&gt;sed&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&#39;s/^[[:space:]]*//;s/[[:space:]]*$//&#39;&lt;/span&gt;&lt;span class=&quot;token variable&quot;&gt;)&lt;/span&gt;&lt;/span&gt;
prefect deployment schedule resume &lt;span class=&quot;token variable&quot;&gt;$flow_name&lt;/span&gt;/&lt;span class=&quot;token variable&quot;&gt;$deployment_name&lt;/span&gt; &lt;span class=&quot;token variable&quot;&gt;$schedule_id&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;This script can be added to the workflow to automate schedule enabling.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;3. Synv DEV:&lt;/strong&gt; After deleting the pull request branch, reapply deployments to the DEV Prefect workspace, now referencing the &lt;code&gt;main&lt;/code&gt; branch. Scheduling remains disabled for DEV.&lt;/p&gt;&lt;p&gt;You can verify the &lt;strong&gt;branch references&lt;/strong&gt; used in deployments by checking the assigned tag or the Configuration tab in Prefect Cloud. For example, in the screenshot below, the deployment runs on branch &lt;code&gt;feature_branch_1&lt;/code&gt;.&lt;/p&gt;&lt;p&gt;&lt;is-land on:idle&gt;&lt;/is-land&gt;&lt;/p&gt;&lt;dialog class=&quot;flow modal13&quot;&gt;&lt;button autofocus class=&quot;button&quot;&gt;Close&lt;/button&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/img/U2Nw6rtxZH-605.webp 605w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/img/U2Nw6rtxZH-605.jpeg&quot; alt=&quot;verify branch references&quot; width=&quot;605&quot; height=&quot;408&quot;&gt;&lt;/picture&gt;&lt;/dialog&gt;&lt;button data-index=&quot;13&quot;&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/img/U2Nw6rtxZH-605.webp 605w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/img/U2Nw6rtxZH-605.jpeg&quot; alt=&quot;verify branch references&quot; width=&quot;605&quot; height=&quot;408&quot;&gt;&lt;/picture&gt;&lt;/button&gt;&lt;p&gt;&lt;/p&gt;&lt;h2 id=&quot;conclusion-and-series-summary&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/ci/cd-for-data-workflows-automating-prefect-deployments-with-github-actions/#conclusion-and-series-summary&quot; class=&quot;heading-anchor&quot;&gt;Conclusion &amp;amp; series summary&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;This article wraps up our four-part journey to building a modern, automated data platform—from high-level architecture to fully hands-off, production-ready operations. Along the way, we’ve shown how each layer, including architecture, infrastructure, orchestration, and automation, works together to create a resilient, scalable foundation for data engineering.&lt;/p&gt;&lt;p&gt;Let’s quickly recap:&lt;/p&gt;&lt;p&gt;&lt;a href=&quot;https://thescalableway.com/blog/deploying-prefect-on-any-cloud-using-a-single-virtual-machine/&quot; rel=&quot;noopener&quot;&gt;&lt;strong&gt;Part 1&lt;/strong&gt;&lt;/a&gt; focused on architectural decisions, demonstrating how a lightweight Kubernetes setup on a single VM can enable rapid adoption and growth, even for teams just starting with cloud-native data platforms.&lt;/p&gt;&lt;p&gt;&lt;a href=&quot;https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/&quot; rel=&quot;noopener&quot;&gt;&lt;strong&gt;Part 2&lt;/strong&gt;&lt;/a&gt; moved from design to implementation, automating cloud infrastructure provisioning with Terraform to ensure consistency, reproducibility, and cloud-agnostic flexibility.&lt;/p&gt;&lt;p&gt;&lt;a href=&quot;https://thescalableway.com/blog/getting-to-your-first-flow-run-prefect-worker-and-deployment-setup/&quot; rel=&quot;noopener&quot;&gt;&lt;strong&gt;Part 3&lt;/strong&gt;&lt;/a&gt; took us deeper into operations, guiding you through containerized flow execution, Prefect worker configuration, and deployment management—helping you run your first data ingestion flows confidently.&lt;/p&gt;&lt;p&gt;And here in &lt;strong&gt;Part 4&lt;/strong&gt;, we brought everything together by introducing CI/CD automation. We showed how three specialized workflows for Docker images, Prefect workers, and deployment orchestration help reduce manual errors, maintain a single source of truth, and scale both your platform and your team. This kind of automation makes development smoother, testing more reliable, and production releases faster and safer.&lt;/p&gt;&lt;p&gt;The main takeaway is that adopting CI/CD for data platforms is not just about tools but about changing how your team works. Automation connects development and production, reduces risk, and frees your engineers to focus on data rather than infrastructure.&lt;/p&gt;&lt;p&gt;Thanks so much for following along. For more tips and updates, check out my &lt;a href=&quot;https://thescalableway.com/author/karol-wolski/&quot; rel=&quot;noopener&quot;&gt;&lt;strong&gt;other articles&lt;/strong&gt;&lt;/a&gt;, subscribe to our &lt;strong&gt;newsletter&lt;/strong&gt;, and connect with me on &lt;a href=&quot;https://www.linkedin.com/in/wolski-karol/&quot; rel=&quot;noopener&quot;&gt;&lt;strong&gt;LinkedIn&lt;/strong&gt;&lt;/a&gt;. Let’s keep the conversation about smarter data platforms going.&lt;/p&gt;&lt;/div&gt;
 			</content>
    </entry><entry>
      <title>Scaling Secure Data Access: A Systematic RBAC Approach Using Entra ID</title>
      <link href="https://thescalableway.com/blog/scaling-secure-data-access-a-systematic-rbac-approach-using-entra-id/" />
      <updated>2025-06-23T09:30:00Z</updated>
      <id>https://thescalableway.com/blog/scaling-secure-data-access-a-systematic-rbac-approach-using-entra-id/</id>
      <content type="html">
				&lt;nav id=&quot;toc&quot; class=&quot;table-of-contents prose&quot;&gt;&lt;ol&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/scaling-secure-data-access-a-systematic-rbac-approach-using-entra-id/#what-is-role-based-access-control-rbac&quot;&gt;What is Role-Based Access Control (RBAC)?&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/scaling-secure-data-access-a-systematic-rbac-approach-using-entra-id/#why-rbac-matters-in-modular-data-platforms&quot;&gt;Why RBAC Matters in Modular Data Platforms&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/scaling-secure-data-access-a-systematic-rbac-approach-using-entra-id/#phase-1-defining-user-personas&quot;&gt;Phase 1: Defining User Personas&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/scaling-secure-data-access-a-systematic-rbac-approach-using-entra-id/#phase-2-creating-entraid-groups&quot;&gt;Phase 2: Creating EntraID Groups&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/scaling-secure-data-access-a-systematic-rbac-approach-using-entra-id/#phase-3-mapping-users-to-entraid-groups&quot;&gt;Phase 3: Mapping Users to EntraID Groups&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/scaling-secure-data-access-a-systematic-rbac-approach-using-entra-id/#phase-4-access-configuration-and-permissions-setup&quot;&gt;Phase 4: Access Configuration and Permissions Setup&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/scaling-secure-data-access-a-systematic-rbac-approach-using-entra-id/#phase-5-audit-and-continuous-monitoring&quot;&gt;Phase 5: Audit and Continuous Monitoring&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/scaling-secure-data-access-a-systematic-rbac-approach-using-entra-id/#final-thoughts&quot;&gt;Final Thoughts&lt;/a&gt;&lt;/li&gt;&lt;/ol&gt;&lt;/nav&gt;&lt;p&gt;&lt;span id=&quot;toc-skipped&quot; class=&quot;visually-hidden&quot;&gt;&lt;/span&gt;&lt;/p&gt;&lt;div class=&quot;flow prose&quot;&gt;&lt;p&gt;&lt;/p&gt;&lt;p&gt;When data platforms grow in scale and complexity, so do the risks. Suddenly, you’re juggling dozens of tools, a growing number of users, and multiple layers of sensitive data, while trying to balance it all with security controls. Managing who gets access to what (and making sure they only get only what they need) quickly becomes a full-time job.&lt;/p&gt;&lt;p&gt;This reality requires flexible and manageable access controls. That’s exactly where &lt;strong&gt;Role-Based Access Control (RBAC)&lt;/strong&gt; enters the scene.&lt;/p&gt;&lt;h2 id=&quot;what-is-role-based-access-control-rbac&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/scaling-secure-data-access-a-systematic-rbac-approach-using-entra-id/#what-is-role-based-access-control-rbac&quot; class=&quot;heading-anchor&quot;&gt;What is Role-Based Access Control (RBAC)?&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;RBAC is a foundational security model that has evolved from military protocols to become a standard in IT systems. In this paradigm, user permissions are assigned to roles, rather than managing access for each user individually. This approach brings several advantages:&lt;/p&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;&lt;strong&gt;Scalability:&lt;/strong&gt; Manage access for large user bases efficiently.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Consistency:&lt;/strong&gt; Reduces the risk of misconfiguration.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Auditability:&lt;/strong&gt; Makes it easier to track who has access to what.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Modularity:&lt;/strong&gt; Aligns well with component-based data architectures.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Savings&lt;/strong&gt; Cuts down on repetitive access management tasks.&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;At its core, RBAC secures both organizational assets and resources—a distinction that’s especially relevant in modular data platforms.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Assets&lt;/strong&gt; are digital, valuable objects used across the organization to support data-driven decisions. These include data models, dashboards, reports, and machine learning models.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Resources&lt;/strong&gt; are the underlying technical components or capabilities that enable the creation and delivery of those assets. These include EKS clusters, storage systems (e.g., S3, ADLS), databases, orchestrators, pipelines, and monitoring tools.&lt;/p&gt;&lt;p&gt;Access to both &lt;strong&gt;assets and resources&lt;/strong&gt; should be managed consistently and securely. In most cases, Role-Based Access Control provides a solid and sufficient framework to do exactly that.&lt;/p&gt;&lt;h2 id=&quot;why-rbac-matters-in-modular-data-platforms&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/scaling-secure-data-access-a-systematic-rbac-approach-using-entra-id/#why-rbac-matters-in-modular-data-platforms&quot; class=&quot;heading-anchor&quot;&gt;Why RBAC Matters in Modular Data Platforms&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;RBAC is essential in &lt;strong&gt;modular data platforms&lt;/strong&gt;. These platforms serve diverse users (data engineers, analysts, scientists, AI engineers, and business users), each with specific access needs. The challenge lies in designing a system that grants appropriate access without compromising security or compliance.&lt;/p&gt;&lt;p&gt;This article walks through a structured, five-phase RBAC implementation using Microsoft Entra ID—a widely adopted identity and access management tool. The end goal of this process is to establish a streamlined, scalable access workflow, just like the one visualized below:&lt;/p&gt;&lt;p&gt;&lt;is-land on:idle&gt;&lt;/is-land&gt;&lt;/p&gt;&lt;dialog class=&quot;flow modal55&quot;&gt;&lt;button autofocus class=&quot;button&quot;&gt;Close&lt;/button&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/img/SbIhpGpl98-960.webp 960w, https://thescalableway.com/img/SbIhpGpl98-1600.webp 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/img/SbIhpGpl98-960.jpeg&quot; alt=&quot;data accesswith entra id&quot; width=&quot;1600&quot; height=&quot;916&quot; srcset=&quot;https://thescalableway.com/img/SbIhpGpl98-960.jpeg 960w, https://thescalableway.com/img/SbIhpGpl98-1600.jpeg 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;/picture&gt;&lt;/dialog&gt;&lt;button data-index=&quot;55&quot;&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/img/SbIhpGpl98-960.webp 960w, https://thescalableway.com/img/SbIhpGpl98-1600.webp 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/img/SbIhpGpl98-960.jpeg&quot; alt=&quot;data accesswith entra id&quot; width=&quot;1600&quot; height=&quot;916&quot; srcset=&quot;https://thescalableway.com/img/SbIhpGpl98-960.jpeg 960w, https://thescalableway.com/img/SbIhpGpl98-1600.jpeg 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;/picture&gt;&lt;/button&gt;&lt;p&gt;&lt;/p&gt;&lt;p&gt;&lt;is-land on:idle&gt;&lt;/is-land&gt;&lt;/p&gt;&lt;dialog class=&quot;flow modal56&quot;&gt;&lt;button autofocus class=&quot;button&quot;&gt;Close&lt;/button&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/img/ZwcOIWG-rZ-960.webp 960w, https://thescalableway.com/img/ZwcOIWG-rZ-1600.webp 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/img/ZwcOIWG-rZ-960.jpeg&quot; alt=&quot;data access to an enterprise data platform via entra id&quot; width=&quot;1600&quot; height=&quot;711&quot; srcset=&quot;https://thescalableway.com/img/ZwcOIWG-rZ-960.jpeg 960w, https://thescalableway.com/img/ZwcOIWG-rZ-1600.jpeg 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;/picture&gt;&lt;/dialog&gt;&lt;button data-index=&quot;56&quot;&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/img/ZwcOIWG-rZ-960.webp 960w, https://thescalableway.com/img/ZwcOIWG-rZ-1600.webp 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/img/ZwcOIWG-rZ-960.jpeg&quot; alt=&quot;data access to an enterprise data platform via entra id&quot; width=&quot;1600&quot; height=&quot;711&quot; srcset=&quot;https://thescalableway.com/img/ZwcOIWG-rZ-960.jpeg 960w, https://thescalableway.com/img/ZwcOIWG-rZ-1600.jpeg 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;/picture&gt;&lt;/button&gt;&lt;p&gt;&lt;/p&gt;&lt;h2 id=&quot;phase-1-defining-user-personas&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/scaling-secure-data-access-a-systematic-rbac-approach-using-entra-id/#phase-1-defining-user-personas&quot; class=&quot;heading-anchor&quot;&gt;Phase 1: Defining User Personas&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;RBAC starts with clear role definitions, but many teams get stuck here. &lt;em&gt;Where to start? Which roles do we need? What defines a good role? Let’s stick to 77/755/744…&lt;/em&gt;&lt;/p&gt;&lt;p&gt;This is a critical step, as your entire access model depends on getting this right.&lt;/p&gt;&lt;p&gt;The best approach begins by listing &lt;strong&gt;key user personas&lt;/strong&gt;—broad categories of users with shared access needs, security considerations, and operational patterns—and refining them into sub-categories if necessary. Here are typical personas in a modern data platform:&lt;/p&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;&lt;strong&gt;Data Platform Engineer/DevOps&lt;/strong&gt; – Needs broad access across infrastructure components. Responsible for developing, deploying, and maintaining platform services. Typically requires admin-level permissions across environments and tooling.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Data Engineer&lt;/strong&gt; – Focuses on building and managing pipelines, data flows, and transformations. Requires read/write access in development and staging environments, limited access to production, and admin permissions in orchestration tools.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Data Scientist &lt;/strong&gt;– Engages in exploratory data analysis and model development. Needs read access across multiple domains, access to computational resources, and controlled write access to model deployment tools.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Data Analyst&lt;/strong&gt; – Primarily interacts with modeled or normalized data. Requires read access to curated datasets and limited write access for storing results. May need orchestrator access to refresh or trigger reporting workflows.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Business Analysts, BI Developers, and Report Creators&lt;/strong&gt; – Work mostly in presentation layers. They typically require read-only access to business-ready data and tools for dashboard creation. Their access often spans multiple domains, which makes clear scoping important.&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;&lt;strong&gt;Pro tip:&lt;/strong&gt; If your organization uses data domains (e.g., Finance, HR, Sales), use them as attributes to refine personas. A matrix of personas vs. domains can simplify access mapping without adding unnecessary complexity.&lt;/p&gt;&lt;h2 id=&quot;phase-2-creating-entraid-groups&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/scaling-secure-data-access-a-systematic-rbac-approach-using-entra-id/#phase-2-creating-entraid-groups&quot; class=&quot;heading-anchor&quot;&gt;Phase 2: Creating EntraID Groups&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;With personas defined, the next step is to translate them into &lt;strong&gt;Microsoft EntraID security groups&lt;/strong&gt;. They serve as the operational layer of the RBAC model, forming the backbone for scalable, repeatable permission management.&lt;/p&gt;&lt;p&gt;Entra ID groups do much more than just bundle users. They support critical capabilities for maintaining control as the platform scales, allowing audit trails, access reviews, and user lifecycle automation.&lt;/p&gt;&lt;p&gt;Each group should correspond to a specific persona, have a clearly defined owner, and include a domain context where applicable. For example, instead of creating a generic DataAnalyst group, consider a more specific DataAnalyst_Marketing group. Naming conventions should be meaningful, consistent, and designed to support future growth. A poorly named group today can create significant technical debt later if it needs to be split or redefined.&lt;/p&gt;&lt;p&gt;Security groups offer clear advantages over individual user permission assignments. They enable systematic and repeatable permission management that scales with organizational growth. When users move between roles or teams, administrators can adjust group memberships rather than managing permissions across individual services, reducing complexity, administrative overhead, and the risk of errors.&lt;/p&gt;&lt;p&gt;Some organizations also choose to create hierarchies of groups, where a parent group holds shared access, and child groups inherit those permissions while adding more specific scopes. This approach can work well if aligned with organizational structure, but it must be handled carefully to avoid unintentionally granting broader access than intended.&lt;/p&gt;&lt;h2 id=&quot;phase-3-mapping-users-to-entraid-groups&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/scaling-secure-data-access-a-systematic-rbac-approach-using-entra-id/#phase-3-mapping-users-to-entraid-groups&quot; class=&quot;heading-anchor&quot;&gt;Phase 3: Mapping Users to EntraID Groups&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;With groups in place, the focus shifts to assigning users to appropriate groups based on their roles and domain access requirements, if applicable.&lt;/p&gt;&lt;p&gt;The &lt;strong&gt;principle of least privilege&lt;/strong&gt; should guide all user assignments, ensuring that individuals receive only the minimum access necessary to perform their current job responsibilities. Start conservatively and expand access only when it’s validated by actual business need and approved by the data platform owner.&lt;/p&gt;&lt;p&gt;Many organizations struggle with group management and hesitate to adopt group-based assignments until they can fully automate the process. However, waiting for perfect automation can delay real security improvements. Even a manual group assignment following the audit rules is more secure and manageable than using user-resource access management.&lt;/p&gt;&lt;p&gt;Entra ID supports &lt;strong&gt;dynamic membership rules&lt;/strong&gt;, which allow users to be automatically assigned to groups based on attributes like department, job title, or location. For example, analysts in the Finance department can be automatically added to the DataAnalyst_Finance group using:&lt;/p&gt;&lt;p&gt;&lt;code&gt;(user.department -eq &quot;Finance&quot;) and (user.role -contains &quot;analyst&quot;)&lt;/code&gt;&lt;/p&gt;&lt;p&gt;These rules can also be extended by integrating Entra ID with external systems like Workday, allowing logic based on a richer organizational context to be applied.&lt;/p&gt;&lt;h2 id=&quot;phase-4-access-configuration-and-permissions-setup&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/scaling-secure-data-access-a-systematic-rbac-approach-using-entra-id/#phase-4-access-configuration-and-permissions-setup&quot; class=&quot;heading-anchor&quot;&gt;Phase 4: Access Configuration and Permissions Setup&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;The access configuration phase &lt;strong&gt;translates group memberships into specific permissions&lt;/strong&gt; across the modular data platform components. This phase requires a tools inventory, listing the components in use, and a data assets inventory, where the data catalog proves invaluable.&lt;/p&gt;&lt;p&gt;With the BoM and inventory in place, the next step is to enable EntraID as the identity provider. This can be achieved either through native support or via SSO, SAML, or OAuth. Once configured, RBAC is implemented at the component level by granting permissions to the appropriate groups.&lt;/p&gt;&lt;p&gt;This process should remain transparent to users while ensuring comprehensive logging for security monitoring and compliance purposes.&lt;/p&gt;&lt;h2 id=&quot;phase-5-audit-and-continuous-monitoring&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/scaling-secure-data-access-a-systematic-rbac-approach-using-entra-id/#phase-5-audit-and-continuous-monitoring&quot; class=&quot;heading-anchor&quot;&gt;Phase 5: Audit and Continuous Monitoring&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Regular auditing is essential for effective RBAC maintenance, ensuring that access permissions remain appropriate and aligned with organizational policies. This phase systematically reviews user assignments, group configurations, and access patterns to detect potential security risks or compliance issues. Organizations should define &lt;strong&gt;regular audit schedules&lt;/strong&gt;, typically at monthly or quarterly intervals.&lt;/p&gt;&lt;p&gt;To maintain a robust RBAC setup, the following steps should be followed:&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Step 1: Generate (new) User Overview&lt;/strong&gt;&lt;/p&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;The Data Platform Owner initiates the generation of EntraID audit reports, which is typically carried out by the EntraID Team.&lt;/li&gt;&lt;li&gt;The Data Platform Owner verifies the BoM and access levels per role.&lt;/li&gt;&lt;li&gt;The resulting report is shared with the respective data and tool owners.&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;&lt;strong&gt;Step 2: Review User Overview&lt;/strong&gt;&lt;/p&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;The Data Owner reviews the user access report.&lt;/li&gt;&lt;li&gt;They assess whether each user’s access is still appropriate.&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;&lt;strong&gt;Step 3: User Access Approval&lt;/strong&gt;&lt;/p&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;If all user access is deemed appropriate (approval), the process concludes with a notification sent to relevant stakeholders.&lt;/li&gt;&lt;li&gt;If any user is rejected:&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;The Data/Tool Owner requests that specific users be removed from relevant dashboards or reports.&lt;/li&gt;&lt;li&gt;The affected users are notified of their access removal.&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;&lt;strong&gt;Step 4: Remove User Access&lt;/strong&gt;&lt;/p&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;The EntraID Team removes the specified users from the corresponding EntraID Group.&lt;/li&gt;&lt;li&gt;A confirmation notification is sent to finalize the update.&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;&lt;is-land on:idle&gt;&lt;/is-land&gt;&lt;/p&gt;&lt;dialog class=&quot;flow modal57&quot;&gt;&lt;button autofocus class=&quot;button&quot;&gt;Close&lt;/button&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/img/wYL-aCJeEe-960.webp 960w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/img/wYL-aCJeEe-960.jpeg&quot; alt=&quot;rbac workflow&quot; width=&quot;960&quot; height=&quot;540&quot;&gt;&lt;/picture&gt;&lt;/dialog&gt;&lt;button data-index=&quot;57&quot;&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/img/wYL-aCJeEe-960.webp 960w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/img/wYL-aCJeEe-960.jpeg&quot; alt=&quot;rbac workflow&quot; width=&quot;960&quot; height=&quot;540&quot;&gt;&lt;/picture&gt;&lt;/button&gt;&lt;p&gt;&lt;/p&gt;&lt;h2 id=&quot;final-thoughts&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/scaling-secure-data-access-a-systematic-rbac-approach-using-entra-id/#final-thoughts&quot; class=&quot;heading-anchor&quot;&gt;Final Thoughts&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Implementing RBAC in modular data platforms is not a one-time effort but a continuous, &lt;strong&gt;cyclical process&lt;/strong&gt;. The five-phase methodology outlined here offers a scalable, repeatable framework that enhances security, reduces operational overhead, and improves transparency across the data ecosystem.&lt;/p&gt;&lt;p&gt;With Microsoft Entra ID as the backbone, organizations gain enterprise-grade identity management, automation support, and auditable group control. It enables sustainable, compliant access control by supporting the full user and permission lifecycle. When paired with regular audits, this approach ensures that RBAC remains aligned with evolving business needs and security risks, with audit findings feeding back into earlier phases for ongoing refinement.&lt;/p&gt;&lt;/div&gt;
 			</content>
    </entry><entry>
      <title>Getting to Your First Flow Run: Prefect Worker &amp; Deployment Setup</title>
      <link href="https://thescalableway.com/blog/getting-to-your-first-flow-run-prefect-worker-and-deployment-setup/" />
      <updated>2025-06-10T11:00:00Z</updated>
      <id>https://thescalableway.com/blog/getting-to-your-first-flow-run-prefect-worker-and-deployment-setup/</id>
      <content type="html">
				&lt;nav id=&quot;toc&quot; class=&quot;table-of-contents prose&quot;&gt;&lt;ol&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/getting-to-your-first-flow-run-prefect-worker-and-deployment-setup/#data-platform-components-overview&quot;&gt;Data Platform Components Overview&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/getting-to-your-first-flow-run-prefect-worker-and-deployment-setup/#docker-image-for-flows-execution&quot;&gt;Docker Image for Flows Execution&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/getting-to-your-first-flow-run-prefect-worker-and-deployment-setup/#prefect-worker&quot;&gt;Prefect Worker&lt;/a&gt;&lt;ol&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/getting-to-your-first-flow-run-prefect-worker-and-deployment-setup/#how-prefect-executes-flow-runs&quot;&gt;How Prefect Executes Flow Runs&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/getting-to-your-first-flow-run-prefect-worker-and-deployment-setup/#prefect-worker-configuration&quot;&gt;Prefect Worker Configuration&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/getting-to-your-first-flow-run-prefect-worker-and-deployment-setup/#example-base-job-template&quot;&gt;Example Base Job Template&lt;/a&gt;&lt;/li&gt;&lt;/ol&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/getting-to-your-first-flow-run-prefect-worker-and-deployment-setup/#prefect-configuration-files&quot;&gt;Prefect Configuration Files&lt;/a&gt;&lt;ol&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/getting-to-your-first-flow-run-prefect-worker-and-deployment-setup/#example-prefect-flow&quot;&gt;Example Prefect Flow&lt;/a&gt;&lt;/li&gt;&lt;/ol&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/getting-to-your-first-flow-run-prefect-worker-and-deployment-setup/#conclusion&quot;&gt;Conclusion&lt;/a&gt;&lt;ol&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/getting-to-your-first-flow-run-prefect-worker-and-deployment-setup/#whats-next&quot;&gt;What’s next?&lt;/a&gt;&lt;/li&gt;&lt;/ol&gt;&lt;/li&gt;&lt;/ol&gt;&lt;/nav&gt;&lt;p&gt;&lt;span id=&quot;toc-skipped&quot; class=&quot;visually-hidden&quot;&gt;&lt;/span&gt;&lt;/p&gt;&lt;div class=&quot;flow prose&quot;&gt;&lt;p&gt;&lt;/p&gt;&lt;p&gt;You’ve laid the groundwork: the infrastructure is in place. The next logical step is turning that foundation into something functional, running your first data ingestion workflow. That moment when everything connects for the first time can feel like crossing an invisible line: from setup to real-world execution.&lt;br&gt;This article picks up where we left off. It’s the third part in a series designed to guide data engineers through the complete journey of building a modern data platform.&lt;/p&gt;&lt;p&gt;In &lt;a href=&quot;https://thescalableway.com/blog/deploying-prefect-on-any-cloud-using-a-single-virtual-machine/&quot; rel=&quot;noopener&quot;&gt;Part 1&lt;/a&gt;, I explored architectural approaches and proposed a lightweight Kubernetes setup running on a single VM. While it doesn’t offer full high availability, this setup has proven to be a practical starting point, especially for teams with limited cloud-native experience. It allows organizations to grow along the data maturity curve without the overhead of more complex solutions.&lt;/p&gt;&lt;p&gt;&lt;a href=&quot;https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/&quot; rel=&quot;noopener&quot;&gt;Part 2&lt;/a&gt; focused on provisioning the infrastructure using Terraform on Google Cloud Platform (GCP). We used GCP as an example, but the underlying architectural principles are cloud-agnostic and applicable across providers.&lt;/p&gt;&lt;p&gt;Now that the infrastructure is ready, this article walks through the next milestone: configuring all the components required to execute your first data ingestion workflow.&lt;/p&gt;&lt;h2 id=&quot;data-platform-components-overview&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/getting-to-your-first-flow-run-prefect-worker-and-deployment-setup/#data-platform-components-overview&quot; class=&quot;heading-anchor&quot;&gt;Data Platform Components Overview&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Here’s a high-level overview of a generic data platform architecture (as discussed in previous articles).&lt;/p&gt;&lt;p&gt;&lt;is-land on:idle&gt;&lt;/is-land&gt;&lt;/p&gt;&lt;dialog class=&quot;flow modal25&quot;&gt;&lt;button autofocus class=&quot;button&quot;&gt;Close&lt;/button&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/img/NvijIMTwVi-960.webp 960w, https://thescalableway.com/img/NvijIMTwVi-1600.webp 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/img/NvijIMTwVi-960.jpeg&quot; alt=&quot;data platform architecture&quot; width=&quot;1600&quot; height=&quot;761&quot; srcset=&quot;https://thescalableway.com/img/NvijIMTwVi-960.jpeg 960w, https://thescalableway.com/img/NvijIMTwVi-1600.jpeg 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;/picture&gt;&lt;/dialog&gt;&lt;button data-index=&quot;25&quot;&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/img/NvijIMTwVi-960.webp 960w, https://thescalableway.com/img/NvijIMTwVi-1600.webp 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/img/NvijIMTwVi-960.jpeg&quot; alt=&quot;data platform architecture&quot; width=&quot;1600&quot; height=&quot;761&quot; srcset=&quot;https://thescalableway.com/img/NvijIMTwVi-960.jpeg 960w, https://thescalableway.com/img/NvijIMTwVi-1600.jpeg 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;/picture&gt;&lt;/button&gt;&lt;p&gt;&lt;/p&gt;&lt;p&gt;While this article won’t cover the data warehouse or data lake, we will zoom in on the other key components that power workflow orchestration:&lt;/p&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;&lt;strong&gt;GitHub&lt;/strong&gt; as the version control system&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Prefect Cloud&lt;/strong&gt; as the orchestration environment&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Prefect worker&lt;/strong&gt; as the workflow orchestration system&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;To prepare these, we’ll walk through these essential elements that serve as the backbone of a modern data platform:&lt;/p&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;&lt;strong&gt;Docker Image for Flows Execution&lt;/strong&gt;&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;Every Prefect flow needs a controlled and consistent environment to run. Using a Docker container is an ideal solution for this purpose, as it provides isolation and ensures that all dependencies and runtime configurations are reproducible. Containerization is a standard practice in modern data platforms to guarantee reliable flow execution across different environments.&lt;/p&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;&lt;strong&gt;Prefect Worker&lt;/strong&gt;&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;Prefect Cloud handles all deployment schedules, but a process within our infrastructure is required to pull and execute these scheduled tasks. This is the responsibility of the Prefect worker. Before diving into more advanced topics, it’s important to understand how to configure and manage the Prefect worker. Later, I will explain further how to set up a base job template that will be used to execute all deployments, ensuring consistent and scalable workflow orchestration.&lt;/p&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;&lt;strong&gt;Prefect Configuration Files&lt;/strong&gt;&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;A well-structured repository is essential for managing deployments and flows efficiently. Ideally, adding a new deployment should require only a few lines of code, making it easy for any team member to contribute. The &lt;code&gt;prefect.yaml&lt;/code&gt; file plays a key role in this process by organizing and codifying deployment configurations, clearly connecting flows, deployments, and infrastructure in a maintainable way.&lt;/p&gt;&lt;p&gt;By focusing on containerized execution, robust workflow orchestration, and clear configuration management, we lay the groundwork for a scalable and maintainable data platform that ensures reliable data ingestion and processing.&lt;/p&gt;&lt;h2 id=&quot;docker-image-for-flows-execution&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/getting-to-your-first-flow-run-prefect-worker-and-deployment-setup/#docker-image-for-flows-execution&quot; class=&quot;heading-anchor&quot;&gt;Docker Image for Flows Execution&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Having each Prefect flow executed in its own isolated environment is essential. Without isolation, two flows running side-by-side could conflict—think mismatched library versions, breaking changes, or dependency clashes. Suddenly, what worked yesterday doesn’t work today. Containerization solves this problem elegantly.&lt;/p&gt;&lt;p&gt;Building a dedicated Docker image ensures that flow runs are reproducible and decoupled from the host system. You can test new dependencies in dev, tag the image, and confidently promote it to prod. No more “it worked on my machine” surprises.&lt;/p&gt;&lt;p&gt;For most data teams, starting with the official Prefect image is the way to go. It includes all the Prefect orchestration tools out of the box, so you don’t have to reinvent the wheel. Manage dependencies via &lt;code&gt;uv&lt;/code&gt; and a &lt;code&gt;pyproject.toml&lt;/code&gt;:&lt;/p&gt;&lt;pre class=&quot;language-toml&quot;&gt;&lt;code class=&quot;language-toml&quot;&gt;&lt;span class=&quot;token key property&quot;&gt;dependencies&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;[&lt;/span&gt;
    &lt;span class=&quot;token string&quot;&gt;&quot;prefect[docker,github,gcp]&amp;gt;=3.3.7&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;token string&quot;&gt;&quot;dlt[mssql,parquet]&amp;gt;=1.8.1&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;token string&quot;&gt;&quot;pymssql&amp;gt;=2.3.2&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;token punctuation&quot;&gt;]&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;Once dependencies are set, run &lt;code&gt;uv sync&lt;/code&gt;. This generates a &lt;code&gt;uv.lock&lt;/code&gt; file, locking all versions for reproducibility. This file, along with your &lt;code&gt;pyproject.toml&lt;/code&gt;, is all you need for the build process.&lt;/p&gt;&lt;p&gt;The Dockerfile itself is rather straightforward, assuming we need to prepare the &lt;code&gt;uv&lt;/code&gt; to install system dependencies to be used inside the container without an additional virtual environment:&lt;/p&gt;&lt;pre class=&quot;language-docker&quot;&gt;&lt;code class=&quot;language-docker&quot;&gt;&lt;span class=&quot;token instruction&quot;&gt;&lt;span class=&quot;token keyword&quot;&gt;FROM&lt;/span&gt; prefecthq/prefect:3.3.7-python3.12&lt;/span&gt;

&lt;span class=&quot;token instruction&quot;&gt;&lt;span class=&quot;token keyword&quot;&gt;RUN&lt;/span&gt; pip install uv --no-cache&lt;/span&gt;

&lt;span class=&quot;token instruction&quot;&gt;&lt;span class=&quot;token keyword&quot;&gt;COPY&lt;/span&gt; pyproject.toml uv.lock README.md ./&lt;/span&gt;

&lt;span class=&quot;token instruction&quot;&gt;&lt;span class=&quot;token keyword&quot;&gt;RUN&lt;/span&gt; uv export --frozen --no-dev --no-editable &amp;gt; requirements.txt&lt;/span&gt;

&lt;span class=&quot;token instruction&quot;&gt;&lt;span class=&quot;token keyword&quot;&gt;RUN&lt;/span&gt; uv pip install --no-cache --system --pre -r requirements.txt&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;Notice we don’t copy the entire repo, only dependency files. The flow code lives outside the image, so we only rebuild the image when dependencies change. To push the image to the GitHub Container Registry, you can use:&lt;/p&gt;&lt;pre class=&quot;language-bash&quot;&gt;&lt;code class=&quot;language-bash&quot;&gt;&lt;span class=&quot;token function&quot;&gt;docker&lt;/span&gt; build &lt;span class=&quot;token parameter variable&quot;&gt;-t&lt;/span&gt; ghcr.io/&lt;span class=&quot;token operator&quot;&gt;&amp;lt;&lt;/span&gt;your_github_organisation&lt;span class=&quot;token operator&quot;&gt;&amp;gt;&lt;/span&gt;/edp-flows:&lt;span class=&quot;token operator&quot;&gt;&amp;lt;&lt;/span&gt;tag&lt;span class=&quot;token operator&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;token builtin class-name&quot;&gt;.&lt;/span&gt;

&lt;span class=&quot;token function&quot;&gt;docker&lt;/span&gt; push ghcr.io/&lt;span class=&quot;token operator&quot;&gt;&amp;lt;&lt;/span&gt;your_github_organisation&lt;span class=&quot;token operator&quot;&gt;&amp;gt;&lt;/span&gt;/edp-flows:&lt;span class=&quot;token operator&quot;&gt;&amp;lt;&lt;/span&gt;tag&lt;span class=&quot;token operator&quot;&gt;&amp;gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;This is a manual step for now, but don’t worry—we’ll automate it with GitHub Actions soon. For now, you’ve got a solid, reliable foundation for running flows in a clean, isolated environment every single time.&lt;/p&gt;&lt;h2 id=&quot;prefect-worker&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/getting-to-your-first-flow-run-prefect-worker-and-deployment-setup/#prefect-worker&quot; class=&quot;heading-anchor&quot;&gt;Prefect Worker&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Once the Docker image is ready, the next step is orchestrating flows execution. That’s where the Prefect worker comes in. Workers are long-running processes that poll work pools for scheduled flow runs and execute them. If you want to dive deeper into the range of possibilities, &lt;a href=&quot;https://docs.prefect.io/v3/deploy/infrastructure-concepts/workers&quot; rel=&quot;noopener&quot;&gt;Prefect’s documentation&lt;/a&gt; is a great resource.&lt;/p&gt;&lt;p&gt;For our setup, we’ll use the Kubernetes worker type and deploy it with the official &lt;a href=&quot;https://github.com/PrefectHQ/prefect-helm/tree/main/charts/prefect-worker&quot; rel=&quot;noopener&quot;&gt;Prefect Helm Chart&lt;/a&gt;.&lt;/p&gt;&lt;h3 id=&quot;how-prefect-executes-flow-runs&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/getting-to-your-first-flow-run-prefect-worker-and-deployment-setup/#how-prefect-executes-flow-runs&quot; class=&quot;heading-anchor&quot;&gt;How Prefect Executes Flow Runs&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;Prefect workers are responsible for:&lt;/p&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;&lt;strong&gt;Polling work pools&lt;/strong&gt; for scheduled flow runs&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Spinning up job-specific infrastructure&lt;/strong&gt; using base templates&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Enabling environment-specific configurations&lt;/strong&gt; via Helm charts&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Supporting zero-downtime updates&lt;/strong&gt; when modifying worker settings&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;This dynamic approach means you can flexibly scale, update, and manage your workflow execution environments.&lt;/p&gt;&lt;h3 id=&quot;prefect-worker-configuration&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/getting-to-your-first-flow-run-prefect-worker-and-deployment-setup/#prefect-worker-configuration&quot; class=&quot;heading-anchor&quot;&gt;Prefect Worker Configuration&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;Here’s a minimal &lt;code&gt;values.yaml&lt;/code&gt; example, sufficient to get a Prefect worker running via Helm:&lt;/p&gt;&lt;pre class=&quot;language-yaml&quot;&gt;&lt;code class=&quot;language-yaml&quot;&gt;&lt;span class=&quot;token key atrule&quot;&gt;worker&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;token key atrule&quot;&gt;image&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;token key atrule&quot;&gt;repository&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; prefecthq/prefect
    &lt;span class=&quot;token key atrule&quot;&gt;prefectTag&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; 3.3.7&lt;span class=&quot;token punctuation&quot;&gt;-&lt;/span&gt;python3.12&lt;span class=&quot;token punctuation&quot;&gt;-&lt;/span&gt;kubernetes
    &lt;span class=&quot;token key atrule&quot;&gt;pullPolicy&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; IfNotPresent

  &lt;span class=&quot;token key atrule&quot;&gt;config&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;token key atrule&quot;&gt;workPool&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;edp-work-pool&quot;&lt;/span&gt;
    &lt;span class=&quot;token key atrule&quot;&gt;workQueues&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&quot;default&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;]&lt;/span&gt;
    &lt;span class=&quot;token key atrule&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;edp-worker&quot;&lt;/span&gt;
    &lt;span class=&quot;token key atrule&quot;&gt;baseJobTemplate&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;
      &lt;span class=&quot;token key atrule&quot;&gt;configuration&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; &amp;lt;&amp;lt; BASE JOB TEMPLATE WILL BE PROVIDED IN THE NEXT SECTION &lt;span class=&quot;token punctuation&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;&amp;gt;&lt;/span&gt;

  &lt;span class=&quot;token key atrule&quot;&gt;cloudApiConfig&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;token key atrule&quot;&gt;accountId&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;7e3c367b-143a-86e2-b92f-6i414816c39b&quot;&lt;/span&gt;
    &lt;span class=&quot;token key atrule&quot;&gt;workspaceId&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;c506815f-qe83-42dc-b905-re6bcfb68c52&quot;&lt;/span&gt;
    &lt;span class=&quot;token key atrule&quot;&gt;apiKeySecret&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;
      &lt;span class=&quot;token key atrule&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; prefect&lt;span class=&quot;token punctuation&quot;&gt;-&lt;/span&gt;api&lt;span class=&quot;token punctuation&quot;&gt;-&lt;/span&gt;key&lt;span class=&quot;token punctuation&quot;&gt;-&lt;/span&gt;secret
      &lt;span class=&quot;token key atrule&quot;&gt;key&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; PREFECT_API_KEY&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;&lt;strong&gt;Each of the three key sections plays a vital role:&lt;/strong&gt;&lt;/p&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;&lt;strong&gt;Image:&lt;/strong&gt; Specifies the Prefect image used to run the worker. This is not the flow’s image that runs your business logic; that will be defined in the &lt;code&gt;baseJobTemplate&lt;/code&gt;.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Config:&lt;/strong&gt; Defines work pool, queues, worker name, and most importantly, the &lt;code&gt;baseJobTemplate&lt;/code&gt;. An example of a working base job template is shown in the next section.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;cloudApiConfig:&lt;/strong&gt; Provides the cloud workspace details, such as account and workspace IDs. You’ll also need to configure a service account in Prefect Cloud and store its API token as a Kubernetes secret (&lt;code&gt;prefect-api-key-secret&lt;/code&gt; with the key &lt;code&gt;PREFECT_API_KEY&lt;/code&gt;).&lt;/li&gt;&lt;/ul&gt;&lt;h3 id=&quot;example-base-job-template&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/getting-to-your-first-flow-run-prefect-worker-and-deployment-setup/#example-base-job-template&quot; class=&quot;heading-anchor&quot;&gt;Example Base Job Template&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;In a Kubernetes environment, the base job template defines how Prefect spins up infrastructure for each flow run. The minimal job template for a single-pod job below will allow you to set:&lt;/p&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;The job name (helpful for distinguishing jobs and pods in Kubernetes)&lt;/li&gt;&lt;li&gt;The image tag used in the pod&lt;/li&gt;&lt;li&gt;The namespace for job creation&lt;/li&gt;&lt;li&gt;Other Kubernetes settings include image pull secrets, retry limits, and job cleanup.&lt;/li&gt;&lt;/ul&gt;&lt;pre class=&quot;language-json&quot;&gt;&lt;code class=&quot;language-json&quot;&gt;&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;token property&quot;&gt;&quot;variables&quot;&lt;/span&gt;&lt;span class=&quot;token operator&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;token property&quot;&gt;&quot;type&quot;&lt;/span&gt;&lt;span class=&quot;token operator&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;object&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;token property&quot;&gt;&quot;properties&quot;&lt;/span&gt;&lt;span class=&quot;token operator&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;
      &lt;span class=&quot;token property&quot;&gt;&quot;name&quot;&lt;/span&gt;&lt;span class=&quot;token operator&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;token property&quot;&gt;&quot;type&quot;&lt;/span&gt;&lt;span class=&quot;token operator&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;string&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
        &lt;span class=&quot;token property&quot;&gt;&quot;title&quot;&lt;/span&gt;&lt;span class=&quot;token operator&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;Name given to job created by a worker (key: name)&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
        &lt;span class=&quot;token property&quot;&gt;&quot;default&quot;&lt;/span&gt;&lt;span class=&quot;token operator&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;edp-k8s-job&quot;&lt;/span&gt;
      &lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
      &lt;span class=&quot;token property&quot;&gt;&quot;image&quot;&lt;/span&gt;&lt;span class=&quot;token operator&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;token property&quot;&gt;&quot;type&quot;&lt;/span&gt;&lt;span class=&quot;token operator&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;string&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
        &lt;span class=&quot;token property&quot;&gt;&quot;title&quot;&lt;/span&gt;&lt;span class=&quot;token operator&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;Image name and tag that will execute flows, to be provided from deployment (key: image)&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
        &lt;span class=&quot;token property&quot;&gt;&quot;default&quot;&lt;/span&gt;&lt;span class=&quot;token operator&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;ghcr.io/&amp;lt;your_github_organisation&amp;gt;/edp-flows:&amp;lt;tag&amp;gt;&quot;&lt;/span&gt;
      &lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
      &lt;span class=&quot;token property&quot;&gt;&quot;namespace&quot;&lt;/span&gt;&lt;span class=&quot;token operator&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;token property&quot;&gt;&quot;type&quot;&lt;/span&gt;&lt;span class=&quot;token operator&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;string&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
        &lt;span class=&quot;token property&quot;&gt;&quot;title&quot;&lt;/span&gt;&lt;span class=&quot;token operator&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;Namespace name where jobs will be scheduled (key: namespace)&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
        &lt;span class=&quot;token property&quot;&gt;&quot;default&quot;&lt;/span&gt;&lt;span class=&quot;token operator&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;prefect&quot;&lt;/span&gt;
      &lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;
    &lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;
  &lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;token property&quot;&gt;&quot;job_configuration&quot;&lt;/span&gt;&lt;span class=&quot;token operator&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;token property&quot;&gt;&quot;env&quot;&lt;/span&gt;&lt;span class=&quot;token operator&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;token property&quot;&gt;&quot;name&quot;&lt;/span&gt;&lt;span class=&quot;token operator&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;{{ name }}&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;token property&quot;&gt;&quot;labels&quot;&lt;/span&gt;&lt;span class=&quot;token operator&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;token property&quot;&gt;&quot;command&quot;&lt;/span&gt;&lt;span class=&quot;token operator&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;token property&quot;&gt;&quot;namespace&quot;&lt;/span&gt;&lt;span class=&quot;token operator&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;{{ namespace }}&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;token property&quot;&gt;&quot;job_manifest&quot;&lt;/span&gt;&lt;span class=&quot;token operator&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;
      &lt;span class=&quot;token property&quot;&gt;&quot;kind&quot;&lt;/span&gt;&lt;span class=&quot;token operator&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;Job&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
      &lt;span class=&quot;token property&quot;&gt;&quot;spec&quot;&lt;/span&gt;&lt;span class=&quot;token operator&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;token property&quot;&gt;&quot;template&quot;&lt;/span&gt;&lt;span class=&quot;token operator&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;
          &lt;span class=&quot;token property&quot;&gt;&quot;spec&quot;&lt;/span&gt;&lt;span class=&quot;token operator&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;
            &lt;span class=&quot;token property&quot;&gt;&quot;volumes&quot;&lt;/span&gt;&lt;span class=&quot;token operator&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
            &lt;span class=&quot;token property&quot;&gt;&quot;containers&quot;&lt;/span&gt;&lt;span class=&quot;token operator&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;[&lt;/span&gt;
              &lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;
                &lt;span class=&quot;token property&quot;&gt;&quot;env&quot;&lt;/span&gt;&lt;span class=&quot;token operator&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
                &lt;span class=&quot;token property&quot;&gt;&quot;name&quot;&lt;/span&gt;&lt;span class=&quot;token operator&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;prefect-job&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
                &lt;span class=&quot;token property&quot;&gt;&quot;image&quot;&lt;/span&gt;&lt;span class=&quot;token operator&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;{{ image }}&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
                &lt;span class=&quot;token property&quot;&gt;&quot;imagePullPolicy&quot;&lt;/span&gt;&lt;span class=&quot;token operator&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;Always&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
                &lt;span class=&quot;token property&quot;&gt;&quot;envFrom&quot;&lt;/span&gt;&lt;span class=&quot;token operator&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
                &lt;span class=&quot;token property&quot;&gt;&quot;volumeMounts&quot;&lt;/span&gt;&lt;span class=&quot;token operator&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;]&lt;/span&gt;
              &lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;
            &lt;span class=&quot;token punctuation&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
            &lt;span class=&quot;token property&quot;&gt;&quot;completions&quot;&lt;/span&gt;&lt;span class=&quot;token operator&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token number&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
            &lt;span class=&quot;token property&quot;&gt;&quot;parallelism&quot;&lt;/span&gt;&lt;span class=&quot;token operator&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token number&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
            &lt;span class=&quot;token property&quot;&gt;&quot;tolerations&quot;&lt;/span&gt;&lt;span class=&quot;token operator&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
            &lt;span class=&quot;token property&quot;&gt;&quot;restartPolicy&quot;&lt;/span&gt;&lt;span class=&quot;token operator&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;Never&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
            &lt;span class=&quot;token property&quot;&gt;&quot;imagePullSecrets&quot;&lt;/span&gt;&lt;span class=&quot;token operator&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;[&lt;/span&gt;
              &lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;
                &lt;span class=&quot;token property&quot;&gt;&quot;name&quot;&lt;/span&gt;&lt;span class=&quot;token operator&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;reg-creds&quot;&lt;/span&gt;
              &lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;
            &lt;span class=&quot;token punctuation&quot;&gt;]&lt;/span&gt;
          &lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;
        &lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
        &lt;span class=&quot;token property&quot;&gt;&quot;backoffLimit&quot;&lt;/span&gt;&lt;span class=&quot;token operator&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token number&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
        &lt;span class=&quot;token property&quot;&gt;&quot;ttlSecondsAfterFinished&quot;&lt;/span&gt;&lt;span class=&quot;token operator&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token number&quot;&gt;7200&lt;/span&gt;
      &lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
      &lt;span class=&quot;token property&quot;&gt;&quot;metadata&quot;&lt;/span&gt;&lt;span class=&quot;token operator&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;token property&quot;&gt;&quot;namespace&quot;&lt;/span&gt;&lt;span class=&quot;token operator&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;{{ namespace }}&quot;&lt;/span&gt;
      &lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
      &lt;span class=&quot;token property&quot;&gt;&quot;apiVersion&quot;&lt;/span&gt;&lt;span class=&quot;token operator&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;batch/v1&quot;&lt;/span&gt;
    &lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;token property&quot;&gt;&quot;stream_output&quot;&lt;/span&gt;&lt;span class=&quot;token operator&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token boolean&quot;&gt;true&lt;/span&gt;
  &lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;You can customize this further as needed, and include the template in your Prefect worker configuration. To deploy a worker, it’s enough to run the helm command:&lt;/p&gt;&lt;pre class=&quot;language-bash&quot;&gt;&lt;code class=&quot;language-bash&quot;&gt;helm upgrade prefect-worker &lt;span class=&quot;token parameter variable&quot;&gt;--install&lt;/span&gt; prefect/prefect-worker &lt;span class=&quot;token punctuation&quot;&gt;&#92;&lt;/span&gt;
            &lt;span class=&quot;token parameter variable&quot;&gt;-n&lt;/span&gt; prefect &lt;span class=&quot;token punctuation&quot;&gt;&#92;&lt;/span&gt;
            &lt;span class=&quot;token parameter variable&quot;&gt;-f&lt;/span&gt; &lt;span class=&quot;token variable&quot;&gt;${{ helm_values_path }&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;Once deployed, your worker should appear in the Work Pool section in Prefect Cloud.&lt;/p&gt;&lt;p&gt;&lt;is-land on:idle&gt;&lt;/is-land&gt;&lt;/p&gt;&lt;dialog class=&quot;flow modal26&quot;&gt;&lt;button autofocus class=&quot;button&quot;&gt;Close&lt;/button&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/blog/getting-to-your-first-flow-run-prefect-worker-and-deployment-setup/BKcAkHKq2_-639.webp 639w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/blog/getting-to-your-first-flow-run-prefect-worker-and-deployment-setup/BKcAkHKq2_-639.jpeg&quot; alt=&quot;prefect work pool&quot; width=&quot;639&quot; height=&quot;294&quot;&gt;&lt;/picture&gt;&lt;/dialog&gt;&lt;button data-index=&quot;26&quot;&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/blog/getting-to-your-first-flow-run-prefect-worker-and-deployment-setup/BKcAkHKq2_-639.webp 639w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/blog/getting-to-your-first-flow-run-prefect-worker-and-deployment-setup/BKcAkHKq2_-639.jpeg&quot; alt=&quot;prefect work pool&quot; width=&quot;639&quot; height=&quot;294&quot;&gt;&lt;/picture&gt;&lt;/button&gt;&lt;p&gt;&lt;/p&gt;&lt;p&gt;The &lt;code&gt;baseJobTemplate&lt;/code&gt; is also exposed as a config map in your Kubernetes cluster. To follow best practices, manage all worker configuration changes as infrastructure-as-code. Use GitHub workflows to apply updates automatically, reducing the risk of human error from manual changes in Prefect Cloud.&lt;/p&gt;&lt;p&gt;For more security, assign the service account used to register the worker the “Developer” role, and limit regular developers to read-only access within the Work Pool. These permission settings can be configured in your Prefect Cloud account settings:&lt;/p&gt;&lt;p&gt;&lt;is-land on:idle&gt;&lt;/is-land&gt;&lt;/p&gt;&lt;dialog class=&quot;flow modal27&quot;&gt;&lt;button autofocus class=&quot;button&quot;&gt;Close&lt;/button&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/blog/getting-to-your-first-flow-run-prefect-worker-and-deployment-setup/uBynYwIUGs-322.webp 322w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/blog/getting-to-your-first-flow-run-prefect-worker-and-deployment-setup/uBynYwIUGs-322.jpeg&quot; alt=&quot;prefect cloud workers permission&quot; width=&quot;322&quot; height=&quot;123&quot;&gt;&lt;/picture&gt;&lt;/dialog&gt;&lt;button data-index=&quot;27&quot;&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/blog/getting-to-your-first-flow-run-prefect-worker-and-deployment-setup/uBynYwIUGs-322.webp 322w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/blog/getting-to-your-first-flow-run-prefect-worker-and-deployment-setup/uBynYwIUGs-322.jpeg&quot; alt=&quot;prefect cloud workers permission&quot; width=&quot;322&quot; height=&quot;123&quot;&gt;&lt;/picture&gt;&lt;/button&gt;&lt;p&gt;&lt;/p&gt;&lt;p&gt;Once your Prefect worker is up and running, you’re ready to register your first deployment.&lt;/p&gt;&lt;h2 id=&quot;prefect-configuration-files&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/getting-to-your-first-flow-run-prefect-worker-and-deployment-setup/#prefect-configuration-files&quot; class=&quot;heading-anchor&quot;&gt;Prefect Configuration Files&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;The &lt;code&gt;prefect.yaml&lt;/code&gt; file describes base settings for all deployments, with additional instructions for preparing the execution environment for a deployment run. It can be initialized with the &lt;code&gt;prefect init&lt;/code&gt; command, and after filling in the data, you might end up with a file like this:&lt;/p&gt;&lt;pre class=&quot;language-yaml&quot;&gt;&lt;code class=&quot;language-yaml&quot;&gt;&lt;span class=&quot;token key atrule&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; prefect&lt;span class=&quot;token punctuation&quot;&gt;-&lt;/span&gt;deployments
&lt;span class=&quot;token key atrule&quot;&gt;prefect-version&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; 3.3.7

&lt;span class=&quot;token key atrule&quot;&gt;definitions&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;
  schedules&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;
    cron_default&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token important&quot;&gt;&amp;amp;cron_default&lt;/span&gt;
      cron&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;0 0 * * *&quot;&lt;/span&gt;
      timezone&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;UTC&quot;&lt;/span&gt;
      active&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token boolean important&quot;&gt;false&lt;/span&gt;

&lt;span class=&quot;token key atrule&quot;&gt;build&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token null important&quot;&gt;null&lt;/span&gt;
&lt;span class=&quot;token key atrule&quot;&gt;push&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token null important&quot;&gt;null&lt;/span&gt;

&lt;span class=&quot;token key atrule&quot;&gt;pull&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;token punctuation&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;token key atrule&quot;&gt;prefect.deployments.steps.set_working_directory&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;
      directory&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; /opt/prefect
  &lt;span class=&quot;token punctuation&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;token key atrule&quot;&gt;prefect.deployments.steps.git_clone&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;
      repository&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; https&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;//github.com/&amp;lt;your_github_organisation&lt;span class=&quot;token punctuation&quot;&gt;&amp;gt;&lt;/span&gt;/edp&lt;span class=&quot;token punctuation&quot;&gt;-&lt;/span&gt;flows.git
      access_token&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;{{ prefect.blocks.github-credentials.edp-github-credentials.token }}&quot;&lt;/span&gt;
  &lt;span class=&quot;token punctuation&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;token key atrule&quot;&gt;prefect.deployments.steps.run_shell_script&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;
      directory&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;/opt/prefect/edp-flows&quot;&lt;/span&gt;
      script&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;|&lt;/span&gt;
        uv pip install &lt;span class=&quot;token punctuation&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;-&lt;/span&gt;no&lt;span class=&quot;token punctuation&quot;&gt;-&lt;/span&gt;cache &lt;span class=&quot;token punctuation&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;-&lt;/span&gt;system .

&lt;span class=&quot;token key atrule&quot;&gt;deployments&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;token punctuation&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;token key atrule&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; hello_world
    description&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;Test hello-world deployment.&quot;&lt;/span&gt;
    schedules&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;
      &lt;span class=&quot;token punctuation&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;token key atrule&quot;&gt;&amp;lt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token important&quot;&gt;*cron_default&lt;/span&gt;
        cron&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;5 0 * * *&quot;&lt;/span&gt;
    entrypoint&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; src/edp_flows/flows/hello_world.py&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;hello_world_flow
    parameters&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;
      text&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;Hello, world!&quot;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;&lt;strong&gt;The most notable sections are:&lt;/strong&gt;&lt;/p&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;&lt;strong&gt;Definitions:&lt;/strong&gt; Shared properties such as schedules.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;build/push:&lt;/strong&gt; Relevant only if you need to build a fresh image with each deployment; not applicable in our case (Docker images are maintained separately from prefect.yaml.)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Pull:&lt;/strong&gt; Clones the repository to ensure each deployment uses the latest code. During development, it can be configured to target specific branches. Finally, it installs all Python modules so they’re available during execution.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Deployments&lt;/strong&gt;: A structured list of deployments, each with customizable parameters, allowing the same flow to be reused across multiple scenarios.&lt;/li&gt;&lt;/ul&gt;&lt;h3 id=&quot;example-prefect-flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/getting-to-your-first-flow-run-prefect-worker-and-deployment-setup/#example-prefect-flow&quot; class=&quot;heading-anchor&quot;&gt;Example Prefect Flow&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;As Prefect flows aren’t covered in detail here, we’ll use a simple “Hello, World!” example for illustration. In your actual use case, this is where you would implement the logic for your first ingestion workflow, tailored to your specific data source and target (such as a data warehouse or lake). Here is &lt;code&gt;hello_world.py&lt;/code&gt;:&lt;/p&gt;&lt;pre class=&quot;language-python&quot;&gt;&lt;code class=&quot;language-python&quot;&gt;&lt;span class=&quot;token triple-quoted-string string&quot;&gt;&quot;&quot;&quot;A flow to demonstrate how to log messages in Prefect flows.&quot;&quot;&quot;&lt;/span&gt;

&lt;span class=&quot;token keyword&quot;&gt;from&lt;/span&gt; prefect &lt;span class=&quot;token keyword&quot;&gt;import&lt;/span&gt; flow&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; get_run_logger&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; task

&lt;span class=&quot;token decorator annotation punctuation&quot;&gt;@task&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;log_prints&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;token boolean&quot;&gt;True&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;token keyword&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;token function&quot;&gt;print_log_prints&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;text&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token builtin&quot;&gt;str&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;token operator&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;token operator&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;token boolean&quot;&gt;None&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;token triple-quoted-string string&quot;&gt;&quot;&quot;&quot;Attempt to print text, world using regular Python print() function.

    This time, use the `log_prints` task parameter.
    &quot;&quot;&quot;&lt;/span&gt;
    &lt;span class=&quot;token keyword&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token string-interpolation&quot;&gt;&lt;span class=&quot;token string&quot;&gt;f&quot;log_prints=True: &lt;/span&gt;&lt;span class=&quot;token interpolation&quot;&gt;&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;text&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&quot;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;  &lt;span class=&quot;token comment&quot;&gt;# noqa: T201&lt;/span&gt;

&lt;span class=&quot;token decorator annotation punctuation&quot;&gt;@task&lt;/span&gt;
&lt;span class=&quot;token keyword&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;token function&quot;&gt;log_prefect_run_logger&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;text&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token builtin&quot;&gt;str&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;token operator&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;token operator&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;token boolean&quot;&gt;None&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;token triple-quoted-string string&quot;&gt;&quot;&quot;&quot;Attempt to log text, world using Prefect&#39;s runtime logger.&quot;&quot;&quot;&lt;/span&gt;
    logger &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; get_run_logger&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
    logger&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;info&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token string-interpolation&quot;&gt;&lt;span class=&quot;token string&quot;&gt;f&quot;Prefect runtime logger: &lt;/span&gt;&lt;span class=&quot;token interpolation&quot;&gt;&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;text&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&quot;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;


&lt;span class=&quot;token decorator annotation punctuation&quot;&gt;@flow&lt;/span&gt;
&lt;span class=&quot;token keyword&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;token function&quot;&gt;hello_world_flow&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;text&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token builtin&quot;&gt;str&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;token operator&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;token operator&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;token boolean&quot;&gt;None&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;token triple-quoted-string string&quot;&gt;&quot;&quot;&quot;Demonstrate how to log messages in Prefect flows.&quot;&quot;&quot;&lt;/span&gt;
    print_log_prints&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;text&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
    log_prefect_run_logger&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;text&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;Register the flow with:&lt;/p&gt;&lt;pre class=&quot;language-bash&quot;&gt;&lt;code class=&quot;language-bash&quot;&gt;prefect --no-prompt deploy &lt;span class=&quot;token punctuation&quot;&gt;&#92;&lt;/span&gt;
        &lt;span class=&quot;token parameter variable&quot;&gt;--name&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;&lt;span class=&quot;token variable&quot;&gt;$deployment_name&lt;/span&gt;&quot;&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;&#92;&lt;/span&gt;
        &lt;span class=&quot;token parameter variable&quot;&gt;--tag&lt;/span&gt; &lt;span class=&quot;token variable&quot;&gt;$tag&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;&#92;&lt;/span&gt;
        &lt;span class=&quot;token parameter variable&quot;&gt;--pool&lt;/span&gt; &lt;span class=&quot;token variable&quot;&gt;$work_pool&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;&#92;&lt;/span&gt;
        --job-variable &lt;span class=&quot;token assign-left variable&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;token variable&quot;&gt;$deployment_name&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;The registered deployment includes a schedule, but it’s disabled by default. To enable it, either do so manually in Prefect Cloud or use the following commands:&lt;/p&gt;&lt;pre class=&quot;language-bash&quot;&gt;&lt;code class=&quot;language-bash&quot;&gt;prefect deployment schedule &lt;span class=&quot;token function&quot;&gt;ls&lt;/span&gt; &lt;span class=&quot;token operator&quot;&gt;&amp;lt;&lt;/span&gt;flow_name&lt;span class=&quot;token operator&quot;&gt;&amp;gt;&lt;/span&gt;/&lt;span class=&quot;token operator&quot;&gt;&amp;lt;&lt;/span&gt;deployment_name&lt;span class=&quot;token operator&quot;&gt;&amp;gt;&lt;/span&gt;
prefect deployment schedule resume &lt;span class=&quot;token operator&quot;&gt;&amp;lt;&lt;/span&gt;flow_name&lt;span class=&quot;token operator&quot;&gt;&amp;gt;&lt;/span&gt;/&lt;span class=&quot;token operator&quot;&gt;&amp;lt;&lt;/span&gt;deployment_name&lt;span class=&quot;token operator&quot;&gt;&amp;gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;After running the first command, you should see a view like this:&lt;/p&gt;&lt;p&gt;&lt;is-land on:idle&gt;&lt;/is-land&gt;&lt;/p&gt;&lt;dialog class=&quot;flow modal28&quot;&gt;&lt;button autofocus class=&quot;button&quot;&gt;Close&lt;/button&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/blog/getting-to-your-first-flow-run-prefect-worker-and-deployment-setup/mFAqNsbicn-720.webp 720w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/blog/getting-to-your-first-flow-run-prefect-worker-and-deployment-setup/mFAqNsbicn-720.jpeg&quot; alt=&quot;prefect deployment schedule&quot; width=&quot;720&quot; height=&quot;98&quot;&gt;&lt;/picture&gt;&lt;/dialog&gt;&lt;button data-index=&quot;28&quot;&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/blog/getting-to-your-first-flow-run-prefect-worker-and-deployment-setup/mFAqNsbicn-720.webp 720w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/blog/getting-to-your-first-flow-run-prefect-worker-and-deployment-setup/mFAqNsbicn-720.jpeg&quot; alt=&quot;prefect deployment schedule&quot; width=&quot;720&quot; height=&quot;98&quot;&gt;&lt;/picture&gt;&lt;/button&gt;&lt;p&gt;&lt;/p&gt;&lt;p&gt;After enabling the schedule, it should appear as Active, and the Prefect worker will trigger it every day at noon. To test a new deployment, you can manually trigger it from Prefect Cloud. Once your deployment runs successfully, you will see logs from it like the following:&lt;/p&gt;&lt;p&gt;&lt;is-land on:idle&gt;&lt;/is-land&gt;&lt;/p&gt;&lt;dialog class=&quot;flow modal29&quot;&gt;&lt;button autofocus class=&quot;button&quot;&gt;Close&lt;/button&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/blog/getting-to-your-first-flow-run-prefect-worker-and-deployment-setup/fI-tbv8ODV-960.webp 960w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/blog/getting-to-your-first-flow-run-prefect-worker-and-deployment-setup/fI-tbv8ODV-960.jpeg&quot; alt=&quot;deployment run prefect&quot; width=&quot;960&quot; height=&quot;411&quot;&gt;&lt;/picture&gt;&lt;/dialog&gt;&lt;button data-index=&quot;29&quot;&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/blog/getting-to-your-first-flow-run-prefect-worker-and-deployment-setup/fI-tbv8ODV-960.webp 960w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/blog/getting-to-your-first-flow-run-prefect-worker-and-deployment-setup/fI-tbv8ODV-960.jpeg&quot; alt=&quot;deployment run prefect&quot; width=&quot;960&quot; height=&quot;411&quot;&gt;&lt;/picture&gt;&lt;/button&gt;&lt;p&gt;&lt;/p&gt;&lt;h2 id=&quot;conclusion&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/getting-to-your-first-flow-run-prefect-worker-and-deployment-setup/#conclusion&quot; class=&quot;heading-anchor&quot;&gt;Conclusion&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;With this setup, you now have everything in place to execute your first data ingestion flow. You’ve:&lt;/p&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;Built a containerized flow execution environment&lt;/li&gt;&lt;li&gt;Deployed a scalable Prefect worker&lt;/li&gt;&lt;li&gt;Defined clean, reusable deployment configurations.&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;This architecture gives you flexibility and control across environments and sets the stage for more advanced workflows.&lt;/p&gt;&lt;p&gt;While the example used here is a simple “Hello World!” flow, the same deployment structure can be applied to your real data ingestion workflows. To run your first actual ingestion pipeline, all you need to do is replace the flow logic with code that connects to your data source and writes to your destination (like a data warehouse or lake). The orchestration, environment, and deployment pieces remain the same.&lt;/p&gt;&lt;h4 id=&quot;whats-next&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/getting-to-your-first-flow-run-prefect-worker-and-deployment-setup/#whats-next&quot; class=&quot;heading-anchor&quot;&gt;What’s next?&lt;/a&gt;&lt;/h4&gt;&lt;p&gt;In the next article, I’ll show you how to automate this entire process using GitHub Actions, turning this manual setup into a streamlined CI/CD pipeline your whole team can rely on.&lt;/p&gt;&lt;/div&gt;
 			</content>
    </entry><entry>
      <title>Roles in the Context of the Analytics Workflow</title>
      <link href="https://thescalableway.com/blog/roles-in-the-context-of-the-analytics-workflow/" />
      <updated>2025-03-27T10:47:00Z</updated>
      <id>https://thescalableway.com/blog/roles-in-the-context-of-the-analytics-workflow/</id>
      <content type="html">
				&lt;nav id=&quot;toc&quot; class=&quot;table-of-contents prose&quot;&gt;&lt;ol&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/roles-in-the-context-of-the-analytics-workflow/#key-roles&quot;&gt;Key Roles&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/roles-in-the-context-of-the-analytics-workflow/#measuring-the-impact-of-each-role&quot;&gt;Measuring the Impact of Each Role&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/roles-in-the-context-of-the-analytics-workflow/#understanding-the-need-for-roles&quot;&gt;Understanding the Need for Roles&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/roles-in-the-context-of-the-analytics-workflow/#working-with-an-external-partner&quot;&gt;Working With an External Partner&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/roles-in-the-context-of-the-analytics-workflow/#conclusions&quot;&gt;Conclusions&lt;/a&gt;&lt;ol&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/roles-in-the-context-of-the-analytics-workflow/#resources&quot;&gt;Resources&lt;/a&gt;&lt;/li&gt;&lt;/ol&gt;&lt;/li&gt;&lt;/ol&gt;&lt;/nav&gt;&lt;p&gt;&lt;span id=&quot;toc-skipped&quot; class=&quot;visually-hidden&quot;&gt;&lt;/span&gt;&lt;/p&gt;&lt;div class=&quot;flow prose&quot;&gt;&lt;p&gt;&lt;/p&gt;&lt;p&gt;An analytics workflow documents the journey from raw data to production-ready data models, encompassing development and testing phases. A critical component is governance, implemented through a Pull Request approval process that facilitates regular code reviews and prevents technical debt accumulation. This structured approach ensures quality and maintainability while supporting collaborative development.&lt;/p&gt;&lt;p&gt;To manage this governance effectively, several roles are typically involved. It’s becoming increasingly rare to see a single analyst handle the entire end-to-end process of creating a report or data model. Instead, the trend is moving toward a growing number of specialized roles, each with distinct responsibilities.&lt;/p&gt;&lt;p&gt;&lt;is-land on:idle&gt;&lt;/is-land&gt;&lt;/p&gt;&lt;dialog class=&quot;flow modal5&quot;&gt;&lt;button autofocus class=&quot;button&quot;&gt;Close&lt;/button&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/img/MpiK_Rgewv-960.webp 960w, https://thescalableway.com/img/MpiK_Rgewv-1600.webp 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/img/MpiK_Rgewv-960.jpeg&quot; alt=&quot;typical code-based analytics workflow&quot; width=&quot;1600&quot; height=&quot;520&quot; srcset=&quot;https://thescalableway.com/img/MpiK_Rgewv-960.jpeg 960w, https://thescalableway.com/img/MpiK_Rgewv-1600.jpeg 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;/picture&gt;&lt;/dialog&gt;&lt;button data-index=&quot;5&quot;&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/img/MpiK_Rgewv-960.webp 960w, https://thescalableway.com/img/MpiK_Rgewv-1600.webp 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/img/MpiK_Rgewv-960.jpeg&quot; alt=&quot;typical code-based analytics workflow&quot; width=&quot;1600&quot; height=&quot;520&quot; srcset=&quot;https://thescalableway.com/img/MpiK_Rgewv-960.jpeg 960w, https://thescalableway.com/img/MpiK_Rgewv-1600.jpeg 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;/picture&gt;&lt;/button&gt;&lt;p&gt;&lt;/p&gt;&lt;h2 id=&quot;key-roles&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/roles-in-the-context-of-the-analytics-workflow/#key-roles&quot; class=&quot;heading-anchor&quot;&gt;Key Roles&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;The analytics workflow involves several key roles, with data analysts playing a particularly key position. Data analysts combine technical skills with deep business domain knowledge, giving them unique insight into business models and challenges.&lt;/p&gt;&lt;p&gt;In contrast, data engineers and data platform engineers typically focus on technical implementation rather than direct business interaction. While aligning the data platform roadmap and investments with business value remains strategically important for leadership, this alignment happens at a higher level and doesn’t directly impact the day-to-day analytics workflow operations.&lt;/p&gt;&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Role&lt;/th&gt;&lt;th&gt;Tasks&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Data Analyst&lt;/td&gt;&lt;td&gt;- Understand business requirements&lt;br&gt;- Analyze data in intermediate and mart layers&lt;br&gt;- Develop SQL queries and transformations&lt;br&gt;- Create and maintain metadata documentation&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Data Platform Engineer&lt;/td&gt;&lt;td&gt;- Monitor and support infrastructure resources&lt;br&gt;- Maintain CI/CD pipelines&lt;br&gt;- Manage network infrastructure&lt;br&gt;- Implement cybersecurity measures&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Data Engineer&lt;/td&gt;&lt;td&gt;- Design and develop data pipelines&lt;br&gt;- Maintain and optimize data flows&lt;br&gt;- Schedule and orchestrate data processing&lt;br&gt;- Implement data ingestion processes&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;h2 id=&quot;measuring-the-impact-of-each-role&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/roles-in-the-context-of-the-analytics-workflow/#measuring-the-impact-of-each-role&quot; class=&quot;heading-anchor&quot;&gt;Measuring the Impact of Each Role&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Because roles usually tackle different problems, it is a good idea to measure performance and impact differently. Measuring how a role is doing is also important for creating rules such as notification rules, issue and incident prioritization rules, and other operational matters to increase the reliability of production.&lt;/p&gt;&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Role&lt;/th&gt;&lt;th&gt;Measure&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Data Analyst&lt;/td&gt;&lt;td&gt;- Understanding of business domain&lt;br&gt;- Business Satisfaction&lt;br&gt;- ROI from data initiatives&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Data Platform Engineer&lt;/td&gt;&lt;td&gt;- Speed of new data platform features&lt;br&gt;- Reliability of the data platform&lt;br&gt;- Data analysts’ support and satisfaction&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Data Engineer&lt;/td&gt;&lt;td&gt;- Speed of new data ingestions&lt;br&gt;- Reliability of data pipelines&lt;br&gt;- Data analysts’ support and satisfaction&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;h2 id=&quot;understanding-the-need-for-roles&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/roles-in-the-context-of-the-analytics-workflow/#understanding-the-need-for-roles&quot; class=&quot;heading-anchor&quot;&gt;Understanding the Need for Roles&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Analytics teams often tend to combine roles or leave them loosely defined. This approach is understandable, and sometimes even beneficial, in the early stages of an analytics initiative. After all, when starting out, the priority is delivering business value quickly, and formal roles and approval processes can slow things down.&lt;/p&gt;&lt;p&gt;However, this lack of clearly defined roles and boundaries typically creates challenges as the analytics function matures. Common issues include:&lt;/p&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;Blurred lines between exploratory analytics work and production pipeline operations make it difficult to maintain service levels&lt;/li&gt;&lt;li&gt;Insufficient knowledge transfer mechanisms, including limited documentation, unclear onboarding processes, and a lack of backup coverage for key roles&lt;/li&gt;&lt;li&gt;Team friction arising from ambiguous responsibilities and overlapping ownership&lt;/li&gt;&lt;li&gt;An overemphasis on technical tools and implementation details, rather than addressing the more fundamental needs of role clarity and process alignment&lt;/li&gt;&lt;/ul&gt;&lt;h2 id=&quot;working-with-an-external-partner&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/roles-in-the-context-of-the-analytics-workflow/#working-with-an-external-partner&quot; class=&quot;heading-anchor&quot;&gt;Working With an External Partner&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;In this workflow, there are already quite a few roles and tasks, and in reality, even more can be involved. For example, handling data privacy in Europe requires a solid understanding of GDPR. Given the range of responsibilities and the breadth of expertise needed to run a data department effectively, many data and analytics teams choose to rely on external partners. However, this isn’t always straightforward. External teams often create strong and rigid boundaries between themselves (the extended team) and the customer’s in-house team (the internal team), which can hinder collaboration.&lt;/p&gt;&lt;p&gt;One practical way to navigate the tension between collaboration and rigid boundaries is to establish a clear RACI matrix from the very beginning. This matrix serves as a shared reference to define roles and responsibilities, helping both internal and extended teams understand who is Responsible, Accountable, Consulted, and Informed for each task. It provides structure without creating silos, enabling smoother handovers and aligned expectations.&lt;/p&gt;&lt;p&gt;&lt;is-land on:idle&gt;&lt;/is-land&gt;&lt;/p&gt;&lt;dialog class=&quot;flow modal6&quot;&gt;&lt;button autofocus class=&quot;button&quot;&gt;Close&lt;/button&gt;&lt;figure&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/img/y2O8aXqAtv-960.webp 960w, https://thescalableway.com/img/y2O8aXqAtv-1498.webp 1498w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/img/y2O8aXqAtv-960.jpeg&quot; alt=&quot;data analytics team RACI Matrix&quot; title=&quot;Example of RACI Matrix in the Analytics Workflow&quot; width=&quot;1498&quot; height=&quot;678&quot; srcset=&quot;https://thescalableway.com/img/y2O8aXqAtv-960.jpeg 960w, https://thescalableway.com/img/y2O8aXqAtv-1498.jpeg 1498w&quot; sizes=&quot;auto&quot;&gt;&lt;/picture&gt;&lt;figcaption&gt;Example of RACI Matrix in the Analytics Workflow&lt;/figcaption&gt;&lt;/figure&gt;&lt;/dialog&gt;&lt;button data-index=&quot;6&quot;&gt;&lt;figure&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/img/y2O8aXqAtv-960.webp 960w, https://thescalableway.com/img/y2O8aXqAtv-1498.webp 1498w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/img/y2O8aXqAtv-960.jpeg&quot; alt=&quot;data analytics team RACI Matrix&quot; title=&quot;Example of RACI Matrix in the Analytics Workflow&quot; width=&quot;1498&quot; height=&quot;678&quot; srcset=&quot;https://thescalableway.com/img/y2O8aXqAtv-960.jpeg 960w, https://thescalableway.com/img/y2O8aXqAtv-1498.jpeg 1498w&quot; sizes=&quot;auto&quot;&gt;&lt;/picture&gt;&lt;figcaption&gt;Example of RACI Matrix in the Analytics Workflow&lt;/figcaption&gt;&lt;/figure&gt;&lt;/button&gt;&lt;p&gt;&lt;/p&gt;&lt;h2 id=&quot;conclusions&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/roles-in-the-context-of-the-analytics-workflow/#conclusions&quot; class=&quot;heading-anchor&quot;&gt;Conclusions&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;A well-structured analytics workflow is essential for turning raw data into reliable insights. As analytics initiatives mature, the need for governance, role clarity, and collaboration becomes increasingly important. Specialized roles such as data analysts, data engineers, and platform engineers each contribute unique expertise, and their impact should be measured differently to reflect their responsibilities.&lt;/p&gt;&lt;p&gt;Clearly defined roles and processes—such as code reviews, CI/CD practices, and RACI matrices—not only support governance and maintainability but also foster collaboration across internal and external teams. While early-stage flexibility is useful, long-term success in data and analytics depends on thoughtful structure, cross-functional alignment, and a shared understanding of who does what and why.&lt;/p&gt;&lt;h3 id=&quot;resources&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/roles-in-the-context-of-the-analytics-workflow/#resources&quot; class=&quot;heading-anchor&quot;&gt;Resources&lt;/a&gt;&lt;/h3&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;&lt;a href=&quot;https://docs.google.com/spreadsheets/d/1UFddBx-2mKSE8h-TypmYH4sdM7HN2B6TWr0AgWqzLcU/edit?usp=sharing&quot; rel=&quot;noopener&quot;&gt;Example of RACI Matrix&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/div&gt;
 			</content>
    </entry><entry>
      <title>How to Setup Data Platform Infrastructure on Google Cloud Platform with Terraform</title>
      <link href="https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/" />
      <updated>2025-03-05T13:31:00Z</updated>
      <id>https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/</id>
      <content type="html">
				&lt;nav id=&quot;toc&quot; class=&quot;table-of-contents prose&quot;&gt;&lt;ol&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/#why-choose-a-server-based-approach-with-a-single-vm&quot;&gt;Why Choose a Server-Based Approach with a Single VM?&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/#infrastructure-overview&quot;&gt;Infrastructure Overview&lt;/a&gt;&lt;ol&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/#understanding-google-cloud-identity-aware-proxy-iap&quot;&gt;Understanding Google Cloud Identity-Aware Proxy (IAP)&lt;/a&gt;&lt;/li&gt;&lt;/ol&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/#phase-1-securing-the-essentials&quot;&gt;Phase 1: Securing the Essentials&lt;/a&gt;&lt;ol&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/#step-1-installing-terraform&quot;&gt;Step 1: Installing Terraform&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/#step-2-gcloud-cli-installation&quot;&gt;Step 2: gcloud CLI installation&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/#step-3-setting-up-gcp-service-account&quot;&gt;Step 3: Setting up GCP Service Account&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/#step-4-downloading-the-json-key-for-the-service-account&quot;&gt;Step 4: Downloading the JSON Key for the Service Account&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/#step-5-activating-the-service-account&quot;&gt;Step 5: Activating the Service Account&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/#step-6-generating-hmac-key-to-buckets&quot;&gt;Step 6: Generating HMAC Key to Buckets&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/#step-7-enabling-compute-engine-api-and-cloud-resource-manager-api&quot;&gt;Step 7: Enabling Compute Engine API and Cloud Resource Manager API&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/#step-8-setting-up-a-remote-state-for-terraform&quot;&gt;Step 8: Setting Up a Remote State for Terraform&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/#step-9-exporting-credentials-and-setting-up-a-new-bucket&quot;&gt;Step 9: Exporting Credentials and Setting up a New Bucket&lt;/a&gt;&lt;/li&gt;&lt;/ol&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/#phase-2-installing-and-deploying-infrastructure-with-terraform&quot;&gt;Phase 2: Installing &amp;amp; Deploying Infrastructure with Terraform&lt;/a&gt;&lt;ol&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/#terraform-files&quot;&gt;Terraform Files&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/#step-1-infrastructure-provisioning-with-terraform&quot;&gt;Step 1: Infrastructure Provisioning with Terraform&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/#step-2-set-up-verification&quot;&gt;Step 2: Set up Verification&lt;/a&gt;&lt;/li&gt;&lt;/ol&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/#conclusion&quot;&gt;Conclusion&lt;/a&gt;&lt;/li&gt;&lt;/ol&gt;&lt;/nav&gt;&lt;p&gt;&lt;span id=&quot;toc-skipped&quot; class=&quot;visually-hidden&quot;&gt;&lt;/span&gt;&lt;/p&gt;&lt;div class=&quot;flow prose&quot;&gt;&lt;p&gt;&lt;/p&gt;&lt;p&gt;Setting up a solid, scalable data platform is crucial for organizations looking to get the most out of their data. Building upon our previous discussion on architectural considerations for &lt;a href=&quot;https://thescalableway.com/insights/deploying-prefect-on-any-cloud-using-a-single-virtual-machine/&quot; rel=&quot;noopener&quot;&gt;deploying Prefect on various cloud platforms&lt;/a&gt;, this article will walk you through building your data platform infrastructure on Google Cloud Platform (GCP) using Terraform.&lt;/p&gt;&lt;p&gt;Our focus is on creating a server-based approach utilizing a single Virtual Machine (VM)—a simple yet powerful starting point for organizations that don’t need to dive into complex source systems or full-blown data warehouses just yet. This approach offers an easy entry point with plenty of room to grow as your data needs evolve.&lt;/p&gt;&lt;p&gt;As this article will use a substantial amount of code, you can find all the relevant code samples in our&lt;a href=&quot;https://github.com/thescalableway/dataplatform-gcp-terraform&quot; rel=&quot;noopener&quot;&gt; dedicated repository.&lt;/a&gt;&lt;/p&gt;&lt;h2 id=&quot;why-choose-a-server-based-approach-with-a-single-vm&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/#why-choose-a-server-based-approach-with-a-single-vm&quot; class=&quot;heading-anchor&quot;&gt;Why Choose a Server-Based Approach with a Single VM?&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Choosing a server-based approach with a single VM comes with several advantages:&lt;/p&gt;&lt;ol class=&quot;list&quot;&gt;&lt;li&gt;&lt;strong&gt;Cost-effectiveness:&lt;/strong&gt; A single VM setup is often a more budget-friendly option for initial deployments or smaller-scale projects.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Simplified management:&lt;/strong&gt; Fewer components mean easier maintenance and troubleshooting.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Flexibility:&lt;/strong&gt; This approach offers the ability to easily expand or modify your infrastructure as requirements change.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Learning curve:&lt;/strong&gt; For teams new to cloud infrastructure, starting with a single VM can be less overwhelming and serve as a stepping stone to more complex architectures.&lt;/li&gt;&lt;/ol&gt;&lt;p&gt;This guide will walk you through the process of setting up key components of our data platform infrastructure on GCP. You’ll learn how to configure the VPC and subnets, set up Compute Engine instances, configure firewall rules, secure SSH access with Identity-Aware Proxy (IAP), establish internet connectivity with Cloud Router and NAT, and store the state files in Cloud Storage. We’ll also dive into the specifics of GCP’s Identity-Aware Proxy, exploring its crucial role in enhancing the security of our data platform.&lt;/p&gt;&lt;p&gt;By using Terraform to manage infrastructure as code, we ensure that our setup is reproducible, version-controlled, and easy to manage. This not only streamlines the initial deployment but also makes scaling and future updates much more efficient.&lt;/p&gt;&lt;p&gt;Let’s get started on building a solid, scalable data platform infrastructure on GCP—one that will grow with your organization’s data needs.&lt;/p&gt;&lt;h2 id=&quot;infrastructure-overview&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/#infrastructure-overview&quot; class=&quot;heading-anchor&quot;&gt;Infrastructure Overview&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Before we dive into the step-by-step process of setting up your data platform on GCP, let’s take a first look at the key components that make up the environment we’ll be building:&lt;/p&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;&lt;strong&gt;Virtual Private Cloud (VPC)&lt;/strong&gt;: A private network that will serve as the foundation of your environment, providing isolation and security.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Subnet&lt;/strong&gt;: A private subnet where the virtual machine will reside.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Compute Engine Virtual Machine&lt;/strong&gt;: The instance where both the GitHub Runner and Prefect Worker will be set up.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Firewall&lt;/strong&gt;: Configured with rules to allow inbound access exclusively through Google Cloud Identity-Aware Proxy (IAP), blocking all other traffic.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;IAP SSH Permissions&lt;/strong&gt;: Enables secure access to the virtual machine.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Cloud Router&lt;/strong&gt;: Provides internet connectivity for the virtual machine.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Cloud NAT&lt;/strong&gt;: Configures a NAT gateway that directs the virtual machine to the Cloud Router for outbound internet access. It also ensures that the public IP is fixed as long as the Cloud NAT object is not destroyed and configured for the same zone.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Cloud Storage&lt;/strong&gt;: Sets up a Google Cloud Storage bucket to store ingested data as Parquet files before transforming and loading it into the database as tables.&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;&lt;is-land on:idle&gt;&lt;/is-land&gt;&lt;/p&gt;&lt;dialog class=&quot;flow modal30&quot;&gt;&lt;button autofocus class=&quot;button&quot;&gt;Close&lt;/button&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/img/ITUd3RhFds-960.webp 960w, https://thescalableway.com/img/ITUd3RhFds-1600.webp 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/img/ITUd3RhFds-960.jpeg&quot; alt=&quot;google cloud platform infrastructure&quot; width=&quot;1600&quot; height=&quot;943&quot; srcset=&quot;https://thescalableway.com/img/ITUd3RhFds-960.jpeg 960w, https://thescalableway.com/img/ITUd3RhFds-1600.jpeg 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;/picture&gt;&lt;/dialog&gt;&lt;button data-index=&quot;30&quot;&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/img/ITUd3RhFds-960.webp 960w, https://thescalableway.com/img/ITUd3RhFds-1600.webp 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/img/ITUd3RhFds-960.jpeg&quot; alt=&quot;google cloud platform infrastructure&quot; width=&quot;1600&quot; height=&quot;943&quot; srcset=&quot;https://thescalableway.com/img/ITUd3RhFds-960.jpeg 960w, https://thescalableway.com/img/ITUd3RhFds-1600.jpeg 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;/picture&gt;&lt;/button&gt;&lt;p&gt;&lt;/p&gt;&lt;h3 id=&quot;understanding-google-cloud-identity-aware-proxy-iap&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/#understanding-google-cloud-identity-aware-proxy-iap&quot; class=&quot;heading-anchor&quot;&gt;Understanding Google Cloud Identity-Aware Proxy (IAP)&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;To ensure a secure environment, all public access should be completely blocked. With this configuration, resources within the environment can be accessed using two main options:&lt;/p&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;&lt;strong&gt;VPN Connection&lt;/strong&gt;: In this setup, at least one resource within the VPC must be exposed to the internet to host a VPN endpoint. Alternatively, a separate VPC can be configured solely for VPN purposes, with VPC Network Peering into the main environment. This way, only the VPN-hosting VPC is exposed to the internet, while the main environment remains accessible only internally. Although effective, this configuration is more complex and falls outside the scope of this documentation.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Google Cloud Identity-Aware Proxy (IAP)&lt;/strong&gt;: This option offers a similar secure access model to a VPN but with simplified management through Google Cloud. As outlined in the &lt;a href=&quot;https://cloud.google.com/iap/docs/concepts-overview#how_iap_works&quot; rel=&quot;noopener&quot;&gt;official documentation&lt;/a&gt;:&lt;/li&gt;&lt;/ul&gt;&lt;blockquote&gt;&lt;p&gt;&lt;em&gt;“When an application or resource is protected by IAP, it can only be accessed through the proxy by principals, also known as users, who have the correct Identity and Access Management (IAM) role. When you grant a user access to an application or resource by IAP, they’re subject to the fine-grained access controls implemented by the product in use without requiring a VPN. When a user tries to access an IAP-secured resource, IAP performs authentication and authorization checks.”&lt;/em&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;This &lt;a href=&quot;https://cloud.google.com/iap/images/iap-load-balancer.png&quot; rel=&quot;noopener&quot;&gt;diagram from Google&lt;/a&gt; further illustrates the components required to implement this configuration:&lt;/p&gt;&lt;p&gt;&lt;is-land on:idle&gt;&lt;/is-land&gt;&lt;/p&gt;&lt;dialog class=&quot;flow modal31&quot;&gt;&lt;button autofocus class=&quot;button&quot;&gt;Close&lt;/button&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/img/2CFZG6gV0g-960.webp 960w, https://thescalableway.com/img/2CFZG6gV0g-1600.webp 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/img/2CFZG6gV0g-960.jpeg&quot; alt=&quot;google configuration diagram&quot; width=&quot;1600&quot; height=&quot;1257&quot; srcset=&quot;https://thescalableway.com/img/2CFZG6gV0g-960.jpeg 960w, https://thescalableway.com/img/2CFZG6gV0g-1600.jpeg 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;/picture&gt;&lt;/dialog&gt;&lt;button data-index=&quot;31&quot;&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/img/2CFZG6gV0g-960.webp 960w, https://thescalableway.com/img/2CFZG6gV0g-1600.webp 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/img/2CFZG6gV0g-960.jpeg&quot; alt=&quot;google configuration diagram&quot; width=&quot;1600&quot; height=&quot;1257&quot; srcset=&quot;https://thescalableway.com/img/2CFZG6gV0g-960.jpeg 960w, https://thescalableway.com/img/2CFZG6gV0g-1600.jpeg 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;/picture&gt;&lt;/button&gt;&lt;p&gt;&lt;/p&gt;&lt;p&gt;With a solid understanding of Google Cloud Identity and its role in managing users and access, let’s now dive into the practical steps for setting it up and implementing it effectively.&lt;/p&gt;&lt;h2 id=&quot;phase-1-securing-the-essentials&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/#phase-1-securing-the-essentials&quot; class=&quot;heading-anchor&quot;&gt;Phase 1: Securing the Essentials&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Before starting the Terraform configuration, make sure you have the following tools and setups in place:&lt;/p&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;&lt;strong&gt;Terraform&lt;/strong&gt;: You’ll need Terraform installed on your local machine. This will be your main tool for provisioning infrastructure.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;gcloud CLI&lt;/strong&gt;: The gcloud CLI tool should be installed and configured to interact with your Google Cloud account.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GCP Service Account&lt;/strong&gt;: A Google Cloud Platform service account needs to be created.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Enabled APIs&lt;/strong&gt;: Make sure the Compute Engine API and Cloud Resource Manager API are enabled on your GCP account.&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;Let’s take a detailed look at these prerequisites:&lt;/p&gt;&lt;h3 id=&quot;step-1-installing-terraform&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/#step-1-installing-terraform&quot; class=&quot;heading-anchor&quot;&gt;Step 1: Installing Terraform&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;Terraform can be installed in various ways, which are outlined &lt;a href=&quot;https://developer.hashicorp.com/terraform/install&quot; rel=&quot;noopener&quot;&gt;by Hashicorp here&lt;/a&gt;. For Ubuntu, installation can be done with the following commands:&lt;/p&gt;&lt;pre class=&quot;language-bash&quot;&gt;&lt;code class=&quot;language-bash&quot;&gt;&lt;span class=&quot;token function&quot;&gt;wget&lt;/span&gt; &lt;span class=&quot;token parameter variable&quot;&gt;-O&lt;/span&gt; - https://apt.releases.hashicorp.com/gpg &lt;span class=&quot;token operator&quot;&gt;|&lt;/span&gt; &lt;span class=&quot;token function&quot;&gt;sudo&lt;/span&gt; gpg &lt;span class=&quot;token parameter variable&quot;&gt;--dearmor&lt;/span&gt; &lt;span class=&quot;token parameter variable&quot;&gt;-o&lt;/span&gt;
/usr/share/keyrings/hashicorp-archive-keyring.gpg
&lt;span class=&quot;token builtin class-name&quot;&gt;echo&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;deb [signed-by=/usr/share/keyrings/hashicorp-archive-keyring.gpg]
https://apt.releases.hashicorp.com &lt;span class=&quot;token variable&quot;&gt;&lt;span class=&quot;token variable&quot;&gt;$(&lt;/span&gt;lsb_release &lt;span class=&quot;token parameter variable&quot;&gt;-cs&lt;/span&gt;&lt;span class=&quot;token variable&quot;&gt;)&lt;/span&gt;&lt;/span&gt; main&quot;&lt;/span&gt; &lt;span class=&quot;token operator&quot;&gt;|&lt;/span&gt; &lt;span class=&quot;token function&quot;&gt;sudo&lt;/span&gt; &lt;span class=&quot;token function&quot;&gt;tee&lt;/span&gt;
/etc/apt/sources.list.d/hashicorp.list
&lt;span class=&quot;token function&quot;&gt;sudo&lt;/span&gt; &lt;span class=&quot;token function&quot;&gt;apt&lt;/span&gt; update &lt;span class=&quot;token operator&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;token function&quot;&gt;sudo&lt;/span&gt; &lt;span class=&quot;token function&quot;&gt;apt&lt;/span&gt; &lt;span class=&quot;token function&quot;&gt;install&lt;/span&gt; terraform&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&quot;step-2-gcloud-cli-installation&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/#step-2-gcloud-cli-installation&quot; class=&quot;heading-anchor&quot;&gt;Step 2: gcloud CLI installation&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;Similarly to Terraform, the gcloud CLI can be installed as per the &lt;a href=&quot;https://cloud.google.com/sdk/docs/install&quot; rel=&quot;noopener&quot;&gt;official instructions&lt;/a&gt;. For Ubuntu, run:&lt;/p&gt;&lt;pre class=&quot;language-bash&quot;&gt;&lt;code class=&quot;language-bash&quot;&gt; 
&lt;span class=&quot;token function&quot;&gt;sudo&lt;/span&gt; &lt;span class=&quot;token function&quot;&gt;apt-get&lt;/span&gt; update
&lt;span class=&quot;token function&quot;&gt;sudo&lt;/span&gt; &lt;span class=&quot;token function&quot;&gt;apt-get&lt;/span&gt; &lt;span class=&quot;token function&quot;&gt;install&lt;/span&gt; apt-transport-https ca-certificates gnupg &lt;span class=&quot;token function&quot;&gt;curl&lt;/span&gt;
&lt;span class=&quot;token function&quot;&gt;curl&lt;/span&gt; https://packages.cloud.google.com/apt/doc/apt-key.gpg &lt;span class=&quot;token operator&quot;&gt;|&lt;/span&gt; &lt;span class=&quot;token function&quot;&gt;sudo&lt;/span&gt; gpg &lt;span class=&quot;token parameter variable&quot;&gt;--dearmor&lt;/span&gt;
&lt;span class=&quot;token parameter variable&quot;&gt;-o&lt;/span&gt; /usr/share/keyrings/cloud.google.gpg
&lt;span class=&quot;token builtin class-name&quot;&gt;echo&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;deb [signed-by=/usr/share/keyrings/cloud.google.gpg]
https://packages.cloud.google.com/apt cloud-sdk main&quot;&lt;/span&gt; &lt;span class=&quot;token operator&quot;&gt;|&lt;/span&gt; &lt;span class=&quot;token function&quot;&gt;sudo&lt;/span&gt; &lt;span class=&quot;token function&quot;&gt;tee&lt;/span&gt; &lt;span class=&quot;token parameter variable&quot;&gt;-a&lt;/span&gt;
/etc/apt/sources.list.d/google-cloud-sdk.list
&lt;span class=&quot;token function&quot;&gt;sudo&lt;/span&gt; &lt;span class=&quot;token function&quot;&gt;apt-get&lt;/span&gt; update &lt;span class=&quot;token operator&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;token function&quot;&gt;sudo&lt;/span&gt; &lt;span class=&quot;token function&quot;&gt;apt-get&lt;/span&gt; &lt;span class=&quot;token function&quot;&gt;install&lt;/span&gt; google-cloud-cli&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;After installation, initialize the gcloud CLI by providing the “gcloud init” command and setting up a new account by opening the provided URL.&lt;/p&gt;&lt;pre class=&quot;language-bash&quot;&gt;&lt;code class=&quot;language-bash&quot;&gt;gcloud init


&lt;span class=&quot;token comment&quot;&gt;# gcloud init&lt;/span&gt;
Welcome&lt;span class=&quot;token operator&quot;&gt;!&lt;/span&gt; This &lt;span class=&quot;token builtin class-name&quot;&gt;command&lt;/span&gt; will take you through the configuration of gcloud.

Your current configuration has been &lt;span class=&quot;token builtin class-name&quot;&gt;set&lt;/span&gt; to: &lt;span class=&quot;token punctuation&quot;&gt;[&lt;/span&gt;default&lt;span class=&quot;token punctuation&quot;&gt;]&lt;/span&gt;

You can skip diagnostics next &lt;span class=&quot;token function&quot;&gt;time&lt;/span&gt; by using the following flag:
  gcloud init --skip-diagnostics

Network diagnostic detects and fixes &lt;span class=&quot;token builtin class-name&quot;&gt;local&lt;/span&gt; network connection issues.
Checking network connection&lt;span class=&quot;token punctuation&quot;&gt;..&lt;/span&gt;.done.
Reachability Check passed.
Network diagnostic passed &lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token number&quot;&gt;1&lt;/span&gt;/1 checks passed&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;.

You must sign &lt;span class=&quot;token keyword&quot;&gt;in&lt;/span&gt; to continue. Would you like to sign &lt;span class=&quot;token keyword&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;Y/n&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;?  Y

Go to the following &lt;span class=&quot;token function&quot;&gt;link&lt;/span&gt; &lt;span class=&quot;token keyword&quot;&gt;in&lt;/span&gt; your browser, and complete the sign-in prompts:

    https://accounts.google.com/o/oauth2/auth&lt;span class=&quot;token operator&quot;&gt;&amp;lt;&lt;/span&gt;URL_TO_OPEN_IN_BROWSER&lt;span class=&quot;token operator&quot;&gt;&amp;gt;&lt;/span&gt;

Once finished, enter the verification code provided &lt;span class=&quot;token keyword&quot;&gt;in&lt;/span&gt; your browser:
&lt;span class=&quot;token operator&quot;&gt;&amp;lt;&lt;/span&gt;PROVIDE_VERIFICATION_CODE&lt;span class=&quot;token operator&quot;&gt;&amp;gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;Subsequently, configure the desired Cloud project, default Compute Region, and Zone for your environment.&lt;/p&gt;&lt;h3 id=&quot;step-3-setting-up-gcp-service-account&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/#step-3-setting-up-gcp-service-account&quot; class=&quot;heading-anchor&quot;&gt;Step 3: Setting up GCP Service Account&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;To create a GCP Service Account, navigate to the &lt;a href=&quot;https://console.cloud.google.com/&quot; rel=&quot;noopener&quot;&gt;Google Cloud Console&lt;/a&gt;, select the correct project, and go to &lt;code&gt;Navigation Menu (3 lines) &amp;gt; IAM &amp;amp; Admin &amp;gt; Service Accounts.&lt;/code&gt;&lt;/p&gt;&lt;p&gt;&lt;is-land on:idle&gt;&lt;/is-land&gt;&lt;/p&gt;&lt;dialog class=&quot;flow modal32&quot;&gt;&lt;button autofocus class=&quot;button&quot;&gt;Close&lt;/button&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/0OGIJK8OA6-487.webp 487w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/0OGIJK8OA6-487.jpeg&quot; alt=&quot;how to set up a google cloud platform service account&quot; width=&quot;487&quot; height=&quot;326&quot;&gt;&lt;/picture&gt;&lt;/dialog&gt;&lt;button data-index=&quot;32&quot;&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/0OGIJK8OA6-487.webp 487w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/0OGIJK8OA6-487.jpeg&quot; alt=&quot;how to set up a google cloud platform service account&quot; width=&quot;487&quot; height=&quot;326&quot;&gt;&lt;/picture&gt;&lt;/button&gt;&lt;p&gt;&lt;/p&gt;&lt;p&gt;Click &lt;code&gt;Create Service Account&lt;/code&gt; and provide the required information.&lt;/p&gt;&lt;p&gt;&lt;is-land on:idle&gt;&lt;/is-land&gt;&lt;/p&gt;&lt;dialog class=&quot;flow modal33&quot;&gt;&lt;button autofocus class=&quot;button&quot;&gt;Close&lt;/button&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/cg275sFJs--552.webp 552w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/cg275sFJs--552.jpeg&quot; alt=&quot;how to create service account on goocle cloud platform&quot; width=&quot;552&quot; height=&quot;532&quot;&gt;&lt;/picture&gt;&lt;/dialog&gt;&lt;button data-index=&quot;33&quot;&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/cg275sFJs--552.webp 552w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/cg275sFJs--552.jpeg&quot; alt=&quot;how to create service account on goocle cloud platform&quot; width=&quot;552&quot; height=&quot;532&quot;&gt;&lt;/picture&gt;&lt;/button&gt;&lt;p&gt;&lt;/p&gt;&lt;p&gt;When prompted to &lt;strong&gt;Grant this service account access to a project&lt;/strong&gt;, select the appropriate role. In this guide, we use the &lt;strong&gt;Owner&lt;/strong&gt; role for simplicity, but it’s advisable to limit permissions to only what’s necessary.&lt;/p&gt;&lt;p&gt;Finally, in the &lt;strong&gt;Grant users access to this service account&lt;/strong&gt; step, assign access to the users who will need to interact with the Kubernetes cluster (not part of this article) or VM. Once done, verify that the service account is correctly set up. Its email should follow this pattern:&lt;/p&gt;&lt;p&gt;&lt;code&gt;{service_account_name}@{project}.iam.gserviceaccount.com&lt;/code&gt;&lt;/p&gt;&lt;h3 id=&quot;step-4-downloading-the-json-key-for-the-service-account&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/#step-4-downloading-the-json-key-for-the-service-account&quot; class=&quot;heading-anchor&quot;&gt;Step 4: Downloading the JSON Key for the Service Account&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;In the service account interface, go to &lt;code&gt;Actions (3 dots) &amp;gt; Manage keys&lt;/code&gt;.&lt;/p&gt;&lt;p&gt;&lt;is-land on:idle&gt;&lt;/is-land&gt;&lt;/p&gt;&lt;dialog class=&quot;flow modal34&quot;&gt;&lt;button autofocus class=&quot;button&quot;&gt;Close&lt;/button&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/JG7MD0ilyl-742.webp 742w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/JG7MD0ilyl-742.jpeg&quot; alt=&quot;how to set up json key in google cloud platform&quot; width=&quot;742&quot; height=&quot;339&quot;&gt;&lt;/picture&gt;&lt;/dialog&gt;&lt;button data-index=&quot;34&quot;&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/JG7MD0ilyl-742.webp 742w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/JG7MD0ilyl-742.jpeg&quot; alt=&quot;how to set up json key in google cloud platform&quot; width=&quot;742&quot; height=&quot;339&quot;&gt;&lt;/picture&gt;&lt;/button&gt;&lt;p&gt;&lt;/p&gt;&lt;p&gt;Select &lt;code&gt;Add Key &amp;gt; Create new key &amp;gt; JSON&lt;/code&gt; to download the JSON key file. Keep this file secure, as it will be required for Terraform configuration.&lt;/p&gt;&lt;h3 id=&quot;step-5-activating-the-service-account&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/#step-5-activating-the-service-account&quot; class=&quot;heading-anchor&quot;&gt;Step 5: Activating the Service Account&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;After downloading the JSON key, activate the service account locally with the following command:&lt;/p&gt;&lt;pre class=&quot;language-bash&quot;&gt;&lt;code class=&quot;language-bash&quot;&gt;gcloud auth activate-service-account
&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;service_account_name&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;@&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;project&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;.iam.gserviceaccount.com
--key-file&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;json_file&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;.json&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;This step enables the service account for use in the local environment, ensuring access to necessary GCP resources with IAP tunnel functionality.&lt;/p&gt;&lt;h3 id=&quot;step-6-generating-hmac-key-to-buckets&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/#step-6-generating-hmac-key-to-buckets&quot; class=&quot;heading-anchor&quot;&gt;Step 6: Generating HMAC Key to Buckets&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;To enable file uploads to Cloud Storage, some libraries require an HMAC token in addition to a JSON key. To generate an HMAC token:&lt;/p&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;Go to &lt;code&gt;Cloud Storage &amp;gt; Settings &amp;gt; Interoperability&lt;/code&gt;.&lt;/li&gt;&lt;li&gt;Under &lt;code&gt;Service account HMAC&lt;/code&gt;, click &lt;code&gt;Create a key for service account&lt;/code&gt;.&lt;/li&gt;&lt;li&gt;Once created, the token will be marked as ‘Active’.&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;&lt;is-land on:idle&gt;&lt;/is-land&gt;&lt;/p&gt;&lt;dialog class=&quot;flow modal35&quot;&gt;&lt;button autofocus class=&quot;button&quot;&gt;Close&lt;/button&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/eXA4O9qWt9-479.webp 479w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/eXA4O9qWt9-479.jpeg&quot; alt=&quot;how to generate HMAC key in google cloud platform&quot; width=&quot;479&quot; height=&quot;392&quot;&gt;&lt;/picture&gt;&lt;/dialog&gt;&lt;button data-index=&quot;35&quot;&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/eXA4O9qWt9-479.webp 479w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/eXA4O9qWt9-479.jpeg&quot; alt=&quot;how to generate HMAC key in google cloud platform&quot; width=&quot;479&quot; height=&quot;392&quot;&gt;&lt;/picture&gt;&lt;/button&gt;&lt;p&gt;&lt;/p&gt;&lt;p&gt;For basic configuration, this step is not required. However, for ingestion, the key must be added to Google Secret Manager to ensure it’s accessible for flow runs.&lt;/p&gt;&lt;h3 id=&quot;step-7-enabling-compute-engine-api-and-cloud-resource-manager-api&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/#step-7-enabling-compute-engine-api-and-cloud-resource-manager-api&quot; class=&quot;heading-anchor&quot;&gt;Step 7: Enabling Compute Engine API and Cloud Resource Manager API&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;Before Terraform can interact with GCP, make sure these APIs are enabled for your project:&lt;/p&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;&lt;a href=&quot;https://console.cloud.google.com/marketplace/product/google/compute.googleapis.com&quot; rel=&quot;noopener&quot;&gt;Compute Engine API&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href=&quot;https://console.cloud.google.com/marketplace/product/google/cloudresourcemanager.googleapis.com&quot; rel=&quot;noopener&quot;&gt;Cloud Resource Manager API&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href=&quot;https://console.cloud.google.com/marketplace/product/google-cloud-platform/cloud-storage&quot; rel=&quot;noopener&quot;&gt;Cloud Storage&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;If they’re not enabled, head to the Google Cloud Console and enable them.&lt;/p&gt;&lt;h3 id=&quot;step-8-setting-up-a-remote-state-for-terraform&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/#step-8-setting-up-a-remote-state-for-terraform&quot; class=&quot;heading-anchor&quot;&gt;Step 8: Setting Up a Remote State for Terraform&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;By default, Terraform stores its state locally in .tfstate files. While this works for development, for any persistent environment, even if you’re the only one working on the project, it’s crucial to store this state in a centralized and reliable location. A common best practice is to use a Google Cloud Storage (GCS) bucket to keep the state safe and accessible, avoiding potential issues with local file loss or conflicts.&lt;/p&gt;&lt;p&gt;However, Terraform itself cannot create the bucket required for storing its state, leading to what’s called a “chicken-and-egg” problem. The bucket must be created manually before running any Terraform code. Tools like Terragrunt can solve this by simplifying environment management and reducing code duplication &lt;a href=&quot;https://blog.alterway.fr/en/manage-multiple-kubernetes-clusters-on-gke-with-terragrunt.html&quot; rel=&quot;noopener&quot;&gt;(example setup)&lt;/a&gt;. However, for the sake of simplicity, we are not introducing such tools in this context.&lt;/p&gt;&lt;p&gt;To create a new bucket using the &lt;code&gt;gcloud CLI&lt;/code&gt;, export the necessary credentials and then proceed with the bucket creation process.&lt;/p&gt;&lt;h3 id=&quot;step-9-exporting-credentials-and-setting-up-a-new-bucket&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/#step-9-exporting-credentials-and-setting-up-a-new-bucket&quot; class=&quot;heading-anchor&quot;&gt;Step 9: Exporting Credentials and Setting up a New Bucket&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;Once we have our service account JSON prepared, an export of credentials is necessary to provide Application Default Credentials (ADC):&lt;/p&gt;&lt;p&gt;&lt;code&gt;export GOOGLE_APPLICATION_CREDENTIALS=test-project-32206692d146.json&lt;/code&gt;&lt;/p&gt;&lt;p&gt;Then, with the usage of gcloud CLI, a new bucket with the applied policy should be created:&lt;/p&gt;&lt;pre class=&quot;language-bash&quot;&gt;&lt;code class=&quot;language-bash&quot;&gt;gcloud storage buckets create gs://test-project-tfstate &lt;span class=&quot;token parameter variable&quot;&gt;--location&lt;/span&gt;&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;us-central1 
--uniform-bucket-level-access

gcloud storage buckets add-iam-policy-binding gs://test-project-tfstate &lt;span class=&quot;token punctuation&quot;&gt;&#92;&lt;/span&gt;
&lt;span class=&quot;token parameter variable&quot;&gt;--member&lt;/span&gt;&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&quot;serviceAccount:test-service-account@test-project.iam.gserviceaccount.com&quot;&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;&#92;&lt;/span&gt;
  &lt;span class=&quot;token parameter variable&quot;&gt;--role&lt;/span&gt;&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&quot;roles/storage.objectAdmin&quot;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;Once completed, it should be available in the GCP Console. To check it, go to &lt;code&gt;Navigation Menu &amp;gt; Cloud Storage &amp;gt; Buckets&lt;/code&gt;:&lt;/p&gt;&lt;p&gt;&lt;is-land on:idle&gt;&lt;/is-land&gt;&lt;/p&gt;&lt;dialog class=&quot;flow modal36&quot;&gt;&lt;button autofocus class=&quot;button&quot;&gt;Close&lt;/button&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/DwEN8YF8e3-960.webp 960w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/DwEN8YF8e3-960.jpeg&quot; alt=&quot;how to set up a new bucket on google cloud platform&quot; width=&quot;960&quot; height=&quot;133&quot;&gt;&lt;/picture&gt;&lt;/dialog&gt;&lt;button data-index=&quot;36&quot;&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/DwEN8YF8e3-960.webp 960w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/DwEN8YF8e3-960.jpeg&quot; alt=&quot;how to set up a new bucket on google cloud platform&quot; width=&quot;960&quot; height=&quot;133&quot;&gt;&lt;/picture&gt;&lt;/button&gt;&lt;p&gt;&lt;/p&gt;&lt;h2 id=&quot;phase-2-installing-and-deploying-infrastructure-with-terraform&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/#phase-2-installing-and-deploying-infrastructure-with-terraform&quot; class=&quot;heading-anchor&quot;&gt;Phase 2: Installing &amp;amp; Deploying Infrastructure with Terraform&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;To set up the environment, Terraform will handle provisioning all the required GCP resources. By the end of this process, your directory structure will look like this:&lt;/p&gt;&lt;pre class=&quot;language-bash&quot;&gt;&lt;code class=&quot;language-bash&quot;&gt;$ tree

&lt;span class=&quot;token operator&quot;&gt;|&lt;/span&gt;-- backend.tf
&lt;span class=&quot;token operator&quot;&gt;|&lt;/span&gt;-- main.tf
&lt;span class=&quot;token operator&quot;&gt;|&lt;/span&gt;-- provider.tf
&lt;span class=&quot;token operator&quot;&gt;|&lt;/span&gt;-- test-project-32206692d146.json
&lt;span class=&quot;token operator&quot;&gt;|&lt;/span&gt;-- variable.tf&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;For managing both DEV and PROD environments, you can duplicate the files as shown:&lt;/p&gt;&lt;pre class=&quot;language-bash&quot;&gt;&lt;code class=&quot;language-bash&quot;&gt;$ tree
├── dev
    ├── backend.tf
│   ├── main.tf
│   ├── test-project-32206692d146.json
│   ├── provider.tf
│   └── variable.tf
└── prod
    ├── backend.tf
    ├── main.tf
    ├── test-project-32206692d146.json
    ├── provider.tf
    └── variable.tf&lt;/code&gt;&lt;/pre&gt;&lt;h4 id=&quot;terraform-files&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/#terraform-files&quot; class=&quot;heading-anchor&quot;&gt;Terraform Files&lt;/a&gt;&lt;/h4&gt;&lt;p&gt;The main difference between environments lies in the &lt;code&gt;backend.tf&lt;/code&gt; and &lt;code&gt;variables.tf&lt;/code&gt; files. In larger projects, using Terraform modules or tools like Terragrunt is recommended for reusable configurations. However, for simplicity, this example uses code duplication, which is also a valid approach.&lt;/p&gt;&lt;p&gt;The content of &lt;code&gt;provider.tf&lt;/code&gt; should look like this:&lt;/p&gt;&lt;pre class=&quot;language-hcl&quot;&gt;&lt;code class=&quot;language-hcl&quot;&gt;&lt;span class=&quot;token keyword&quot;&gt;provider&lt;span class=&quot;token type variable&quot;&gt; &quot;google&quot; &lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;token property&quot;&gt;region&lt;/span&gt;      &lt;span class=&quot;token punctuation&quot;&gt;=&lt;/span&gt; var.region
  &lt;span class=&quot;token property&quot;&gt;project&lt;/span&gt;     &lt;span class=&quot;token punctuation&quot;&gt;=&lt;/span&gt; var.project_name
  &lt;span class=&quot;token property&quot;&gt;credentials&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;=&lt;/span&gt; file(var.credentials_file)
  &lt;span class=&quot;token property&quot;&gt;zone&lt;/span&gt;        &lt;span class=&quot;token punctuation&quot;&gt;=&lt;/span&gt; var.zone
&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;&lt;code&gt;backend.tf&lt;/code&gt; should point to a bucket with a shared tfstate file created in step 9 of the first phase. It needs to be manually configured because it is the first block loaded when running &lt;code&gt;terraform init&lt;/code&gt;, and variables from &lt;code&gt;variables.tf&lt;/code&gt; cannot be referenced here:&lt;/p&gt;&lt;pre class=&quot;language-hcl&quot;&gt;&lt;code class=&quot;language-hcl&quot;&gt;&lt;span class=&quot;token keyword&quot;&gt;terraform&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;token keyword&quot;&gt;backend&lt;span class=&quot;token type variable&quot;&gt; &quot;gcs&quot; &lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;token property&quot;&gt;bucket&lt;/span&gt;  &lt;span class=&quot;token punctuation&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;test-project-tfstate&quot;&lt;/span&gt;
    &lt;span class=&quot;token property&quot;&gt;prefix&lt;/span&gt;  &lt;span class=&quot;token punctuation&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;terraform/state/prod&quot;&lt;/span&gt;
  &lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;All variables used in &lt;code&gt;provider.tf&lt;/code&gt; and &lt;code&gt;main.tf&lt;/code&gt; are defined in &lt;code&gt;variable.tf&lt;/code&gt;, as shown below:&lt;/p&gt;&lt;pre class=&quot;language-hcl&quot;&gt;&lt;code class=&quot;language-hcl&quot;&gt;&lt;span class=&quot;token keyword&quot;&gt;variable&lt;span class=&quot;token type variable&quot;&gt; &quot;credentials_file&quot; &lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;token property&quot;&gt;default&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;test-project-32206692d146.json&quot;&lt;/span&gt;
&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;token keyword&quot;&gt;variable&lt;span class=&quot;token type variable&quot;&gt; &quot;environment&quot; &lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;token property&quot;&gt;default&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;prod&quot;&lt;/span&gt;
&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;token keyword&quot;&gt;variable&lt;span class=&quot;token type variable&quot;&gt; &quot;filesystem&quot; &lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;token property&quot;&gt;default&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;ext4&quot;&lt;/span&gt;
&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;token keyword&quot;&gt;variable&lt;span class=&quot;token type variable&quot;&gt; &quot;image&quot; &lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;token property&quot;&gt;default&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;=&lt;/span&gt; 
&lt;span class=&quot;token string&quot;&gt;&quot;projects/ubuntu-os-cloud/global/images/ubuntu-2404-noble-amd64-v20241115&quot;&lt;/span&gt;
&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;token keyword&quot;&gt;variable&lt;span class=&quot;token type variable&quot;&gt; &quot;ip_cidr_range&quot; &lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;token property&quot;&gt;default&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;10.202.0.0/24&quot;&lt;/span&gt;
&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;token keyword&quot;&gt;variable&lt;span class=&quot;token type variable&quot;&gt; &quot;machine_type&quot; &lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;token property&quot;&gt;default&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;c3d-standard-8-lssd&quot;&lt;/span&gt;
&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;token keyword&quot;&gt;variable&lt;span class=&quot;token type variable&quot;&gt; &quot;project_name&quot; &lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;token property&quot;&gt;default&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;test-project&quot;&lt;/span&gt;
&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;token keyword&quot;&gt;variable&lt;span class=&quot;token type variable&quot;&gt; &quot;region&quot; &lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;token property&quot;&gt;default&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;us-central1&quot;&lt;/span&gt;
&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;token keyword&quot;&gt;variable&lt;span class=&quot;token type variable&quot;&gt; &quot;service_account&quot; &lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;token property&quot;&gt;default&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;=&lt;/span&gt; 
&lt;span class=&quot;token string&quot;&gt;&quot;serviceAccount:test-service-account@test-project.iam.gserviceaccount.com&quot;&lt;/span&gt;
&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;token keyword&quot;&gt;variable&lt;span class=&quot;token type variable&quot;&gt; &quot;zone&quot; &lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;token property&quot;&gt;default&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;us-central1-c&quot;&lt;/span&gt;
&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;&lt;code&gt;main.tf&lt;/code&gt; defines and initializes all infrastructure components outlined in the &lt;a href=&quot;https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/#infrastructure-overview&quot;&gt;Infrastructure overview section&lt;/a&gt;.&lt;/p&gt;&lt;pre class=&quot;language-hcl&quot;&gt;&lt;code class=&quot;language-hcl&quot;&gt;&lt;span class=&quot;token keyword&quot;&gt;resource &lt;span class=&quot;token type variable&quot;&gt;&quot;google_compute_network&quot;&lt;/span&gt;&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;vpc_edp&quot;&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;
 &lt;span class=&quot;token property&quot;&gt;name&lt;/span&gt;                    &lt;span class=&quot;token punctuation&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;vpc-&lt;span class=&quot;token interpolation&quot;&gt;&lt;span class=&quot;token punctuation&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;token keyword&quot;&gt;var&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;token type variable&quot;&gt;project_name&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;&lt;/span&gt;-&lt;span class=&quot;token interpolation&quot;&gt;&lt;span class=&quot;token punctuation&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;token keyword&quot;&gt;var&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;token type variable&quot;&gt;environment&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;&lt;/span&gt;&quot;&lt;/span&gt;
 &lt;span class=&quot;token property&quot;&gt;auto_create_subnetworks&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;false&quot;&lt;/span&gt;

&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;token keyword&quot;&gt;resource &lt;span class=&quot;token type variable&quot;&gt;&quot;google_compute_subnetwork&quot;&lt;/span&gt;&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;subnet_edp&quot;&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;
 &lt;span class=&quot;token property&quot;&gt;name&lt;/span&gt;          &lt;span class=&quot;token punctuation&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;subnet-&lt;span class=&quot;token interpolation&quot;&gt;&lt;span class=&quot;token punctuation&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;token keyword&quot;&gt;var&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;token type variable&quot;&gt;project_name&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;&lt;/span&gt;-&lt;span class=&quot;token interpolation&quot;&gt;&lt;span class=&quot;token punctuation&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;token keyword&quot;&gt;var&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;token type variable&quot;&gt;environment&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;&lt;/span&gt;&quot;&lt;/span&gt;
 &lt;span class=&quot;token property&quot;&gt;ip_cidr_range&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;=&lt;/span&gt; var.ip_cidr_range
 &lt;span class=&quot;token property&quot;&gt;network&lt;/span&gt;       &lt;span class=&quot;token punctuation&quot;&gt;=&lt;/span&gt; google_compute_network.vpc_edp.name
 &lt;span class=&quot;token property&quot;&gt;region&lt;/span&gt;        &lt;span class=&quot;token punctuation&quot;&gt;=&lt;/span&gt; var.region
 &lt;span class=&quot;token property&quot;&gt;depends_on&lt;/span&gt;    &lt;span class=&quot;token punctuation&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;[&lt;/span&gt;google_compute_network.vpc_edp&lt;span class=&quot;token punctuation&quot;&gt;]&lt;/span&gt;
&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;token keyword&quot;&gt;resource &lt;span class=&quot;token type variable&quot;&gt;&quot;google_compute_instance&quot;&lt;/span&gt;&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;vm_edp&quot;&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;
 &lt;span class=&quot;token property&quot;&gt;project&lt;/span&gt;      &lt;span class=&quot;token punctuation&quot;&gt;=&lt;/span&gt; var.project_name
 &lt;span class=&quot;token property&quot;&gt;zone&lt;/span&gt;         &lt;span class=&quot;token punctuation&quot;&gt;=&lt;/span&gt; var.zone
 &lt;span class=&quot;token property&quot;&gt;name&lt;/span&gt;         &lt;span class=&quot;token punctuation&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;&lt;span class=&quot;token interpolation&quot;&gt;&lt;span class=&quot;token punctuation&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;token keyword&quot;&gt;var&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;token type variable&quot;&gt;project_name&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;&lt;/span&gt;-&lt;span class=&quot;token interpolation&quot;&gt;&lt;span class=&quot;token punctuation&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;token keyword&quot;&gt;var&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;token type variable&quot;&gt;environment&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;&lt;/span&gt;-01&quot;&lt;/span&gt;
 &lt;span class=&quot;token property&quot;&gt;machine_type&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;=&lt;/span&gt; var.machine_type
 &lt;span class=&quot;token keyword&quot;&gt;boot_disk&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;
   &lt;span class=&quot;token property&quot;&gt;auto_delete&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token boolean&quot;&gt;true&lt;/span&gt;
   &lt;span class=&quot;token keyword&quot;&gt;initialize_params&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;
     &lt;span class=&quot;token property&quot;&gt;image&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;=&lt;/span&gt; var.image
     &lt;span class=&quot;token property&quot;&gt;size&lt;/span&gt;  &lt;span class=&quot;token punctuation&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token number&quot;&gt;50&lt;/span&gt;
     &lt;span class=&quot;token property&quot;&gt;type&lt;/span&gt;  &lt;span class=&quot;token punctuation&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;pd-ssd&quot;&lt;/span&gt;
   &lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;
   &lt;span class=&quot;token property&quot;&gt;mode&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;READ_WRITE&quot;&lt;/span&gt;
 &lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;
 &lt;span class=&quot;token keyword&quot;&gt;scratch_disk&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;
   &lt;span class=&quot;token property&quot;&gt;interface&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;NVME&quot;&lt;/span&gt;
 &lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;
 &lt;span class=&quot;token keyword&quot;&gt;network_interface&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;
   &lt;span class=&quot;token property&quot;&gt;network&lt;/span&gt;    &lt;span class=&quot;token punctuation&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;vpc-&lt;span class=&quot;token interpolation&quot;&gt;&lt;span class=&quot;token punctuation&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;token keyword&quot;&gt;var&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;token type variable&quot;&gt;project_name&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;&lt;/span&gt;-&lt;span class=&quot;token interpolation&quot;&gt;&lt;span class=&quot;token punctuation&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;token keyword&quot;&gt;var&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;token type variable&quot;&gt;environment&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;&lt;/span&gt;&quot;&lt;/span&gt;
   &lt;span class=&quot;token property&quot;&gt;subnetwork&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;=&lt;/span&gt; google_compute_subnetwork.subnet_edp.name
 &lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;
 &lt;span class=&quot;token property&quot;&gt;metadata_startup_script&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token heredoc string&quot;&gt;&amp;lt;&amp;lt;-EOT
   #!/bin/bash
   set -e
   sudo mkfs.ext4 -F /dev/disk/by-id/google-local-nvme-ssd-0
   sudo mkdir -p /mnt/disks/local-nvme-ssd
   sudo mount /dev/disk/by-id/google-local-nvme-ssd-0 /mnt/disks/local-nvme-ssd
   sudo chmod a+w /mnt/disks/local-nvme-ssd

   echo UUID=`sudo blkid -s UUID -o value /dev/disk/by-id/google-local-nvme-ssd-0` /mnt/disks/local-nvme-ssd ext4 discard,defaults,nofail 0 2 | sudo tee -a /etc/fstab
 EOT&lt;/span&gt;
 &lt;span class=&quot;token property&quot;&gt;depends_on&lt;/span&gt;              &lt;span class=&quot;token punctuation&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;[&lt;/span&gt;google_compute_network.vpc_edp&lt;span class=&quot;token punctuation&quot;&gt;]&lt;/span&gt;
&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;token keyword&quot;&gt;resource &lt;span class=&quot;token type variable&quot;&gt;&quot;google_compute_firewall&quot;&lt;/span&gt;&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;rules&quot;&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;
 &lt;span class=&quot;token property&quot;&gt;project&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;=&lt;/span&gt; var.project_name
 &lt;span class=&quot;token property&quot;&gt;name&lt;/span&gt;    &lt;span class=&quot;token punctuation&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;allow-ssh-&lt;span class=&quot;token interpolation&quot;&gt;&lt;span class=&quot;token punctuation&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;token keyword&quot;&gt;var&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;token type variable&quot;&gt;environment&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;&lt;/span&gt;&quot;&lt;/span&gt;
 &lt;span class=&quot;token property&quot;&gt;network&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;vpc-&lt;span class=&quot;token interpolation&quot;&gt;&lt;span class=&quot;token punctuation&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;token keyword&quot;&gt;var&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;token type variable&quot;&gt;project_name&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;&lt;/span&gt;-&lt;span class=&quot;token interpolation&quot;&gt;&lt;span class=&quot;token punctuation&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;token keyword&quot;&gt;var&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;token type variable&quot;&gt;environment&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;&lt;/span&gt;&quot;&lt;/span&gt;

 &lt;span class=&quot;token keyword&quot;&gt;allow&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;
   &lt;span class=&quot;token property&quot;&gt;protocol&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;tcp&quot;&lt;/span&gt;
   &lt;span class=&quot;token property&quot;&gt;ports&lt;/span&gt;    &lt;span class=&quot;token punctuation&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&quot;22&quot;&lt;/span&gt;, &lt;span class=&quot;token string&quot;&gt;&quot;6443&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;]&lt;/span&gt;
 &lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;
 &lt;span class=&quot;token property&quot;&gt;source_ranges&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&quot;35.235.240.0/20&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;]&lt;/span&gt;
 &lt;span class=&quot;token property&quot;&gt;depends_on&lt;/span&gt;    &lt;span class=&quot;token punctuation&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;[&lt;/span&gt;google_compute_network.vpc_edp&lt;span class=&quot;token punctuation&quot;&gt;]&lt;/span&gt;
&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;token keyword&quot;&gt;resource &lt;span class=&quot;token type variable&quot;&gt;&quot;google_project_iam_member&quot;&lt;/span&gt;&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;project&quot;&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;
 &lt;span class=&quot;token property&quot;&gt;project&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;=&lt;/span&gt; var.project_name
 &lt;span class=&quot;token property&quot;&gt;role&lt;/span&gt;    &lt;span class=&quot;token punctuation&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;roles/iap.tunnelResourceAccessor&quot;&lt;/span&gt;
 &lt;span class=&quot;token property&quot;&gt;member&lt;/span&gt;  &lt;span class=&quot;token punctuation&quot;&gt;=&lt;/span&gt; var.service_account
&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;token keyword&quot;&gt;resource &lt;span class=&quot;token type variable&quot;&gt;&quot;google_compute_router&quot;&lt;/span&gt;&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;router&quot;&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;
 &lt;span class=&quot;token property&quot;&gt;project&lt;/span&gt;    &lt;span class=&quot;token punctuation&quot;&gt;=&lt;/span&gt; var.project_name
 &lt;span class=&quot;token property&quot;&gt;name&lt;/span&gt;       &lt;span class=&quot;token punctuation&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;nat-router-&lt;span class=&quot;token interpolation&quot;&gt;&lt;span class=&quot;token punctuation&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;token keyword&quot;&gt;var&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;token type variable&quot;&gt;environment&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;&lt;/span&gt;&quot;&lt;/span&gt;
 &lt;span class=&quot;token property&quot;&gt;network&lt;/span&gt;    &lt;span class=&quot;token punctuation&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;vpc-&lt;span class=&quot;token interpolation&quot;&gt;&lt;span class=&quot;token punctuation&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;token keyword&quot;&gt;var&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;token type variable&quot;&gt;project_name&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;&lt;/span&gt;-&lt;span class=&quot;token interpolation&quot;&gt;&lt;span class=&quot;token punctuation&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;token keyword&quot;&gt;var&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;token type variable&quot;&gt;environment&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;&lt;/span&gt;&quot;&lt;/span&gt;
 &lt;span class=&quot;token property&quot;&gt;region&lt;/span&gt;     &lt;span class=&quot;token punctuation&quot;&gt;=&lt;/span&gt; var.region
 &lt;span class=&quot;token property&quot;&gt;depends_on&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;[&lt;/span&gt;google_compute_network.vpc_edp&lt;span class=&quot;token punctuation&quot;&gt;]&lt;/span&gt;
&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;token keyword&quot;&gt;resource &lt;span class=&quot;token type variable&quot;&gt;&quot;google_compute_router_nat&quot;&lt;/span&gt;&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;nat&quot;&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;
 &lt;span class=&quot;token property&quot;&gt;name&lt;/span&gt;                               &lt;span class=&quot;token punctuation&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;router-nat-&lt;span class=&quot;token interpolation&quot;&gt;&lt;span class=&quot;token punctuation&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;token keyword&quot;&gt;var&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;token type variable&quot;&gt;project_name&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;&lt;/span&gt;-&lt;span class=&quot;token interpolation&quot;&gt;&lt;span class=&quot;token punctuation&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;token keyword&quot;&gt;var&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;token type variable&quot;&gt;environment&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;&lt;/span&gt;&quot;&lt;/span&gt;
 &lt;span class=&quot;token property&quot;&gt;router&lt;/span&gt;                             &lt;span class=&quot;token punctuation&quot;&gt;=&lt;/span&gt; google_compute_router.router.name
 &lt;span class=&quot;token property&quot;&gt;region&lt;/span&gt;                             &lt;span class=&quot;token punctuation&quot;&gt;=&lt;/span&gt; var.region
 &lt;span class=&quot;token property&quot;&gt;nat_ip_allocate_option&lt;/span&gt;             &lt;span class=&quot;token punctuation&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;AUTO_ONLY&quot;&lt;/span&gt;
 &lt;span class=&quot;token property&quot;&gt;source_subnetwork_ip_ranges_to_nat&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;ALL_SUBNETWORKS_ALL_IP_RANGES&quot;&lt;/span&gt;

 &lt;span class=&quot;token keyword&quot;&gt;log_config&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;
   &lt;span class=&quot;token property&quot;&gt;enable&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token boolean&quot;&gt;true&lt;/span&gt;
   &lt;span class=&quot;token property&quot;&gt;filter&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;ERRORS_ONLY&quot;&lt;/span&gt;
 &lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;token keyword&quot;&gt;resource &lt;span class=&quot;token type variable&quot;&gt;&quot;google_storage_bucket&quot;&lt;/span&gt;&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;private_bucket&quot;&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;
 &lt;span class=&quot;token property&quot;&gt;name&lt;/span&gt;          &lt;span class=&quot;token punctuation&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;&lt;span class=&quot;token interpolation&quot;&gt;&lt;span class=&quot;token punctuation&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;token keyword&quot;&gt;var&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;token type variable&quot;&gt;project_name&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;&lt;/span&gt;-&lt;span class=&quot;token interpolation&quot;&gt;&lt;span class=&quot;token punctuation&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;token keyword&quot;&gt;var&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;token type variable&quot;&gt;environment&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;&lt;/span&gt;&quot;&lt;/span&gt;
 &lt;span class=&quot;token property&quot;&gt;location&lt;/span&gt;      &lt;span class=&quot;token punctuation&quot;&gt;=&lt;/span&gt; var.region
 &lt;span class=&quot;token property&quot;&gt;storage_class&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;STANDARD&quot;&lt;/span&gt;

 &lt;span class=&quot;token property&quot;&gt;uniform_bucket_level_access&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token boolean&quot;&gt;true&lt;/span&gt;
&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;token keyword&quot;&gt;resource &lt;span class=&quot;token type variable&quot;&gt;&quot;google_storage_bucket_iam_binding&quot;&lt;/span&gt;&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;bucket_writer&quot;&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;
 &lt;span class=&quot;token property&quot;&gt;bucket&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;=&lt;/span&gt; google_storage_bucket.private_bucket.name

 &lt;span class=&quot;token property&quot;&gt;role&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;roles/storage.objectCreator&quot;&lt;/span&gt;
 &lt;span class=&quot;token property&quot;&gt;members&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;[&lt;/span&gt;
   &lt;span class=&quot;token string&quot;&gt;&quot;&lt;span class=&quot;token interpolation&quot;&gt;&lt;span class=&quot;token punctuation&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;token keyword&quot;&gt;var&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;token type variable&quot;&gt;service_account&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;&lt;/span&gt;&quot;&lt;/span&gt;
 &lt;span class=&quot;token punctuation&quot;&gt;]&lt;/span&gt;
&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;token keyword&quot;&gt;resource &lt;span class=&quot;token type variable&quot;&gt;&quot;google_storage_bucket_iam_binding&quot;&lt;/span&gt;&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;bucket_admin&quot;&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;
 &lt;span class=&quot;token property&quot;&gt;bucket&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;=&lt;/span&gt; google_storage_bucket.private_bucket.name

 &lt;span class=&quot;token property&quot;&gt;role&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;roles/storage.admin&quot;&lt;/span&gt;
 &lt;span class=&quot;token property&quot;&gt;members&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;[&lt;/span&gt;
   &lt;span class=&quot;token string&quot;&gt;&quot;&lt;span class=&quot;token interpolation&quot;&gt;&lt;span class=&quot;token punctuation&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;token keyword&quot;&gt;var&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;token type variable&quot;&gt;service_account&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;&lt;/span&gt;&quot;&lt;/span&gt;
 &lt;span class=&quot;token punctuation&quot;&gt;]&lt;/span&gt;
&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;With the environment outlined, let’s provision the infrastructure by validating, formatting, and applying the configuration.&lt;/p&gt;&lt;h3 id=&quot;step-1-infrastructure-provisioning-with-terraform&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/#step-1-infrastructure-provisioning-with-terraform&quot; class=&quot;heading-anchor&quot;&gt;Step 1: Infrastructure Provisioning with Terraform&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;Once the necessary files are prepared, validate and format the configuration:&lt;/p&gt;&lt;pre class=&quot;language-bash&quot;&gt;&lt;code class=&quot;language-bash&quot;&gt;$ terraform validate
Success&lt;span class=&quot;token operator&quot;&gt;!&lt;/span&gt; The configuration is valid.
$ terraform &lt;span class=&quot;token function&quot;&gt;fmt&lt;/span&gt;
main.tf
provider.tf&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;Before applying changes, inspect them with the &lt;code&gt;plan&lt;/code&gt; command:&lt;/p&gt;&lt;pre class=&quot;language-bash&quot;&gt;&lt;code class=&quot;language-bash&quot;&gt;terraform plan
&lt;span class=&quot;token comment&quot;&gt;# Check if it&#39;s all good&lt;/span&gt;
terraform apply
&lt;span class=&quot;token comment&quot;&gt;# Enter a value: yes&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;After a few minutes, the environment will be ready. To list the created resources, run:&lt;/p&gt;&lt;pre class=&quot;language-bash&quot;&gt;&lt;code class=&quot;language-bash&quot;&gt;$ terraform state list
google_compute_firewall.rules
google_compute_instance.vm_edp
google_compute_network.vpc_edp
google_compute_router.router
google_compute_router_nat.nat
google_compute_subnetwork.subnet_edp
google_project_iam_member.project&lt;/code&gt;&lt;/pre&gt;&lt;h3 id=&quot;step-2-set-up-verification&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/#step-2-set-up-verification&quot; class=&quot;heading-anchor&quot;&gt;Step 2: Set up Verification&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;Verify the resources by logging into the GCP Console. Confirm the creation of &lt;strong&gt;VPC and Subnet, Virtual Machine, Firewall Rule, IAP SSH Permission, Cloud Router, and NAT Gateway&lt;/strong&gt;. Navigate to the following sections:&lt;/p&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;&lt;strong&gt;VPC and Subnet&lt;/strong&gt;&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;Go to &lt;code&gt;Navigation Menu &amp;gt; VPC Network &amp;gt; VPC Networks&lt;/code&gt;:&lt;/p&gt;&lt;p&gt;&lt;is-land on:idle&gt;&lt;/is-land&gt;&lt;/p&gt;&lt;dialog class=&quot;flow modal37&quot;&gt;&lt;button autofocus class=&quot;button&quot;&gt;Close&lt;/button&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/fTt7wxmNGG-960.webp 960w, https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/fTt7wxmNGG-1219.webp 1219w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/fTt7wxmNGG-960.jpeg&quot; alt=&quot;how to verify set up on google cloud platform with VPC&quot; width=&quot;1219&quot; height=&quot;144&quot; srcset=&quot;https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/fTt7wxmNGG-960.jpeg 960w, https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/fTt7wxmNGG-1219.jpeg 1219w&quot; sizes=&quot;auto&quot;&gt;&lt;/picture&gt;&lt;/dialog&gt;&lt;button data-index=&quot;37&quot;&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/fTt7wxmNGG-960.webp 960w, https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/fTt7wxmNGG-1219.webp 1219w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/fTt7wxmNGG-960.jpeg&quot; alt=&quot;how to verify set up on google cloud platform with VPC&quot; width=&quot;1219&quot; height=&quot;144&quot; srcset=&quot;https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/fTt7wxmNGG-960.jpeg 960w, https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/fTt7wxmNGG-1219.jpeg 1219w&quot; sizes=&quot;auto&quot;&gt;&lt;/picture&gt;&lt;/button&gt;&lt;p&gt;&lt;/p&gt;&lt;p&gt;Click on the VPC and check the &lt;code&gt;Subnets&lt;/code&gt; tab:&lt;/p&gt;&lt;p&gt;&lt;is-land on:idle&gt;&lt;/is-land&gt;&lt;/p&gt;&lt;dialog class=&quot;flow modal38&quot;&gt;&lt;button autofocus class=&quot;button&quot;&gt;Close&lt;/button&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/enz5i7h6Ld-960.webp 960w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/enz5i7h6Ld-960.jpeg&quot; alt=&quot;verifying google cloud platform setupn on VPC&quot; width=&quot;960&quot; height=&quot;207&quot;&gt;&lt;/picture&gt;&lt;/dialog&gt;&lt;button data-index=&quot;38&quot;&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/enz5i7h6Ld-960.webp 960w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/enz5i7h6Ld-960.jpeg&quot; alt=&quot;verifying google cloud platform setupn on VPC&quot; width=&quot;960&quot; height=&quot;207&quot;&gt;&lt;/picture&gt;&lt;/button&gt;&lt;p&gt;&lt;/p&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;&lt;strong&gt;Virtual Machine&lt;/strong&gt;&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;Navigate to &lt;code&gt;Navigation Menu &amp;gt; Compute Engine &amp;gt; VM instances&lt;/code&gt;:&lt;/p&gt;&lt;p&gt;&lt;is-land on:idle&gt;&lt;/is-land&gt;&lt;/p&gt;&lt;dialog class=&quot;flow modal39&quot;&gt;&lt;button autofocus class=&quot;button&quot;&gt;Close&lt;/button&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/Rz29rknMsr-960.webp 960w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/Rz29rknMsr-960.jpeg&quot; alt=&quot;how to verify set up on google cloud platform on VM&quot; width=&quot;960&quot; height=&quot;126&quot;&gt;&lt;/picture&gt;&lt;/dialog&gt;&lt;button data-index=&quot;39&quot;&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/Rz29rknMsr-960.webp 960w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/Rz29rknMsr-960.jpeg&quot; alt=&quot;how to verify set up on google cloud platform on VM&quot; width=&quot;960&quot; height=&quot;126&quot;&gt;&lt;/picture&gt;&lt;/button&gt;&lt;p&gt;&lt;/p&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;&lt;strong&gt;Firewall Rule&lt;/strong&gt;&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;Go to &lt;code&gt;Navigation Menu &amp;gt; VPC Network &amp;gt; Firewall&lt;/code&gt;:&lt;/p&gt;&lt;p&gt;&lt;is-land on:idle&gt;&lt;/is-land&gt;&lt;/p&gt;&lt;dialog class=&quot;flow modal40&quot;&gt;&lt;button autofocus class=&quot;button&quot;&gt;Close&lt;/button&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/ptoJB2fUhV-960.webp 960w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/ptoJB2fUhV-960.jpeg&quot; alt=&quot;how to verify set up on google cloud platform with Firewall rule&quot; width=&quot;960&quot; height=&quot;343&quot;&gt;&lt;/picture&gt;&lt;/dialog&gt;&lt;button data-index=&quot;40&quot;&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/ptoJB2fUhV-960.webp 960w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/ptoJB2fUhV-960.jpeg&quot; alt=&quot;how to verify set up on google cloud platform with Firewall rule&quot; width=&quot;960&quot; height=&quot;343&quot;&gt;&lt;/picture&gt;&lt;/button&gt;&lt;p&gt;&lt;/p&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;&lt;strong&gt;IAM Role&lt;/strong&gt;&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;Navigate to &lt;code&gt;Navigation Menu &amp;gt; IAM &amp;amp; Admin &amp;gt; IAM&lt;/code&gt;, and &lt;code&gt;View by roles&lt;/code&gt;:&lt;/p&gt;&lt;p&gt;&lt;is-land on:idle&gt;&lt;/is-land&gt;&lt;/p&gt;&lt;dialog class=&quot;flow modal41&quot;&gt;&lt;button autofocus class=&quot;button&quot;&gt;Close&lt;/button&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/UQrnGOluqb-871.webp 871w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/UQrnGOluqb-871.jpeg&quot; alt=&quot;how to verify set up on google cloud platform with IAM role&quot; width=&quot;871&quot; height=&quot;272&quot;&gt;&lt;/picture&gt;&lt;/dialog&gt;&lt;button data-index=&quot;41&quot;&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/UQrnGOluqb-871.webp 871w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/UQrnGOluqb-871.jpeg&quot; alt=&quot;how to verify set up on google cloud platform with IAM role&quot; width=&quot;871&quot; height=&quot;272&quot;&gt;&lt;/picture&gt;&lt;/button&gt;&lt;p&gt;&lt;/p&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;&lt;strong&gt;Cloud NAT gateway&lt;/strong&gt;&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;Go to &lt;code&gt;Navigation Menu &amp;gt; Network Connectivity &amp;gt; Cloud Routers &amp;gt; Open Cloud Router&lt;/code&gt;:&lt;/p&gt;&lt;p&gt;&lt;is-land on:idle&gt;&lt;/is-land&gt;&lt;/p&gt;&lt;dialog class=&quot;flow modal42&quot;&gt;&lt;button autofocus class=&quot;button&quot;&gt;Close&lt;/button&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/akM5MKhm_V-421.webp 421w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/akM5MKhm_V-421.jpeg&quot; alt=&quot;how to verify set up on google cloud platform with Cloud NAT gateway&quot; width=&quot;421&quot; height=&quot;105&quot;&gt;&lt;/picture&gt;&lt;/dialog&gt;&lt;button data-index=&quot;42&quot;&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/akM5MKhm_V-421.webp 421w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/akM5MKhm_V-421.jpeg&quot; alt=&quot;how to verify set up on google cloud platform with Cloud NAT gateway&quot; width=&quot;421&quot; height=&quot;105&quot;&gt;&lt;/picture&gt;&lt;/button&gt;&lt;p&gt;&lt;/p&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;&lt;strong&gt;Cloud NAT&lt;/strong&gt;&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;Navigate to &lt;code&gt;Navigation Menu &amp;gt; Network Services &amp;gt; Cloud NAT&lt;/code&gt;:&lt;/p&gt;&lt;p&gt;&lt;is-land on:idle&gt;&lt;/is-land&gt;&lt;/p&gt;&lt;dialog class=&quot;flow modal43&quot;&gt;&lt;button autofocus class=&quot;button&quot;&gt;Close&lt;/button&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/DPj09xSoyH-863.webp 863w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/DPj09xSoyH-863.jpeg&quot; alt=&quot;how to verify set up on google cloud platform with Cloud NAT&quot; width=&quot;863&quot; height=&quot;107&quot;&gt;&lt;/picture&gt;&lt;/dialog&gt;&lt;button data-index=&quot;43&quot;&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/DPj09xSoyH-863.webp 863w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/DPj09xSoyH-863.jpeg&quot; alt=&quot;how to verify set up on google cloud platform with Cloud NAT&quot; width=&quot;863&quot; height=&quot;107&quot;&gt;&lt;/picture&gt;&lt;/button&gt;&lt;p&gt;&lt;/p&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;&lt;strong&gt;SSH Access to Virtual Machine using GCP Console&lt;/strong&gt;&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;To test SSH connectivity to the Virtual Machine, go to &lt;code&gt;Navigation Menu &amp;gt; Compute Engine &amp;gt; VM instances&lt;/code&gt; and click on the SSH option for the created VM. Approve the connection when prompted, and you should be logged in.&lt;/p&gt;&lt;p&gt;This verification ensures all components are correctly configured and accessible. The next steps in setting up the data platform should be setting up a self-hosted GitHub runner and then a Prefect worker.&lt;/p&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;&lt;strong&gt;SSH Access to Virtual Machine Using the gcloud CLI&lt;/strong&gt;&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;You can access a Virtual Vachine securely using only a service account token and gcloud CLI. Follow these steps to set up and establish SSH access:&lt;/p&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;Ensure you have the service account JSON key stored locally.&lt;/li&gt;&lt;li&gt;Log in to your Google Cloud account using the following command:&lt;/li&gt;&lt;/ul&gt;&lt;pre class=&quot;language-bash&quot;&gt;&lt;code class=&quot;language-bash&quot;&gt;gcloud auth activate-service-account 
test-service-account@test-project.iam.gserviceaccount.com --key-file
 ~/.config/gcloud.json
gcloud config &lt;span class=&quot;token builtin class-name&quot;&gt;set&lt;/span&gt; project test-project&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;Once authenticated, execute the following command to initiate the SSH connection:&lt;/p&gt;&lt;pre class=&quot;language-bash&quot;&gt;&lt;code class=&quot;language-bash&quot;&gt;gcloud compute &lt;span class=&quot;token function&quot;&gt;ssh&lt;/span&gt; &lt;span class=&quot;token variable&quot;&gt;${project_name}&lt;/span&gt;-&lt;span class=&quot;token variable&quot;&gt;${environment}&lt;/span&gt;-01&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;On the first execution, the gcloud CLI will prompt you to generate a private and public SSH key pair. Follow the instructions to create the key pair.&lt;/p&gt;&lt;p&gt;Once the keys are created, access to the virtual machine will be automatically established. Subsequent logins will reuse the existing key pair, simplifying future access.&lt;/p&gt;&lt;h2 id=&quot;conclusion&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/how-to-setup-data-platform-infrastructure-on-google-cloud-platform-with-terraform/#conclusion&quot; class=&quot;heading-anchor&quot;&gt;Conclusion&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Setting up a data platform infrastructure on Google Cloud Platform using Terraform provides a solid foundation for organizations looking to leverage the power of their data. This approach offers several key benefits:&lt;/p&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;Scalability and Flexibility: The server-based approach with a single VM provides an excellent starting point that can easily be expanded as your data needs grow.&lt;/li&gt;&lt;li&gt;Security: By leveraging Google Cloud’s Identity-Aware Proxy (IAP), we’ve ensured that access to our resources is tightly controlled and secure.&lt;/li&gt;&lt;li&gt;Infrastructure as Code: Using Terraform allows for version-controlled, reproducible infrastructure deployments, making it easier to manage and update your environment over time.&lt;/li&gt;&lt;li&gt;Cost-Effectiveness: Starting with a single VM setup is often more budget-friendly for initial deployments or smaller-scale projects.&lt;/li&gt;&lt;li&gt;Simplified Management: With fewer components to manage initially, maintenance and troubleshooting become more straightforward.&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;By following the steps outlined in this guide, you’ve created a robust infrastructure that includes a VPC, subnet, Compute Engine instance, firewall rules, IAP SSH permissions, Cloud Router, Cloud NAT, and Cloud Storage. This setup provides a solid base for running a data platform, including components like a GitHub Runner and Prefect Worker. The process of setting up these additional components will be covered in the next article of this series, building upon the foundation we’ve established here.&lt;/p&gt;&lt;p&gt;Our &lt;a href=&quot;https://github.com/thescalableway/dataplatform-gcp-terraform&quot; rel=&quot;noopener&quot;&gt;dedicated repository&lt;/a&gt; contains all code examples and implementations discussed in this article, which can be accessed for reference and further exploration. We encourage you to review the repository for a comprehensive understanding of the concepts presented.&lt;/p&gt;&lt;p&gt;Remember, while this guide provides a strong starting point, it’s crucial to continually assess and adjust your infrastructure to meet your organization’s changing needs and to stay aligned with best practices in cloud computing and data management.&lt;/p&gt;&lt;/div&gt;
 			</content>
    </entry><entry>
      <title>Organizing Networking for Data Platforms: Key Connectivity Options</title>
      <link href="https://thescalableway.com/blog/organizing-networking-for-data-platforms-key-connectivity-options/" />
      <updated>2025-03-05T09:50:00Z</updated>
      <id>https://thescalableway.com/blog/organizing-networking-for-data-platforms-key-connectivity-options/</id>
      <content type="html">
				&lt;nav id=&quot;toc&quot; class=&quot;table-of-contents prose&quot;&gt;&lt;ol&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/organizing-networking-for-data-platforms-key-connectivity-options/#data-platform-architecture-and-networking&quot;&gt;Data Platform Architecture and Networking&lt;/a&gt;&lt;ol&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/organizing-networking-for-data-platforms-key-connectivity-options/#extract&quot;&gt;Extract&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/organizing-networking-for-data-platforms-key-connectivity-options/#load&quot;&gt;Load&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/organizing-networking-for-data-platforms-key-connectivity-options/#transform&quot;&gt;Transform&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/organizing-networking-for-data-platforms-key-connectivity-options/#data-consumption&quot;&gt;Data Consumption&lt;/a&gt;&lt;/li&gt;&lt;/ol&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/organizing-networking-for-data-platforms-key-connectivity-options/#application-layer-security&quot;&gt;Application Layer Security&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/organizing-networking-for-data-platforms-key-connectivity-options/#connectivity-options&quot;&gt;Connectivity Options&lt;/a&gt;&lt;ol&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/organizing-networking-for-data-platforms-key-connectivity-options/#public-access&quot;&gt;Public Access&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/organizing-networking-for-data-platforms-key-connectivity-options/#public-access-with-access-control-list-acl&quot;&gt;Public Access with Access Control List (ACL)&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/organizing-networking-for-data-platforms-key-connectivity-options/#vpc-peering&quot;&gt;VPC Peering&lt;/a&gt;&lt;ol&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/organizing-networking-for-data-platforms-key-connectivity-options/#vpc-peering-within-a-single-project&quot;&gt;VPC Peering Within a Single Project&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/organizing-networking-for-data-platforms-key-connectivity-options/#vpc-peering-across-separate-projects&quot;&gt;VPC Peering Across Separate Projects&lt;/a&gt;&lt;/li&gt;&lt;/ol&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/organizing-networking-for-data-platforms-key-connectivity-options/#site-to-site-s2s-vpn&quot;&gt;Site-to-site (S2S) VPN&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/organizing-networking-for-data-platforms-key-connectivity-options/#other-networking-possibilities&quot;&gt;Other Networking Possibilities&lt;/a&gt;&lt;ol&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/organizing-networking-for-data-platforms-key-connectivity-options/#mpls-multiprotocol-label-switching&quot;&gt;MPLS (Multiprotocol Label Switching)&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/organizing-networking-for-data-platforms-key-connectivity-options/#dedicated-link&quot;&gt;Dedicated Link&lt;/a&gt;&lt;/li&gt;&lt;/ol&gt;&lt;/li&gt;&lt;/ol&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/organizing-networking-for-data-platforms-key-connectivity-options/#conclusion&quot;&gt;Conclusion&lt;/a&gt;&lt;/li&gt;&lt;/ol&gt;&lt;/nav&gt;&lt;p&gt;&lt;span id=&quot;toc-skipped&quot; class=&quot;visually-hidden&quot;&gt;&lt;/span&gt;&lt;/p&gt;&lt;div class=&quot;flow prose&quot;&gt;&lt;p&gt;&lt;/p&gt;&lt;p&gt;A poorly designed network can cripple even the most advanced data platform. Slow queries, failed data transfers, and security vulnerabilities often stem from overlooked networking decisions. Yet, networking remains one of the least understood aspects of data architecture.&lt;/p&gt;&lt;p&gt;The Extract, Load, and Transform (ELT) process has become the standard for data integration. It enables organizations to move raw data from source systems to destinations like data warehouses, where it can be analyzed using Business Intelligence (BI) tools. While many aspects of this process deserve attention, networking is a critical yet often underestimated component.&lt;/p&gt;&lt;p&gt;Building a data platform that supports ELT processes requires a clear understanding of &lt;strong&gt;how all components communicate&lt;/strong&gt;. Whether implementing an on-premise solution with open-source tools, leveraging cloud providers, or utilizing SaaS or PaaS solutions, the common thread is the need for seamless connectivity between all elements.&lt;/p&gt;&lt;p&gt;In this article, we’ll explore the options of organizing networking in data platforms, covering key connectivity options, security considerations, and best practices. To lay the groundwork for our discussion, let’s first examine the optimal organization of a data platform.&lt;/p&gt;&lt;h2 id=&quot;data-platform-architecture-and-networking&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/organizing-networking-for-data-platforms-key-connectivity-options/#data-platform-architecture-and-networking&quot; class=&quot;heading-anchor&quot;&gt;Data Platform Architecture and Networking&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;The diagram below presents the reference architecture for the ELT process as a whole, outlining the key components and workflows involved. Each stage has its own phases, with Ingest being part of data extraction, Land being the process of loading data, and Prepare with Model being the transformation.&lt;/p&gt;&lt;p&gt;&lt;is-land on:idle&gt;&lt;/is-land&gt;&lt;/p&gt;&lt;dialog class=&quot;flow modal47&quot;&gt;&lt;button autofocus class=&quot;button&quot;&gt;Close&lt;/button&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/img/s6LJnDtkZT-960.webp 960w, https://thescalableway.com/img/s6LJnDtkZT-1600.webp 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/img/s6LJnDtkZT-960.jpeg&quot; alt=&quot;modular data platform architecture&quot; width=&quot;1600&quot; height=&quot;824&quot; srcset=&quot;https://thescalableway.com/img/s6LJnDtkZT-960.jpeg 960w, https://thescalableway.com/img/s6LJnDtkZT-1600.jpeg 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;/picture&gt;&lt;/dialog&gt;&lt;button data-index=&quot;47&quot;&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/img/s6LJnDtkZT-960.webp 960w, https://thescalableway.com/img/s6LJnDtkZT-1600.webp 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/img/s6LJnDtkZT-960.jpeg&quot; alt=&quot;modular data platform architecture&quot; width=&quot;1600&quot; height=&quot;824&quot; srcset=&quot;https://thescalableway.com/img/s6LJnDtkZT-960.jpeg 960w, https://thescalableway.com/img/s6LJnDtkZT-1600.jpeg 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;/picture&gt;&lt;/button&gt;&lt;p&gt;&lt;/p&gt;&lt;p&gt;To better understand how networking ties into a data platform, let’s examine a second diagram, which shifts focus to the networking aspects of the data platform architecture.&lt;/p&gt;&lt;p&gt;&lt;is-land on:idle&gt;&lt;/is-land&gt;&lt;/p&gt;&lt;dialog class=&quot;flow modal48&quot;&gt;&lt;button autofocus class=&quot;button&quot;&gt;Close&lt;/button&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/img/qlCBwG-fKo-960.webp 960w, https://thescalableway.com/img/qlCBwG-fKo-1600.webp 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/img/qlCBwG-fKo-960.jpeg&quot; alt=&quot;data platform networking&quot; width=&quot;1600&quot; height=&quot;870&quot; srcset=&quot;https://thescalableway.com/img/qlCBwG-fKo-960.jpeg 960w, https://thescalableway.com/img/qlCBwG-fKo-1600.jpeg 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;/picture&gt;&lt;/dialog&gt;&lt;button data-index=&quot;48&quot;&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/img/qlCBwG-fKo-960.webp 960w, https://thescalableway.com/img/qlCBwG-fKo-1600.webp 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/img/qlCBwG-fKo-960.jpeg&quot; alt=&quot;data platform networking&quot; width=&quot;1600&quot; height=&quot;870&quot; srcset=&quot;https://thescalableway.com/img/qlCBwG-fKo-960.jpeg 960w, https://thescalableway.com/img/qlCBwG-fKo-1600.jpeg 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;/picture&gt;&lt;/button&gt;&lt;p&gt;&lt;/p&gt;&lt;p&gt;The diagram illustrates various components of a data platform, each requiring network configuration to ensure smooth and secure data movement. At the core of our setup is &lt;strong&gt;workflow orchestration&lt;/strong&gt;, which manages the data integration process. Tools like &lt;strong&gt;Prefect&lt;/strong&gt;, &lt;strong&gt;Airflow,&lt;/strong&gt; or &lt;strong&gt;Azure Data Factory&lt;/strong&gt; can handle this, running data flows across various stages.&lt;/p&gt;&lt;h4 id=&quot;extract&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/organizing-networking-for-data-platforms-key-connectivity-options/#extract&quot; class=&quot;heading-anchor&quot;&gt;&lt;strong&gt;Extract&lt;/strong&gt;&lt;/a&gt;&lt;/h4&gt;&lt;p&gt;The initial phase of the &lt;strong&gt;ELT (Extract, Load, Transform)&lt;/strong&gt; process is data extraction. Every data platform needs to gather data from external systems, represented as “data sources&quot; in the diagram. To access resources from a private environment, we need to use a gateway that allows us to reach external resources. This could be:&lt;/p&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;&lt;strong&gt;Internet Gateway&lt;/strong&gt; - For accessing public resources.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;NAT Gateway&lt;/strong&gt; - Allow resources in private subnets to connect to services outside the private network.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;VPN Gateway&lt;/strong&gt; - Establishes a secure tunnel with private resources within a different network of our organization or a partner.&lt;/li&gt;&lt;/ul&gt;&lt;h4 id=&quot;load&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/organizing-networking-for-data-platforms-key-connectivity-options/#load&quot; class=&quot;heading-anchor&quot;&gt;&lt;strong&gt;Load&lt;/strong&gt;&lt;/a&gt;&lt;/h4&gt;&lt;p&gt;Once extracted, data needs to be loaded into a central repository—typically a &lt;strong&gt;Data Warehouse&lt;/strong&gt;. This can be hosted within the same network as the workflow orchestration tool or exist as an external resource.&lt;/p&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;&lt;strong&gt;Same Network:&lt;/strong&gt; Configuration is simpler as the same team is likely responsible for setting up both components along with networking.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;External Resource&lt;/strong&gt;: Requires additional networking considerations, but the same principles apply—ensuring secure, reliable connectivity.&lt;/li&gt;&lt;/ul&gt;&lt;h4 id=&quot;transform&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/organizing-networking-for-data-platforms-key-connectivity-options/#transform&quot; class=&quot;heading-anchor&quot;&gt;&lt;strong&gt;Transform&lt;/strong&gt;&lt;/a&gt;&lt;/h4&gt;&lt;p&gt;The &lt;strong&gt;Transform phase&lt;/strong&gt; follows a similar working pattern, as the workflow orchestration tool needs access to the Data Warehouse. The same resources need to communicate with each other, regardless of whether it’s the Load or Transform phase.&lt;/p&gt;&lt;h4 id=&quot;data-consumption&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/organizing-networking-for-data-platforms-key-connectivity-options/#data-consumption&quot; class=&quot;heading-anchor&quot;&gt;&lt;strong&gt;Data Consumption&lt;/strong&gt;&lt;/a&gt;&lt;/h4&gt;&lt;p&gt;The final stage is &lt;strong&gt;data consumption&lt;/strong&gt;, where users and tools query the Data Warehouse. Given that sensitive information such as Client Identifying Data (CID) may be stored, secure connections are essential. BI tools, accessible by data platform consumers, need controlled access to the Data Warehouse. Such tools are often managed by a different team from those responsible for data gathering, loading, and transformation.&lt;/p&gt;&lt;h2 id=&quot;application-layer-security&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/organizing-networking-for-data-platforms-key-connectivity-options/#application-layer-security&quot; class=&quot;heading-anchor&quot;&gt;Application Layer Security&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;When discussing the networking aspects of data platforms, it’s essential to understand the context within the &lt;strong&gt;OSI (Open Systems Interconnection)&lt;/strong&gt; model. This article primarily focuses on the &lt;strong&gt;Network Layer (Layer 3) and Transport Layer (Layer 4)&lt;/strong&gt;—the backbone of data connectivity. These layers handle IP addressing, routing, and basic connection establishment, forming the foundation for gateways and other networking components.&lt;/p&gt;&lt;p&gt;However, security doesn’t stop at the network level. The &lt;strong&gt;Application Layer (Layer 7)&lt;/strong&gt; plays a critical role in securing data and applications. While this article centers on network infrastructure, robust Layer 7 security is just as important. Common security measures include:&lt;/p&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;&lt;strong&gt;OAuth&lt;/strong&gt; for secure authorization&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Mutual TLS (mTLS)&lt;/strong&gt; for encrypted, authenticated communication&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Basic authentication&lt;/strong&gt; for simple access control&lt;/li&gt;&lt;li&gt;&lt;strong&gt;API gateways&lt;/strong&gt; for managing and securing API access&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Web Application Firewalls (WAF)&lt;/strong&gt; for protecting against application-level attacks&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;Regardless of network configuration, Application Layer security should always be implemented. A key principle to remember is that the weaker the Layer 7 security measures, the stronger the network-level controls must be to compensate. This inverse relationship between application-level and network-level security is key to maintaining overall system integrity.&lt;/p&gt;&lt;p&gt;That’s why, while our focus remains on network infrastructure, a holistic approach to data platform security should consider all relevant OSI layers, especially when dealing with sensitive data and critical business intelligence tools.&lt;/p&gt;&lt;p&gt;With this foundation in place, let’s dive into the specific networking options available for securing your data platform.&lt;/p&gt;&lt;h2 id=&quot;connectivity-options&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/organizing-networking-for-data-platforms-key-connectivity-options/#connectivity-options&quot; class=&quot;heading-anchor&quot;&gt;Connectivity Options&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;While there are many possible approaches to networking configurations, we’ll focus on the most common scenarios applicable to the majority of data platform use cases:&lt;/p&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;&lt;strong&gt;Public access&lt;/strong&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Public access with Access Control List (ACL)&lt;/strong&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;VPC peering&lt;/strong&gt; (within a single project and multiple ones)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Site-to-site VPN&lt;/strong&gt;&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;We’ll explore each in detail, followed by a brief discussion of additional networking possibilities.&lt;/p&gt;&lt;h3 id=&quot;public-access&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/organizing-networking-for-data-platforms-key-connectivity-options/#public-access&quot; class=&quot;heading-anchor&quot;&gt;Public Access&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;Public Access is the &lt;strong&gt;least secure&lt;/strong&gt; networking option, as it does not restrict access at the network level. Resources residing in a private network are configured to access the internet, while the target resource has no network security applied. This doesn’t necessarily mean the resource is available to anyone, as Application Layer security may still be in place. However, from a networking perspective, access is unrestricted.&lt;/p&gt;&lt;p&gt;This configuration exposes resources to potential attacks, as malicious actors can easily reach them and attempt to bypass application security. Whenever possible, such unrestricted access should be avoided.&lt;is-land on:idle&gt;&lt;/is-land&gt;&lt;/p&gt;&lt;dialog class=&quot;flow modal49&quot;&gt;&lt;button autofocus class=&quot;button&quot;&gt;Close&lt;/button&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/img/B4Kflo1I1s-960.webp 960w, https://thescalableway.com/img/B4Kflo1I1s-1600.webp 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/img/B4Kflo1I1s-960.jpeg&quot; alt=&quot;public access network&quot; width=&quot;1600&quot; height=&quot;646&quot; srcset=&quot;https://thescalableway.com/img/B4Kflo1I1s-960.jpeg 960w, https://thescalableway.com/img/B4Kflo1I1s-1600.jpeg 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;/picture&gt;&lt;/dialog&gt;&lt;button data-index=&quot;49&quot;&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/img/B4Kflo1I1s-960.webp 960w, https://thescalableway.com/img/B4Kflo1I1s-1600.webp 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/img/B4Kflo1I1s-960.jpeg&quot; alt=&quot;public access network&quot; width=&quot;1600&quot; height=&quot;646&quot; srcset=&quot;https://thescalableway.com/img/B4Kflo1I1s-960.jpeg 960w, https://thescalableway.com/img/B4Kflo1I1s-1600.jpeg 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;/picture&gt;&lt;/button&gt;&lt;p&gt;&lt;/p&gt;&lt;p&gt;That said, Public Access remains the best option for &lt;strong&gt;specific, low-risk data sources&lt;/strong&gt;, such as:&lt;/p&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;Exchange rates for currencies&lt;/li&gt;&lt;li&gt;Stock market prices&lt;/li&gt;&lt;li&gt;Other publicly accessible data needed in ELT pipelines&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;To mitigate risks associated with Public Access, organizations can implement additional security measures within their private networks. For instance, they can configure a &lt;strong&gt;firewall&lt;/strong&gt; to block access to all public resources except those explicitly whitelisted. This approach adheres to the principle of least privilege, ensuring only necessary connections are allowed.&lt;/p&gt;&lt;p&gt;By implementing such measures, organizations can balance the need for access to public data sources with maintaining a secure network environment.&lt;/p&gt;&lt;h3 id=&quot;public-access-with-access-control-list-acl&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/organizing-networking-for-data-platforms-key-connectivity-options/#public-access-with-access-control-list-acl&quot; class=&quot;heading-anchor&quot;&gt;Public Access with Access Control List (ACL)&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;When a system is publicly available, in addition to securing access through Application Layer security controls, we can implement networking mechanisms to expose the system only to a limited group of servers or users. An &lt;strong&gt;Access Control List&lt;/strong&gt;, often referred to as a whitelist, is a security mechanism implemented on the publicly available target system. While the simplest scenario involves allowing access for specific IP addresses, ACLs offer more sophisticated options, including:&lt;/p&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;Source and destination IP addresses&lt;/li&gt;&lt;li&gt;Port numbers&lt;/li&gt;&lt;li&gt;Network protocols (e.g., TCP, UDP, ICMP)&lt;/li&gt;&lt;li&gt;Time ranges for when the ACL is active&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;&lt;is-land on:idle&gt;&lt;/is-land&gt;&lt;/p&gt;&lt;dialog class=&quot;flow modal50&quot;&gt;&lt;button autofocus class=&quot;button&quot;&gt;Close&lt;/button&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/img/rmOW1HKnGZ-960.webp 960w, https://thescalableway.com/img/rmOW1HKnGZ-1600.webp 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/img/rmOW1HKnGZ-960.jpeg&quot; alt=&quot;Public Access with Access Control List network&quot; width=&quot;1600&quot; height=&quot;646&quot; srcset=&quot;https://thescalableway.com/img/rmOW1HKnGZ-960.jpeg 960w, https://thescalableway.com/img/rmOW1HKnGZ-1600.jpeg 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;/picture&gt;&lt;/dialog&gt;&lt;button data-index=&quot;50&quot;&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/img/rmOW1HKnGZ-960.webp 960w, https://thescalableway.com/img/rmOW1HKnGZ-1600.webp 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/img/rmOW1HKnGZ-960.jpeg&quot; alt=&quot;Public Access with Access Control List network&quot; width=&quot;1600&quot; height=&quot;646&quot; srcset=&quot;https://thescalableway.com/img/rmOW1HKnGZ-960.jpeg 960w, https://thescalableway.com/img/rmOW1HKnGZ-1600.jpeg 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;/picture&gt;&lt;/button&gt;&lt;p&gt;&lt;/p&gt;&lt;p&gt;For ACLs to be effective, the system must have &lt;strong&gt;a fixed IP address&lt;/strong&gt;. If this cannot be guaranteed, &lt;strong&gt;more secure alternatives like Site-to-Site VPN&lt;/strong&gt; should be considered.&lt;/p&gt;&lt;p&gt;ACLs can be implemented at multiple levels, including routers, firewalls, or other network devices, providing a layered approach to security. Additionally, ACLs can be used for both inbound and outbound traffic, allowing for fine-grained control over data flow in both directions.&lt;/p&gt;&lt;p&gt;However, while they enhance security, they should not be relied upon as the sole protection mechanism—they work best alongside other security measures, such as authentication, encryption, and regular security audits.&lt;/p&gt;&lt;h3 id=&quot;vpc-peering&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/organizing-networking-for-data-platforms-key-connectivity-options/#vpc-peering&quot; class=&quot;heading-anchor&quot;&gt;VPC Peering&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;For cloud environments, &lt;strong&gt;VPC (Virtual Private Cloud) Peering&lt;/strong&gt; allows direct, private network connections between different cloud resources without exposing traffic to the public internet&lt;/p&gt;&lt;p&gt;Since cloud providers use different naming conventions (AWS: accounts, Azure: subscriptions, GCP: projects), we’ll use the term “project” to refer to these organizational units.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;VPC peering&lt;/strong&gt; should be considered the default network configuration for resources within the same cloud provider, regardless of the specific implementation option.&lt;/p&gt;&lt;p&gt;The implementation process varies depending on whether the VPCs are located within the same project or separate ones. Therefore, we will discuss these scenarios separately to highlight their unique characteristics and requirements.&lt;/p&gt;&lt;h4 id=&quot;vpc-peering-within-a-single-project&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/organizing-networking-for-data-platforms-key-connectivity-options/#vpc-peering-within-a-single-project&quot; class=&quot;heading-anchor&quot;&gt;VPC Peering Within a Single Project&lt;/a&gt;&lt;/h4&gt;&lt;p&gt;Connecting two networks within the same project is a streamlined process. It requires no additional permissions and can be configured entirely from a single account. This peering effectively extends the network, making all resources in the peered network accessible from the first network.&lt;/p&gt;&lt;p&gt;As with other networking options, additional firewall rules or ACLs can be implemented to restrict access directionality or limit connectivity to specific services.&lt;/p&gt;&lt;p&gt;&lt;is-land on:idle&gt;&lt;/is-land&gt;&lt;/p&gt;&lt;dialog class=&quot;flow modal51&quot;&gt;&lt;button autofocus class=&quot;button&quot;&gt;Close&lt;/button&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/img/eDMeGFZtME-960.webp 960w, https://thescalableway.com/img/eDMeGFZtME-1600.webp 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/img/eDMeGFZtME-960.jpeg&quot; alt=&quot;VPC Peering within a single project&quot; width=&quot;1600&quot; height=&quot;894&quot; srcset=&quot;https://thescalableway.com/img/eDMeGFZtME-960.jpeg 960w, https://thescalableway.com/img/eDMeGFZtME-1600.jpeg 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;/picture&gt;&lt;/dialog&gt;&lt;button data-index=&quot;51&quot;&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/img/eDMeGFZtME-960.webp 960w, https://thescalableway.com/img/eDMeGFZtME-1600.webp 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/img/eDMeGFZtME-960.jpeg&quot; alt=&quot;VPC Peering within a single project&quot; width=&quot;1600&quot; height=&quot;894&quot; srcset=&quot;https://thescalableway.com/img/eDMeGFZtME-960.jpeg 960w, https://thescalableway.com/img/eDMeGFZtME-1600.jpeg 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;/picture&gt;&lt;/button&gt;&lt;p&gt;&lt;/p&gt;&lt;h4 id=&quot;vpc-peering-across-separate-projects&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/organizing-networking-for-data-platforms-key-connectivity-options/#vpc-peering-across-separate-projects&quot; class=&quot;heading-anchor&quot;&gt;VPC Peering Across Separate Projects&lt;/a&gt;&lt;/h4&gt;&lt;p&gt;When peering VPCs between different projects, additional security and administrative steps are required:&lt;/p&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;Cross-project peering permissions must be explicitly granted.&lt;/li&gt;&lt;li&gt;Approval is needed from administrators in both projects.&lt;/li&gt;&lt;li&gt;Firewall rules must be configured in each project to enable cross-project traffic.&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;Despite these additional requirements, VPC Peering remains the most secure and efficient method for connecting resources within the same cloud provider, offering greater control and reduced exposure compared to internet-based connections.&lt;/p&gt;&lt;p&gt;&lt;is-land on:idle&gt;&lt;/is-land&gt;&lt;/p&gt;&lt;dialog class=&quot;flow modal52&quot;&gt;&lt;button autofocus class=&quot;button&quot;&gt;Close&lt;/button&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/img/vQjw6TH34Y-960.webp 960w, https://thescalableway.com/img/vQjw6TH34Y-1600.webp 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/img/vQjw6TH34Y-960.jpeg&quot; alt=&quot;VPC Peering across separate projects&quot; width=&quot;1600&quot; height=&quot;894&quot; srcset=&quot;https://thescalableway.com/img/vQjw6TH34Y-960.jpeg 960w, https://thescalableway.com/img/vQjw6TH34Y-1600.jpeg 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;/picture&gt;&lt;/dialog&gt;&lt;button data-index=&quot;52&quot;&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/img/vQjw6TH34Y-960.webp 960w, https://thescalableway.com/img/vQjw6TH34Y-1600.webp 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/img/vQjw6TH34Y-960.jpeg&quot; alt=&quot;VPC Peering across separate projects&quot; width=&quot;1600&quot; height=&quot;894&quot; srcset=&quot;https://thescalableway.com/img/vQjw6TH34Y-960.jpeg 960w, https://thescalableway.com/img/vQjw6TH34Y-1600.jpeg 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;/picture&gt;&lt;/button&gt;&lt;p&gt;&lt;/p&gt;&lt;h3 id=&quot;site-to-site-s2s-vpn&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/organizing-networking-for-data-platforms-key-connectivity-options/#site-to-site-s2s-vpn&quot; class=&quot;heading-anchor&quot;&gt;Site-to-site (S2S) VPN&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;Site-to-Site VPN is a secure networking solution that connects two or more separate networks, typically in different physical locations, enabling them to communicate as if they were directly connected. This technology creates an encrypted tunnel over the public internet, allowing organizations to securely link their geographically dispersed offices, data centers, or cloud resources.&lt;/p&gt;&lt;p&gt;&lt;is-land on:idle&gt;&lt;/is-land&gt;&lt;/p&gt;&lt;dialog class=&quot;flow modal53&quot;&gt;&lt;button autofocus class=&quot;button&quot;&gt;Close&lt;/button&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/img/eobX_P1cOC-960.webp 960w, https://thescalableway.com/img/eobX_P1cOC-1600.webp 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/img/eobX_P1cOC-960.jpeg&quot; alt=&quot;site to site vpn&quot; width=&quot;1600&quot; height=&quot;410&quot; srcset=&quot;https://thescalableway.com/img/eobX_P1cOC-960.jpeg 960w, https://thescalableway.com/img/eobX_P1cOC-1600.jpeg 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;/picture&gt;&lt;/dialog&gt;&lt;button data-index=&quot;53&quot;&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/img/eobX_P1cOC-960.webp 960w, https://thescalableway.com/img/eobX_P1cOC-1600.webp 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/img/eobX_P1cOC-960.jpeg&quot; alt=&quot;site to site vpn&quot; width=&quot;1600&quot; height=&quot;410&quot; srcset=&quot;https://thescalableway.com/img/eobX_P1cOC-960.jpeg 960w, https://thescalableway.com/img/eobX_P1cOC-1600.jpeg 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;/picture&gt;&lt;/button&gt;&lt;p&gt;&lt;/p&gt;&lt;p&gt;Key aspects of Site-to-Site VPN:&lt;/p&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;&lt;strong&gt;VPN Gateways:&lt;/strong&gt; Specialized devices or software applications are deployed at each network endpoint to act as tunnel terminators.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Encryption:&lt;/strong&gt; Data is encrypted before entering the VPN tunnel and decrypted upon reaching its destination, ensuring confidentiality during transit.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Tunneling Protocols:&lt;/strong&gt; Protocols like IPsec establish the secure tunnel and manage encryption/decryption processes.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Routing Configuration:&lt;/strong&gt; Network administrators configure routing to direct traffic through the VPN tunnel instead of the public internet.&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;A Site-to-Site VPN is one of the most secure ways to connect resources across different locations. While the tunnel relies on public internet infrastructure, all traffic is encrypted, ensuring that data cannot be decrypted without the secret key used to establish the connection.&amp;nbsp; Because of this, securely sharing the secret key is crucial and should never be transmitted through unencrypted channels. With strong encryption and secure key management, Site-to-Site VPN provides an excellent solution for organizations requiring high levels of data protection and privacy across geographically dispersed networks.&lt;/p&gt;&lt;h3 id=&quot;other-networking-possibilities&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/organizing-networking-for-data-platforms-key-connectivity-options/#other-networking-possibilities&quot; class=&quot;heading-anchor&quot;&gt;Other Networking Possibilities&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;There are more advanced options available for securing network traffic and isolating it from the public internet. Two notable solutions worth mentioning are:&lt;/p&gt;&lt;h4 id=&quot;mpls-multiprotocol-label-switching&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/organizing-networking-for-data-platforms-key-connectivity-options/#mpls-multiprotocol-label-switching&quot; class=&quot;heading-anchor&quot;&gt;MPLS (Multiprotocol Label Switching)&lt;/a&gt;&lt;/h4&gt;&lt;p&gt;MPLS is a packet forwarding technology that operates between Layer 2 and Layer 3 of the OSI model. It typically utilizes a dedicated network infrastructure, ensuring no public connection is involved. Implementing MPLS requires finding a vendor capable of leasing physical cables for exclusive use. While more expensive and complex to implement than previously mentioned options, MPLS offers enhanced security and guaranteed connection speeds.&lt;/p&gt;&lt;h4 id=&quot;dedicated-link&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/organizing-networking-for-data-platforms-key-connectivity-options/#dedicated-link&quot; class=&quot;heading-anchor&quot;&gt;Dedicated Link&lt;/a&gt;&lt;/h4&gt;&lt;p&gt;Cloud providers offer solutions like Google Cloud’s Dedicated Interconnect or AWS Direct Connect, which are faster to implement than MPLS, as the cloud provider handles much of the infrastructure. These options are ideal for establishing physical, private connections between on-premises networks and cloud provider networks. However, they may be excessive for connecting to a single data source on a data platform.&lt;/p&gt;&lt;p&gt;While these options provide additional layers of security and performance, they should be carefully considered based on specific organizational needs and resources.&lt;/p&gt;&lt;h2 id=&quot;conclusion&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/organizing-networking-for-data-platforms-key-connectivity-options/#conclusion&quot; class=&quot;heading-anchor&quot;&gt;Conclusion&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Selecting the right networking strategy for your data platform is critical to ensuring security, performance, and scalability. From public access to VPC peering and site-to-site VPNs, the choice of networking strategy significantly impacts your data platform’s security, performance, and flexibility. Each option comes with trade-offs that need to be considered.&lt;/p&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;Security should be a primary concern. Public access is the least secure option, while site-to-site VPN offers robust protection.&lt;/li&gt;&lt;li&gt;VPC peering provides an excellent balance of performance and security for resources within the same cloud provider.&lt;/li&gt;&lt;li&gt;Access Control Lists (ACLs) offer an additional layer of security for public access scenarios, allowing for fine-grained control.&lt;/li&gt;&lt;li&gt;Application layer security remains crucial regardless of the chosen networking option, complementing network-level protections.&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;When designing your data platform’s networking architecture, consider your specific use case, security needs, and scalability requirements. Remember that a comprehensive approach, combining appropriate networking strategies with robust application-level security measures, will provide the most effective protection for your valuable data assets.&lt;/p&gt;&lt;p&gt;As technology evolves, staying updated on &lt;strong&gt;best practices and emerging solutions&lt;/strong&gt; will help ensure your platform remains secure and efficient in the long run.&lt;/p&gt;&lt;/div&gt;
 			</content>
    </entry><entry>
      <title>What is a Modular Data Platform?</title>
      <link href="https://thescalableway.com/blog/what-is-a-modular-data-platform/" />
      <updated>2025-02-10T09:30:00Z</updated>
      <id>https://thescalableway.com/blog/what-is-a-modular-data-platform/</id>
      <content type="html">
				&lt;nav id=&quot;toc&quot; class=&quot;table-of-contents prose&quot;&gt;&lt;ol&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/what-is-a-modular-data-platform/#history-of-data-platforms-from-olap-cubes-to-hadoop-to-lakehouses&quot;&gt;History of Data Platforms: From OLAP Cubes to Hadoop to Lakehouses&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/what-is-a-modular-data-platform/#reference-architecture-of-a-modern-and-modular-data-platform&quot;&gt;Reference Architecture of a Modern and Modular Data Platform&lt;/a&gt;&lt;ol&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/what-is-a-modular-data-platform/#data-sources&quot;&gt;Data Sources&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/what-is-a-modular-data-platform/#ingestion&quot;&gt;Ingestion&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/what-is-a-modular-data-platform/#landing&quot;&gt;Landing&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/what-is-a-modular-data-platform/#preparation&quot;&gt;Preparation&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/what-is-a-modular-data-platform/#modeling&quot;&gt;Modeling&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/what-is-a-modular-data-platform/#consumption&quot;&gt;Consumption&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/what-is-a-modular-data-platform/#data-cataloging&quot;&gt;Data Cataloging&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/what-is-a-modular-data-platform/#data-orchestration&quot;&gt;Data Orchestration&lt;/a&gt;&lt;/li&gt;&lt;/ol&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/what-is-a-modular-data-platform/#understanding-modularity-in-the-context-of-data-platforms&quot;&gt;Understanding Modularity in the Context of Data Platforms&lt;/a&gt;&lt;ol&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/what-is-a-modular-data-platform/#components-and-interfaces-before-tools&quot;&gt;Components and Interfaces Before Tools&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/what-is-a-modular-data-platform/#workflows-and-developer-experience&quot;&gt;Workflows and Developer Experience&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/what-is-a-modular-data-platform/#data-governance-and-security&quot;&gt;Data Governance and Security&lt;/a&gt;&lt;/li&gt;&lt;/ol&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/what-is-a-modular-data-platform/#conclusion&quot;&gt;Conclusion&lt;/a&gt;&lt;/li&gt;&lt;/ol&gt;&lt;/nav&gt;&lt;p&gt;&lt;span id=&quot;toc-skipped&quot; class=&quot;visually-hidden&quot;&gt;&lt;/span&gt;&lt;/p&gt;&lt;div class=&quot;flow prose&quot;&gt;&lt;p&gt;&lt;/p&gt;&lt;p&gt;Modern data analytics can get complicated. With an abundance of tools, conflicting methodologies, and ever-evolving technologies, mistakes can be costly. However, at its core, data analytics remains grounded in a few fundamental principles. Understanding these fundamentals while leveraging modular and well-designed data platforms can significantly improve operational efficiency and decision-making.&lt;/p&gt;&lt;h2 id=&quot;history-of-data-platforms-from-olap-cubes-to-hadoop-to-lakehouses&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/what-is-a-modular-data-platform/#history-of-data-platforms-from-olap-cubes-to-hadoop-to-lakehouses&quot; class=&quot;heading-anchor&quot;&gt;&lt;strong&gt;History of Data Platforms: From OLAP Cubes to Hadoop to Lakehouses&lt;/strong&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;The evolution of data platforms has been driven by two primary goals:&lt;/p&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;&lt;strong&gt;Doing Analytics Better:&lt;/strong&gt; improving analytics work with more efficient storage and retrieval of business and machine data; moving insights generation closer to the domain experts by improving self-service tools and processes&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Doing Better Analytics:&lt;/strong&gt; increasing the value of analytics by having more and deeper insights; leverage statistical modelling, machine learning and AI to improve the quality of business decisions&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;And so, while SQL, which was invented in 1975, remains at the core of analytics, there have been significant advances in technology.&lt;/p&gt;&lt;p&gt;OLAP cubes emerged in 1993, introducing multi-dimensional analysis. In the early 2000s, Hadoop revolutionized big data processing, allowing distributed storage and computing. More recently, the Lakehouse paradigm has sought to unify the best aspects of data warehouses and data lakes, improving performance, governance, and flexibility.&lt;/p&gt;&lt;h2 id=&quot;reference-architecture-of-a-modern-and-modular-data-platform&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/what-is-a-modular-data-platform/#reference-architecture-of-a-modern-and-modular-data-platform&quot; class=&quot;heading-anchor&quot;&gt;&lt;strong&gt;Reference Architecture of a Modern and Modular Data Platform&lt;/strong&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;A modern data platform consists of several key components, each playing a crucial role in the data lifecycle. These components enable efficient data movement, transformation, and consumption while ensuring modularity and scalability.&lt;/p&gt;&lt;p&gt;&lt;is-land on:idle&gt;&lt;/is-land&gt;&lt;/p&gt;&lt;dialog class=&quot;flow modal58&quot;&gt;&lt;button autofocus class=&quot;button&quot;&gt;Close&lt;/button&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/img/s6LJnDtkZT-960.webp 960w, https://thescalableway.com/img/s6LJnDtkZT-1600.webp 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/img/s6LJnDtkZT-960.jpeg&quot; alt width=&quot;1600&quot; height=&quot;824&quot; srcset=&quot;https://thescalableway.com/img/s6LJnDtkZT-960.jpeg 960w, https://thescalableway.com/img/s6LJnDtkZT-1600.jpeg 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;/picture&gt;&lt;/dialog&gt;&lt;button data-index=&quot;58&quot;&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/img/s6LJnDtkZT-960.webp 960w, https://thescalableway.com/img/s6LJnDtkZT-1600.webp 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/img/s6LJnDtkZT-960.jpeg&quot; alt width=&quot;1600&quot; height=&quot;824&quot; srcset=&quot;https://thescalableway.com/img/s6LJnDtkZT-960.jpeg 960w, https://thescalableway.com/img/s6LJnDtkZT-1600.jpeg 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;/picture&gt;&lt;/button&gt;&lt;p&gt;&lt;/p&gt;&lt;h3 id=&quot;data-sources&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/what-is-a-modular-data-platform/#data-sources&quot; class=&quot;heading-anchor&quot;&gt;&lt;strong&gt;Data Sources&lt;/strong&gt;&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;Data sources are the origin of information within an organization. These range from structured databases, APIs, and SaaS applications to unstructured sources such as logs, IoT streams, and social media feeds. Some sources offer modern APIs for easy integration, while others, particularly legacy systems, require extensive workarounds.&lt;/p&gt;&lt;h3 id=&quot;ingestion&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/what-is-a-modular-data-platform/#ingestion&quot; class=&quot;heading-anchor&quot;&gt;&lt;strong&gt;Ingestion&lt;/strong&gt;&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;Ingestion refers to the process of transferring data from sources into the platform reliably. This is typically done via scheduled batch jobs, though some architectures incorporate real-time ingestion using event brokers like Kafka.&lt;/p&gt;&lt;p&gt;The ingestion landscape is fragmented, with numerous tools available, such as Azure Data Factory and Fivetran. However, no tool provides connectors for every possible source. Consequently, organizations often need to develop custom connectors, leading to maintenance challenges and dependencies on vendors.&lt;/p&gt;&lt;h3 id=&quot;landing&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/what-is-a-modular-data-platform/#landing&quot; class=&quot;heading-anchor&quot;&gt;&lt;strong&gt;Landing&lt;/strong&gt;&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;Landing zones serve as the initial storage layer where raw, unprocessed data is deposited after ingestion. This stage ensures that data is captured in its original form, preserving fidelity and enabling downstream transformation.&lt;/p&gt;&lt;p&gt;Storage for this type of data (including lakehouse data) has been standardized around the AWS S3 object storage API. Consequently, most cloud providers now offer their own variations of object storage with APIs closely mirroring AWS S3.&lt;/p&gt;&lt;h3 id=&quot;preparation&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/what-is-a-modular-data-platform/#preparation&quot; class=&quot;heading-anchor&quot;&gt;&lt;strong&gt;Preparation&lt;/strong&gt;&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;Since raw data can be messy and inconsistent, preparation is necessary to clean, standardize, and format it for further processing. This stage includes:&lt;/p&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;Data masking&lt;/li&gt;&lt;li&gt;Data anonymization&lt;/li&gt;&lt;li&gt;Structuring into standardized formats such as Delta Tables or Apache Iceberg Parquet files&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;Note that both data masking and anonymization could be done also during landing on data “in-transit” to avoid storing sensitive information on the platform.&lt;/p&gt;&lt;p&gt;Data engineers typically handle this step using workflow tools like Alteryx, Azure Data Factory, or programming languages such as Python.&lt;/p&gt;&lt;h3 id=&quot;modeling&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/what-is-a-modular-data-platform/#modeling&quot; class=&quot;heading-anchor&quot;&gt;&lt;strong&gt;Modeling&lt;/strong&gt;&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;Modeling transforms prepared data into well-structured datasets optimized for analytical use. Historically, this was the “T” in ETL (Extract, Transform, Load). Today, tools like dbt have popularized the concept of modular and scalable transformation workflows.&lt;/p&gt;&lt;h3 id=&quot;consumption&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/what-is-a-modular-data-platform/#consumption&quot; class=&quot;heading-anchor&quot;&gt;&lt;strong&gt;Consumption&lt;/strong&gt;&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;Once modeled, data is consumed in various ways, including:&lt;/p&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;Traditional dashboards and Excel reports&lt;/li&gt;&lt;li&gt;Embedded analytics within applications&lt;/li&gt;&lt;li&gt;AI-powered data exploration (e.g., generative AI and natural language querying)&lt;/li&gt;&lt;/ul&gt;&lt;h3 id=&quot;data-cataloging&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/what-is-a-modular-data-platform/#data-cataloging&quot; class=&quot;heading-anchor&quot;&gt;&lt;strong&gt;Data Cataloging&lt;/strong&gt;&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;A data catalog is a comprehensive inventory of an organization’s data assets, documenting their structure, relationships, and usage. It extends beyond datasets to include analytical assets such as dashboards, reports, and Jupyter notebooks, ensuring a unified and well-organized view of available information.&lt;/p&gt;&lt;p&gt;Despite its critical role in data management, data cataloging is often overlooked or deprioritized in analytics projects. However, a well-maintained data catalog is fundamental to effective data governance and security. By systematically identifying all data assets, their ownership, and their respective domains, organizations can enhance discoverability, streamline compliance efforts, and facilitate data democratization.&lt;/p&gt;&lt;h3 id=&quot;data-orchestration&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/what-is-a-modular-data-platform/#data-orchestration&quot; class=&quot;heading-anchor&quot;&gt;&lt;strong&gt;Data Orchestration&lt;/strong&gt;&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;Data orchestration refers to the automated coordination of ETL processes, from data ingestion and preparation to final modeling for consumption. It ensures that data flows seamlessly across different stages, reducing manual intervention and improving efficiency.&lt;/p&gt;&lt;p&gt;This industry is highly fragmented, with traditional IT approaches relying on UI-based tools such as Talend and Azure Data Factory. More modern methodologies, however, focus on code-driven orchestration using tools like Apache Airflow and Prefect. These newer solutions provide greater flexibility, scalability, and integration capabilities, making them preferred choices for organizations aiming to build robust and automated data pipelines.&lt;/p&gt;&lt;h2 id=&quot;understanding-modularity-in-the-context-of-data-platforms&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/what-is-a-modular-data-platform/#understanding-modularity-in-the-context-of-data-platforms&quot; class=&quot;heading-anchor&quot;&gt;&lt;strong&gt;Understanding Modularity in the Context of Data Platforms&lt;/strong&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Unlike ERP systems, which are often monolithic, data platforms are inherently modular. The diversity of data workflows and use cases makes it impractical to consolidate everything into a single tool.&lt;/p&gt;&lt;p&gt;Some vendors, such as Databricks and Microsoft Fabric, attempt to provide an all-in-one solution. However, even these platforms require integration with external components to cover all aspects of data management.&lt;/p&gt;&lt;h3 id=&quot;components-and-interfaces-before-tools&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/what-is-a-modular-data-platform/#components-and-interfaces-before-tools&quot; class=&quot;heading-anchor&quot;&gt;&lt;strong&gt;Components and Interfaces Before Tools&lt;/strong&gt;&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;The success of a data platform hinges on well-defined interfaces between its components. A common pitfall is over-reliance on a single vendor, leading to inflexible architectures that struggle to adapt to evolving business needs. Organizations should prioritize:&lt;/p&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;Well-defined interfaces between tools&lt;/li&gt;&lt;li&gt;Single point for managing accesses (i.e. using A/D groups)&lt;/li&gt;&lt;li&gt;Loose coupling between components to enable flexibility&lt;/li&gt;&lt;/ul&gt;&lt;h3 id=&quot;workflows-and-developer-experience&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/what-is-a-modular-data-platform/#workflows-and-developer-experience&quot; class=&quot;heading-anchor&quot;&gt;&lt;strong&gt;Workflows and Developer Experience&lt;/strong&gt;&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;A streamlined developer experience is crucial for maintaining data platform efficiency. Poorly designed workflows can introduce bottlenecks, reduce productivity, and increase technical debt. Best practices include:&lt;/p&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;Automating repetitive tasks (e.g., CI/CD for data pipelines)&lt;/li&gt;&lt;li&gt;Enforcing coding standards and documentation&lt;/li&gt;&lt;li&gt;Providing self-service capabilities for data consumers&lt;/li&gt;&lt;/ul&gt;&lt;h3 id=&quot;data-governance-and-security&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/what-is-a-modular-data-platform/#data-governance-and-security&quot; class=&quot;heading-anchor&quot;&gt;&lt;strong&gt;Data Governance and Security&lt;/strong&gt;&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;With numerous tools and evolving datasets, data governance and security must be proactive rather than reactive. &lt;strong&gt;Traditional IT governance models, which assume static datasets, are insufficient for modern data platforms&lt;/strong&gt;. Without a structured approach to creating and managing new data assets, governance becomes &lt;strong&gt;impossible&lt;/strong&gt;.&lt;/p&gt;&lt;p&gt;Effective data governance requires:&lt;/p&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;Clear company policies on data access, privacy, and security.&lt;/li&gt;&lt;li&gt;A solid understanding of analytics workflows to incorporate governance steps and audit reviews seamlessly.&lt;/li&gt;&lt;li&gt;Automated data cataloging to maintain visibility into data assets and their ownership.&lt;/li&gt;&lt;li&gt;A comprehensive inventory of all analytics tools, ensuring each one is correctly configured and continuously monitored for compliance.&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;While data governance is straightforward in principle, it requires a structured, realistic approach with well-defined steps to ensure its successful implementation and long-term effectiveness.&lt;/p&gt;&lt;h2 id=&quot;conclusion&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/what-is-a-modular-data-platform/#conclusion&quot; class=&quot;heading-anchor&quot;&gt;&lt;strong&gt;Conclusion&lt;/strong&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Modern data platforms are modular ecosystems that require careful design and governance to be effective. By understanding the historical evolution of data architectures, organizations can make informed decisions about structuring their platforms. Prioritizing interoperability, developer experience, and security ensures a scalable and efficient data operations strategy.&lt;/p&gt;&lt;p&gt;Organizations that embrace modularity and best practices in data management will not only improve operational efficiency but also gain a competitive advantage in an increasingly data-driven world.&lt;/p&gt;&lt;/div&gt;
 			</content>
    </entry><entry>
      <title>Breaking Down Prefect Deployments To Improve The Data Ops Efficiency</title>
      <link href="https://thescalableway.com/blog/breaking-down-prefect-deployments-to-improve-the-data-ops-efficiency/" />
      <updated>2025-01-28T09:45:00Z</updated>
      <id>https://thescalableway.com/blog/breaking-down-prefect-deployments-to-improve-the-data-ops-efficiency/</id>
      <content type="html">
				&lt;nav id=&quot;toc&quot; class=&quot;table-of-contents prose&quot;&gt;&lt;ol&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/breaking-down-prefect-deployments-to-improve-the-data-ops-efficiency/#why-observability-matters-in-etl-processes&quot;&gt;Why Observability Matters in ETL Processes&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/breaking-down-prefect-deployments-to-improve-the-data-ops-efficiency/#the-pitfalls-of-a-single-monolithic-flow&quot;&gt;The Pitfalls of a Single Monolithic Flow&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/breaking-down-prefect-deployments-to-improve-the-data-ops-efficiency/#the-case-for-granulated-focused-flows&quot;&gt;The Case for Granulated, Focused Flows&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/breaking-down-prefect-deployments-to-improve-the-data-ops-efficiency/#conclusion&quot;&gt;Conclusion&lt;/a&gt;&lt;/li&gt;&lt;/ol&gt;&lt;/nav&gt;&lt;p&gt;&lt;span id=&quot;toc-skipped&quot; class=&quot;visually-hidden&quot;&gt;&lt;/span&gt;&lt;/p&gt;&lt;div class=&quot;flow prose&quot;&gt;&lt;p&gt;&lt;/p&gt;&lt;p&gt;When building data platforms, it’s tempting to focus entirely on the technology stack—choosing shiny tools, debating between bulk loads or streaming, and designing storage and infrastructure to meet current needs. Yet, the rush to get data flowing often overshadows a crucial question: &lt;strong&gt;How will we monitor and operate all of this effectively?&lt;/strong&gt;&lt;/p&gt;&lt;p&gt;In the early stages, data projects typically start small: an MVP, one or two data sources, and a couple of flow runs per day. At this scale, operations often feel secondary— issues can be solved on the spot, and data engineering teams are under pressure to deliver data to the end users. But as the platform scales, this oversight catches up. Within months, many teams find themselves struggling to manage DataOps, with operational gaps threatening their progress.&lt;/p&gt;&lt;p&gt;Observability and day-to-day functionality are the bedrock of robust, scalable, and maintainable data pipelines. Modern orchestration tools like Prefect excel at breaking down pipelines into smaller, more manageable pieces, making it easier to monitor, troubleshoot, and deploy smoothly. By designing pipelines with intention and visibility in mind, teams can ensure their data platform remains reliable—even as it evolves.&lt;/p&gt;&lt;h2 id=&quot;why-observability-matters-in-etl-processes&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/breaking-down-prefect-deployments-to-improve-the-data-ops-efficiency/#why-observability-matters-in-etl-processes&quot; class=&quot;heading-anchor&quot;&gt;Why Observability Matters in ETL Processes&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Observability is a cornerstone of modern data engineering and operations. As ETL pipelines become critical for decision-making, data teams need deep visibility into pipeline performance and meaningful, actionable logs. The stakes are high—when something goes wrong, time is lost (and as we all know, time is money, or at least that is what they say), and teams are left scrambling to identify issues. At best, this means tedious log analysis and guesswork; at worst—handling complaints from frustrated end-users.&lt;/p&gt;&lt;p&gt;To avoid these pitfalls, observability is a must. It not only ensures transparency with stakeholders but also equips teams to diagnose and address problems efficiently. Effective observability hinges on four dimensions:&lt;/p&gt;&lt;ol class=&quot;list&quot;&gt;&lt;li&gt;&lt;strong&gt;Transparency:&lt;/strong&gt; Understand what each step in the pipeline does, including inputs and outputs.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Traceability:&lt;/strong&gt; Track data as it flows through the pipeline, making it possible to pinpoint where issues arise.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Granularity:&lt;/strong&gt; Drill down to isolate performance bottlenecks, failed tasks, or long-running tasks.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Scalability:&lt;/strong&gt; Expand monitoring and alerting systems to keep pace as the ETL process grows in complexity.&lt;/li&gt;&lt;/ol&gt;&lt;h2 id=&quot;the-pitfalls-of-a-single-monolithic-flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/breaking-down-prefect-deployments-to-improve-the-data-ops-efficiency/#the-pitfalls-of-a-single-monolithic-flow&quot; class=&quot;heading-anchor&quot;&gt;The Pitfalls of a Single Monolithic Flow&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;When starting an ELT project, it’s common to build one or two monolithic flows. These flows often contain dozens of tasks, which can inevitably grow as the solution scales.&lt;/p&gt;&lt;p&gt;The code usually looks then more or less like this:&lt;/p&gt;&lt;p&gt;&lt;strong&gt;1. Task to fetch a list of tables from MS SQL&lt;/strong&gt;&lt;/p&gt;&lt;pre class=&quot;language-sql&quot;&gt;&lt;code class=&quot;language-sql&quot;&gt;&lt;span class=&quot;token variable&quot;&gt;@task&lt;/span&gt;
def get_table_names&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;conn_str: str&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;token operator&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;token operator&quot;&gt;&amp;gt;&lt;/span&gt; List&lt;span class=&quot;token punctuation&quot;&gt;[&lt;/span&gt;str&lt;span class=&quot;token punctuation&quot;&gt;]&lt;/span&gt;:
    &lt;span class=&quot;token string&quot;&gt;&quot;&quot;&quot;
    Connect to an MS SQL database and return a list of tables.
    &quot;&quot;&quot;&lt;/span&gt;
    query &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;&quot;&quot;
    SELECT TABLE_NAME
    FROM INFORMATION_SCHEMA.TABLES
    WHERE TABLE_TYPE = &#39;BASE TABLE&#39;
      AND TABLE_CATALOG = DB_NAME()
    &quot;&quot;&quot;&lt;/span&gt;
    &lt;span class=&quot;token keyword&quot;&gt;with&lt;/span&gt; pyodbc&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;token keyword&quot;&gt;connect&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;conn_str&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;token keyword&quot;&gt;as&lt;/span&gt; conn:
        &lt;span class=&quot;token keyword&quot;&gt;cursor&lt;/span&gt; &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; conn&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;token keyword&quot;&gt;cursor&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;token keyword&quot;&gt;cursor&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;token keyword&quot;&gt;execute&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;query&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
        results &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token keyword&quot;&gt;cursor&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;fetchall&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;

    table_names &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;token keyword&quot;&gt;row&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;token number&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;token keyword&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;token keyword&quot;&gt;row&lt;/span&gt; &lt;span class=&quot;token operator&quot;&gt;in&lt;/span&gt; results&lt;span class=&quot;token punctuation&quot;&gt;]&lt;/span&gt;
    &lt;span class=&quot;token keyword&quot;&gt;return&lt;/span&gt; table_names&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;&lt;strong&gt;2. Task to extract data from a specific table into a DataFrame&lt;/strong&gt;&lt;/p&gt;&lt;pre class=&quot;language-sql&quot;&gt;&lt;code class=&quot;language-sql&quot;&gt;&lt;span class=&quot;token variable&quot;&gt;@task&lt;/span&gt;
def extract_table_to_df&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;conn_str: str&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; table_name: str&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;token operator&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;token operator&quot;&gt;&amp;gt;&lt;/span&gt; pd&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;DataFrame:
    &lt;span class=&quot;token string&quot;&gt;&quot;&quot;&quot;
    Run SELECT * on the given table and return a Pandas DataFrame.
    &quot;&quot;&quot;&lt;/span&gt;
    query &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; f&lt;span class=&quot;token string&quot;&gt;&quot;SELECT * FROM {table_name}&quot;&lt;/span&gt;
    &lt;span class=&quot;token keyword&quot;&gt;with&lt;/span&gt; pyodbc&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;token keyword&quot;&gt;connect&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;conn_str&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;token keyword&quot;&gt;as&lt;/span&gt; conn:
        df &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; pd&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;read_sql&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;query&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; conn&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;token keyword&quot;&gt;return&lt;/span&gt; df&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;&lt;strong&gt;3. Task to write a DataFrame to S3 as a Parquet file&lt;/strong&gt;&lt;/p&gt;&lt;pre class=&quot;language-sql&quot;&gt;&lt;code class=&quot;language-sql&quot;&gt;&lt;span class=&quot;token variable&quot;&gt;@task&lt;/span&gt;
def write_parquet_to_s3&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;df: pd&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;DataFrame&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; bucket: str&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; table_name: str&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;:
    &lt;span class=&quot;token string&quot;&gt;&quot;&quot;&quot;
    Write the given DataFrame as a Parquet file to the specified S3 bucket.
    &quot;&quot;&quot;&lt;/span&gt;

    s3_path &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; f&lt;span class=&quot;token string&quot;&gt;&quot;s3://{bucket}/{table_name}.parquet&quot;&lt;/span&gt;

    df&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;to_parquet&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;
        path&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;s3_path&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
        &lt;span class=&quot;token keyword&quot;&gt;engine&lt;/span&gt;&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&quot;pyarrow&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
        &lt;span class=&quot;token keyword&quot;&gt;index&lt;/span&gt;&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;token boolean&quot;&gt;False&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
        storage_options&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;{
            &lt;span class=&quot;token string&quot;&gt;&quot;key&quot;&lt;/span&gt;: get_secret_from_gcsm&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&quot;AWS_ACCESS_KEY_ID&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;     
            &lt;span class=&quot;token string&quot;&gt;&quot;secret&quot;&lt;/span&gt;: get_secret_from_gcsm&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&quot;AWS_SECRET_ACCESS_KEY&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;}&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;

    &lt;span class=&quot;token keyword&quot;&gt;return&lt;/span&gt; s3_path&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;&lt;strong&gt;4. Main Flow orchestrating the above tasks&lt;/strong&gt;&lt;/p&gt;&lt;pre class=&quot;language-sql&quot;&gt;&lt;code class=&quot;language-sql&quot;&gt;&lt;span class=&quot;token variable&quot;&gt;@flow&lt;/span&gt;
def ms_sql_to_s3_flow&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;
    conn_str: str&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
    bucket: str&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;:
    &lt;span class=&quot;token string&quot;&gt;&quot;&quot;&quot;
    A Prefect flow that loads all tables from MS SQL into S3 as Parquet files.
    &quot;&quot;&quot;&lt;/span&gt;
    &lt;span class=&quot;token comment&quot;&gt;# Fetch all table names&lt;/span&gt;
    &lt;span class=&quot;token keyword&quot;&gt;tables&lt;/span&gt; &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; get_table_names&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;conn_str&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;

    &lt;span class=&quot;token comment&quot;&gt;# For each table, extract and load&lt;/span&gt;
    &lt;span class=&quot;token keyword&quot;&gt;for&lt;/span&gt; table_name &lt;span class=&quot;token operator&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;token keyword&quot;&gt;tables&lt;/span&gt;:
        df &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; extract_table_to_df&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;conn_str&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; table_name&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
        write_parquet_to_s3&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;df&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; bucket&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; table_name&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;At first, this approach might seem efficient. A single flow can ingest all objects from a database in one run—straightforward and convenient, right?&lt;/p&gt;&lt;p&gt;Initially, with just 10 objects in the database, it works well enough. But as the source database grows to 100 or more items, the cracks begin to show. Usually, this approach introduces several significant challenges:&lt;/p&gt;&lt;ol class=&quot;list&quot;&gt;&lt;li&gt;&lt;strong&gt;Difficult Monitoring:&lt;/strong&gt; A single failure makes the entire flow as failed, forcing data engineers to dig through logs to identify the problematic element.&lt;/li&gt;&lt;/ol&gt;&lt;p&gt;&lt;is-land on:idle&gt;&lt;/is-land&gt;&lt;/p&gt;&lt;dialog class=&quot;flow modal7&quot;&gt;&lt;button autofocus class=&quot;button&quot;&gt;Close&lt;/button&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/img/U7yo_GzWvo-960.webp 960w, https://thescalableway.com/img/U7yo_GzWvo-1600.webp 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/img/U7yo_GzWvo-960.jpeg&quot; alt=&quot;Single Monolithic Flow difficult monitorig&quot; width=&quot;1600&quot; height=&quot;109&quot; srcset=&quot;https://thescalableway.com/img/U7yo_GzWvo-960.jpeg 960w, https://thescalableway.com/img/U7yo_GzWvo-1600.jpeg 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;/picture&gt;&lt;/dialog&gt;&lt;button data-index=&quot;7&quot;&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/img/U7yo_GzWvo-960.webp 960w, https://thescalableway.com/img/U7yo_GzWvo-1600.webp 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/img/U7yo_GzWvo-960.jpeg&quot; alt=&quot;Single Monolithic Flow difficult monitorig&quot; width=&quot;1600&quot; height=&quot;109&quot; srcset=&quot;https://thescalableway.com/img/U7yo_GzWvo-960.jpeg 960w, https://thescalableway.com/img/U7yo_GzWvo-1600.jpeg 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;/picture&gt;&lt;/button&gt;&lt;p&gt;&lt;/p&gt;&lt;ol start=&quot;2&quot; class=&quot;list&quot;&gt;&lt;li&gt;&lt;strong&gt;Limited Reusability:&lt;/strong&gt; It’s hard to run deployments for one table or only failed objects without re-running the entire flow.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Reduced Scheduling Flexibility:&lt;/strong&gt; Monoflow might require running all tasks together, even when only a subset of tasks needs frequent execution.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;SLA Reporting:&lt;/strong&gt; Measuring success rates becomes much harder. Reporting on flow run states is unreliable since the failure on one table out of 1,000 causes the whole flow to be marked as failed. Again, this requires digging into logs to measure performance accurately.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Execution Time&lt;/strong&gt;: Monolith flows are time-consuming and don’t allow parallel execution, hindering efficiency.&lt;/li&gt;&lt;/ol&gt;&lt;p&gt;In essence, a monolithic approach limits observability, reduces performance, and complicates operations.&lt;/p&gt;&lt;h2 id=&quot;the-case-for-granulated-focused-flows&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/breaking-down-prefect-deployments-to-improve-the-data-ops-efficiency/#the-case-for-granulated-focused-flows&quot; class=&quot;heading-anchor&quot;&gt;The Case for Granulated, Focused Flows&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;When it comes to sizing your ELT flows, trust me—you’d rather fight 100 duck-sized horses than one horse-sized duck. In other words, breaking down monolithic flows into smaller, focused units is the key to scaling effectively.&lt;/p&gt;&lt;p&gt;The first step is modularizing the monolithic flow. Ideally, each deployment flow should represent a single data object. For example, if you’re ingesting data from an SQL database, think about organizing your process to allow for per-table scalability—it might require more time investment but will divide the complexity.&lt;/p&gt;&lt;p&gt;With the right tools, this approach is not as complex as it sounds. Prefect allows defining deployments with YAML, leveraging project-level default configurations stored under the definitions: key in the prefect.yml file. There are two main ways of using them:&lt;/p&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;using the entire value as-is,&lt;/li&gt;&lt;li&gt;using part of the pre-defined values (eg. overriding only a single parameter).&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;&lt;is-land on:idle&gt;&lt;/is-land&gt;&lt;/p&gt;&lt;dialog class=&quot;flow modal8&quot;&gt;&lt;button autofocus class=&quot;button&quot;&gt;Close&lt;/button&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/img/8LA2FSTlgr-556.webp 556w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/img/8LA2FSTlgr-556.jpeg&quot; alt=&quot;predefined values for granulated, focused flows&quot; width=&quot;556&quot; height=&quot;377&quot;&gt;&lt;/picture&gt;&lt;/dialog&gt;&lt;button data-index=&quot;8&quot;&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/img/8LA2FSTlgr-556.webp 556w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/img/8LA2FSTlgr-556.jpeg&quot; alt=&quot;predefined values for granulated, focused flows&quot; width=&quot;556&quot; height=&quot;377&quot;&gt;&lt;/picture&gt;&lt;/button&gt;&lt;p&gt;&lt;/p&gt;&lt;p&gt;This way, you can stick with the pre-defined daily schedule as it is, which makes the deployment creation way easier than it initially seemed.&lt;/p&gt;&lt;p&gt;Here’s why granular flow deployments are worth the effort:&lt;/p&gt;&lt;ol class=&quot;list&quot;&gt;&lt;li&gt;&lt;strong&gt;Parallelism:&lt;/strong&gt; Each table flow can run independently in parallel with others. If one table experiences performance degradation, it does not immediately affect the rest. And yes, it can be included in the monoflow, but why spend time reinventing the wheel? Orchestrator can take care of that.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Monitoring and Error Handling&lt;/strong&gt;: If a single table fails, its flow run alone fails. This allows one to quickly identify the failed table, debug it, and restart only that deployment. Also, it helps with monitoring the execution time of a particular table or with tracking data quality issues.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Improved Data Quality Testing:&lt;/strong&gt; It’s much easier to enable data quality tests per data object instead of having universal rules. Is it better to have customized tests per column in the data set or check if the set is not null only?&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Incremental Maintenance and Scalability:&lt;/strong&gt; Modular flows create clear boundaries. Adding or updating flows for new or modified tables doesn’t necessarily affect existing deployments. Each table’s logic is easier to maintain and evolve in isolation.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Version Control:&lt;/strong&gt; Each deployment can be versioned independently. This makes testing changes for one table more straightforward and also makes the CI/CD implementation easier.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Team Collaboration:&lt;/strong&gt; Different engineers can own specific deployments, making it easier to distribute responsibility and keep changes localized. It’s good to use tags to identify project-related deployments—e.g., it’s possible to have a sales tag in Prefect for sales data-related processes.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Granular Scheduling&lt;/strong&gt;: Some tables need to be refreshed three times daily, but some should be reloaded monthly only. The granular approach allows for more playing with the schedule.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;SLA Reporting:&lt;/strong&gt; It’s simpler, as the real situation is shown on the run level, and failure means real failure.&lt;/li&gt;&lt;/ol&gt;&lt;p&gt;&lt;is-land on:idle&gt;&lt;/is-land&gt;&lt;/p&gt;&lt;dialog class=&quot;flow modal9&quot;&gt;&lt;button autofocus class=&quot;button&quot;&gt;Close&lt;/button&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/img/a2G_ARBp8W-887.webp 887w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/img/a2G_ARBp8W-887.jpeg&quot; alt=&quot;granular flow deployments SLA reporting&quot; width=&quot;887&quot; height=&quot;228&quot;&gt;&lt;/picture&gt;&lt;/dialog&gt;&lt;button data-index=&quot;9&quot;&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/img/a2G_ARBp8W-887.webp 887w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/img/a2G_ARBp8W-887.jpeg&quot; alt=&quot;granular flow deployments SLA reporting&quot; width=&quot;887&quot; height=&quot;228&quot;&gt;&lt;/picture&gt;&lt;/button&gt;&lt;p&gt;&lt;/p&gt;&lt;h2 id=&quot;conclusion&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/breaking-down-prefect-deployments-to-improve-the-data-ops-efficiency/#conclusion&quot; class=&quot;heading-anchor&quot;&gt;Conclusion&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;In conclusion, a granular approach to orchestrated deployments is more than just a technical choice—it’s a strategic advantage. By breaking large, monolithic pipelines into focused, modular flows, data teams gain clearer observability, easier troubleshooting, and the flexibility to handle diverse scheduling needs&lt;/p&gt;&lt;p&gt;Focusing on key concerns—performance, reliability, and maintainability—can help you build a better data solution using a granular approach. Over time, this approach will lead to more predictable, scalable, and maintainable ETL processes.&lt;/p&gt;&lt;/div&gt;
 			</content>
    </entry><entry>
      <title>dlt and Prefect, a Great Combo for Streamlined Data Ingestion Pipelines</title>
      <link href="https://thescalableway.com/blog/dlt-and-prefect-a-great-combo-for-streamlined-data-ingestion-pipelines/" />
      <updated>2025-01-27T12:54:00Z</updated>
      <id>https://thescalableway.com/blog/dlt-and-prefect-a-great-combo-for-streamlined-data-ingestion-pipelines/</id>
      <content type="html">
				&lt;nav id=&quot;toc&quot; class=&quot;table-of-contents prose&quot;&gt;&lt;ol&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/dlt-and-prefect-a-great-combo-for-streamlined-data-ingestion-pipelines/#a-short-introduction-to-dlt-and-prefect&quot;&gt;A Short Introduction to dlt and Prefect&lt;/a&gt;&lt;ol&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/dlt-and-prefect-a-great-combo-for-streamlined-data-ingestion-pipelines/#dlt&quot;&gt;dlt&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/dlt-and-prefect-a-great-combo-for-streamlined-data-ingestion-pipelines/#prefect&quot;&gt;Prefect&lt;/a&gt;&lt;/li&gt;&lt;/ol&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/dlt-and-prefect-a-great-combo-for-streamlined-data-ingestion-pipelines/#creating-data-connectors-and-pipelines-with-dlt&quot;&gt;Creating Data Connectors and Pipelines with dlt&lt;/a&gt;&lt;ol&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/dlt-and-prefect-a-great-combo-for-streamlined-data-ingestion-pipelines/#data-pipeline-features&quot;&gt;Data Pipeline Features&lt;/a&gt;&lt;ol&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/dlt-and-prefect-a-great-combo-for-streamlined-data-ingestion-pipelines/#modularity&quot;&gt;Modularity&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/dlt-and-prefect-a-great-combo-for-streamlined-data-ingestion-pipelines/#extensibility&quot;&gt;Extensibility&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/dlt-and-prefect-a-great-combo-for-streamlined-data-ingestion-pipelines/#reliability&quot;&gt;Reliability&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/dlt-and-prefect-a-great-combo-for-streamlined-data-ingestion-pipelines/#security&quot;&gt;Security&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/dlt-and-prefect-a-great-combo-for-streamlined-data-ingestion-pipelines/#privacy&quot;&gt;Privacy&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/dlt-and-prefect-a-great-combo-for-streamlined-data-ingestion-pipelines/#efficiency&quot;&gt;Efficiency&lt;/a&gt;&lt;/li&gt;&lt;/ol&gt;&lt;/li&gt;&lt;/ol&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/dlt-and-prefect-a-great-combo-for-streamlined-data-ingestion-pipelines/#orchestrating-data-pipelines-with-prefect&quot;&gt;Orchestrating Data Pipelines with Prefect&lt;/a&gt;&lt;ol&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/dlt-and-prefect-a-great-combo-for-streamlined-data-ingestion-pipelines/#orchestration-job-features&quot;&gt;Orchestration Job Features&lt;/a&gt;&lt;ol&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/dlt-and-prefect-a-great-combo-for-streamlined-data-ingestion-pipelines/#alerting&quot;&gt;Alerting&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/dlt-and-prefect-a-great-combo-for-streamlined-data-ingestion-pipelines/#reliability-1&quot;&gt;Reliability&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/dlt-and-prefect-a-great-combo-for-streamlined-data-ingestion-pipelines/#secret-management&quot;&gt;Secret Management&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/dlt-and-prefect-a-great-combo-for-streamlined-data-ingestion-pipelines/#distributed-processing&quot;&gt;Distributed Processing&lt;/a&gt;&lt;/li&gt;&lt;/ol&gt;&lt;/li&gt;&lt;/ol&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/dlt-and-prefect-a-great-combo-for-streamlined-data-ingestion-pipelines/#production-workflow&quot;&gt;Production Workflow&lt;/a&gt;&lt;ol&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/dlt-and-prefect-a-great-combo-for-streamlined-data-ingestion-pipelines/#overview&quot;&gt;Overview&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/dlt-and-prefect-a-great-combo-for-streamlined-data-ingestion-pipelines/#configuring-dlt&quot;&gt;Configuring dlt&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/dlt-and-prefect-a-great-combo-for-streamlined-data-ingestion-pipelines/#creating-a-dlt-pipeline&quot;&gt;Creating a dlt Pipeline&lt;/a&gt;&lt;ol&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/dlt-and-prefect-a-great-combo-for-streamlined-data-ingestion-pipelines/#pipeline-design&quot;&gt;Pipeline Design&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/dlt-and-prefect-a-great-combo-for-streamlined-data-ingestion-pipelines/#inspecting-the-data-manually&quot;&gt;Inspecting the Data Manually&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/dlt-and-prefect-a-great-combo-for-streamlined-data-ingestion-pipelines/#testing-the-pipeline&quot;&gt;Testing the Pipeline&lt;/a&gt;&lt;/li&gt;&lt;/ol&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/dlt-and-prefect-a-great-combo-for-streamlined-data-ingestion-pipelines/#creating-a-prefect-flow-and-deployment&quot;&gt;Creating a Prefect Flow and Deployment&lt;/a&gt;&lt;ol&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/dlt-and-prefect-a-great-combo-for-streamlined-data-ingestion-pipelines/#flow-design&quot;&gt;Flow Design&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/dlt-and-prefect-a-great-combo-for-streamlined-data-ingestion-pipelines/#handling-pipeline-secrets&quot;&gt;Handling Pipeline Secrets&lt;/a&gt;&lt;/li&gt;&lt;/ol&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/dlt-and-prefect-a-great-combo-for-streamlined-data-ingestion-pipelines/#deploying-to-production&quot;&gt;Deploying to Production&lt;/a&gt;&lt;/li&gt;&lt;/ol&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/dlt-and-prefect-a-great-combo-for-streamlined-data-ingestion-pipelines/#summary&quot;&gt;Summary&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/dlt-and-prefect-a-great-combo-for-streamlined-data-ingestion-pipelines/#next-steps&quot;&gt;Next steps&lt;/a&gt;&lt;ol&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/dlt-and-prefect-a-great-combo-for-streamlined-data-ingestion-pipelines/#data-transformation&quot;&gt;Data Transformation&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/dlt-and-prefect-a-great-combo-for-streamlined-data-ingestion-pipelines/#ready-to-dive-deeper&quot;&gt;Ready to Dive Deeper?&lt;/a&gt;&lt;/li&gt;&lt;/ol&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/dlt-and-prefect-a-great-combo-for-streamlined-data-ingestion-pipelines/#footnotes&quot;&gt;Footnotes&lt;/a&gt;&lt;/li&gt;&lt;/ol&gt;&lt;/nav&gt;&lt;p&gt;&lt;span id=&quot;toc-skipped&quot; class=&quot;visually-hidden&quot;&gt;&lt;/span&gt;&lt;/p&gt;&lt;div class=&quot;flow prose&quot;&gt;&lt;p&gt;&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Doing data ingestion right is hard…&lt;/strong&gt;&lt;/p&gt;&lt;p&gt;Despite advances in data engineering, data ingestion, which includes the Extract and Load (EL) steps of the ELT process, remains a persistent challenge for many data teams.&lt;/p&gt;&lt;p&gt;This complexity is often due to the real-world limitations of open-source tools, leading teams to opt for UI-based solutions. While these tools are great for getting started quickly, they often lack the flexibility and scalability required for production-grade data platforms.&lt;br&gt;In the era of AI, UI-based tools face one more limitation: they miss out on most of the benefits of the advanced code generation capacity of modern LLMs (Large Language Models)&lt;a href=&quot;https://thescalableway.com/blog/dlt-and-prefect-a-great-combo-for-streamlined-data-ingestion-pipelines/#footnotes&quot;&gt;[1]&lt;/a&gt;.&lt;/p&gt;&lt;p&gt;Even if teams do decide to use open-source solutions, they often end up creating volumes of low-quality glue code. This in-house software, typically written in a rush by non-professional engineers, often fails to meet essential requirements for modern data platforms, such as EaC (Everything as Code), security, monitoring &amp;amp; alerting, reliability, or extensibility. Moreover, since it’s written by non-professional engineers, such code is far more brittle and much harder to maintain and modify. Consequently, all modifications to the code (such as adding new features or fixing bugs) take much more time and are far riskier than they should be.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;…but there is light at the end of the tunnel&lt;/strong&gt;&lt;/p&gt;&lt;p&gt;Luckily, in recent years, with the growing adoption of software engineering practices, we’ve seen a professionalization of the data engineering field. This has resulted in the creation of a number of high-quality, open-source tools that simplify and improve the quality of data engineering work, such as &lt;a href=&quot;https://dlthub.com/&quot; rel=&quot;noopener&quot;&gt;dlt&lt;/a&gt; and &lt;a href=&quot;https://www.prefect.io/&quot; rel=&quot;noopener&quot;&gt;Prefect&lt;/a&gt;.&lt;/p&gt;&lt;p&gt;In this article, we explore how dlt and Prefect can be seamlessly integrated to implement a best-practice data ingestion component of a modern data platform. Our insights are grounded in real-world experience designing and implementing scalable, code-based data platforms with these open-source tools.&lt;/p&gt;&lt;h2 id=&quot;a-short-introduction-to-dlt-and-prefect&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/dlt-and-prefect-a-great-combo-for-streamlined-data-ingestion-pipelines/#a-short-introduction-to-dlt-and-prefect&quot; class=&quot;heading-anchor&quot;&gt;A Short Introduction to dlt and Prefect&lt;/a&gt;&lt;/h2&gt;&lt;h3 id=&quot;dlt&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/dlt-and-prefect-a-great-combo-for-streamlined-data-ingestion-pipelines/#dlt&quot; class=&quot;heading-anchor&quot;&gt;dlt&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;&lt;a href=&quot;https://dlthub.com/&quot; rel=&quot;noopener&quot;&gt;dlt&lt;/a&gt; is a Python data ingestion framework enabling data engineers to define connectors and pipelines as code. It offers a rich set of features for building best-practice pipelines and supports both built-in and custom connectors built with regular Python code.&lt;/p&gt;&lt;p&gt;dlt ingests data in &lt;a href=&quot;https://dlthub.com/docs/reference/explainers/how-dlt-works&quot; rel=&quot;noopener&quot;&gt;three stages&lt;/a&gt;: extract, normalize, and load. The &lt;strong&gt;extract&lt;/strong&gt; stage downloads source data to disk. The &lt;strong&gt;normalize&lt;/strong&gt; stage applies light transformations to the data, such as column renaming or datetime parsing. The &lt;strong&gt;load&lt;/strong&gt; stage loads the data into the destination system.&lt;/p&gt;&lt;p&gt;Here’s a compact guide to key dlt concepts:&lt;/p&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;&lt;strong&gt;dlt config&lt;/strong&gt;: dlt can be configured in three ways: through files (&lt;code&gt;config.toml and secrets.toml&lt;/code&gt;), environment variables, and Python code.&lt;br&gt;Using &lt;code&gt;config.toml&lt;/code&gt; for default settings is recommended, as it’s easy to store the file together with pipeline code on git. While it can contain some pipeline-level settings as well, its main purpose is to configure global behavior such as logging, parallelization, execution settings, and source or destination configuration common to all pipelines.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Resource&lt;/strong&gt; and &lt;strong&gt;Source&lt;/strong&gt;:&lt;br&gt;A resource is a representation of a single item in a dataset. It can be a file, a database table, a REST API endpoint, etc.&lt;br&gt;A source is a collection of resources, such as a filesystem (eg. s3), a database, or a REST API.&lt;br&gt;By applying hints to the resource with &lt;code&gt;resource.apply_hints()&lt;/code&gt;, we can configure extraction settings specific to the resource, a pipeline, or a pipeline run, such as primary key, cursor column, column typing, partitioning, etc. We can also apply some light transformations to the data (eg. data masking) before it’s loaded to the destination with the &lt;code&gt;resource.add_map()&lt;/code&gt; method.&lt;br&gt;dlt is flexible when it comes to working with sources and resources, and it’s easy to use either, depending on the need.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Pipeline:&lt;/strong&gt; In dlt, pipeline describes the flow of data from a source (or resource) to a destination. Each pipeline handles a single source&amp;lt;-&amp;gt;destination pair and takes a source or resource as input.&lt;br&gt;Pipelines can be reused to ingest different resources each run. For example, we can have one “Postgres to S3” pipeline, but ingest each Postgres table separately due to different scheduling or configuration needs.&lt;br&gt;A pipeline definition contains pipeline- or pipeline run-specific destination configuration, as well as settings for the load phase of the ingestion. Under the hood, a pipeline run (&lt;code&gt;pipeline.run()&lt;/code&gt;) executes each pipeline step: extract (&lt;code&gt;pipeline.extract()&lt;/code&gt;), normalize (&lt;code&gt;pipeline.normalize()&lt;/code&gt;), and load (&lt;code&gt;pipeline.load()&lt;/code&gt;).&lt;/li&gt;&lt;/ul&gt;&lt;h3 id=&quot;prefect&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/dlt-and-prefect-a-great-combo-for-streamlined-data-ingestion-pipelines/#prefect&quot; class=&quot;heading-anchor&quot;&gt;Prefect&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;&lt;a href=&quot;https://www.prefect.io/&quot; rel=&quot;noopener&quot;&gt;Prefect&lt;/a&gt; is a Python data orchestration library that allows data and machine learning engineers to define data workflows (data ingestion, transformation, model training, etc.) as code. It provides a rich set of features to help engineers implement best-practice data orchestration workflows.&lt;/p&gt;&lt;p&gt;Its cloud offering eliminates the historically stressful and labor-intensive maintenance of data orchestration systems.&lt;/p&gt;&lt;p&gt;Let’s unpack the core concepts of Prefect:&lt;/p&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;&lt;strong&gt;Task:&lt;/strong&gt; A task is a single unit of work in a Prefect flow. It describes a single step to be executed in the workflow.&lt;br&gt;While it’s possible to implement the logic of the step directly in the task, in most cases, we recommend keeping tasks as thin wrappers around regular Python functions.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Flow:&lt;/strong&gt; A flow is a collection of tasks that define a data workflow. You can think of it as a graph of tasks, describing their relationship (eg. this task should always run after this one, and this other task should run after that one, but only if it fails).&lt;br&gt;Similar to a dlt pipeline, the same flow can be reused with different sets of parameters. An instance of a flow with specific parameter values is called a &lt;a href=&quot;https://docs.prefect.io/v3/deploy/index&quot; rel=&quot;noopener&quot;&gt;deployment&lt;/a&gt;.&lt;br&gt;In this article, we utilize this fact by utilizing a single &lt;code&gt;extract_and_load()&lt;/code&gt; flow capable of executing any dlt pipeline, depending on the parameters passed to it. As a result, each ingestion becomes a new Prefect deployment rather than a new flow, which has a major consequence: deployments can be defined with YAML, which means that they don’t require any Python code to be written, which means users don’t need to set up a local Python development environment just to eg. ingest a new table with an existing pipeline. Instead, we can, for example, expose a simple application that allows non-technical users to create new deployments with a few clicks.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Deployment:&lt;/strong&gt; A deployment is a way to run a flow with a specific set of parameters and environment configuration. While most environment configurations in Prefect would typically be defined at the workspace level, deployments allow for overriding some of these settings, including on a per-run basis, which simplifies testing and debugging.&lt;/li&gt;&lt;/ul&gt;&lt;h2 id=&quot;creating-data-connectors-and-pipelines-with-dlt&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/dlt-and-prefect-a-great-combo-for-streamlined-data-ingestion-pipelines/#creating-data-connectors-and-pipelines-with-dlt&quot; class=&quot;heading-anchor&quot;&gt;Creating Data Connectors and Pipelines with dlt&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Now that we’ve covered the theoretical underpinnings of dlt and Prefect, it’s time to see these concepts in action. We’ll explore how to implement best-practice dlt pipelines and bring these tools to life.&lt;/p&gt;&lt;h3 id=&quot;data-pipeline-features&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/dlt-and-prefect-a-great-combo-for-streamlined-data-ingestion-pipelines/#data-pipeline-features&quot; class=&quot;heading-anchor&quot;&gt;Data Pipeline Features&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;Alright, before we dive into the technical part, let’s start with the basics. A production-grade data pipeline needs to have several key features:&lt;/p&gt;&lt;ol class=&quot;list&quot;&gt;&lt;li&gt;Modularity: The pipeline should be designed to allow the reuse of components across multiple pipelines.&lt;/li&gt;&lt;li&gt;Extensibility: The pipeline must be upgradeable without disrupting ongoing production jobs.&lt;/li&gt;&lt;li&gt;Reliability: The ability to inspect pipeline execution and quickly identify and resolve issues is crucial.&lt;/li&gt;&lt;li&gt;Security: Proper mechanisms must be in place to securely store and access secrets.&lt;/li&gt;&lt;li&gt;Privacy: Data storage should adhere to privacy regulations, ensuring compliance.&lt;/li&gt;&lt;li&gt;Efficiency: Pipelines must be optimized for cost-effective execution.&lt;/li&gt;&lt;/ol&gt;&lt;p&gt;Data pipelines aren’t one-size-fits-all, and achieving a production-grade pipeline involves ensuring those key features. But how to get there?&lt;/p&gt;&lt;h4 id=&quot;modularity&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/dlt-and-prefect-a-great-combo-for-streamlined-data-ingestion-pipelines/#modularity&quot; class=&quot;heading-anchor&quot;&gt;Modularity&lt;/a&gt;&lt;/h4&gt;&lt;p&gt;To achieve modularity, it’s best to split the dlt pipeline code into the following structure:&lt;/p&gt;&lt;pre class=&quot;language-bash&quot;&gt;&lt;code class=&quot;language-bash&quot;&gt;├── pipelines
│   ├── a_to_c.py
│   ├── b_to_c.py
│   └── utils.py&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;In this structure, &lt;code&gt;a_to_c.py&lt;/code&gt; and &lt;code&gt;b_to_c.py&lt;/code&gt; represent two example pipelines, each handling data from a source system (a and b) to a destination system ©.&lt;/p&gt;&lt;p&gt;The &lt;code&gt;utils.py&lt;/code&gt; file contains common utilities such as data masking implementation, default configuration for source and destination systems, or default pipeline configuration (except configuration specified in dlt’s &lt;code&gt;config.toml&lt;/code&gt;; for more information, see the dlt config paragraph in &lt;a href=&quot;https://thescalableway.com/blog/dlt-and-prefect-a-great-combo-for-streamlined-data-ingestion-pipelines/#dlt&quot;&gt;the dlt section&lt;/a&gt;).&lt;/p&gt;&lt;h4 id=&quot;extensibility&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/dlt-and-prefect-a-great-combo-for-streamlined-data-ingestion-pipelines/#extensibility&quot; class=&quot;heading-anchor&quot;&gt;Extensibility&lt;/a&gt;&lt;/h4&gt;&lt;p&gt;Implementing extensibility goes beyond modularity. The code should also be testable, and ideally, automated testing should be integrated into the CI/CD process.&lt;/p&gt;&lt;p&gt;Since dlt pipelines are implemented using Python, they can be tested with common tools like &lt;code&gt;pytest&lt;/code&gt;. Unit tests should focus on custom utility functions, while integration tests verify the entire pipeline’s behavior.&lt;/p&gt;&lt;p&gt;For integration testing, use a local database or disk drive instead of the target database. &lt;a href=&quot;https://duckdb.org/&quot; rel=&quot;noopener&quot;&gt;DuckDB&lt;/a&gt; is a great choice for this purpose, as it’s a lightweight, in-memory database that can be used to inspect the loaded data quickly.&lt;/p&gt;&lt;h4 id=&quot;reliability&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/dlt-and-prefect-a-great-combo-for-streamlined-data-ingestion-pipelines/#reliability&quot; class=&quot;heading-anchor&quot;&gt;Reliability&lt;/a&gt;&lt;/h4&gt;&lt;p&gt;To maintain trust with data platform users, make sure that when production pipelines fail, you are informed immediately and can recover quickly. While we recommend &lt;a href=&quot;https://thescalableway.com/blog/dlt-and-prefect-a-great-combo-for-streamlined-data-ingestion-pipelines/#alerting&quot;&gt;implementing alerting in the orchestration layer&lt;/a&gt;, pipeline recoverability depends on having access to detailed logs.&lt;/p&gt;&lt;p&gt;Luckily, dlt provides rich built-in logging and error-handling mechanisms. It’s a good idea to also enable &lt;a href=&quot;https://dlthub.com/docs/general-usage/pipeline#display-the-loading-progress&quot; rel=&quot;noopener&quot;&gt;progress monitoring&lt;/a&gt; for additional useful information, such as CPU and memory usage.&lt;/p&gt;&lt;h4 id=&quot;security&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/dlt-and-prefect-a-great-combo-for-streamlined-data-ingestion-pipelines/#security&quot; class=&quot;heading-anchor&quot;&gt;Security&lt;/a&gt;&lt;/h4&gt;&lt;p&gt;dlt supports various ways of storing credentials. For local use, secrets can be stored in a .&lt;code&gt;dlt/secrets.toml&lt;/code&gt; file, while production environments may benefit from an external credential store, such as &lt;a href=&quot;https://cloud.google.com/security/products/secret-manager?hl=en&quot; rel=&quot;noopener&quot;&gt;Google Cloud Secret Manager&lt;/a&gt;. To accomplish this, you can store the &lt;a href=&quot;https://dlthub.com/docs/walkthroughs/add_credentials#retrieving-credentials-from-google-cloud-secret-manager&quot; rel=&quot;noopener&quot;&gt;secret retrieval utility function&lt;/a&gt; in &lt;code&gt;utils.py&lt;/code&gt; and reuse it within your pipelines.&lt;/p&gt;&lt;p&gt;However, since we’re using Prefect for orchestration, it’s also possible to follow a different path and &lt;a href=&quot;https://thescalableway.com/blog/dlt-and-prefect-a-great-combo-for-streamlined-data-ingestion-pipelines/#secret-management&quot;&gt;use Prefect Secrets to store the credentials&lt;/a&gt;.&lt;/p&gt;&lt;h4 id=&quot;privacy&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/dlt-and-prefect-a-great-combo-for-streamlined-data-ingestion-pipelines/#privacy&quot; class=&quot;heading-anchor&quot;&gt;Privacy&lt;/a&gt;&lt;/h4&gt;&lt;p&gt;Data anonymization and/or pseudonymization are crucial to ensure compliance with privacy regulations. Data can be erased/anonymized either:&lt;/p&gt;&lt;ol class=&quot;list&quot;&gt;&lt;li&gt;During the ingestion phase (in which case the original data never enters the destination system)&lt;/li&gt;&lt;/ol&gt;&lt;p&gt;&lt;is-land on:idle&gt;&lt;/is-land&gt;&lt;/p&gt;&lt;dialog class=&quot;flow modal20&quot;&gt;&lt;button autofocus class=&quot;button&quot;&gt;Close&lt;/button&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/img/USynO6REx1-960.webp 960w, https://thescalableway.com/img/USynO6REx1-1600.webp 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/img/USynO6REx1-960.jpeg&quot; alt=&quot;data masking in data ingestion&quot; width=&quot;1600&quot; height=&quot;335&quot; srcset=&quot;https://thescalableway.com/img/USynO6REx1-960.jpeg 960w, https://thescalableway.com/img/USynO6REx1-1600.jpeg 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;/picture&gt;&lt;/dialog&gt;&lt;button data-index=&quot;20&quot;&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/img/USynO6REx1-960.webp 960w, https://thescalableway.com/img/USynO6REx1-1600.webp 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/img/USynO6REx1-960.jpeg&quot; alt=&quot;data masking in data ingestion&quot; width=&quot;1600&quot; height=&quot;335&quot; srcset=&quot;https://thescalableway.com/img/USynO6REx1-960.jpeg 960w, https://thescalableway.com/img/USynO6REx1-1600.jpeg 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;/picture&gt;&lt;/button&gt;&lt;p&gt;&lt;/p&gt;&lt;ol start=&quot;2&quot; class=&quot;list&quot;&gt;&lt;li&gt;During the transformation phase (in which case private data is stored in one or more layers in the destination system but hidden from the eyes of end users)&lt;/li&gt;&lt;/ol&gt;&lt;p&gt;&lt;is-land on:idle&gt;&lt;/is-land&gt;&lt;/p&gt;&lt;dialog class=&quot;flow modal21&quot;&gt;&lt;button autofocus class=&quot;button&quot;&gt;Close&lt;/button&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/img/TpvfAarwiy-960.webp 960w, https://thescalableway.com/img/TpvfAarwiy-1600.webp 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/img/TpvfAarwiy-960.jpeg&quot; alt=&quot;data masking in transformation&quot; width=&quot;1600&quot; height=&quot;335&quot; srcset=&quot;https://thescalableway.com/img/TpvfAarwiy-960.jpeg 960w, https://thescalableway.com/img/TpvfAarwiy-1600.jpeg 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;/picture&gt;&lt;/dialog&gt;&lt;button data-index=&quot;21&quot;&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/img/TpvfAarwiy-960.webp 960w, https://thescalableway.com/img/TpvfAarwiy-1600.webp 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/img/TpvfAarwiy-960.jpeg&quot; alt=&quot;data masking in transformation&quot; width=&quot;1600&quot; height=&quot;335&quot; srcset=&quot;https://thescalableway.com/img/TpvfAarwiy-960.jpeg 960w, https://thescalableway.com/img/TpvfAarwiy-1600.jpeg 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;/picture&gt;&lt;/button&gt;&lt;p&gt;&lt;/p&gt;&lt;p&gt;While dlt doesn’t provide built-in anonymization features, it does provide the necessary tools to implement the first option effectively.&lt;/p&gt;&lt;p&gt;For more information, see the &lt;a href=&quot;https://dlthub.com/docs/general-usage/customising-pipelines/pseudonymizing_columns&quot; rel=&quot;noopener&quot;&gt;example&lt;/a&gt; in the official documentation.&lt;/p&gt;&lt;h4 id=&quot;efficiency&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/dlt-and-prefect-a-great-combo-for-streamlined-data-ingestion-pipelines/#efficiency&quot; class=&quot;heading-anchor&quot;&gt;Efficiency&lt;/a&gt;&lt;/h4&gt;&lt;p&gt;To ensure pipelines are both cost-effective and high-performing, several optimization techniques can be applied:&lt;/p&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;&lt;strong&gt;Incremental extraction&lt;/strong&gt;&lt;br&gt;Loading data incrementally allows for reducing the amount of data that needs to be &lt;strong&gt;extracted&lt;/strong&gt;. Currently, dlt supports incremental extraction for its &lt;a href=&quot;https://dlthub.com/docs/dlt-ecosystem/verified-sources/#core-sources&quot; rel=&quot;noopener&quot;&gt;core sources&lt;/a&gt;: &lt;a href=&quot;https://dlthub.com/docs/general-usage/incremental-loading#incremental-loading-with-a-cursor-field&quot; rel=&quot;noopener&quot;&gt;REST API&lt;/a&gt;, &lt;a href=&quot;https://dlthub.com/docs/walkthroughs/sql-incremental-configuration&quot; rel=&quot;noopener&quot;&gt;SQL database&lt;/a&gt;, and &lt;a href=&quot;https://dlthub.com/docs/dlt-ecosystem/verified-sources/filesystem/basic#5-incremental-loading&quot; rel=&quot;noopener&quot;&gt;filesystem&lt;/a&gt;.&lt;br&gt;Incremental extraction allows us to download only new or modified data.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Write dispositions&lt;/strong&gt;&lt;br&gt;&lt;a href=&quot;https://dlthub.com/docs/general-usage/incremental-loading#choosing-a-write-disposition&quot; rel=&quot;noopener&quot;&gt;Write dispositions&lt;/a&gt; work in tandem with the two extraction methods to reduce the amount of data that needs to be &lt;strong&gt;loaded&lt;/strong&gt;. For example, if you only extracted new and modified data, you don’t want to overwrite existing data, as that would result in data loss. In such a case, insert the new records and update the existing ones instead.&lt;/li&gt;&lt;li&gt;&lt;a href=&quot;https://dlthub.com/docs/reference/performance#parallelism&quot; rel=&quot;noopener&quot;&gt;Parallelization&lt;br&gt;&lt;/a&gt;dlt allows parallelizing each stage of the pipeline utilizing multithreading and multiprocessing (depending on the stage).&lt;br&gt;In cases where further parallelization is needed (i.e., the workload exceeds the capacity of a single machine), utilizing orchestrator-layer parallelization may be required. However, this scenario is now rare, as large virtual machines capable of processing petabytes of data are widely available, and dlt can leverage the machine’s resources more efficiently than older tools or typical in-house Python code.&lt;/li&gt;&lt;li&gt;&lt;a href=&quot;https://dlthub.com/docs/reference/performance&quot; rel=&quot;noopener&quot;&gt;&lt;strong&gt;Various other optimizations&lt;/strong&gt;&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;As the topic of incremental loading can be complex even for seasoned data engineers, we’ve prepared a diagram of all the viable ELT patterns:&lt;/p&gt;&lt;p&gt;&lt;is-land on:idle&gt;&lt;/is-land&gt;&lt;/p&gt;&lt;dialog class=&quot;flow modal22&quot;&gt;&lt;button autofocus class=&quot;button&quot;&gt;Close&lt;/button&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/img/uT145YgjSn-960.webp 960w, https://thescalableway.com/img/uT145YgjSn-1600.webp 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/img/uT145YgjSn-960.jpeg&quot; alt=&quot;Extract load transform patterns&quot; width=&quot;1600&quot; height=&quot;2176&quot; srcset=&quot;https://thescalableway.com/img/uT145YgjSn-960.jpeg 960w, https://thescalableway.com/img/uT145YgjSn-1600.jpeg 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;/picture&gt;&lt;/dialog&gt;&lt;button data-index=&quot;22&quot;&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/img/uT145YgjSn-960.webp 960w, https://thescalableway.com/img/uT145YgjSn-1600.webp 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/img/uT145YgjSn-960.jpeg&quot; alt=&quot;Extract load transform patterns&quot; width=&quot;1600&quot; height=&quot;2176&quot; srcset=&quot;https://thescalableway.com/img/uT145YgjSn-960.jpeg 960w, https://thescalableway.com/img/uT145YgjSn-1600.jpeg 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;/picture&gt;&lt;/button&gt;&lt;p&gt;&lt;/p&gt;&lt;p&gt;&lt;strong&gt;NOTE:&lt;/strong&gt; dlt also provides sub-types of the “merge” disposition, including&amp;nbsp;&lt;a href=&quot;https://dlthub.com/blog/scd2-and-incremental-loading&quot; rel=&quot;noopener&quot;&gt;SCD type 2&lt;/a&gt;; however, for clarity, we did not include these in the diagram. For more information on these subtypes, see&amp;nbsp;&lt;a href=&quot;https://dlthub.com/docs/general-usage/incremental-loading#merge-incremental-loading&quot; rel=&quot;noopener&quot;&gt;relevant documentation&lt;/a&gt;.&lt;/p&gt;&lt;p&gt;The choice of a specific implementation depends on what is supported by the source and destination systems as well as on how the source data is generated. Ideally, incremental extract should be used whenever possible. Then, whether you choose the “append” or “merge” write disposition depends on how the data is generated: if you can guarantee that only new records are produced and no existing data is ever modified, you can safely use the “append” disposition. Next, you need to check if the destination system handles the disposition you intend to use (eg. some systems don’t support the “merge” disposition).&lt;/p&gt;&lt;p&gt;The following diagram from&amp;nbsp;&lt;a href=&quot;https://dlthub.com/docs/general-usage/incremental-loading#two-simple-questions-determine-the-write-disposition-you-use&quot; rel=&quot;noopener&quot;&gt;dlt’s official documentation&lt;/a&gt;&amp;nbsp;also provides a good overview of when to choose which write disposition:&lt;/p&gt;&lt;p&gt;&lt;is-land on:idle&gt;&lt;/is-land&gt;&lt;/p&gt;&lt;dialog class=&quot;flow modal23&quot;&gt;&lt;button autofocus class=&quot;button&quot;&gt;Close&lt;/button&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/img/TixcvWiB7K-960.webp 960w, https://thescalableway.com/img/TixcvWiB7K-1600.webp 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/img/TixcvWiB7K-960.jpeg&quot; alt=&quot;how to choose write dispositionin in dlt&quot; width=&quot;1600&quot; height=&quot;977&quot; srcset=&quot;https://thescalableway.com/img/TixcvWiB7K-960.jpeg 960w, https://thescalableway.com/img/TixcvWiB7K-1600.jpeg 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;/picture&gt;&lt;/dialog&gt;&lt;button data-index=&quot;23&quot;&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/img/TixcvWiB7K-960.webp 960w, https://thescalableway.com/img/TixcvWiB7K-1600.webp 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/img/TixcvWiB7K-960.jpeg&quot; alt=&quot;how to choose write dispositionin in dlt&quot; width=&quot;1600&quot; height=&quot;977&quot; srcset=&quot;https://thescalableway.com/img/TixcvWiB7K-960.jpeg 960w, https://thescalableway.com/img/TixcvWiB7K-1600.jpeg 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;/picture&gt;&lt;/button&gt;&lt;p&gt;&lt;/p&gt;&lt;h2 id=&quot;orchestrating-data-pipelines-with-prefect&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/dlt-and-prefect-a-great-combo-for-streamlined-data-ingestion-pipelines/#orchestrating-data-pipelines-with-prefect&quot; class=&quot;heading-anchor&quot;&gt;Orchestrating Data Pipelines with Prefect&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Orchestrating data pipelines with Prefect can streamline your workflow and significantly improve efficiency. Let’s dive into the best practices for implementing Prefect flows and how they integrate smoothly with your data pipelines.&lt;/p&gt;&lt;h3 id=&quot;orchestration-job-features&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/dlt-and-prefect-a-great-combo-for-streamlined-data-ingestion-pipelines/#orchestration-job-features&quot; class=&quot;heading-anchor&quot;&gt;Orchestration Job Features&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;Ideally, the orchestration layer is a thin wrapper over the underlying data pipeline logic. Whenever a feature can be implemented at the pipeline level, it should be implemented there in order to prevent excessive coupling with the orchestration layer and minimize complexity, which simplifies self-service data ingestion.&lt;/p&gt;&lt;p&gt;Here are a few key features that are best handled at the orchestration layer:&lt;/p&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;alerting&lt;/li&gt;&lt;li&gt;additional reliability measures&lt;/li&gt;&lt;li&gt;security (specifically, secret management)&lt;/li&gt;&lt;li&gt;distributed processing&lt;/li&gt;&lt;/ul&gt;&lt;h4 id=&quot;alerting&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/dlt-and-prefect-a-great-combo-for-streamlined-data-ingestion-pipelines/#alerting&quot; class=&quot;heading-anchor&quot;&gt;Alerting&lt;/a&gt;&lt;/h4&gt;&lt;p&gt;With Prefect, you can set up &lt;a href=&quot;https://docs.prefect.io/v3/automate/events/automations-triggers#manage-automations&quot; rel=&quot;noopener&quot;&gt;alerts&lt;/a&gt;, ensuring you’re notified via Slack, Teams, or email whenever jobs or infrastructure components enter an unexpected state.&lt;/p&gt;&lt;h4 id=&quot;reliability-1&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/dlt-and-prefect-a-great-combo-for-streamlined-data-ingestion-pipelines/#reliability-1&quot; class=&quot;heading-anchor&quot;&gt;Reliability&lt;/a&gt;&lt;/h4&gt;&lt;p&gt;While we can (and, where possible, should) implement retries and &lt;a href=&quot;https://dlthub.com/docs/general-usage/http/requests#customizing-retry-settings&quot; rel=&quot;noopener&quot;&gt;timeouts&lt;/a&gt; at the pipeline level, Prefect provides these features at the task and flow level. Think of this as a last-resort, catch-all mechanism that allows data engineers to ensure timeouts and retries are enforced regardless of how well the dlt pipeline or helper code is written, again lowering the bar for self-service data ingestion.&lt;/p&gt;&lt;h4 id=&quot;secret-management&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/dlt-and-prefect-a-great-combo-for-streamlined-data-ingestion-pipelines/#secret-management&quot; class=&quot;heading-anchor&quot;&gt;Secret Management&lt;/a&gt;&lt;/h4&gt;&lt;p&gt;Security is always a top concern, and Prefect’s secret management integrations make it easier than ever to store and handle secrets. Whether it’s Google Cloud Secret Manager or AWS Secret Manager, Prefect allows you to securely retrieve credentials and pass them to the dlt pipeline. This approach ensures that no credentials are stored locally, and administrators have fine-grained control over access by utilizing Prefect’s Role-Based Access Control (RBAC).&lt;/p&gt;&lt;h4 id=&quot;distributed-processing&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/dlt-and-prefect-a-great-combo-for-streamlined-data-ingestion-pipelines/#distributed-processing&quot; class=&quot;heading-anchor&quot;&gt;Distributed Processing&lt;/a&gt;&lt;/h4&gt;&lt;p&gt;While any code-based orchestration tool allows for distributed processing, this feature is rarely required at the pipeline level in recent times. Firstly, data ingestion tools such as dlt are capable of efficiently utilizing machine resources, including parallelization and efficient and safe use of memory. Secondly, virtual machines have grown bigger—we can now easily rent VMs with hundreds of cores and hundreds of gigabytes of RAM. Therefore, typically, distributed processing is only required in case we need to run multiple resource-hungry pipelines in parallel.&lt;/p&gt;&lt;h2 id=&quot;production-workflow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/dlt-and-prefect-a-great-combo-for-streamlined-data-ingestion-pipelines/#production-workflow&quot; class=&quot;heading-anchor&quot;&gt;Production Workflow&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Now that we’ve outlined the essential features of a production-grade dlt pipeline and Prefect flow, let’s break down the steps of creating and orchestrating data ingestion pipelines in production.&lt;/p&gt;&lt;h3 id=&quot;overview&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/dlt-and-prefect-a-great-combo-for-streamlined-data-ingestion-pipelines/#overview&quot; class=&quot;heading-anchor&quot;&gt;Overview&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;The diagram below illustrates the key steps in this production workflow.&lt;/p&gt;&lt;p&gt;&lt;is-land on:idle&gt;&lt;/is-land&gt;&lt;/p&gt;&lt;dialog class=&quot;flow modal24&quot;&gt;&lt;button autofocus class=&quot;button&quot;&gt;Close&lt;/button&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/img/m5ty1gBN3J-960.webp 960w, https://thescalableway.com/img/m5ty1gBN3J-1600.webp 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/img/m5ty1gBN3J-960.jpeg&quot; alt=&quot;data pipeline workflow&quot; width=&quot;1600&quot; height=&quot;591&quot; srcset=&quot;https://thescalableway.com/img/m5ty1gBN3J-960.jpeg 960w, https://thescalableway.com/img/m5ty1gBN3J-1600.jpeg 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;/picture&gt;&lt;/dialog&gt;&lt;button data-index=&quot;24&quot;&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/img/m5ty1gBN3J-960.webp 960w, https://thescalableway.com/img/m5ty1gBN3J-1600.webp 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/img/m5ty1gBN3J-960.jpeg&quot; alt=&quot;data pipeline workflow&quot; width=&quot;1600&quot; height=&quot;591&quot; srcset=&quot;https://thescalableway.com/img/m5ty1gBN3J-960.jpeg 960w, https://thescalableway.com/img/m5ty1gBN3J-1600.jpeg 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;/picture&gt;&lt;/button&gt;&lt;p&gt;&lt;/p&gt;&lt;ol class=&quot;list&quot;&gt;&lt;li&gt;&lt;strong&gt;Create a dlt pipeline:&lt;/strong&gt; We start by creating a dlt pipeline (if the one we need doesn’t exist yet). Once the pipeline is finished and tests pass, we can move on to the next step.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Create Prefect deployment&lt;/strong&gt;: We create a Prefect deployment for the pipeline. Notice we utilize Prefect’s &lt;code&gt;prefect.yaml&lt;/code&gt; file together with a single &lt;code&gt;extract_and_load()&lt;/code&gt; flow capable of executing any dlt pipeline to drastically simplify this process.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Create a Pull Request:&lt;/strong&gt; We create a pull request with the new deployment. This triggers the CI/CD process.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;DEV environment:&lt;/strong&gt; The deployment is created in the DEV Prefect workspace, and a DEV Docker image is built. We can now manually run the deployment in Prefect UI, which will execute our pipeline in the DEV environment.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;PROD environment:&lt;/strong&gt; Once we’re happy with the results, we merge the pull request. This triggers a CI/CD job, which creates the deployment in the PROD Prefect workspace and builds a PROD Docker image. The deployment schedule is also only enabled at this stage.&lt;/li&gt;&lt;/ol&gt;&lt;p&gt;If the pipeline already exists and only a new table is being ingested, the user needs only add a few lines of &lt;code&gt;YAML toprefect.yaml&lt;/code&gt; and create a PR.&lt;/p&gt;&lt;h3 id=&quot;configuring-dlt&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/dlt-and-prefect-a-great-combo-for-streamlined-data-ingestion-pipelines/#configuring-dlt&quot; class=&quot;heading-anchor&quot;&gt;Configuring dlt&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;While dlt is highly configurable and allows for a lot of customization and optimization, we recommend starting with three highly useful configurations:&lt;/p&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;&lt;code&gt;runtime.log_level&lt;/code&gt; to enable more logging&lt;/li&gt;&lt;li&gt;&lt;code&gt;normalize.parquet_normalizer.add_dlt_load_id&lt;/code&gt; to add a dlt load ID to the loaded data&lt;/li&gt;&lt;li&gt;&lt;code&gt;normalize.parquet_normalizer.add_dlt_id&lt;/code&gt; to add a unique id to each row.&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;The ID settings will make our data easier to work with for downstream users, as well as make our loads (especially incremental ones) easier to debug.&lt;/p&gt;&lt;h3 id=&quot;creating-a-dlt-pipeline&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/dlt-and-prefect-a-great-combo-for-streamlined-data-ingestion-pipelines/#creating-a-dlt-pipeline&quot; class=&quot;heading-anchor&quot;&gt;Creating a dlt Pipeline&lt;/a&gt;&lt;/h3&gt;&lt;h4 id=&quot;pipeline-design&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/dlt-and-prefect-a-great-combo-for-streamlined-data-ingestion-pipelines/#pipeline-design&quot; class=&quot;heading-anchor&quot;&gt;Pipeline Design&lt;/a&gt;&lt;/h4&gt;&lt;p&gt;We start by creating a dlt pipeline, following the best practices detailed in the &lt;a href=&quot;https://thescalableway.com/blog/dlt-and-prefect-a-great-combo-for-streamlined-data-ingestion-pipelines/#creating-data-connectors-and-pipelines-with-dlt&quot;&gt;Creating data connectors and pipelines with dlt&lt;/a&gt; section above.&lt;/p&gt;&lt;p&gt;For testability and modularity, we recommend splitting the pipeline into a resource (source data) and pipeline (journey and destination) parts. This way, you can easily test each part separately.&lt;/p&gt;&lt;h4 id=&quot;inspecting-the-data-manually&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/dlt-and-prefect-a-great-combo-for-streamlined-data-ingestion-pipelines/#inspecting-the-data-manually&quot; class=&quot;heading-anchor&quot;&gt;Inspecting the Data Manually&lt;/a&gt;&lt;/h4&gt;&lt;p&gt;At any stage of pipeline development, you can manually inspect the loaded data, e.g., by printing it to the console or by checking the database directly.&lt;/p&gt;&lt;h4 id=&quot;testing-the-pipeline&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/dlt-and-prefect-a-great-combo-for-streamlined-data-ingestion-pipelines/#testing-the-pipeline&quot; class=&quot;heading-anchor&quot;&gt;Testing the Pipeline&lt;/a&gt;&lt;/h4&gt;&lt;p&gt;For integration testing, you can use DuckDB as a destination system. It’s lightweight and allows you to quickly check ingested data, so you can iterate faster.&lt;/p&gt;&lt;h3 id=&quot;creating-a-prefect-flow-and-deployment&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/dlt-and-prefect-a-great-combo-for-streamlined-data-ingestion-pipelines/#creating-a-prefect-flow-and-deployment&quot; class=&quot;heading-anchor&quot;&gt;Creating a Prefect Flow and Deployment&lt;/a&gt;&lt;/h3&gt;&lt;h4 id=&quot;flow-design&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/dlt-and-prefect-a-great-combo-for-streamlined-data-ingestion-pipelines/#flow-design&quot; class=&quot;heading-anchor&quot;&gt;Flow Design&lt;/a&gt;&lt;/h4&gt;&lt;p&gt;After the dlt pipeline is working, it’s time to wrap it in a Prefect task and flow. Keep the orchestration layer simple—use a single &lt;code&gt;extract_and_load()&lt;/code&gt; flow for all data ingestion tasks. With Prefect deployments handling the pipeline name and arguments, you can set everything up with just a few lines of YAML.&lt;/p&gt;&lt;h4 id=&quot;handling-pipeline-secrets&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/dlt-and-prefect-a-great-combo-for-streamlined-data-ingestion-pipelines/#handling-pipeline-secrets&quot; class=&quot;heading-anchor&quot;&gt;Handling Pipeline Secrets&lt;/a&gt;&lt;/h4&gt;&lt;p&gt;Secrets should be passed through a special dictionary parameter, such as secrets. These secrets should then extracted from Prefect blocks and forwarded to the dlt pipeline, ensuring they are securely handled.&lt;/p&gt;&lt;h3 id=&quot;deploying-to-production&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/dlt-and-prefect-a-great-combo-for-streamlined-data-ingestion-pipelines/#deploying-to-production&quot; class=&quot;heading-anchor&quot;&gt;Deploying to Production&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;A pull request with the new deployment should automatically trigger the CI/CD process in our project repository’s CI/CD pipelines. We will soon dive deeper into how to implement this process using GitHub Actions in a separate article, so stay tuned!&lt;/p&gt;&lt;h2 id=&quot;summary&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/dlt-and-prefect-a-great-combo-for-streamlined-data-ingestion-pipelines/#summary&quot; class=&quot;heading-anchor&quot;&gt;Summary&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Building a modern, scalable data platform starts with mastering data ingestion, which requires tools that are as powerful as they are flexible. By combining dlt for efficient, open-source data pipelines with Prefect for orchestration, you can create workflows that are not only production-ready but also streamlined for both developers and data teams.&lt;/p&gt;&lt;p&gt;This approach ensures flexibility, scalability, and cost-effectiveness, making it ideal for modern data platforms while also strategically positioning your platform to excel in the upcoming AI age.&lt;/p&gt;&lt;h2 id=&quot;next-steps&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/dlt-and-prefect-a-great-combo-for-streamlined-data-ingestion-pipelines/#next-steps&quot; class=&quot;heading-anchor&quot;&gt;Next steps&lt;/a&gt;&lt;/h2&gt;&lt;h3 id=&quot;data-transformation&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/dlt-and-prefect-a-great-combo-for-streamlined-data-ingestion-pipelines/#data-transformation&quot; class=&quot;heading-anchor&quot;&gt;Data Transformation&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;dlt and Prefect (with the help of &lt;a href=&quot;https://www.getdbt.com/&quot; rel=&quot;noopener&quot;&gt;dbt&lt;/a&gt;) are just as good at data transformation as they are at data ingestion. Stay tuned as we explore how to integrate these tools for data transformation in a future article!&lt;/p&gt;&lt;h3 id=&quot;ready-to-dive-deeper&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/dlt-and-prefect-a-great-combo-for-streamlined-data-ingestion-pipelines/#ready-to-dive-deeper&quot; class=&quot;heading-anchor&quot;&gt;Ready to Dive Deeper?&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;If you’re ready to build a cutting-edge data platform with dlt and Prefect, &lt;a href=&quot;https://meetings-eu1.hubspot.com/alessio-civitillo/the-scalable-way?uuid=871f7790-0d02-4f3a-9287-3c0f24b53ba2&quot; rel=&quot;noopener&quot;&gt;get in touch&lt;/a&gt;. We offer expert guidance to help you set up every component and provide a fully equipped template Git repository with production-grade code. No fluff—just practical, scalable solutions designed to handle real-world challenges and set your data workflows up for long-term success.&lt;/p&gt;&lt;h2 id=&quot;footnotes&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/dlt-and-prefect-a-great-combo-for-streamlined-data-ingestion-pipelines/#footnotes&quot; class=&quot;heading-anchor&quot;&gt;Footnotes&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;[1] While more and more UI-based tools add copilot capabilities, they face several fundamental limitations:&lt;/p&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;Copilots, while text-based, are limited by the UI tools they are built upon.&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;Imagine instructing someone to build a complex LEGO castle with only a basic set of blocks. No matter how clearly you explain, the result will always be limited, forcing you to find workarounds.&lt;/p&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;These UI tools often use a custom language to define data pipelines, which adds another layer of complexity.&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;As the quality of LLMs is highly reliant on the size and quality of the dataset they’re learning from, it means these assistants cannot reach the same level of fluency as LLMs trained on much more popular languages, such as Python.&lt;/p&gt;&lt;p&gt;Imagine the person you’re instructing to build your LEGO castle has very little experience with LEGO or construction in general. They would struggle to understand basic jargon and construction trade practices, and they would often make mistakes requiring your intervention.&lt;/p&gt;&lt;/div&gt;
 			</content>
    </entry><entry>
      <title>Deploying Prefect on any Cloud Using a Single Virtual Machine</title>
      <link href="https://thescalableway.com/blog/deploying-prefect-on-any-cloud-using-a-single-virtual-machine/" />
      <updated>2025-01-15T09:29:00Z</updated>
      <id>https://thescalableway.com/blog/deploying-prefect-on-any-cloud-using-a-single-virtual-machine/</id>
      <content type="html">
				&lt;nav id=&quot;toc&quot; class=&quot;table-of-contents prose&quot;&gt;&lt;ol&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/deploying-prefect-on-any-cloud-using-a-single-virtual-machine/#challenges-with-picking-a-data-platform-architecture&quot;&gt;Challenges With Picking a Data Platform Architecture&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/deploying-prefect-on-any-cloud-using-a-single-virtual-machine/#data-platform-orchestration-the-key-to-seamless-integration&quot;&gt;Data Platform Orchestration: The Key to Seamless Integration&lt;/a&gt;&lt;ol&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/deploying-prefect-on-any-cloud-using-a-single-virtual-machine/#data-orchestration-tools&quot;&gt;Data Orchestration Tools&lt;/a&gt;&lt;/li&gt;&lt;/ol&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/deploying-prefect-on-any-cloud-using-a-single-virtual-machine/#what-is-prefect-cloud&quot;&gt;What is Prefect Cloud?&lt;/a&gt;&lt;ol&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/deploying-prefect-on-any-cloud-using-a-single-virtual-machine/#common-struggle-for-prefect-users-deployment&quot;&gt;Common Struggle for Prefect Users: Deployment&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/deploying-prefect-on-any-cloud-using-a-single-virtual-machine/#eternal-dilemma-server-based-or-serverless&quot;&gt;Eternal Dilemma: Server-based or Serverless&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/deploying-prefect-on-any-cloud-using-a-single-virtual-machine/#deployment-options-for-a-server-based-data-platform&quot;&gt;Deployment Options for a Server-based Data Platform&lt;/a&gt;&lt;/li&gt;&lt;/ol&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/deploying-prefect-on-any-cloud-using-a-single-virtual-machine/#recommended-setup-for-getting-started-lightweight-kubernetes-on-a-single-virtual-machine&quot;&gt;Recommended Setup for Getting Started: Lightweight Kubernetes on a Single Virtual Machine&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;flow&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/deploying-prefect-on-any-cloud-using-a-single-virtual-machine/#conclusion&quot;&gt;Conclusion&lt;/a&gt;&lt;/li&gt;&lt;/ol&gt;&lt;/nav&gt;&lt;p&gt;&lt;span id=&quot;toc-skipped&quot; class=&quot;visually-hidden&quot;&gt;&lt;/span&gt;&lt;/p&gt;&lt;div class=&quot;flow prose&quot;&gt;&lt;p&gt;&lt;/p&gt;&lt;p&gt;Choosing the right data platform architecture is quite a challenge for any organization. It’s a balancing act: you need something that delivers immediate value while staying flexible enough for future growth, all without sacrificing scalability, simplicity, or efficiency.&lt;/p&gt;&lt;p&gt;This article offers a thoughtful guide to the decision-making process behind choosing Prefect with lightweight Kubernetes (K3S) on a single Virtual Machine (VM) with any cloud provider. You’ll explore:&lt;/p&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;Why simplicity and flexibility are essential for modern data platforms.&lt;/li&gt;&lt;li&gt;Key considerations for selecting the right data orchestration tool.&lt;/li&gt;&lt;li&gt;Insights into serverless vs server-based execution of Prefect flows.&lt;/li&gt;&lt;li&gt;Approaches to running a server-based Prefect worker&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;Rather than a step-by-step tutorial, this guide is designed to help you make solid platform architecture decisions and design a solution tailored to your organization’s unique needs. Let’s dive in.&lt;/p&gt;&lt;h2 id=&quot;challenges-with-picking-a-data-platform-architecture&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/deploying-prefect-on-any-cloud-using-a-single-virtual-machine/#challenges-with-picking-a-data-platform-architecture&quot; class=&quot;heading-anchor&quot;&gt;Challenges With Picking a Data Platform Architecture&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;The options for building a data platform are endless, but many fall short. With the rise of affordable cloud storage, expectations have changed, leaving many once-revolutionary legacy systems struggling to keep up. At the same time, new solutions making big claims often fail, either missing critical features or bogging organizations down with unnecessary complexity. For smaller companies, the challenge is even greater—a data platform should drive business value, not require a dedicated team just to maintain it.&lt;/p&gt;&lt;p&gt;Starting small may seem practical, but early shortcuts can turn into major obstacles as the platform grows. Undoing poor architectural choices later is often costly and disruptive. That’s why &lt;strong&gt;choosing a solution that is both simple and scalable from the outset is essential&lt;/strong&gt;.&lt;/p&gt;&lt;p&gt;For decision-makers, this journey begins by stepping back and evaluating both the current state of their team and the platform they rely on. The &lt;strong&gt;Data Platform Maturity Curve&lt;/strong&gt; is a helpful framework for this:&lt;/p&gt;&lt;p&gt;&lt;is-land on:idle&gt;&lt;/is-land&gt;&lt;/p&gt;&lt;dialog class=&quot;flow modal44&quot;&gt;&lt;button autofocus class=&quot;button&quot;&gt;Close&lt;/button&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/img/GGk32ZeRJa-960.webp 960w, https://thescalableway.com/img/GGk32ZeRJa-1600.webp 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/img/GGk32ZeRJa-960.jpeg&quot; alt=&quot;data platform maturity&quot; width=&quot;1600&quot; height=&quot;860&quot; srcset=&quot;https://thescalableway.com/img/GGk32ZeRJa-960.jpeg 960w, https://thescalableway.com/img/GGk32ZeRJa-1600.jpeg 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;/picture&gt;&lt;/dialog&gt;&lt;button data-index=&quot;44&quot;&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/img/GGk32ZeRJa-960.webp 960w, https://thescalableway.com/img/GGk32ZeRJa-1600.webp 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/img/GGk32ZeRJa-960.jpeg&quot; alt=&quot;data platform maturity&quot; width=&quot;1600&quot; height=&quot;860&quot; srcset=&quot;https://thescalableway.com/img/GGk32ZeRJa-960.jpeg 960w, https://thescalableway.com/img/GGk32ZeRJa-1600.jpeg 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;/picture&gt;&lt;/button&gt;&lt;p&gt;&lt;/p&gt;&lt;p&gt;Depending on the organization’s data technology maturity level, your platform must adapt. This article focuses on those in the middle of the curve, where simple scripts and ad-hoc solutions are no longer enough, but advanced features like autoscaling aren’t yet necessary. At this stage, the platform delivers tangible business value and is steadily becoming integral to operations. Downtime—whether it lasts hours, a day, or even a week—is growing increasingly expensive.&lt;/p&gt;&lt;p&gt;The goal? A platform that’s lightweight, scalable, and future-ready without overcomplicating things.&lt;/p&gt;&lt;h2 id=&quot;data-platform-orchestration-the-key-to-seamless-integration&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/deploying-prefect-on-any-cloud-using-a-single-virtual-machine/#data-platform-orchestration-the-key-to-seamless-integration&quot; class=&quot;heading-anchor&quot;&gt;Data Platform Orchestration: The Key to Seamless Integration&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Even the best-designed data platform is useless if it’s not integrated. No matter how carefully you choose your architecture, your platform’s success hinges on how well its core components—ingestion, transformation, and serving—work together. These phases can only operate efficiently when they are tightly aligned.&lt;/p&gt;&lt;p&gt;&lt;is-land on:idle&gt;&lt;/is-land&gt;&lt;/p&gt;&lt;dialog class=&quot;flow modal45&quot;&gt;&lt;button autofocus class=&quot;button&quot;&gt;Close&lt;/button&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/img/KgwfpAY-HF-960.webp 960w, https://thescalableway.com/img/KgwfpAY-HF-1600.webp 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/img/KgwfpAY-HF-960.jpeg&quot; alt=&quot;data enginnering stages lifecycl&quot; width=&quot;1600&quot; height=&quot;519&quot; srcset=&quot;https://thescalableway.com/img/KgwfpAY-HF-960.jpeg 960w, https://thescalableway.com/img/KgwfpAY-HF-1600.jpeg 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;/picture&gt;&lt;/dialog&gt;&lt;button data-index=&quot;45&quot;&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/img/KgwfpAY-HF-960.webp 960w, https://thescalableway.com/img/KgwfpAY-HF-1600.webp 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/img/KgwfpAY-HF-960.jpeg&quot; alt=&quot;data enginnering stages lifecycl&quot; width=&quot;1600&quot; height=&quot;519&quot; srcset=&quot;https://thescalableway.com/img/KgwfpAY-HF-960.jpeg 960w, https://thescalableway.com/img/KgwfpAY-HF-1600.jpeg 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;/picture&gt;&lt;/button&gt;&lt;p&gt;&lt;/p&gt;&lt;p&gt;&lt;em&gt;Adapted from “Fundamentals of Data Engineering: Plan and Build Robust Data Systems” by Joe Reis &amp;amp; Matt Housley&lt;/em&gt;&lt;/p&gt;&lt;p&gt;Early-stage platforms often rely on manual orchestration, which works at first but quickly becomes a bottleneck as data grows and workflows become more complex. Managing, ensuring accuracy, and reducing downtime requires a more structured approach.&lt;/p&gt;&lt;p&gt;A few basic improvements can help push the boundaries further. For instance:&lt;/p&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;Instead of running all scripts locally, they can be executed on a virtual machine.&lt;/li&gt;&lt;li&gt;Setting up a database helps centralize data&lt;/li&gt;&lt;li&gt;Basic automation of workflows can be managed with cron jobs in Linux.&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;While these incremental improvements help in the short term, significant challenges remain:&lt;/p&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;&lt;strong&gt;Manual code execution&lt;/strong&gt; becomes increasingly error-prone as the scale of operations grows.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Cron jobs&lt;/strong&gt; become difficult to manage as workflows become more complex and interdependent. Debugging failures can quickly turn into a nightmare, especially with cascading issues across multiple flows.&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;This is where automated data orchestration becomes the key to streamlining workflows across the entire lifecycle. It allows teams to automate, monitor, and scale operations by transforming disconnected processes into a cohesive system, minimizing manual intervention and reducing errors.&lt;/p&gt;&lt;p&gt;Let’s review the most popular options available in the market.&lt;/p&gt;&lt;h3 id=&quot;data-orchestration-tools&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/deploying-prefect-on-any-cloud-using-a-single-virtual-machine/#data-orchestration-tools&quot; class=&quot;heading-anchor&quot;&gt;Data Orchestration Tools&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;The three leading orchestration tools in the market are:&lt;/p&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;&lt;strong&gt;Apache Airflow:&lt;/strong&gt; An open-source and community-driven tool with robust features but a steep learning curve. Managed versions like Google Cloud Composer and Amazon MWAA simplify deployment but tie users to specific cloud providers.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Prefect:&lt;/strong&gt; A modern, cloud-agnostic, and easy-to-configure solution emphasizing scalability, portability, and developer-friendly features that allow for flexible orchestration. Prefect’s architecture also supports running workflows in hybrid environments, seamlessly bridging on-premises and cloud solutions.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Dagster:&lt;/strong&gt; Designed for data-aware orchestration, Dagster prioritizes validation, lineage, and developer productivity, making it ideal for teams handling complex pipelines.&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;At The Scalable Way, we have worked with both Airflow and Prefect in a few projects. We advise Prefect for a lightweight setup with fewer deployment things to worry about.&lt;/p&gt;&lt;h2 id=&quot;what-is-prefect-cloud&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/deploying-prefect-on-any-cloud-using-a-single-virtual-machine/#what-is-prefect-cloud&quot; class=&quot;heading-anchor&quot;&gt;What is Prefect Cloud?&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Prefect Cloud is a fully managed orchestration platform that simplifies running and monitoring Python-based workflows without the overhead of managing infrastructure. It’s well-suited for teams looking to automate data workflows, from ingestion and transformation to serving.&lt;/p&gt;&lt;p&gt;Its strengths include:&lt;/p&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;&lt;strong&gt;Scalability&lt;/strong&gt;: Handles thousands of workflows with ease.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Monitoring and alerting&lt;/strong&gt;: Built-in features simplify issue detection and resolution.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Cloud-agnostic architecture&lt;/strong&gt;: Runs seamlessly across environments, avoiding vendor lock-in.&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;By automating the orchestration layer, Prefect Cloud allows teams to focus on building robust pipelines without the overhead of managing infrastructure.&lt;/p&gt;&lt;h3 id=&quot;common-struggle-for-prefect-users-deployment&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/deploying-prefect-on-any-cloud-using-a-single-virtual-machine/#common-struggle-for-prefect-users-deployment&quot; class=&quot;heading-anchor&quot;&gt;Common Struggle for Prefect Users: Deployment&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;Adopting Prefect as an orchestrator unlocks many possibilities, but like any powerful tool, it comes with a learning curve. Prefect flexibility and a developer-first approach can initially feel daunting for teams unfamiliar with building solid deployment solutions.&lt;/p&gt;&lt;p&gt;Prefect’s philosophy emphasizes providing tools rather than prescribing solutions, allowing users to adapt its features to their specific needs. While this approach offers flexibility and scalability, it can leave data engineers uncertain about where to start with scalable deployment practices like CI/CD pipelines and autoscaling.&lt;/p&gt;&lt;h3 id=&quot;eternal-dilemma-server-based-or-serverless&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/deploying-prefect-on-any-cloud-using-a-single-virtual-machine/#eternal-dilemma-server-based-or-serverless&quot; class=&quot;heading-anchor&quot;&gt;Eternal Dilemma: Server-based or Serverless&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;Another consideration is choosing the right setup for running Prefect flows. There are two primary approaches, each designed to cater to different needs:&lt;/p&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;&lt;strong&gt;Server-based&lt;/strong&gt;: This requires setting up infrastructure such as virtual machines, lightweight Kubernetes (e.g., K3S), or managed Kubernetes clusters. While these setups provide maximum control, scalability, and adaptability, they demand a higher level of expertise and upfront effort.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Serverless&lt;/strong&gt;: Managed solutions like Prefect Cloud’s service or serverless compute options from cloud providers (AWS Fargate, Google Cloud Run, Azure Container Instances) eliminate the need for infrastructure management, making them appealing for simpler workflows.&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;Serverless solutions, though convenient, are best suited for simpler workflows, as they come with five notable challenges:&lt;/p&gt;&lt;ol class=&quot;list&quot;&gt;&lt;li&gt;&lt;strong&gt;Startup Overhead&lt;/strong&gt;: Prefect Worker images often have heavy dependencies, increasing flow initialization time. This leads to latency, as serverless platforms can introduce delays between task executions due to event-driven triggers. A long-running server with a persistent Prefect Worker is usually much quicker.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Vendor Lock-In&lt;/strong&gt;: Serverless solutions are often tightly integrated with specific cloud providers, making it difficult to migrate workflows across platforms. Even Prefect Work Pools, though useful, have limited functionality at the Pro tier.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Cost Management&lt;/strong&gt;: Serverless can be cost-effective for intermittent workloads, but can become expensive with unpredictable usage patterns. Managing costs is trickier compared to traditional server-based setups.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Limited Control and Security Concerns&lt;/strong&gt;: Serverless architectures limit control over the execution environment, as all logic runs on cloud provider-managed machines. This raises security risks, especially for companies dealing with sensitive data or operating in highly regulated industries, due to reduced visibility and potential vulnerabilities in shared infrastructure.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Token Management and Data Access Risks&lt;/strong&gt;: Serverless setups require Prefect to hold a token for accessing cloud resources, creating security risks if mismanaged. Server-based setups mitigate this by reversing the data flow, allowing the server to pull from Prefect, and reducing the risk of data breaches or unintended data exposure.&lt;/li&gt;&lt;/ol&gt;&lt;p&gt;Ultimately, the choice between server-based and serverless depends on the teams’ needs and stage of data maturity. However, for most organizations aiming to scale, a Prefect Work Pool running on a long-running server is a more optimal and reliable solution.&lt;/p&gt;&lt;h3 id=&quot;deployment-options-for-a-server-based-data-platform&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/deploying-prefect-on-any-cloud-using-a-single-virtual-machine/#deployment-options-for-a-server-based-data-platform&quot; class=&quot;heading-anchor&quot;&gt;Deployment Options for a Server-based Data Platform&lt;/a&gt;&lt;/h3&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;&lt;strong&gt;Local Prefect Worker Process&lt;/strong&gt;&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;Connects directly to Prefect Cloud and serves as an introductory setup to understand Prefect Cloud’s functionality. However, this is not suitable for production scenarios due to limited scalability and resilience.&lt;/p&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;&lt;strong&gt;Systemd Process on Single or Multiple VMs&lt;/strong&gt;&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;Runs Prefect flows in Docker containers, providing a lightweight setup that is relatively easy to configure. This approach is well-suited to small projects and teams, as Docker limits unnecessary complexity.&lt;/p&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;&lt;strong&gt;Single VM with Lightweight Kubernetes (K3S)&lt;/strong&gt;&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;It’s not as simple as a Systemd setup because of the introduction of Kubernetes and Helm. Thanks to these tools, it’s more scalable and adaptable for future growth. This setup offers flexibility for migration to more robust configurations as project demands increase.&lt;/p&gt;&lt;ul class=&quot;list&quot;&gt;&lt;li&gt;&lt;strong&gt;Managed Kubernetes Cluster&lt;/strong&gt;&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;The most feature-rich solution-managed Kubernetes supports autoscaling, spot instances, and integrations with tools like Active Directory. It is ideal for comprehensive data platforms. However, this approach adds operational complexity and may be excessive for smaller projects.&lt;/p&gt;&lt;h2 id=&quot;recommended-setup-for-getting-started-lightweight-kubernetes-on-a-single-virtual-machine&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/deploying-prefect-on-any-cloud-using-a-single-virtual-machine/#recommended-setup-for-getting-started-lightweight-kubernetes-on-a-single-virtual-machine&quot; class=&quot;heading-anchor&quot;&gt;Recommended Setup for Getting Started: Lightweight Kubernetes on a Single Virtual Machine&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;The lightweight Kubernetes on a single Virtual Machine (VM) setup strikes an ideal balance between cost efficiency and operational flexibility. By leveraging lightweight Kubernetes (K3S), you gain the core benefits of Kubernetes with significantly reduced overhead, making it perfect for smaller environments or projects with constrained resources. Its streamlined architecture ensures smooth operations without the complexity of managing a full Kubernetes cluster. The diagram illustrates a basic architecture that effectively meets most requirements for running Prefect flows in a scalable manner.&lt;/p&gt;&lt;p&gt;&lt;is-land on:idle&gt;&lt;/is-land&gt;&lt;/p&gt;&lt;dialog class=&quot;flow modal46&quot;&gt;&lt;button autofocus class=&quot;button&quot;&gt;Close&lt;/button&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/img/Q4en1un5bs-960.webp 960w, https://thescalableway.com/img/Q4en1un5bs-1600.webp 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/img/Q4en1un5bs-960.jpeg&quot; alt=&quot;lightweight data platform setup&quot; width=&quot;1600&quot; height=&quot;1033&quot; srcset=&quot;https://thescalableway.com/img/Q4en1un5bs-960.jpeg 960w, https://thescalableway.com/img/Q4en1un5bs-1600.jpeg 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;/picture&gt;&lt;/dialog&gt;&lt;button data-index=&quot;46&quot;&gt;&lt;picture&gt;&lt;source type=&quot;image/webp&quot; srcset=&quot;https://thescalableway.com/img/Q4en1un5bs-960.webp 960w, https://thescalableway.com/img/Q4en1un5bs-1600.webp 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://thescalableway.com/img/Q4en1un5bs-960.jpeg&quot; alt=&quot;lightweight data platform setup&quot; width=&quot;1600&quot; height=&quot;1033&quot; srcset=&quot;https://thescalableway.com/img/Q4en1un5bs-960.jpeg 960w, https://thescalableway.com/img/Q4en1un5bs-1600.jpeg 1600w&quot; sizes=&quot;auto&quot;&gt;&lt;/picture&gt;&lt;/button&gt;&lt;p&gt;&lt;/p&gt;&lt;p&gt;Using Helm charts to deploy the Prefect Worker simplifies orchestration, ensuring seamless integration with existing systems while minimizing manual configurations. Helm also makes updates easier, promotes standardization, and reduces deployment errors.&lt;/p&gt;&lt;p&gt;Running everything on a single virtual machine keeps the infrastructure simple yet scalable. If project demands grow, you can easily upgrade the VM or expand to a multi-node cluster without major changes to your architecture. Additionally, this setup simplifies maintenance, provides clear monitoring and debugging paths, and avoids vendor lock-in, preserving flexibility for future enhancements.&lt;/p&gt;&lt;h2 id=&quot;conclusion&quot;&gt;&lt;a href=&quot;https://thescalableway.com/blog/deploying-prefect-on-any-cloud-using-a-single-virtual-machine/#conclusion&quot; class=&quot;heading-anchor&quot;&gt;Conclusion&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Building a modern data platform is no easy task. Success lies in keeping it simple while ensuring flexibility and scalability. With the right tools and setup, like Prefect and lightweight Kubernetes on a single virtual machine, you can create a platform that delivers immediate value and adapts as your needs grow.&lt;/p&gt;&lt;p&gt;By focusing on scalable, modular solutions, you’re not just solving today’s problems—you’re building a platform ready for whatever comes next.&lt;/p&gt;&lt;/div&gt;
 			</content>
    </entry></feed>
