Strangler Fig Pattern: Migrate Legacy Systems
TL;DR
The Strangler Fig pattern enables incremental legacy system modernization by gradually replacing old functionality with new services while keeping the system running. A routing layer intercepts requests and directs them to either the legacy system or new implementation based on which features have been migrated. This approach eliminates the risk of big-bang rewrites while allowing continuous feature development and revenue generation.
Cheat Sheet: Legacy modernization strategy → Route requests through facade → Gradually replace old with new → Decommission legacy when complete. Used by: Shopify (Rails monolith → services), Netflix (datacenter → AWS), GitHub (MySQL → Spanner).
The Analogy
Imagine renovating a busy restaurant while it’s still serving customers. You can’t close for 18 months to rebuild everything—you’d lose customers and revenue. Instead, you renovate one section at a time: first the kitchen, then the dining room, then the bar. Customers keep eating throughout the process, and you route them to renovated sections as they’re completed. Eventually, the entire restaurant is modernized without ever closing. The Strangler Fig pattern applies this same incremental approach to software systems, keeping the business running while you modernize underneath.
Why This Matters in Interviews
This pattern comes up in interviews about legacy modernization, microservices migration, and technical debt management. Interviewers want to see that you understand the business risk of big-bang rewrites and can design pragmatic migration strategies. Strong candidates discuss routing strategies, feature parity validation, data synchronization challenges, and rollback mechanisms. The pattern demonstrates architectural maturity—knowing when gradual evolution beats revolutionary change.
Core Concept
The Strangler Fig pattern, named after a tropical plant that grows around a host tree and eventually replaces it, is a migration strategy for modernizing legacy systems incrementally. Instead of attempting a risky big-bang rewrite that halts feature development for months or years, you build new functionality in a modern stack while the legacy system continues serving production traffic. A routing layer (often called a facade or proxy) sits in front of both systems and directs requests to the appropriate implementation based on which features have been migrated.
The pattern was coined by Martin Fowler in 2004 after observing how strangler figs gradually replace their host trees in nature. In software, this translates to a three-phase process: transform (build new capability), coexist (run old and new in parallel), and eliminate (decommission legacy). The key insight is that you never have a “migration day”—instead, you continuously shift traffic from old to new as each piece is ready, validating functionality and maintaining business continuity throughout.
This approach solves the fundamental problem of legacy modernization: how do you upgrade a system that’s generating revenue and can’t afford downtime? Companies like Shopify, Netflix, and GitHub have used variations of this pattern to migrate from monoliths to microservices, from on-premise to cloud, and from old databases to new ones—all while shipping features and growing their businesses. The pattern’s power lies in its risk reduction through incremental change and continuous validation.
How It Works
Step 1: Establish the Routing Layer Deploy a facade (reverse proxy, API gateway, or application-level router) in front of your legacy system. Initially, this layer passes 100% of traffic through to the legacy system unchanged. This step validates that the routing layer itself doesn’t introduce problems. At Netflix during their AWS migration, they used Zuul as this routing layer, initially just proxying all requests to their datacenter systems. The routing layer becomes your control plane for the entire migration.
Step 2: Identify Migration Candidates Analyze your legacy system to identify discrete features or modules to migrate first. Prioritize based on business value, technical risk, and dependencies. Good first candidates are often read-heavy features with clear boundaries (like product catalogs or user profiles) rather than complex transactional workflows. Shopify started their microservices migration with their gift card service—a bounded domain with clear inputs and outputs. Poor candidates include features with deep coupling to legacy data models or critical paths where any failure is catastrophic.
Step 3: Build New Implementation Develop the new service or module in your target architecture (microservice, serverless function, modern framework). Crucially, build it to be feature-complete with the legacy functionality it’s replacing, including edge cases and error handling. This isn’t a greenfield rewrite—it’s a faithful reimplementation. Include comprehensive testing that validates parity with legacy behavior. GitHub spent months ensuring their new Spanner-based system matched MySQL behavior exactly before routing any production traffic.
Step 4: Implement Dual-Write or Data Sync If the feature involves data writes, establish a mechanism to keep legacy and new datastores synchronized during the coexistence phase. This might be dual-writing (application writes to both systems), change data capture (CDC streaming from legacy to new), or event-based synchronization. The goal is to ensure the new system has all the data it needs to serve requests. Stripe used Kafka-based CDC to stream data from their monolithic PostgreSQL database to new microservice databases during their decomposition.
Step 5: Route Traffic Incrementally Update the routing layer to send a small percentage of traffic (1-5%) to the new implementation while monitoring for errors, latency differences, and functional discrepancies. Use feature flags or routing rules to control which requests go where—you might route by user ID, geography, or customer tier. Gradually increase the percentage as confidence grows: 5% → 10% → 25% → 50% → 100%. This gradual rollout provides continuous validation and easy rollback if issues arise. Netflix spent months at each percentage level during their AWS migration, validating performance and reliability.
Step 6: Validate and Monitor Compare responses from legacy and new systems to detect behavioral differences. Some teams run both systems in parallel and compare outputs (shadow mode) before actually routing production traffic. Monitor key metrics: latency, error rates, data consistency, and business KPIs. Set up alerts for discrepancies. This validation phase is critical—you’re proving the new system is truly equivalent to the old one. GitHub ran MySQL and Spanner in parallel for months, comparing query results to ensure correctness.
Step 7: Decommission Legacy Once 100% of traffic flows to the new implementation and you’ve validated it over time (weeks or months), begin decommissioning the legacy code and infrastructure. Remove the routing logic for that feature, delete old code, and shut down unused services. This is the “strangling” phase—the legacy system gradually shrinks as more features migrate. Eventually, the entire legacy system is replaced and can be fully decommissioned. Shopify took years to complete their migration, but each step reduced their monolith’s footprint and technical debt.
Step 8: Repeat Move to the next feature or module and repeat the process. Each migration builds organizational muscle and improves your tooling. Early migrations are slow and cautious; later ones move faster as patterns emerge and confidence grows. The key is maintaining momentum—regular, incremental progress rather than sporadic big pushes.
Strangler Fig Migration Flow: Three-Phase Process
graph TB
subgraph Phase 1: Transform
A["Legacy System<br/><i>Monolith</i>"]
B["Routing Layer<br/><i>API Gateway</i>"]
C["New Service<br/><i>Microservice</i>"]
B --"100% traffic"--> A
C -."Built but not serving traffic".-> C
end
subgraph Phase 2: Coexist
D["Legacy System<br/><i>Monolith</i>"]
E["Routing Layer<br/><i>API Gateway</i>"]
F["New Service<br/><i>Microservice</i>"]
G[("Legacy DB")]
H[("New DB")]
E --"1. 80% traffic"--> D
E --"2. 20% traffic"--> F
D --"3. Dual-write"--> G
D --"4. Dual-write"--> H
F --"5. Read/Write"--> H
end
subgraph Phase 3: Eliminate
I["Routing Layer<br/><i>API Gateway</i>"]
J["New Service<br/><i>Microservice</i>"]
K[("New DB")]
L["Legacy System<br/><i>Decommissioned</i>"]
I --"100% traffic"--> J
J --"Read/Write"--> K
L -."Shut down".-> L
end
The strangler fig pattern progresses through three distinct phases: Transform (build new capability), Coexist (run both systems with gradual traffic shift and data synchronization), and Eliminate (decommission legacy). Notice how traffic gradually shifts from 100% legacy to 100% new, with dual-write ensuring data consistency during coexistence.
Gradual Traffic Rollout with Monitoring
graph LR
Client["Client Requests"]
Router["Routing Layer<br/><i>Feature Flags</i>"]
Legacy["Legacy System<br/><i>Monolith</i>"]
New["New Service<br/><i>Microservice</i>"]
Monitor["Monitoring<br/><i>Metrics Comparison</i>"]
Client --"1. Incoming request"--> Router
Router --"2a. 95% traffic<br/>(Week 1)"--> Legacy
Router --"2b. 5% traffic<br/>(Week 1)"--> New
Legacy --"3. Response + metrics"--> Monitor
New --"4. Response + metrics"--> Monitor
Monitor -."5. Compare:<br/>- Error rates<br/>- Latency p50/p99<br/>- Business metrics".-> Monitor
Router --"6. Increase to 10%<br/>(Week 2)"--> New
Router --"7. Increase to 50%<br/>(Week 4)"--> New
Router --"8. Increase to 100%<br/>(Week 8)"--> New
Traffic rollout happens incrementally with continuous monitoring at each stage. The routing layer uses feature flags or percentage-based rules to control traffic distribution, starting at 5% and gradually increasing to 100% over weeks. Monitoring compares error rates, latency, and business metrics between legacy and new systems to validate behavior before increasing traffic.
Key Principles
Incremental Migration Over Big-Bang Rewrites The fundamental principle is that gradual change is safer and more sustainable than revolutionary change. Big-bang rewrites typically take 2-3x longer than estimated, halt feature development, and risk catastrophic failure if the new system doesn’t work. Incremental migration allows continuous validation, easy rollback, and ongoing feature delivery. When Netscape attempted a complete rewrite of their browser in the late 1990s, they lost market share to Internet Explorer and never recovered. In contrast, when Shopify migrated from a Rails monolith to microservices over five years, they grew from $100M to $1B+ in revenue during the migration because they never stopped shipping features.
Feature Parity Before Traffic Shift Never route production traffic to a new implementation until it achieves functional parity with the legacy system, including edge cases and error handling. This seems obvious but is frequently violated in practice. Teams get excited about the new system’s capabilities and route traffic before it’s truly ready, leading to production incidents. The discipline is to resist the urge to “just try it” and instead thoroughly validate behavior. GitHub’s migration to Spanner took over a year of parallel running and validation before they routed any production writes, because they understood that data correctness was non-negotiable. Feature parity includes performance characteristics too—if the legacy system handles 10,000 requests per second, the new one must match or exceed that before taking traffic.
Maintain Dual-Write Consistency During the coexistence phase when both systems are active, data consistency between legacy and new datastores is critical. Inconsistent data leads to confusing user experiences and difficult-to-debug issues. Implement robust synchronization mechanisms with monitoring and alerting. Understand the consistency model you’re providing—eventual consistency is often acceptable, but you need to know the lag and have visibility into sync failures. Stripe’s migration involved complex dual-write logic where writes went to both the monolith’s PostgreSQL and new microservice databases, with reconciliation jobs detecting and fixing inconsistencies. They invested heavily in tooling to monitor sync lag and data divergence.
Gradual Traffic Shifting with Observability Route traffic incrementally (1% → 5% → 10% → 25% → 50% → 100%) with comprehensive monitoring at each stage. This gradual approach provides continuous validation and limits blast radius if problems occur. The observability is as important as the routing—you need metrics, logs, and traces that let you compare legacy and new system behavior. Netflix’s migration to AWS involved sophisticated traffic shaping where they could route specific customer cohorts or request types to validate different scenarios. They maintained detailed dashboards showing error rates, latency percentiles, and business metrics for both systems side-by-side.
Reversibility and Rollback Capability Design every migration step to be reversible. If the new system exhibits problems, you should be able to route traffic back to the legacy system within seconds or minutes. This requires keeping the legacy system operational and data-synchronized even after routing significant traffic to the new system. The ability to rollback quickly reduces migration risk and allows aggressive traffic shifting—if you can undo a change instantly, you can be bolder about trying it. GitHub maintained the ability to failback to MySQL for months after routing 100% of traffic to Spanner, only decommissioning MySQL after extended validation. This safety net allowed them to move confidently.
Deep Dive
Types / Variants
Proxy-Based Strangler (Infrastructure Layer)
In this variant, a reverse proxy or API gateway sits at the network edge and routes requests based on URL patterns, headers, or other request attributes. This is the most common implementation because it’s technology-agnostic—the routing layer doesn’t need to understand application logic. Tools like NGINX, Envoy, or cloud-native API gateways (AWS API Gateway, Google Cloud Load Balancer) handle the routing. When to use: When you’re migrating between different technology stacks or want to keep routing logic separate from application code. Pros: Clean separation of concerns, easy to configure and monitor, works across any backend technology. Cons: Limited to request-level routing (can’t route based on business logic), requires network-level deployment. Example: A retail company migrating from a PHP monolith to Node.js microservices uses NGINX to route /api/products/* to the new service while routing everything else to the legacy system.
Application-Level Strangler (Code Layer) The routing logic lives within the application code itself, often using feature flags or conditional logic to decide whether to call legacy code or new code. This approach is common when migrating within the same codebase—for example, refactoring a monolith’s internals or replacing specific modules. When to use: When you’re modernizing within the same application boundary or need routing decisions based on business logic (user attributes, A/B test groups, data characteristics). Pros: Fine-grained control over routing, can make decisions based on application state, no additional infrastructure. Cons: Routing logic couples with application code, harder to manage across distributed teams, can create complex conditional logic. Example: An e-commerce platform uses feature flags to route 10% of checkout requests to a new payment processing module while 90% still use the legacy code path, with routing decisions based on user ID hashing.
Event-Driven Strangler (Async Layer) Instead of synchronous request routing, this variant uses event streams to gradually migrate functionality. The legacy system publishes events to a message bus (Kafka, RabbitMQ, AWS EventBridge), and new services consume those events to build their own state and handle processing. Over time, more consumers shift from legacy to new implementations. When to use: When migrating event-driven or asynchronous systems, or when you want to avoid tight coupling between legacy and new systems. Pros: Loose coupling, natural fit for event-driven architectures, enables parallel processing. Cons: Eventual consistency challenges, complex event schema evolution, harder to reason about system state. Example: A logistics company migrates their order processing system by having both legacy and new services consume order events from Kafka, with the new service gradually taking over order fulfillment responsibilities.
Database-Level Strangler (Data Layer) This variant focuses on migrating data storage incrementally, often using change data capture (CDC) to keep legacy and new databases synchronized while applications gradually shift to the new datastore. The routing happens at the data access layer—reads and writes gradually shift from old to new database. When to use: When the primary challenge is data migration rather than application logic, or when modernizing database technology (SQL to NoSQL, on-premise to cloud). Pros: Focuses on the often-hardest part of migration (data), enables independent application and data migration. Cons: Complex data synchronization, potential for data inconsistency, requires careful transaction handling. Example: A SaaS company migrates from PostgreSQL to DynamoDB by using AWS DMS to continuously replicate data, while application code gradually shifts reads to DynamoDB and eventually writes, validating data consistency throughout.
Branch by Abstraction (Code Refactoring) A variant where you introduce an abstraction layer in the codebase that can delegate to either old or new implementations. This is technically a strangler pattern applied at the code level. You create an interface, implement it with the legacy code, deploy that, then add a new implementation and gradually switch callers to use it. When to use: When refactoring within a single codebase or replacing specific libraries/frameworks. Pros: Type-safe routing (compiler helps), clear abstraction boundaries, good for library/framework migrations. Cons: Requires code changes throughout the application, can be tedious for large codebases. Example: A team migrating from an old ORM to a new one creates a data access interface, wraps the old ORM in that interface, then gradually implements the interface with the new ORM while switching callers one module at a time.
Strangler Fig Pattern Variants: Four Implementation Approaches
graph TB
subgraph Proxy-Based Strangler
P1["Client"]
P2["Reverse Proxy<br/><i>NGINX/Envoy</i>"]
P3["Legacy<br/><i>PHP</i>"]
P4["New Service<br/><i>Node.js</i>"]
P1 --"HTTP Request"--> P2
P2 --"/api/old/*"--> P3
P2 --"/api/new/*"--> P4
end
subgraph Application-Level Strangler
A1["Client"]
A2["Application<br/><i>Feature Flags</i>"]
A3["Legacy Code<br/><i>Old Module</i>"]
A4["New Code<br/><i>New Module</i>"]
A1 --"Request"--> A2
A2 --"if flag=old"--> A3
A2 --"if flag=new"--> A4
end
subgraph Event-Driven Strangler
E1["Event Producer"]
E2["Message Bus<br/><i>Kafka</i>"]
E3["Legacy Consumer"]
E4["New Consumer"]
E1 --"Publish event"--> E2
E2 --"Subscribe"--> E3
E2 --"Subscribe"--> E4
end
subgraph Database-Level Strangler
D1["Application"]
D2[("Legacy DB<br/><i>MySQL</i>")]
D3["CDC Pipeline<br/><i>Debezium</i>"]
D4[("New DB<br/><i>DynamoDB</i>")]
D1 --"Write"--> D2
D2 --"Stream changes"--> D3
D3 --"Replicate"--> D4
D1 -."Gradually shift reads".-> D4
end
Four common strangler fig variants address different migration scenarios. Proxy-based routing works at the network layer for cross-stack migrations. Application-level routing uses feature flags for in-process decisions. Event-driven strangling decouples systems through message buses. Database-level strangling uses CDC to migrate data storage incrementally. Choose based on your migration context and constraints.
Trade-offs
Gradual Migration vs. Big-Bang Rewrite Gradual migration (Strangler Fig) means slower time-to-completion but continuous business value and lower risk. You might spend 2-3 years migrating incrementally, but you ship features throughout and can stop/pivot if priorities change. Big-bang rewrite means faster theoretical completion (12-18 months) but halted feature development, all-or-nothing risk, and frequent timeline overruns. Decision framework: Choose gradual migration when the system is revenue-generating and can’t afford feature freeze, when requirements are evolving, or when the team lacks confidence in estimating the full rewrite scope. Choose big-bang only when the legacy system is truly unsalvageable (security vulnerabilities, unmaintainable code, no one understands it) and the business can afford the risk and feature freeze. In practice, gradual migration is almost always the right choice for production systems.
Proxy-Based vs. Application-Level Routing Proxy-based routing keeps routing logic separate from application code, making it easier to manage and monitor, but limits routing decisions to request attributes (URL, headers). Application-level routing enables fine-grained decisions based on business logic (user attributes, data characteristics, A/B test groups) but couples routing with application code and can create complex conditionals. Decision framework: Use proxy-based routing when migrating between different technology stacks, when routing rules are simple (URL-based), or when you want centralized traffic management. Use application-level routing when you need business-logic-based decisions, when migrating within the same codebase, or when routing rules are complex and change frequently. Many systems use both—proxy for coarse-grained routing (service-level) and application-level for fine-grained routing (feature-level).
Dual-Write vs. Change Data Capture (CDC) Dual-write means the application writes to both legacy and new datastores, ensuring immediate consistency but requiring application changes and careful transaction handling. CDC means capturing changes from the legacy database and streaming them to the new datastore, avoiding application changes but introducing replication lag and eventual consistency. Decision framework: Use dual-write when you need strong consistency, when replication lag is unacceptable, or when the application already controls write logic. Use CDC when you can tolerate eventual consistency (seconds to minutes lag), when you want to avoid application changes, or when the legacy system is difficult to modify. Consider hybrid approaches—dual-write for critical paths, CDC for bulk data.
Feature-by-Feature vs. Service-by-Service Migration Feature-by-feature migration means extracting individual features (user authentication, product search, checkout) one at a time, which provides clear business value but may require complex data synchronization if features share data. Service-by-service migration means extracting entire bounded contexts (user service, product service, order service) which provides cleaner boundaries but may require migrating multiple features simultaneously. Decision framework: Use feature-by-feature when business priorities are clear and you want to deliver value incrementally, when features are relatively independent, or when the team is small. Use service-by-service when you have clear domain boundaries, when you’re building a microservices architecture, or when you have multiple teams that can work in parallel.
Shadow Mode vs. Direct Traffic Routing Shadow mode means running the new system in parallel with the legacy system, comparing outputs but not serving production responses, which provides thorough validation but doubles resource usage and doesn’t validate real user impact. Direct traffic routing means actually serving production responses from the new system (starting with small percentages), which validates real behavior but carries more risk. Decision framework: Use shadow mode first for critical systems where correctness is paramount (financial transactions, healthcare data), when you have resources to run both systems, or when you need to validate complex business logic. Move to direct traffic routing after shadow mode validation, starting with non-critical paths or internal users. Many migrations use shadow mode for initial validation (days to weeks) then shift to direct traffic routing with gradual rollout.
Common Pitfalls
Underestimating Data Migration Complexity Teams often focus on migrating application logic while treating data migration as an afterthought, only to discover that data synchronization is the hardest part. Legacy databases have accumulated years of data quirks—inconsistent formats, orphaned records, implicit relationships not captured in schemas. When Uber migrated from their monolithic PostgreSQL database to microservice-specific databases, they discovered thousands of data quality issues that had been masked by application logic. How to avoid: Start with data analysis early—understand data volumes, quality issues, and dependencies. Build data synchronization and validation tooling before migrating application logic. Plan for data cleanup and transformation as part of the migration. Run data consistency checks continuously during the coexistence phase. Budget 40-50% of migration effort for data-related work.
Insufficient Monitoring and Observability Migrating without comprehensive monitoring is like flying blind—you can’t tell if the new system is working correctly or where problems originate. Teams route traffic to the new system and hope for the best, only discovering issues through user complaints. When GitHub migrated to Spanner, they built extensive monitoring comparing MySQL and Spanner query results, latency distributions, and data consistency. This observability was critical for catching subtle bugs. How to avoid: Instrument both legacy and new systems with identical metrics before starting migration. Build dashboards comparing key metrics side-by-side (error rates, latency percentiles, throughput). Implement automated alerting for discrepancies. Use distributed tracing to understand request flows across both systems. Consider shadow mode testing where you run both systems and compare outputs. The monitoring investment pays for itself by catching issues early.
Neglecting Rollback Mechanisms Teams get excited about the new system and decommission the legacy system too quickly, losing the ability to rollback if problems emerge. Or they build routing logic that’s easy to route forward but difficult to route backward. When a major retailer migrated their checkout system, they shut down the legacy system after routing 100% of traffic to the new one, only to discover a critical bug during Black Friday. Without the ability to rollback, they suffered hours of downtime. How to avoid: Keep the legacy system operational and data-synchronized for weeks or months after routing 100% of traffic. Build routing logic that’s symmetric—as easy to route back as forward. Test rollback procedures regularly during the migration. Define clear criteria for when it’s safe to decommission legacy (time-based, incident-free period, business milestone). Maintain runbooks for emergency rollback scenarios.
Feature Parity Gaps Teams migrate the “happy path” functionality but miss edge cases, error handling, or rarely-used features that exist in the legacy system. Users encounter unexpected behavior or missing functionality after migration. When Shopify extracted their gift card service, they initially missed several edge cases around partial redemptions and refunds that existed in the monolith, causing customer support issues. How to avoid: Conduct thorough analysis of legacy system behavior, including error handling and edge cases. Review support tickets and bug reports to identify rarely-used features. Build comprehensive test suites that validate parity, not just core functionality. Run both systems in parallel (shadow mode) and compare outputs for a representative sample of production traffic. Involve domain experts and QA in validation. Accept that 100% parity may be impossible—document known differences and their business impact.
Ignoring Organizational and Process Changes Technical migration is only half the battle—teams also need to adapt processes, documentation, on-call procedures, and knowledge. Developers who understood the legacy system may struggle with the new architecture. When Netflix migrated to AWS, they had to retrain hundreds of engineers on cloud-native patterns and build new operational procedures. How to avoid: Invest in training and documentation alongside technical migration. Update runbooks and operational procedures for the new system. Establish clear ownership and on-call responsibilities. Build internal tooling and abstractions that make the new system approachable. Consider pairing experienced and new team members during migration. Celebrate migration milestones to maintain momentum and morale. Recognize that organizational change is as important as technical change.
Data Synchronization Challenges and Solutions
graph TB
subgraph Problem: Data Inconsistency
App1["Application"]
Legacy1[("Legacy DB<br/><i>PostgreSQL</i>")]
New1[("New DB<br/><i>DynamoDB</i>")]
App1 --"1. Write"--> Legacy1
Legacy1 -."2. CDC lag: 30s".-> New1
New1 -."3. Stale read!".-> App1
end
subgraph Solution 1: Dual-Write
App2["Application"]
Legacy2[("Legacy DB")]
New2[("New DB")]
App2 --"1a. Write"--> Legacy2
App2 --"1b. Write"--> New2
App2 --"2. Read from new"--> New2
end
subgraph Solution 2: CDC with Monitoring
App3["Application"]
Legacy3[("Legacy DB")]
CDC["CDC Pipeline<br/><i>Debezium</i>"]
New3[("New DB")]
Monitor["Consistency Checker<br/><i>Reconciliation Job</i>"]
App3 --"1. Write"--> Legacy3
Legacy3 --"2. Stream changes"--> CDC
CDC --"3. Replicate (lag: 5s)"--> New3
Monitor --"4. Compare data"--> Legacy3
Monitor --"5. Compare data"--> New3
Monitor -."6. Alert on discrepancies".-> Monitor
end
subgraph Solution 3: Event Sourcing
App4["Application"]
Events["Event Stream<br/><i>Kafka</i>"]
Legacy4[("Legacy DB")]
New4[("New DB")]
App4 --"1. Publish event"--> Events
Events --"2. Consume"--> Legacy4
Events --"3. Consume"--> New4
end
Data synchronization during coexistence is the hardest strangler fig challenge. The problem shows CDC lag causing stale reads. Solution 1 (dual-write) provides immediate consistency but requires application changes. Solution 2 (CDC with monitoring) avoids app changes but introduces lag—mitigated by consistency checkers that detect and fix discrepancies. Solution 3 (event sourcing) decouples systems but requires event schema management. Choose based on consistency requirements and system constraints.
Math & Calculations
Migration Timeline Estimation Estimating how long a strangler fig migration will take requires understanding the scope and team capacity. Use this framework:
Formula: Total Time = (Number of Features × Average Feature Migration Time) / (Team Size × Parallel Work Factor)
Variables:
- Number of Features: Count of discrete features or services to migrate (e.g., 50 features)
- Average Feature Migration Time: Weeks per feature including build, test, validate, and stabilize (e.g., 3 weeks)
- Team Size: Number of engineers working on migration (e.g., 5 engineers)
- Parallel Work Factor: How many features can be migrated simultaneously without conflicts (typically 0.5-0.7, accounting for dependencies and shared resources)
Worked Example: A company wants to migrate a monolith with 40 distinct features to microservices. Based on their first few migrations, each feature takes an average of 4 weeks (1 week build, 1 week test, 2 weeks gradual rollout and validation). They have a team of 6 engineers, and estimate they can work on 2-3 features in parallel (Parallel Work Factor = 0.6).
Total Time = (40 features × 4 weeks) / (6 engineers × 0.6) = 160 / 3.6 = 44 weeks ≈ 11 months
This is the technical migration time. Add 20-30% buffer for unexpected issues, data migration complexity, and organizational overhead. Realistic timeline: 13-15 months.
Traffic Shift Risk Calculation When gradually routing traffic, calculate the blast radius at each percentage:
Formula: Affected Users = Total Users × Traffic Percentage × Failure Rate
Worked Example: A system serves 10 million requests per day. You’re routing 5% of traffic to the new system. If the new system has a 2% error rate (compared to 0.1% in legacy), how many additional failed requests occur?
Affected Requests = 10M × 0.05 × (0.02 - 0.001) = 10M × 0.05 × 0.019 = 9,500 additional failures per day
This calculation helps you decide if the error rate is acceptable or if you need to fix issues before increasing traffic. At 5%, 9,500 failures might be acceptable for a non-critical system. At 50%, that becomes 95,000 failures—likely unacceptable.
Data Synchronization Lag For CDC-based migrations, calculate acceptable replication lag:
Formula: Max Acceptable Lag = Business Requirement / Safety Factor
Worked Example: An e-commerce system requires that inventory updates appear in the new system within 30 seconds to prevent overselling. With a safety factor of 2× (to account for spikes), the CDC pipeline must maintain:
Max Acceptable Lag = 30 seconds / 2 = 15 seconds average lag
If your CDC pipeline shows 20-second average lag with 40-second p99, you need to optimize before routing write traffic to the new system. This might mean adding CDC capacity, optimizing transformation logic, or accepting eventual consistency with compensating mechanisms (like inventory reservations).
Real-World Examples
Shopify: Rails Monolith to Microservices (2015-2020) Shopify’s migration from a Ruby on Rails monolith to microservices is one of the most well-documented strangler fig implementations. In 2015, their monolith had grown to millions of lines of code and was becoming difficult to scale and maintain. Rather than attempting a rewrite, they used the strangler fig pattern to incrementally extract services over five years. They started with bounded contexts that had clear boundaries—gift cards, tax calculations, and shipping rates. Each extraction followed a consistent pattern: build the new service, implement dual-write to keep data synchronized, route a small percentage of traffic, validate behavior, and gradually increase traffic. Interesting detail: Shopify built sophisticated tooling called “Resiliency” that automatically compared responses between the monolith and new services, catching behavioral differences before they impacted customers. By 2020, they had extracted dozens of services while growing from $100M to over $1B in revenue, proving that you can modernize while scaling. The key to their success was treating migration as a product—with dedicated teams, clear metrics, and continuous investment.
Netflix: Datacenter to AWS (2008-2016) Netflix’s eight-year migration from their datacenter to AWS is perhaps the most famous strangler fig example at scale. After a major database corruption incident in 2008, they decided to move to the cloud but couldn’t afford to halt feature development during migration. They used a strangler fig approach where they built new services in AWS while keeping existing services in the datacenter, with Zuul (their edge service) routing traffic between the two environments. They migrated service by service, starting with non-critical systems like movie encoding and recommendations, saving the most critical services (billing, streaming) for last when they had the most experience. Interesting detail: Netflix ran in a hybrid datacenter/cloud environment for years, with sophisticated traffic management that could shift load based on availability and performance. They completed the migration in 2016, shutting down their last datacenter. The migration enabled Netflix to scale from 20 million to over 200 million subscribers during this period. Their success demonstrated that strangler fig works even for massive, complex systems if you’re patient and disciplined.
GitHub: MySQL to Spanner (2021-2023) GitHub’s migration from MySQL to Google Cloud Spanner showcases strangler fig for database modernization. As GitHub scaled to over 100 million repositories, their sharded MySQL infrastructure was reaching operational limits. They needed global distribution and stronger consistency guarantees. Rather than a big-bang database migration, they used strangler fig with extensive validation. They implemented dual-write where writes went to both MySQL and Spanner, with background jobs validating data consistency. They ran in shadow mode for months, comparing query results between MySQL and Spanner to ensure correctness. They gradually shifted read traffic to Spanner (1% → 5% → 10% → 50% → 100%) over many months, monitoring for performance and correctness issues. Interesting detail: GitHub built a sophisticated “consistency checker” that continuously compared data between MySQL and Spanner, alerting on any discrepancies. They maintained the ability to failback to MySQL for months after routing 100% of traffic to Spanner, only decommissioning MySQL after extended validation. The migration improved their global latency and simplified operations, but took over two years of careful, incremental work. This example shows that strangler fig requires patience—rushing leads to data corruption and outages.
Netflix AWS Migration Architecture (2008-2016)
graph TB
subgraph Internet
Users["Users<br/><i>Streaming Clients</i>"]
end
subgraph Edge Layer
Zuul["Zuul Gateway<br/><i>Traffic Router</i>"]
end
subgraph Netflix Datacenter - Legacy
DC_API["API Services<br/><i>Java</i>"]
DC_Billing["Billing<br/><i>Critical</i>"]
DC_DB[("Oracle DB")]
end
subgraph AWS Cloud - New
AWS_Encoding["Encoding Service<br/><i>Migrated First</i>"]
AWS_Recommend["Recommendations<br/><i>Migrated Second</i>"]
AWS_Streaming["Streaming API<br/><i>Migrated Later</i>"]
AWS_DB[("Cassandra")]
end
Users --"1. Request"--> Zuul
Zuul --"2a. Route non-critical<br/>(2010-2012)"--> AWS_Encoding
Zuul --"2b. Route critical<br/>(2008-2014)"--> DC_Billing
Zuul --"2c. Route streaming<br/>(2014-2016)"--> AWS_Streaming
DC_API --> DC_DB
DC_Billing --> DC_DB
AWS_Encoding --> AWS_DB
AWS_Recommend --> AWS_DB
AWS_Streaming --> AWS_DB
DC_DB -."Data sync".-> AWS_DB
Netflix’s eight-year migration to AWS demonstrates strangler fig at massive scale. Zuul gateway routed traffic between datacenter and cloud, starting with non-critical services (encoding, recommendations) and saving critical systems (billing, streaming) for last. They ran hybrid for years, gradually shifting services to AWS while maintaining business continuity. Completed in 2016, enabling Netflix to scale from 20M to 200M+ subscribers.
Interview Expectations
Mid-Level
What You Should Know: Understand the basic concept of strangler fig—that it’s an incremental migration strategy using a routing layer to gradually shift traffic from legacy to new systems. Be able to explain why it’s safer than big-bang rewrites (continuous validation, easy rollback, ongoing feature development). Describe the basic steps: establish routing layer, build new implementation, route traffic gradually, validate, decommission legacy. Understand the importance of feature parity and data synchronization during migration.
Bonus Points: Discuss specific routing mechanisms (reverse proxy, API gateway, feature flags). Mention real-world examples like Netflix or Shopify. Explain how you’d monitor and validate during migration (metrics comparison, shadow mode testing). Discuss rollback strategies and why they’re important. Show awareness of data migration challenges and dual-write patterns.
Senior
What You Should Know: Everything from mid-level plus deep understanding of tradeoffs between different strangler fig variants (proxy-based vs. application-level, synchronous vs. event-driven). Explain how to choose migration order (business value, technical risk, dependencies). Discuss data synchronization strategies in detail (dual-write, CDC, event sourcing) with their consistency implications. Describe comprehensive monitoring and validation approaches including shadow mode, traffic comparison, and automated testing. Understand organizational aspects—how to maintain team velocity during migration, how to manage technical debt in both systems.
Bonus Points: Discuss specific challenges you’ve faced in migrations and how you solved them. Explain capacity planning for running both systems simultaneously. Describe how to handle schema evolution and API versioning during migration. Discuss rollback procedures and incident response when migrations go wrong. Show understanding of when strangler fig is NOT appropriate (truly unsalvageable legacy systems, greenfield projects). Explain how to measure migration progress and communicate it to stakeholders. Discuss cost implications of running dual systems and how to optimize.
Staff+
What You Should Know: Everything from senior level plus strategic thinking about migration as a multi-year program. Understand how to structure teams and organizations around migration efforts. Explain how to balance migration work with feature development and technical debt. Discuss risk management frameworks for large-scale migrations including blast radius analysis, rollback procedures, and incident response. Understand financial implications—cost of running dual systems, opportunity cost of migration effort, business value of modernization. Be able to design migration strategies for complex scenarios (distributed systems, multi-region deployments, regulatory compliance requirements).
Distinguishing Signals: Demonstrate experience leading large-scale migrations (multi-year, multiple teams). Discuss how you’ve built organizational consensus around migration approaches and maintained momentum over long periods. Explain how you’ve balanced technical perfection with pragmatic progress—knowing when “good enough” is better than perfect. Show understanding of how migrations fit into broader technical strategy (cloud adoption, microservices architecture, platform modernization). Discuss how you’ve built reusable tooling and patterns that accelerate subsequent migrations. Explain how you’ve handled migrations that went wrong and the lessons learned. Show ability to communicate migration progress and value to non-technical stakeholders (executives, board members).
Common Interview Questions
Q: When would you NOT use the strangler fig pattern?
60-second answer: Don’t use strangler fig when the legacy system is truly unsalvageable (security vulnerabilities, no one understands it, can’t be kept running safely) or when you’re building something fundamentally different that doesn’t map to the old system. Also avoid it for small systems where the overhead of dual-running exceeds the migration effort, or when business requirements have changed so much that feature parity doesn’t make sense.
2-minute answer: Strangler fig isn’t appropriate in several scenarios. First, when the legacy system is so broken that keeping it running is dangerous—active security vulnerabilities, data corruption risks, or complete lack of understanding. In these cases, the risk of continuing to run the legacy system exceeds the risk of a rewrite. Second, when you’re building something fundamentally different that doesn’t map to the old system’s features—like moving from a monolithic desktop app to a SaaS platform with different user models. Third, for small systems where the overhead of building routing infrastructure, maintaining dual systems, and synchronizing data exceeds the effort of just rewriting it. If the legacy system is 5,000 lines of code, strangler fig is overkill. Fourth, when business requirements have changed dramatically and feature parity with the legacy system doesn’t serve users—sometimes you need to rethink the product, not just replatform it. Finally, when you lack the organizational discipline for incremental migration—if your team can’t maintain momentum over months or years, or if stakeholders demand immediate results, strangler fig’s gradual approach won’t work. In these cases, consider alternatives like buying a replacement product, building a minimal viable replacement, or accepting the technical debt.
Red flags: Saying you’d always use strangler fig regardless of context, or claiming big-bang rewrites are never appropriate. Good engineers understand that strangler fig is a tool, not a religion.
Q: How do you handle data consistency during the coexistence phase?
60-second answer: Implement dual-write (application writes to both datastores) or change data capture (stream changes from legacy to new). Monitor synchronization lag and data discrepancies continuously. Accept eventual consistency for most use cases, but identify critical paths that need strong consistency. Build reconciliation jobs that detect and fix inconsistencies. Test rollback scenarios to ensure you can failback to legacy if needed.
2-minute answer: Data consistency during coexistence is often the hardest part of strangler fig. You have several options, each with tradeoffs. Dual-write means the application writes to both legacy and new datastores, providing immediate consistency but requiring application changes and careful transaction handling—if one write fails, how do you handle it? Change data capture (CDC) means capturing changes from the legacy database and streaming them to the new datastore, avoiding application changes but introducing replication lag (seconds to minutes). Event sourcing means publishing events that both systems consume, providing loose coupling but requiring event schema management. For most systems, I recommend starting with CDC for bulk data and dual-write for critical paths. Implement comprehensive monitoring—track synchronization lag, compare data between systems, alert on discrepancies. Build reconciliation jobs that periodically compare data and fix inconsistencies. Accept eventual consistency for most use cases (product catalogs, user profiles) but identify critical paths that need strong consistency (financial transactions, inventory). Test rollback scenarios—if you route traffic back to legacy, ensure data written to the new system is synchronized back. The key is visibility—you need to know when data diverges and have tools to fix it.
Red flags: Claiming you can maintain perfect consistency without tradeoffs, or not discussing monitoring and reconciliation. Experienced engineers know data consistency is hard and plan for it.
Q: How do you decide which features to migrate first?
60-second answer: Prioritize based on three factors: business value (high-value features first to show progress), technical risk (low-risk features first to build confidence), and dependencies (independent features first to avoid coupling). Start with read-heavy, bounded features like catalogs or profiles. Avoid complex transactional workflows or tightly-coupled features early on. Build organizational muscle with early migrations before tackling critical paths.
2-minute answer: Migration order is critical to success. I use a framework considering business value, technical risk, and dependencies. Business value means features that deliver clear benefits—performance improvements, new capabilities, reduced operational cost. Migrating high-value features early demonstrates progress to stakeholders and maintains momentum. Technical risk means likelihood of problems—start with low-risk features to build confidence and learn patterns before tackling critical paths. Low-risk features are typically read-heavy (product catalogs, user profiles), have clear boundaries, and aren’t on critical paths where failures are catastrophic. Dependencies mean how coupled the feature is to other parts of the system—independent features can be migrated without coordinating across teams or systems. In practice, I often start with a “pathfinder” feature—something moderately valuable but low-risk that lets us build tooling and patterns. For example, migrating a product catalog service before checkout. This builds organizational muscle—routing infrastructure, monitoring, data synchronization—that accelerates subsequent migrations. Avoid migrating complex transactional workflows (checkout, payments) or tightly-coupled features (authentication, authorization) early on. Save the most critical, complex features for last when you have the most experience. Document lessons learned from each migration and refine your approach. The goal is steady progress with increasing velocity, not heroic efforts that burn out the team.
Red flags: Suggesting you’d migrate the most critical features first to “get them out of the way,” or not considering dependencies and coupling. Experienced engineers know that early migrations are learning exercises.
Q: How do you measure success and know when the migration is complete?
60-second answer: Track technical metrics (percentage of traffic on new system, number of features migrated, legacy system resource usage), business metrics (feature velocity, operational cost, incident rates), and organizational metrics (team satisfaction, knowledge distribution). Migration is complete when 100% of traffic routes to new system, legacy system is decommissioned, and team is confident operating the new system. Typically takes weeks to months after routing 100% of traffic before decommissioning legacy.
2-minute answer: Measuring migration success requires multiple dimensions. Technical metrics include percentage of traffic routed to new system (goal: 100%), number of features migrated (track progress), legacy system resource usage (should decrease), and error rates/latency (should match or improve on legacy). Business metrics include feature velocity (should maintain or increase during migration), operational cost (should decrease as legacy infrastructure is decommissioned), and incident rates (should not increase). Organizational metrics include team satisfaction (migrations are hard—is the team burned out?), knowledge distribution (how many people understand the new system?), and hiring/retention (can you attract talent with modern tech?). Set clear milestones: 25% traffic migrated, 50%, 100%, legacy decommissioned. Celebrate these milestones to maintain momentum. Migration completion isn’t when you route 100% of traffic—it’s when you decommission the legacy system. This typically takes weeks to months after routing all traffic because you need confidence in the new system. Define criteria for decommissioning: X weeks incident-free, Y% of team trained on new system, Z business milestones achieved. Maintain the ability to rollback until you’re truly confident. Finally, conduct a retrospective—what worked, what didn’t, what would you do differently? Document lessons learned for future migrations. The goal isn’t just technical migration—it’s organizational transformation.
Red flags: Only tracking technical metrics without business or organizational context, or claiming migration is complete as soon as 100% of traffic is routed. Experienced engineers know that decommissioning legacy safely takes time.
Q: What’s your approach to testing during a strangler fig migration?
60-second answer: Use multiple testing strategies: unit tests for new code, integration tests for new system behavior, shadow mode testing (run both systems, compare outputs), gradual traffic rollout (1% → 100%), and production monitoring. Build automated comparison tools that detect behavioral differences. Test rollback procedures regularly. Accept that you can’t catch everything—comprehensive monitoring is your safety net.
2-minute answer: Testing during strangler fig requires a multi-layered approach because you’re validating not just that the new system works, but that it works identically to the legacy system. Start with traditional testing—unit tests for new code, integration tests for new system behavior, end-to-end tests for critical paths. But these aren’t sufficient because they don’t validate parity with legacy. Implement shadow mode testing where you run both systems in parallel and compare outputs. Route production traffic to both systems (or replay traffic), compare responses, and alert on differences. This catches subtle behavioral differences that tests miss. GitHub ran shadow mode for months during their Spanner migration, comparing query results between MySQL and Spanner. Build automated comparison tools—don’t rely on manual inspection. Next, use gradual traffic rollout as a testing strategy. Route 1% of production traffic to the new system and monitor for errors, latency differences, and user complaints. Each percentage increase (5%, 10%, 25%, 50%, 100%) is a validation checkpoint. Use feature flags or routing rules to control which traffic goes where—you might route internal users first, then low-value customers, then everyone. Test rollback procedures regularly—can you route traffic back to legacy quickly if problems occur? Run chaos engineering experiments—what happens if the new system fails? Finally, accept that you can’t catch everything in testing. Comprehensive production monitoring is your safety net. Instrument both systems with identical metrics, set up alerting for discrepancies, and have runbooks for common issues. The goal is defense in depth—multiple testing layers that catch different types of problems.
Red flags: Relying only on traditional testing without validating parity with legacy, or not testing rollback procedures. Experienced engineers know that production is the ultimate test.
Red Flags to Avoid
“We’ll do a big-bang rewrite, it’ll only take 6 months”
Why it’s wrong: Big-bang rewrites almost always take 2-3x longer than estimated, halt feature development, and carry catastrophic risk if the new system doesn’t work. History is littered with failed rewrites—Netscape, Healthcare.gov’s initial launch, countless enterprise projects. The strangler fig pattern exists specifically because big-bang rewrites fail so often.
What to say instead: “Big-bang rewrites are high-risk. I’d recommend an incremental migration using the strangler fig pattern, where we gradually replace functionality while maintaining business continuity. This takes longer overall but delivers value continuously and reduces risk. We can start with a high-value, low-risk feature to build momentum and learn patterns before tackling critical paths.”
“We’ll migrate everything at once to avoid maintaining two systems”
Why it’s wrong: The whole point of strangler fig is accepting the temporary cost of running two systems to reduce migration risk. Trying to avoid this by migrating everything simultaneously defeats the purpose and reintroduces big-bang risk. The dual-system period is an investment in safety and validation.
What to say instead: “Running two systems temporarily is the cost of safe migration. The alternative—migrating everything at once—carries unacceptable risk. We’ll minimize the dual-system period by migrating efficiently, but we won’t rush it. The ability to rollback to the legacy system is our safety net. Once we’re confident in the new system, we’ll decommission legacy.”
“We don’t need to monitor during migration, we’ll just test thoroughly”
Why it’s wrong: Testing can’t catch everything, especially subtle behavioral differences between legacy and new systems. Production traffic has edge cases and patterns that tests miss. Comprehensive monitoring is essential for detecting issues early and validating that the new system truly matches legacy behavior.
What to say instead: “Testing is necessary but not sufficient. We need comprehensive monitoring that compares legacy and new system behavior in production—error rates, latency, throughput, and business metrics. I’d implement shadow mode testing where we run both systems and compare outputs, plus gradual traffic rollout with monitoring at each stage. Production is the ultimate test.”
“We’ll migrate the most critical features first to get them out of the way”
Why it’s wrong: Critical features are the riskiest to migrate because failures have the biggest impact. Strangler fig works best when you start with low-risk features to build organizational muscle and learn patterns before tackling critical paths. Migrating critical features first is like learning to ski on a black diamond run.
What to say instead: “I’d start with moderately valuable but low-risk features—read-heavy, bounded domains like product catalogs or user profiles. This lets us build routing infrastructure, monitoring, and data synchronization patterns with limited blast radius. Once we have experience and confidence, we’ll tackle critical features like checkout or payments. The goal is to build organizational muscle before taking big risks.”
“Data migration is straightforward, we’ll just copy the database”
Why it’s wrong: Data migration is often the hardest part of strangler fig. Legacy databases have accumulated years of data quirks, inconsistencies, and implicit relationships. Simply copying data doesn’t address synchronization during the coexistence phase, data quality issues, or schema differences between legacy and new systems.
What to say instead: “Data migration is typically the most complex part. We need to analyze data quality, understand implicit relationships, and design synchronization mechanisms for the coexistence phase—dual-write or CDC. We’ll need reconciliation jobs to detect and fix inconsistencies, and comprehensive monitoring of data sync lag. I’d budget 40-50% of migration effort for data-related work and start data analysis early.”
Key Takeaways
-
Strangler Fig enables safe legacy modernization by incrementally replacing old functionality with new services while keeping the system running, using a routing layer to gradually shift traffic. This eliminates the catastrophic risk of big-bang rewrites while maintaining business continuity and feature development.
-
The pattern follows a three-phase process: transform (build new capability), coexist (run old and new in parallel with traffic gradually shifting), and eliminate (decommission legacy). Success requires discipline—feature parity before routing traffic, comprehensive monitoring, and maintaining rollback capability throughout.
-
Data consistency during coexistence is the hardest challenge, requiring dual-write or change data capture to keep legacy and new datastores synchronized. Accept eventual consistency for most use cases but identify critical paths needing strong consistency, and build reconciliation tooling to detect and fix discrepancies.
-
Migration order matters critically: start with high-value, low-risk, independent features to build organizational muscle before tackling complex, critical paths. Early migrations are learning exercises that establish patterns and tooling for subsequent migrations.
-
Comprehensive monitoring and gradual rollout are your safety nets: instrument both systems identically, compare behavior continuously, and route traffic incrementally (1% → 5% → 10% → 50% → 100%). Maintain the ability to rollback quickly if problems occur, and don’t decommission legacy until you’re confident in the new system.
Related Topics
Prerequisites: Understanding API Gateway and Reverse Proxy patterns is essential since they often serve as the routing layer in strangler fig implementations. Familiarity with Event-Driven Architecture helps when implementing event-based strangler variants. Knowledge of Database Replication and Change Data Capture is critical for data synchronization strategies.
Related Patterns: Blue-Green Deployment and Canary Deployment share the gradual traffic shifting approach but operate at deployment rather than migration timescales. Circuit Breaker is often used to protect against failures when routing to new services. Anti-Corruption Layer helps isolate legacy and new systems during coexistence.
Follow-up Topics: After understanding strangler fig, explore Microservices Architecture since strangler fig is often used to decompose monoliths into microservices. Study Service Mesh for advanced traffic management in microservices environments. Learn about Feature Flags for fine-grained traffic control during migration.