How to Refactor Legacy Code Safely
How to Refactor Legacy Code Safely
Legacy code breaks in production more often during refactoring than during normal operation. The reason isn't that the old code is particularly fragile — it's that refactoring without proper safeguards transforms "code that works but is hard to modify" into "code that's broken and hard to fix." The difference between a successful refactoring that improves developer velocity and a failed one that creates production incidents comes down to process, not just coding skill.
This article covers the specific techniques that allow you to refactor legacy systems without breaking them. You'll learn how to add tests to untested code, how to make changes incrementally when big-bang rewrites aren't an option, and how to manage risk when modifying code nobody fully understands. The focus is on production systems where downtime has real costs and where "it works on my machine" doesn't count as success.
We'll work through the complete process from identifying refactoring targets to deploying changes safely, with emphasis on the checkpoints that prevent disasters.
Why Legacy Code Refactoring Fails
Most refactoring failures follow predictable patterns. Understanding these patterns helps you avoid them. The most common failure mode: attempting to improve code structure and fix bugs simultaneously. When you combine these activities, you can't determine whether production issues stem from structural changes or bug fixes, making rollback decisions impossible.
The second major failure mode: insufficient test coverage before starting. Teams often try to refactor first and add tests later, but this sequence is backwards. Tests on legacy code serve as change detectors, not quality indicators. They don't need to verify the code is correct — they need to verify your changes didn't alter behavior. If you refactor first, you have no baseline to test against.
The third failure mode: over-ambition. Teams decide to "modernize the authentication system" or "clean up the database layer," which are architectures, not tasks. These initiatives stall because they're too large to complete in one change set. The code exists in a half-refactored state for months, which is worse than not refactoring at all. Successful legacy refactoring happens in deployable increments, where each step improves the codebase without leaving it in a broken intermediate state.
The Characterization Testing Approach
Characterization tests document what code currently does, not what it should do. This distinction is essential for legacy refactoring. You're not trying to verify correctness — you're trying to create a behavioral snapshot so you can detect when refactoring changes behavior.
The process: identify a function or module you want to refactor, write tests that exercise it with various inputs, record the outputs (even if they're wrong), and codify those outputs as test assertions. If the code has a bug where it returns -1 for invalid input instead of throwing an error, your characterization test asserts that it returns -1. Once refactoring is complete and tests pass, then you can fix bugs — but you do that as a separate, intentional change.
// Legacy function with unclear behavior
function calculateDiscount(price, customerType, itemCount) {
if (customerType == 'premium') {
if (itemCount > 10) {
return price * 0.8;
}
return price * 0.9;
} else if (customerType == 'regular') {
if (itemCount > 5) {
return price * 0.95;
}
}
return price;
}
// Characterization tests - document actual behavior
describe('calculateDiscount - current behavior', () => {
test('premium customer with 11 items gets 20% off', () => {
expect(calculateDiscount(100, 'premium', 11)).toBe(80);
});
test('premium customer with 5 items gets 10% off', () => {
expect(calculateDiscount(100, 'premium', 5)).toBe(90);
});
test('regular customer with 6 items gets 5% off', () => {
expect(calculateDiscount(100, 'regular', 6)).toBe(95);
});
test('regular customer with 3 items gets no discount', () => {
expect(calculateDiscount(100, 'regular', 3)).toBe(100);
});
test('unknown customer type gets no discount', () => {
expect(calculateDiscount(100, 'corporate', 10)).toBe(100);
});
// Document edge cases even if behavior seems wrong
test('negative price returns negative discounted price', () => {
expect(calculateDiscount(-100, 'premium', 11)).toBe(-80);
});
});
Notice the last test documents unexpected behavior (negative prices aren't validated). The test doesn't fix it — it makes it visible. After refactoring, you know if your changes accidentally altered this behavior. Then you can decide whether to keep it, fix it, or leave it for later.
The Strangler Fig Pattern for Large Refactorings
The Strangler Fig pattern involves building new functionality alongside old code, gradually routing traffic to the new implementation, and removing old code once the new version handles all cases. This is the primary technique for refactoring systems too large to change in a single deployment.
The pattern works in three phases. First, create an abstraction layer that can route requests to either old or new implementation. Second, implement new code behind this abstraction and route a small percentage of traffic to it. Third, gradually increase traffic to the new implementation while monitoring for issues, eventually removing the old code entirely.
// Old implementation - do not modify
function legacyUserValidation(userData) {
// Complex validation logic accumulated over years
if (!userData.email) return false;
if (userData.email.indexOf('@') === -1) return false;
if (userData.age && userData.age < 13) return false;
// ... dozens more checks
return true;
}
// New implementation - improved structure
class UserValidator {
validate(userData) {
const errors = [];
if (!this.isValidEmail(userData.email)) {
errors.push('Invalid email');
}
if (!this.isValidAge(userData.age)) {
errors.push('Must be 13 or older');
}
return {
isValid: errors.length === 0,
errors
};
}
isValidEmail(email) {
return email && email.includes('@') && email.includes('.');
}
isValidAge(age) {
return !age || age >= 13;
}
}
// Abstraction layer - controls which implementation is used
class UserValidationRouter {
constructor() {
this.newValidator = new UserValidator();
this.useNewValidator = this.shouldUseNewValidator();
}
validate(userData) {
// Feature flag controls gradual rollout
if (this.useNewValidator) {
const result = this.newValidator.validate(userData);
// Shadow mode: compare with old implementation
const legacyResult = legacyUserValidation(userData);
if (result.isValid !== legacyResult) {
// Log discrepancy for investigation
this.logDiscrepancy(userData, result, legacyResult);
}
return result.isValid;
}
return legacyUserValidation(userData);
}
shouldUseNewValidator() {
// Start with 5% traffic, increase gradually
return Math.random() < 0.05;
}
logDiscrepancy(userData, newResult, oldResult) {
console.error('Validation mismatch', {
userId: userData.id,
newResult,
oldResult
});
}
}
// Application code remains unchanged
const validator = new UserValidationRouter();
const isValid = validator.validate(userData);
The shadow mode comparison is critical. It allows you to deploy the new implementation to production and compare results against the old implementation without changing behavior. Discrepancies reveal edge cases your tests missed. Once discrepancies stop appearing and you're confident in the new implementation, you remove the feature flag and delete the old code.
Dependency Breaking Techniques
Legacy code's primary characteristic is tight coupling — functions that depend on databases, external APIs, file systems, and other functions in ways that make isolated testing impossible. You can't write characterization tests if running the code requires a production database connection. Dependency breaking is the process of introducing seams that allow you to substitute test doubles for real dependencies.
Extract and Override: Extract dependency code into a method, create a test subclass that overrides that method. This works when you can't modify the dependency but can modify the class structure.
// Before - hard to test
class OrderProcessor {
processOrder(orderId) {
const order = database.query('SELECT * FROM orders WHERE id = ?', [orderId]);
const payment = stripeAPI.charge(order.amount, order.customerId);
emailService.send(order.customerEmail, 'Order confirmed');
return order;
}
}
// After - testable through subclassing
class OrderProcessor {
processOrder(orderId) {
const order = this.fetchOrder(orderId);
const payment = this.chargePayment(order.amount, order.customerId);
this.sendConfirmation(order.customerEmail);
return order;
}
// Extract dependencies as protected methods
protected fetchOrder(orderId) {
return database.query('SELECT * FROM orders WHERE id = ?', [orderId]);
}
protected chargePayment(amount, customerId) {
return stripeAPI.charge(amount, customerId);
}
protected sendConfirmation(email) {
emailService.send(email, 'Order confirmed');
}
}
// Test subclass
class TestableOrderProcessor extends OrderProcessor {
constructor(public mockOrder, public mockPayment) {
super();
}
protected fetchOrder(orderId) {
return this.mockOrder;
}
protected chargePayment(amount, customerId) {
return this.mockPayment;
}
protected sendConfirmation(email) {
// Don't send real emails in tests
}
}
// Now we can test
const processor = new TestableOrderProcessor(
{ id: 1, amount: 100, customerId: 'cus_123', customerEmail: '[email protected]' },
{ id: 'ch_123', status: 'succeeded' }
);
const result = processor.processOrder(1);
Parameter injection: Add parameters that allow passing dependencies instead of accessing them directly. Start with defaults so existing code doesn't break, then override in tests.
// Before
function generateReport() {
const data = database.query('SELECT * FROM sales');
const formatted = formatData(data);
filesystem.write('/reports/sales.txt', formatted);
}
// After - dependencies injected with defaults
function generateReport(
dataSource = database,
formatter = formatData,
writer = filesystem
) {
const data = dataSource.query('SELECT * FROM sales');
const formatted = formatter(data);
writer.write('/reports/sales.txt', formatted);
}
// Testing becomes possible
function testGenerateReport() {
const mockData = [{ product: 'Widget', sales: 100 }];
const mockSource = { query: () => mockData };
const mockFormatter = (data) => JSON.stringify(data);
const mockWriter = { write: jest.fn() };
generateReport(mockSource, mockFormatter, mockWriter);
expect(mockWriter.write).toHaveBeenCalledWith(
'/reports/sales.txt',
JSON.stringify(mockData)
);
}
The Mikado Method for Complex Refactorings
The Mikado Method is a systematic approach to refactoring when you discover that your initial change requires multiple prerequisite changes. The name comes from Mikado pickup sticks — you identify which changes you can make safely without disturbing others.
The process: attempt the refactoring you want to make, note what breaks, revert the changes, identify the prerequisites needed to make it work, tackle those prerequisites first, then retry the original refactoring. You're building a dependency graph of changes and executing them in reverse dependency order.
Example scenario: You want to extract a user authentication function from a monolithic controller, but discover it's tightly coupled to session management, logging, and database code.
- Attempt the extraction: Try to move the authentication logic to a separate module. Note all the breakages.
- Identify dependencies: Authentication depends on SessionManager, Logger, and Database. These need to be injectable before extraction is possible.
- Revert: Undo the extraction attempt. Commit what you have (nothing) so you have a clean slate.
- Break dependencies individually: First, make SessionManager injectable. Test. Commit. Then Logger. Test. Commit. Then Database. Test. Commit.
- Retry extraction: Now that dependencies are injectable, extract authentication. This time it works.
The key is reverting when you discover complications instead of pushing forward. Attempting to fix all discovered issues in a single change leads to large, risky pull requests that are impossible to review properly. The Mikado Method breaks that large change into a series of small, safe changes.
Incremental Type Safety in Untyped Codebases
Adding TypeScript to a large JavaScript codebase is a common refactoring challenge. The all-or-nothing approach — convert everything to strict TypeScript — fails in production systems. The incremental approach works: add types file by file, starting with the most problematic areas.
Configure TypeScript to allow gradual migration:
// tsconfig.json
{
"compilerOptions": {
"allowJs": true, // Allow .js files
"checkJs": false, // Don't type-check .js files yet
"strict": false, // Start with loose checking
"noImplicitAny": false, // Allow implicit any initially
"strictNullChecks": false // Enable later
},
"include": ["src/**/*"]
}
Then enable stricter checking file by file using per-file directives:
// @ts-check at the top of a .js file enables basic checking
// @ts-check
function calculateTotal(items) {
return items.reduce((sum, item) => sum + item.price, 0);
}
// Convert the most error-prone files to .ts first
// user.ts
interface User {
id: string;
email: string;
role: 'admin' | 'user';
}
export function isAdmin(user: User): boolean {
return user.role === 'admin';
}
// Other files can still use .js and import the typed module
// dashboard.js
import { isAdmin } from './user';
// Gets type checking even though this file is .js
if (isAdmin(currentUser)) {
// ...
}
The strategy: identify the 10-20 files that cause the most bugs (usually involving data transformation, API boundaries, or complex business logic), convert those to TypeScript with proper types, then expand outward. Each conversion makes the codebase incrementally safer without requiring a complete halt on feature development.
Refactoring Under Feature Flags
Feature flags allow you to merge refactored code to production while keeping it inactive until you're ready to enable it. This separates deployment from release, reducing the pressure of "this must work perfectly because we're deploying to all users immediately."
class FeatureFlags {
static isEnabled(flagName) {
// In production, this checks a config service or database
// In tests, you can override
const flags = {
'new-payment-processor': false,
'improved-search': false,
'refactored-auth': false
};
return flags[flagName] || false;
}
}
// Old payment processing
function processPaymentOld(orderId) {
// Legacy implementation
}
// New payment processing (refactored)
function processPaymentNew(orderId) {
// Improved implementation
}
// Application code uses feature flag
function processPayment(orderId) {
if (FeatureFlags.isEnabled('new-payment-processor')) {
return processPaymentNew(orderId);
}
return processPaymentOld(orderId);
}
The deployment sequence: merge the new implementation with the flag off, deploy to production, enable the flag for internal users only, monitor for issues, gradually increase the percentage of real users using the new code, eventually remove the flag and old implementation.
This approach converts binary risk (works or doesn't) into gradual risk (works for 5%, then 20%, then 50% of users). If issues appear at 20%, you disable the flag for everyone immediately without needing to deploy a rollback.
Database Refactoring Strategies
Database schema changes are particularly risky in legacy systems because you can't easily revert them — data has been written in the new format. The safe approach uses a multi-phase migration that maintains backward compatibility.
Example: Renaming a column
Phase 1: Add new column, write to both
// Migration 1: Add new column
ALTER TABLE users ADD COLUMN email_address VARCHAR(255);
// Update application code to write to both columns
function updateUser(userId, email) {
database.query(`
UPDATE users
SET email = ?, email_address = ?
WHERE id = ?
`, [email, email, userId]);
}
// Backfill existing data
UPDATE users SET email_address = email WHERE email_address IS NULL;
Phase 2: Switch reads to new column
// Update application code to read from new column
function getUser(userId) {
return database.query(`
SELECT id, email_address as email, name
FROM users
WHERE id = ?
`, [userId]);
}
// Still writing to both columns for safety
Phase 3: Remove old column
// After monitoring shows no issues, stop writing to old column
function updateUser(userId, email) {
database.query(`
UPDATE users
SET email_address = ?
WHERE id = ?
`, [email, userId]);
}
// Migration 2: Drop old column (do this weeks later, not immediately)
ALTER TABLE users DROP COLUMN email;
Each phase is independently deployable and reversible. If Phase 2 causes issues, you revert the application code (not the database) to read from the old column. The new column still exists and is being populated, so you haven't lost anything.
| Refactoring Type | Risk Level | Recommended Approach |
|---|---|---|
| Pure refactoring (no behavior change) | Low | Characterization tests + direct deployment |
| Algorithm replacement | Medium | Feature flags + shadow mode comparison |
| Database schema changes | High | Multi-phase migration with backward compatibility |
| Architecture changes | High | Strangler Fig pattern over months |
| External API changes | Medium-High | Adapter pattern + gradual cutover |
Testing Strategy for Legacy Code
The testing pyramid (many unit tests, fewer integration tests, minimal end-to-end tests) doesn't work for legacy refactoring. You need the inverse: start with end-to-end tests that verify the system works as a whole, add integration tests for major workflows, and only write unit tests for code you're actively modifying.
End-to-end tests are expensive to maintain but invaluable during legacy refactoring because they catch the issues that characterization tests miss — integration problems, timing issues, state management bugs. You're not trying to achieve 80% test coverage. You're trying to detect when your changes break production workflows.
// End-to-end test for critical user journey
describe('User purchase flow', () => {
test('user can complete purchase from cart to confirmation', async () => {
// Setup
const user = await createTestUser();
await loginAs(user);
// Add items to cart
await addToCart('product-123');
await addToCart('product-456');
// Go to checkout
await navigateToCheckout();
expect(await getCartTotal()).toBe(150.00);
// Enter payment info
await fillPaymentForm({
cardNumber: '4242424242424242',
expiry: '12/25',
cvc: '123'
});
// Complete purchase
await clickButton('Complete Purchase');
// Verify confirmation
await waitForUrl('/order-confirmation');
const orderNumber = await getOrderNumber();
expect(orderNumber).toMatch(/^ORD-\d+$/);
// Verify email sent (check test email service)
const emails = await getEmailsFor(user.email);
expect(emails).toContainEqual(
expect.objectContaining({
subject: 'Order Confirmation',
body: expect.stringContaining(orderNumber)
})
);
});
});
This test exercises the entire purchase flow. If your refactoring breaks checkout, payment processing, order creation, or email sending, this test fails. It doesn't tell you exactly what broke, but it tells you that something broke before you deploy to production.
Monitoring and Observability During Refactoring
Code-level testing catches what you anticipated. Monitoring catches what you didn't. During legacy refactoring, you should have elevated monitoring that specifically tracks the refactored code paths.
Add temporary instrumentation that you'll remove after the refactoring is stable:
function processPayment(orderId) {
const startTime = Date.now();
const implementation = FeatureFlags.isEnabled('new-payment-processor')
? 'new'
: 'old';
try {
const result = implementation === 'new'
? processPaymentNew(orderId)
: processPaymentOld(orderId);
// Track success metrics
metrics.increment('payment.success', { implementation });
metrics.timing('payment.duration', Date.now() - startTime, { implementation });
return result;
} catch (error) {
// Track failure metrics
metrics.increment('payment.failure', {
implementation,
errorType: error.name
});
// Log additional context during migration period
logger.error('Payment processing failed', {
implementation,
orderId,
error: error.message,
stack: error.stack
});
throw error;
}
}
This instrumentation lets you compare the new implementation against the old on multiple dimensions: success rate, latency, error types. If the new implementation has 99.5% success rate while the old has 99.8%, that's a regression you need to investigate before full rollout.
When to Stop Refactoring
Refactoring can become self-perpetuating — each improvement reveals more areas that could be improved. Knowing when to stop is as important as knowing how to start. The decision criteria: does this refactoring enable a specific business capability or reduce a specific operational problem?
Good reasons to continue refactoring: it's blocking a customer-requested feature, it's causing production incidents, it's making onboarding new developers prohibitively slow, or it's creating security vulnerabilities. Bad reasons: the code doesn't follow current best practices, it uses an outdated library that still works, or it's "not clean enough."
Legacy code that works, has no security issues, and doesn't block feature development should be left alone. Documentation is cheaper than refactoring. If the code is confusing but stable, write comments explaining how it works. If it's difficult to modify but rarely needs modification, leave it. Save refactoring effort for code that you need to change frequently.
FAQ
Should I refactor and add features in the same pull request?
No. Separate refactoring from feature work into different pull requests. When you combine them, reviewers can't determine if behavior changes are intentional (features) or accidental (refactoring bugs). More importantly, if the pull request causes production issues, you won't know whether to revert the entire change or just the feature portion. Refactor first in one PR, then add features in a subsequent PR. If the feature requires refactoring to implement, do the minimal refactoring needed, deploy it, verify stability, then add the feature.
How do I convince management to allocate time for refactoring?
Connect refactoring to business outcomes, not code quality. "We need to clean up the authentication code" gets rejected. "The authentication code is causing 30% of our customer support tickets and blocking the enterprise SSO feature that three major prospects are requesting" gets approved. Quantify the cost of not refactoring in terms management cares about: customer churn, support costs, delayed features, or engineer retention. Most importantly, don't ask for dedicated refactoring time — include small refactorings as part of feature work.
What if the legacy code has no tests and no documentation?
Start by adding characterization tests for the specific code you need to change, not the entire system. If you're refactoring the payment processing function, you only need tests for payment processing, not the entire application. Run the code with various inputs, record outputs, codify those as tests. For documentation, read the code to understand what it does, then write tests that demonstrate that understanding. The tests become executable documentation. If you can't understand what the code does, find someone who does and pair with them to write the first tests.
How do I refactor code that I can't run locally?
This is common with legacy systems that depend on production databases, third-party APIs, or complex infrastructure. Your options: create a staging environment that mirrors production, use Docker to containerize dependencies, or extract the code into a testable module by breaking external dependencies. The Extract and Override technique works well here — you pull external dependencies into methods you can override in tests. This allows testing the logic without requiring the actual dependencies to be available.
Should I rewrite from scratch or refactor incrementally?
Rewrite only when the cost of incremental refactoring exceeds the cost of rebuilding, which is rarer than developers think. Rewrites fail because they underestimate how much business logic is embedded in legacy code — edge cases, workarounds for client-specific issues, and undocumented requirements. If the legacy system handles production traffic successfully, it encodes valuable knowledge. Rewrites throw that away. Refactor incrementally unless the technology stack is completely obsolete (like a system written in a language with no available developers) or the system is so small that rewriting takes less than a month.
How can I refactor safely when I don't understand all the code?
You don't need to understand all the code, only the code you're changing. Use characterization tests to create a safety net, make small changes, and verify tests still pass. If you're refactoring function A that calls function B, you don't need to understand B's internals — you just need to ensure your changes to A don't change how it calls B. The Mikado Method helps here: attempt the change, see what breaks, revert, understand those specific broken pieces, and try again. You build understanding incrementally as needed, not comprehensively upfront.
What's the right balance between refactoring and shipping features?
Refactor as part of feature work, not instead of it. The Boy Scout Rule applies: leave code slightly better than you found it. If you're adding a feature that touches a poorly structured file, improve that file as part of the feature work. Don't refactor files you're not touching. This approach ensures refactoring happens continuously in small doses rather than in large, disruptive initiatives. The balance: spend 20-30% of feature development time on preparatory refactoring and cleanup, not 50/50.
How do I handle refactoring when multiple developers are working on the same codebase?
Coordinate major refactorings through team communication, not through code conflicts. If you're planning a large refactoring, announce it to the team so others can avoid working in that area or coordinate their changes. Use feature flags to keep refactored code merged but inactive, allowing other developers' work to proceed normally. For ongoing work, establish a pattern: whoever touches a file owns cleaning it up. This distributes refactoring work across the team and prevents conflicts.
Should I update dependencies as part of refactoring?
No. Dependency updates are separate from code refactoring. Update dependencies in isolated pull requests where you can verify the update didn't break anything. Combining dependency updates with code refactoring creates an impossible debugging situation — if tests fail, is it because of your code changes or the dependency update? Update dependencies first, verify everything works, then refactor. Or refactor first, verify everything works, then update dependencies. Never do both simultaneously.
How long should a refactoring pull request take to review?
If it takes more than 30 minutes to review, it's too large. Break it into smaller pieces. Large refactoring PRs don't get reviewed carefully — reviewers skim them and approve because thoroughly reviewing 2000 lines of changes is unrealistic. Small PRs (under 400 lines) get reviewed properly. If your refactoring requires 2000 lines of changes, break it into 5 PRs of 400 lines each, where each PR leaves the code in a working state. Sequential small changes are safer than one large change.
Conclusion
Safe legacy refactoring is fundamentally about risk management, not code quality. The techniques covered — characterization testing, Strangler Fig pattern, dependency breaking, Mikado Method, feature flags, and multi-phase database migrations — all serve the same purpose: making changes incrementally in ways that allow you to detect and revert problems before they affect users.
The key insight: refactoring legacy code is not about writing better code, it's about making the codebase easier to modify without breaking it. Code that's ugly but works and doesn't need to change shouldn't be refactored. Code that needs frequent modification but resists change is your highest priority target. Focus refactoring effort where it enables business value, not where it satisfies engineering aesthetics.
Success in legacy refactoring comes from discipline — making one change at a time, testing after each change, reverting when you discover complexity, and deploying incrementally. Teams that follow these practices refactor successfully. Teams that try to fix everything at once create production incidents.