Building Production-Grade Encompass Integrations: Patterns for Unreliable Third-Party Systems

Building a production-grade Encompass integration requires defensive architecture and pragmatic tradeoffs. Learn about saga orchestration, lock-aware queuing, webhook reconciliation, and operational resilience patterns that apply to any unreliable third-party system.

Dan Alvare
11 min read
Building Production-Grade Encompass Integrations: Patterns for Unreliable Third-Party Systems
Photo by Jackson Simmer / Unsplash

If you work in mortgage technology, you know Encompass. It's the dominant loan origination system in the industry, which means if you're building mortgage software, you're almost certainly integrating with it. On the surface, Encompass provides APIs and webhooks that promise straightforward integration. In practice, building a production-grade Encompass integration requires defensive architecture, sophisticated error handling, and pragmatic tradeoffs between reliability and complexity.

Over the past 3 months, I've led the rebuild of our company's Encompass integration from scratch. The legacy system was fragile and unreliable—loans would get stuck, data would drift out of sync, and manual intervention was almost a daily occurrence. The new architecture handles hundreds of loans with bidirectional sync, graceful failure handling, and automated reconciliation. More importantly, the patterns we developed apply far beyond Encompass to any integration with unreliable third-party systems.

This article covers the architectural decisions, implementation patterns, and lessons learned from building a production-ready Encompass integration that prioritizes reliability without sacrificing user experience.

The Challenge: Why Encompass Integration is Complex

Encompass presents several challenges that make reliable integration harder than typical REST APIs:

Exclusive locking: When a loan is being edited in Encompass—either by a user or another system—it holds an exclusive lock that prevents other writes. Lock times can extend for minutes or even hours if a user leaves a loan open in their browser. Any write attempt during this period fails immediately, requiring retry logic and queue management.

Webhook unreliability: Encompass provides webhooks for create, update, delete, lock, and unlock events. However, these webhooks can be delayed, arrive out of order, contain duplicates, or fail to arrive entirely. Building on webhook delivery alone guarantees eventual data inconsistency.

Silent failures: Some Encompass API calls return HTTP 200 with no error indication but fail to perform the requested operation. The most notable example from my experience is setting e-consent, which can silently fail due to business rules or internal bugs. Without explicit validation, these failures go undetected.

Cascading updates: Encompass business rules can trigger chains of updates. For example, changing a borrower's income might recalculate debt-to-income ratios, which triggers compliance checks, which fires multiple webhooks. Without proper de-duplication and rate limiting, these cascades can create runaway update loops.

General instability: Encompass is prone to transient errors, timeouts, and occasional outages. Defensive programming isn't optional—it's required for any production system.

These problems aren't unique to Encompass. Many enterprise third-party systems exhibit similar characteristics: chatty APIs, unreliable webhooks, undocumented failure modes, and integration patterns that assume happy-path scenarios. The patterns discussed here generalize to any system with these properties.

Architecture Overview

Our integration uses an event-driven architecture with saga orchestration for loan creation and queue-based processing for bidirectional sync. We maintain local state to track integration status, lock state, and pending operations, which allows us to make decisions without constantly querying Encompass.

The system has three primary flows:

  1. Loan creation: Multi-step saga orchestration using MassTransit
  2. POS → Encompass sync: Queue-based updates with lock management
  3. Encompass → POS sync: Webhook-driven updates with de-duplication and reconciliation

All three flows share common patterns: retry logic with backoff, circuit breakers for isolation, and multiple layers of reconciliation to catch failures.

Loan Creation: Saga Orchestration with Partial Failures

Creating a loan in Encompass requires multiple sequential steps: creating the loan entity, associating borrowers with Consumer Connect (Encompass's borrower portal), setting e-consent preferences, and triggering notifications. Each step can fail independently, and some failures are more critical than others.

We use MassTransit's saga pattern to orchestrate this workflow. Sagas provide durable state management and automatic retry logic, which is essential when individual steps can fail for minutes or hours before succeeding.

Critical vs Non-Critical Steps

The most important architectural decision was determining which steps are critical versus nice-to-have. This was driven by business requirements, not technical constraints. The primary requirement is getting loan data into Encompass so the loan officer can begin to work on their application. Consumer Connect association and e-consent are important but secondary—if they fail, the loan can still proceed while we retry these steps in the background or manually.

This prioritization led us to implement "partial failure" states in our sagas. If the loan is successfully created but Consumer Connect association fails, the saga completes with a "Partially Failed" status. The partial failure triggers separate retry logic with different backoff strategies, and if it ultimately fails, it creates an automated support ticket for manual intervention.

graph TB START[Saga Begins:
Loan Creation Request] STEP1[Step 1: Create Loan
in Encompass] CHECK1{Success?} RETRY1[Retry with
Circuit Breaker] FAIL1[Permanent Failure
Create Support Ticket
Saga Failed] STEP2[Step 2: Consumer Connect
Association] CHECK2{Success?} RETRY2[Retry with Backoff
Up to 5 Attempts] PARTIAL2[Mark Partial Failure
Background Retry] STEP3[Step 3: Set E-Consent] ECONSENT[Recursive Set + Validate
Up to 5 Attempts] CHECK3{Success?} PARTIAL3[Mark Partial Failure
Manual Intervention] STEP4[Step 4: Send Notifications] CHECK4{Success?} RETRY4[Retry] COMPLETE[Saga Complete
Update Integration Status] START --> STEP1 STEP1 --> CHECK1 CHECK1 -->|Transient Error| RETRY1 CHECK1 -->|Permanent Error| FAIL1 CHECK1 -->|Success - Critical| STEP2 RETRY1 --> STEP1 STEP2 --> CHECK2 CHECK2 -->|Success| STEP3 CHECK2 -->|Failed After Retries
Non-Critical| PARTIAL2 CHECK2 -->|Transient Error| RETRY2 RETRY2 --> STEP2 PARTIAL2 --> STEP3 STEP3 --> ECONSENT ECONSENT --> CHECK3 CHECK3 -->|Success| STEP4 CHECK3 -->|Failed After 5 Attempts
Non-Critical| PARTIAL3 PARTIAL3 --> STEP4 STEP4 --> CHECK4 CHECK4 -->|Success| COMPLETE CHECK4 -->|Error| RETRY4 RETRY4 --> STEP4 style START fill:#e1f5ff style STEP1 fill:#ffcccc style STEP2 fill:#fff4e1 style STEP3 fill:#fff4e1 style STEP4 fill:#fff4e1 style CHECK1 fill:#ffe1f5 style CHECK2 fill:#ffe1f5 style CHECK3 fill:#ffe1f5 style COMPLETE fill:#c2f0c2 style FAIL1 fill:#ff9999 style PARTIAL2 fill:#ffeb99 style PARTIAL3 fill:#ffeb99

Setting e-consent in Encompass revealed a particularly problematic failure mode: the API would return HTTP 200, but the e-consent fields wouldn't actually be set. This appeared to be triggered by certain business rule configurations or internal Encompass bugs, and it occurred unpredictably.

We discovered this through production failures. Loan officers would report that e-consent wasn't enabled despite our logs showing successful API calls. After isolating the issue, we implemented a write-then-validate pattern specifically for e-consent.

The solution: After setting e-consent, we immediately query the loan to verify the value was actually set. If it wasn't, we recursively retry up to five times. In production, most issues resolve on the second attempt, suggesting this is primarily a timing or business rule problem in Encompass rather than a permanent configuration issue.

graph TB START[E-Consent Saga Step Begins] SET[Send Set E-Consent Request
to Encompass API] RESPONSE{HTTP Response} ERROR[Handle Error
Retry with Backoff] SUCCESS200[Received HTTP 200] VALIDATE[Query Encompass
Verify E-Consent Actually Set] CHECK{E-Consent
Value Set?} COMPLETE[Mark Step Complete
Continue Saga] ATTEMPT{Retry Count
< 5?} RETRY[Recursively Retry
Set + Validate] FAIL[Mark Partial Failure
Create Support Ticket] START --> SET SET --> RESPONSE RESPONSE -->|Error 4xx/5xx| ERROR ERROR --> SET RESPONSE -->|200 OK| SUCCESS200 SUCCESS200 --> VALIDATE VALIDATE --> CHECK CHECK -->|Yes| COMPLETE CHECK -->|No - Silent Failure| ATTEMPT ATTEMPT -->|Yes| RETRY ATTEMPT -->|No - Max Attempts| FAIL RETRY --> SET style START fill:#e1f5ff style SET fill:#fff4e1 style SUCCESS200 fill:#e1ffe1 style VALIDATE fill:#ffe1f5 style CHECK fill:#ffcccc style COMPLETE fill:#c2f0c2 style FAIL fill:#ffcccc style RETRY fill:#fff4e1

Separating e-consent into its own saga step improved both reliability and code organization. The isolated write-validate-retry logic was cleaner, reusable for non-owning borrower scenarios, and was less failure-prone than bundling it with the loan creation payload.

Circuit Breakers and Bulkheads

Because Encompass is prone to transient errors, we implemented circuit breakers and bulkheads to prevent cascading failures and resource exhaustion.

Circuit breakers protect against overwhelming Encompass during outages. We use per-tenant partitioned circuit breakers set at a 50% failure threshold—higher than we'd normally set for a reliable API, but appropriate given Encompass's baseline error rate. When a circuit breaker opens, we create scheduled retries every 60 seconds with exponential backoff. This prevents hammering Encompass during outages while ensuring we eventually succeed once the system recovers.

Bulkheads isolate resources to prevent one tenant's problems from affecting others. We implement two types of bulkheads:

Thread pool isolation: We limit concurrent Encompass API calls per tenant based on their rate limits. Encompass defaults to 30 concurrent calls per lender environment, so we enforce this limit at our application layer. If a tenant reaches their concurrency limit, additional requests queue with backpressure rather than being rejected. We can't afford to lose update requests in a system where data consistency is critical.

Queue capacity limits: Each tenant has isolated queue capacity for both outbound updates and inbound webhook processing. This prevents a single tenant experiencing high load (bulk imports, cascading business rule updates) from consuming all available queue resources and impacting other tenants' integration performance.

These isolation patterns proved essential in production. When one tenant's Encompass environment experiences an outage or their business rules trigger cascading updates, other tenants continue operating normally. The per-tenant circuit breakers and bulkheads contain the blast radius of any Encompass issues.

POS → Encompass: Managing Locks and Update Queues

Encompass's exclusive locking mechanism is one of the most challenging aspects of the integration. When a loan officer opens a loan in Encompass, the system holds an exclusive lock for the duration of their session which could be hours if they leave their browser open. Any attempt to update the loan during this period fails immediately.

Local Lock State Management

Rather than discovering locks by attempting writes and handling failures, we maintain local lock state synchronized via Encompass lock and unlock webhooks. When a loan is locked, we queue any pending updates. When it unlocks, we process the queue.

This approach has several advantages:

  • No failed API calls due to lock conflicts
  • Updates can be consolidated while queued
  • We can provide user feedback about lock status without querying Encompass
  • Reduces load on Encompass by batching updates

The queue processes in FIFO order to maintain update sequencing. When possible, we consolidate queued requests—if request 1 updates the borrower name, request 2 updates the borrower email, and request 3 updates the borrower name again, we merge these into a single request with the last value taking precedence for duplicate properties resulting in request 3 borrower name and request 2 borrower email in one payload.

Conflict Resolution

If a loan remains locked for an extended period and conflicting updates occur in both systems, we treat Encompass as the source of truth. For example, if a loan is locked for 60 minutes and the borrower name is updated in both Encompass (at minute 0) and our POS (at minute 30), the Encompass value takes precedence when we finally sync.

This was a business decision, not a technical constraint. Once a loan reaches Encompass, it becomes the system of record. Our POS exists to collect initial data and provide borrower-facing features, but Encompass is where loan officers perform the bulk of their work.

Reconciliation and Recovery

Webhooks aren't perfectly reliable, so we can't trust lock state based solely on webhook delivery. We run background jobs every hour to reconcile lock state and detect missed updates:

  • Lock status reconciliation: If a loan's lock timestamp hasn't updated in an unusually long period, we query the Encompass API to verify actual lock state
  • Modified date comparison: We query modified dates in both our database and Encompass to detect missed updates, triggering sync for any loans that show more recent Encompass modifications
graph TB HOURLY[Hourly Background Job] CHECK1[Check Lock State
Stale Timestamps?] API1[Query Encompass
Verify Lock Status] UPDATE1[Update Local
Lock State] CHECK2[Check Modified Dates
Compare DB vs Encompass] API2[Query Encompass
Recent Modifications] SYNC[Sync Missing
Updates] HOURLY --> CHECK1 HOURLY --> CHECK2 CHECK1 --> API1 API1 --> UPDATE1 CHECK2 --> API2 API2 --> SYNC style HOURLY fill:#f5e1e1 style CHECK1 fill:#ffe1f5 style CHECK2 fill:#ffe1f5 style API1 fill:#ffcccc style API2 fill:#ffcccc

For immediate user needs, we provide a manual sync button in our admin UI. If a user reports data inconsistencies, users can trigger an immediate sync without waiting for background job execution.

Encompass → POS: Webhook Processing and De-duplication

Encompass fires webhooks for create, update, and delete events. However, these webhooks only contain the loan ID—we must query the API to retrieve actual loan data. This creates opportunities for optimization through batching and de-duplication.

Handling Webhook Volume

Encompass can be extremely chatty. Bulk imports, business rule changes, or cascading updates can generate hundreds of webhooks in seconds for the same loan. Processing each webhook individually would overwhelm both our system and Encompass's API.

Our solution: Queue all incoming webhooks with a slight delay before processing begins. This delay window allows us to collect multiple webhooks for the same loan and consolidate them into a single API fetch. Since we only need the loan ID and the last write wins, duplicate webhooks for the same loan can be dropped entirely and we only keep the last one in the queue.

graph TB START[Encompass Webhook Received
Create/Update/Delete] QUEUE[Add to Webhook Queue
Loan ID Only] DELAY[Delay Window
Collect & Deduplicate] DEDUP{Duplicate
Loan ID?} DROP[Drop Duplicate] FETCH[Fetch Loan Data
from Encompass API] MAP[Map Data
Match Users by Email] WRITE[Write to Database] TASKS[Generate Tasks
Async] START --> QUEUE QUEUE --> DELAY DELAY --> DEDUP DEDUP -->|Yes| DROP DEDUP -->|No| FETCH FETCH --> MAP MAP --> WRITE WRITE --> TASKS style START fill:#e1ffe1 style QUEUE fill:#e1ffe1 style FETCH fill:#ffe1f5 style WRITE fill:#e1f5ff

Rate Limiting and Recursive Update Prevention

We've observed cases where updating Encompass triggers business rules that fire webhooks back to us, which could trigger another update, creating infinite loops. While Encompass attempts to prevent these loops, it's not something we rely on.

We implement several defensive layers:

  • De-duplication: Drop duplicate loan IDs from the queue before processing
  • Rate limiting: If the same loan generates more than three webhook events per second, we throttle processing and log the anomaly
  • Logging and alerting: Track webhook patterns to detect runaway updates (e.g., the same loan updating once per second for an hour)

These safeguards have proven essential. We don't trust Encompass or any third-party system to handle even basic protections reliably.

User Matching and Task Generation

When processing loan updates from Encompass, we match users by email and add them to the loan in our system. If matched users have confirmed emails, we send notifications that they've been added. Task generation for new borrowers happens asynchronously and doesn't block the sync process.

We intentionally don't auto-create user accounts from Encompass data. Email addresses in Encompass can be incorrect or premature, so we require loan officers to manually invite borrowers. This prevents spam and accidental invitations to wrong addresses.

Encompass can either hard delete or move loans to a trash folder (soft delete). In both cases, we implement soft deletes in our system to preserve audit trails and enable potential recovery.

Operational Resilience

Reliable integration requires more than just good architecture. It requires operational visibility and multiple layers of verification.

Multiple Reconciliation Layers

We don't trust any single mechanism for data consistency. Our reconciliation strategy includes:

  • Webhook-driven updates for real-time sync (primary path)
  • Background jobs checking modified dates and lock status (catches missed webhooks)
  • Manual sync buttons in the admin UI (immediate user-driven recovery)
  • Automated support tickets for failures requiring manual intervention

Each layer catches failures the previous layer might miss. This defense-in-depth approach has proven essential for maintaining data consistency despite Encompass's unreliability.

Monitoring and Visibility

We use Datadog for engineering monitoring, tracking saga success rates, circuit breaker states, queue depths, and webhook processing latency. For operational needs, we built an admin UI with per-loan filtering that shows integration status, logs all API interactions, and provides manual override controls.

This dual approach serves different audiences: engineers need aggregate metrics and alerting, while support staff need loan-specific details and the ability to trigger immediate actions.

Lessons Learned and Best Practices

Building this integration taught us several principles that apply to any unreliable third-party system:

Don't trust third-party APIs. Although unlikely, even HTTP 200 responses can represent failures. Always validate critical operations explicitly, especially for systems with poor documentation or known bugs.

Business-driven failure prioritization. Not all failures are equal. Understanding which operations are truly critical versus nice-to-have allows you to design partial failure states that balance reliability with user experience.

Queue-based patterns for ordering and consolidation. When dealing with high-volume updates or long-running locks, queues provide natural points for consolidation, de-duplication, and backpressure management.

Local state management reduces API dependency. Maintaining lock state, integration status, and pending operations locally allows you to make decisions without constant API queries, improving both performance and resilience.

Multiple reconciliation layers catch what webhooks miss. Webhooks are never perfectly reliable. Background jobs that periodically reconcile state are essential for long-term data consistency.

Operational visibility and manual overrides are essential. No matter how good your automation is, there will be edge cases that require human intervention. Provide your support team with the visibility and tools they need to resolve issues quickly.

Rate limiting and circuit breakers prevent cascading failures. Defensive mechanisms aren't just for your system—they protect the third-party system from being overwhelmed by your integration as well.

Conclusion

Encompass integration requires defensive architecture and pragmatic tradeoffs. The patterns described here—saga orchestration with partial failures, lock-aware queuing, webhook reconciliation, and operational resilience— aren't specific to Encompass. They apply to any integration with unreliable third-party systems, particularly in regulated industries where enterprise software often prioritizes features over API reliability.

The key is balancing reliability, user experience, and operational overhead. Perfect reliability is impossible with systems like Encompass, but well-designed architecture can provide the resilience and observability needed for production use.

If you're building integrations with unreliable third-party systems—whether in mortgage tech or elsewhere—I'd be interested in hearing about the patterns you've found effective. Feel free to connect with me on LinkedIn or reach out to discuss approaches to these challenges.