Building Production-Grade Encompass Integrations: Patterns for Unreliable Third-Party Systems
Building a production-grade Encompass integration requires defensive architecture and pragmatic tradeoffs. Learn about saga orchestration, lock-aware queuing, webhook reconciliation, and operational resilience patterns that apply to any unreliable third-party system.
If you work in mortgage technology, you know Encompass. It's the dominant loan origination system in the industry, which means if you're building mortgage software, you're almost certainly integrating with it. On the surface, Encompass provides APIs and webhooks that promise straightforward integration. In practice, building a production-grade Encompass integration requires defensive architecture, sophisticated error handling, and pragmatic tradeoffs between reliability and complexity.
Over the past 3 months, I've led the rebuild of our company's Encompass integration from scratch. The legacy system was fragile and unreliable—loans would get stuck, data would drift out of sync, and manual intervention was almost a daily occurrence. The new architecture handles hundreds of loans with bidirectional sync, graceful failure handling, and automated reconciliation. More importantly, the patterns we developed apply far beyond Encompass to any integration with unreliable third-party systems.
This article covers the architectural decisions, implementation patterns, and lessons learned from building a production-ready Encompass integration that prioritizes reliability without sacrificing user experience.
The Challenge: Why Encompass Integration is Complex
Encompass presents several challenges that make reliable integration harder than typical REST APIs:
Exclusive locking: When a loan is being edited in Encompass—either by a user or another system—it holds an exclusive lock that prevents other writes. Lock times can extend for minutes or even hours if a user leaves a loan open in their browser. Any write attempt during this period fails immediately, requiring retry logic and queue management.
Webhook unreliability: Encompass provides webhooks for create, update, delete, lock, and unlock events. However, these webhooks can be delayed, arrive out of order, contain duplicates, or fail to arrive entirely. Building on webhook delivery alone guarantees eventual data inconsistency.
Silent failures: Some Encompass API calls return HTTP 200 with no error indication but fail to perform the requested operation. The most notable example from my experience is setting e-consent, which can silently fail due to business rules or internal bugs. Without explicit validation, these failures go undetected.
Cascading updates: Encompass business rules can trigger chains of updates. For example, changing a borrower's income might recalculate debt-to-income ratios, which triggers compliance checks, which fires multiple webhooks. Without proper de-duplication and rate limiting, these cascades can create runaway update loops.
General instability: Encompass is prone to transient errors, timeouts, and occasional outages. Defensive programming isn't optional—it's required for any production system.
These problems aren't unique to Encompass. Many enterprise third-party systems exhibit similar characteristics: chatty APIs, unreliable webhooks, undocumented failure modes, and integration patterns that assume happy-path scenarios. The patterns discussed here generalize to any system with these properties.
Architecture Overview
Our integration uses an event-driven architecture with saga orchestration for loan creation and queue-based processing for bidirectional sync. We maintain local state to track integration status, lock state, and pending operations, which allows us to make decisions without constantly querying Encompass.
The system has three primary flows:
- Loan creation: Multi-step saga orchestration using MassTransit
- POS → Encompass sync: Queue-based updates with lock management
- Encompass → POS sync: Webhook-driven updates with de-duplication and reconciliation
All three flows share common patterns: retry logic with backoff, circuit breakers for isolation, and multiple layers of reconciliation to catch failures.
Loan Creation: Saga Orchestration with Partial Failures
Creating a loan in Encompass requires multiple sequential steps: creating the loan entity, associating borrowers with Consumer Connect (Encompass's borrower portal), setting e-consent preferences, and triggering notifications. Each step can fail independently, and some failures are more critical than others.
We use MassTransit's saga pattern to orchestrate this workflow. Sagas provide durable state management and automatic retry logic, which is essential when individual steps can fail for minutes or hours before succeeding.
Critical vs Non-Critical Steps
The most important architectural decision was determining which steps are critical versus nice-to-have. This was driven by business requirements, not technical constraints. The primary requirement is getting loan data into Encompass so the loan officer can begin to work on their application. Consumer Connect association and e-consent are important but secondary—if they fail, the loan can still proceed while we retry these steps in the background or manually.
This prioritization led us to implement "partial failure" states in our sagas. If the loan is successfully created but Consumer Connect association fails, the saga completes with a "Partially Failed" status. The partial failure triggers separate retry logic with different backoff strategies, and if it ultimately fails, it creates an automated support ticket for manual intervention.
Loan Creation Request] STEP1[Step 1: Create Loan
in Encompass] CHECK1{Success?} RETRY1[Retry with
Circuit Breaker] FAIL1[Permanent Failure
Create Support Ticket
Saga Failed] STEP2[Step 2: Consumer Connect
Association] CHECK2{Success?} RETRY2[Retry with Backoff
Up to 5 Attempts] PARTIAL2[Mark Partial Failure
Background Retry] STEP3[Step 3: Set E-Consent] ECONSENT[Recursive Set + Validate
Up to 5 Attempts] CHECK3{Success?} PARTIAL3[Mark Partial Failure
Manual Intervention] STEP4[Step 4: Send Notifications] CHECK4{Success?} RETRY4[Retry] COMPLETE[Saga Complete
Update Integration Status] START --> STEP1 STEP1 --> CHECK1 CHECK1 -->|Transient Error| RETRY1 CHECK1 -->|Permanent Error| FAIL1 CHECK1 -->|Success - Critical| STEP2 RETRY1 --> STEP1 STEP2 --> CHECK2 CHECK2 -->|Success| STEP3 CHECK2 -->|Failed After Retries
Non-Critical| PARTIAL2 CHECK2 -->|Transient Error| RETRY2 RETRY2 --> STEP2 PARTIAL2 --> STEP3 STEP3 --> ECONSENT ECONSENT --> CHECK3 CHECK3 -->|Success| STEP4 CHECK3 -->|Failed After 5 Attempts
Non-Critical| PARTIAL3 PARTIAL3 --> STEP4 STEP4 --> CHECK4 CHECK4 -->|Success| COMPLETE CHECK4 -->|Error| RETRY4 RETRY4 --> STEP4 style START fill:#e1f5ff style STEP1 fill:#ffcccc style STEP2 fill:#fff4e1 style STEP3 fill:#fff4e1 style STEP4 fill:#fff4e1 style CHECK1 fill:#ffe1f5 style CHECK2 fill:#ffe1f5 style CHECK3 fill:#ffe1f5 style COMPLETE fill:#c2f0c2 style FAIL1 fill:#ff9999 style PARTIAL2 fill:#ffeb99 style PARTIAL3 fill:#ffeb99
The E-Consent Problem
Setting e-consent in Encompass revealed a particularly problematic failure mode: the API would return HTTP 200, but the e-consent fields wouldn't actually be set. This appeared to be triggered by certain business rule configurations or internal Encompass bugs, and it occurred unpredictably.
We discovered this through production failures. Loan officers would report that e-consent wasn't enabled despite our logs showing successful API calls. After isolating the issue, we implemented a write-then-validate pattern specifically for e-consent.
The solution: After setting e-consent, we immediately query the loan to verify the value was actually set. If it wasn't, we recursively retry up to five times. In production, most issues resolve on the second attempt, suggesting this is primarily a timing or business rule problem in Encompass rather than a permanent configuration issue.
to Encompass API] RESPONSE{HTTP Response} ERROR[Handle Error
Retry with Backoff] SUCCESS200[Received HTTP 200] VALIDATE[Query Encompass
Verify E-Consent Actually Set] CHECK{E-Consent
Value Set?} COMPLETE[Mark Step Complete
Continue Saga] ATTEMPT{Retry Count
< 5?} RETRY[Recursively Retry
Set + Validate] FAIL[Mark Partial Failure
Create Support Ticket] START --> SET SET --> RESPONSE RESPONSE -->|Error 4xx/5xx| ERROR ERROR --> SET RESPONSE -->|200 OK| SUCCESS200 SUCCESS200 --> VALIDATE VALIDATE --> CHECK CHECK -->|Yes| COMPLETE CHECK -->|No - Silent Failure| ATTEMPT ATTEMPT -->|Yes| RETRY ATTEMPT -->|No - Max Attempts| FAIL RETRY --> SET style START fill:#e1f5ff style SET fill:#fff4e1 style SUCCESS200 fill:#e1ffe1 style VALIDATE fill:#ffe1f5 style CHECK fill:#ffcccc style COMPLETE fill:#c2f0c2 style FAIL fill:#ffcccc style RETRY fill:#fff4e1
Separating e-consent into its own saga step improved both reliability and code organization. The isolated write-validate-retry logic was cleaner, reusable for non-owning borrower scenarios, and was less failure-prone than bundling it with the loan creation payload.
Circuit Breakers and Bulkheads
Because Encompass is prone to transient errors, we implemented circuit breakers and bulkheads to prevent cascading failures and resource exhaustion.
Circuit breakers protect against overwhelming Encompass during outages. We use per-tenant partitioned circuit breakers set at a 50% failure threshold—higher than we'd normally set for a reliable API, but appropriate given Encompass's baseline error rate. When a circuit breaker opens, we create scheduled retries every 60 seconds with exponential backoff. This prevents hammering Encompass during outages while ensuring we eventually succeed once the system recovers.
Bulkheads isolate resources to prevent one tenant's problems from affecting others. We implement two types of bulkheads:
Thread pool isolation: We limit concurrent Encompass API calls per tenant based on their rate limits. Encompass defaults to 30 concurrent calls per lender environment, so we enforce this limit at our application layer. If a tenant reaches their concurrency limit, additional requests queue with backpressure rather than being rejected. We can't afford to lose update requests in a system where data consistency is critical.
Queue capacity limits: Each tenant has isolated queue capacity for both outbound updates and inbound webhook processing. This prevents a single tenant experiencing high load (bulk imports, cascading business rule updates) from consuming all available queue resources and impacting other tenants' integration performance.
These isolation patterns proved essential in production. When one tenant's Encompass environment experiences an outage or their business rules trigger cascading updates, other tenants continue operating normally. The per-tenant circuit breakers and bulkheads contain the blast radius of any Encompass issues.
POS → Encompass: Managing Locks and Update Queues
Encompass's exclusive locking mechanism is one of the most challenging aspects of the integration. When a loan officer opens a loan in Encompass, the system holds an exclusive lock for the duration of their session which could be hours if they leave their browser open. Any attempt to update the loan during this period fails immediately.
Local Lock State Management
Rather than discovering locks by attempting writes and handling failures, we maintain local lock state synchronized via Encompass lock and unlock webhooks. When a loan is locked, we queue any pending updates. When it unlocks, we process the queue.
This approach has several advantages:
- No failed API calls due to lock conflicts
- Updates can be consolidated while queued
- We can provide user feedback about lock status without querying Encompass
- Reduces load on Encompass by batching updates
The queue processes in FIFO order to maintain update sequencing. When possible, we consolidate queued requests—if request 1 updates the borrower name, request 2 updates the borrower email, and request 3 updates the borrower name again, we merge these into a single request with the last value taking precedence for duplicate properties resulting in request 3 borrower name and request 2 borrower email in one payload.
Conflict Resolution
If a loan remains locked for an extended period and conflicting updates occur in both systems, we treat Encompass as the source of truth. For example, if a loan is locked for 60 minutes and the borrower name is updated in both Encompass (at minute 0) and our POS (at minute 30), the Encompass value takes precedence when we finally sync.
This was a business decision, not a technical constraint. Once a loan reaches Encompass, it becomes the system of record. Our POS exists to collect initial data and provide borrower-facing features, but Encompass is where loan officers perform the bulk of their work.
Reconciliation and Recovery
Webhooks aren't perfectly reliable, so we can't trust lock state based solely on webhook delivery. We run background jobs every hour to reconcile lock state and detect missed updates:
- Lock status reconciliation: If a loan's lock timestamp hasn't updated in an unusually long period, we query the Encompass API to verify actual lock state
- Modified date comparison: We query modified dates in both our database and Encompass to detect missed updates, triggering sync for any loans that show more recent Encompass modifications
Stale Timestamps?] API1[Query Encompass
Verify Lock Status] UPDATE1[Update Local
Lock State] CHECK2[Check Modified Dates
Compare DB vs Encompass] API2[Query Encompass
Recent Modifications] SYNC[Sync Missing
Updates] HOURLY --> CHECK1 HOURLY --> CHECK2 CHECK1 --> API1 API1 --> UPDATE1 CHECK2 --> API2 API2 --> SYNC style HOURLY fill:#f5e1e1 style CHECK1 fill:#ffe1f5 style CHECK2 fill:#ffe1f5 style API1 fill:#ffcccc style API2 fill:#ffcccc
For immediate user needs, we provide a manual sync button in our admin UI. If a user reports data inconsistencies, users can trigger an immediate sync without waiting for background job execution.
Encompass → POS: Webhook Processing and De-duplication
Encompass fires webhooks for create, update, and delete events. However, these webhooks only contain the loan ID—we must query the API to retrieve actual loan data. This creates opportunities for optimization through batching and de-duplication.
Handling Webhook Volume
Encompass can be extremely chatty. Bulk imports, business rule changes, or cascading updates can generate hundreds of webhooks in seconds for the same loan. Processing each webhook individually would overwhelm both our system and Encompass's API.
Our solution: Queue all incoming webhooks with a slight delay before processing begins. This delay window allows us to collect multiple webhooks for the same loan and consolidate them into a single API fetch. Since we only need the loan ID and the last write wins, duplicate webhooks for the same loan can be dropped entirely and we only keep the last one in the queue.
Create/Update/Delete] QUEUE[Add to Webhook Queue
Loan ID Only] DELAY[Delay Window
Collect & Deduplicate] DEDUP{Duplicate
Loan ID?} DROP[Drop Duplicate] FETCH[Fetch Loan Data
from Encompass API] MAP[Map Data
Match Users by Email] WRITE[Write to Database] TASKS[Generate Tasks
Async] START --> QUEUE QUEUE --> DELAY DELAY --> DEDUP DEDUP -->|Yes| DROP DEDUP -->|No| FETCH FETCH --> MAP MAP --> WRITE WRITE --> TASKS style START fill:#e1ffe1 style QUEUE fill:#e1ffe1 style FETCH fill:#ffe1f5 style WRITE fill:#e1f5ff
Rate Limiting and Recursive Update Prevention
We've observed cases where updating Encompass triggers business rules that fire webhooks back to us, which could trigger another update, creating infinite loops. While Encompass attempts to prevent these loops, it's not something we rely on.
We implement several defensive layers:
- De-duplication: Drop duplicate loan IDs from the queue before processing
- Rate limiting: If the same loan generates more than three webhook events per second, we throttle processing and log the anomaly
- Logging and alerting: Track webhook patterns to detect runaway updates (e.g., the same loan updating once per second for an hour)
These safeguards have proven essential. We don't trust Encompass or any third-party system to handle even basic protections reliably.
User Matching and Task Generation
When processing loan updates from Encompass, we match users by email and add them to the loan in our system. If matched users have confirmed emails, we send notifications that they've been added. Task generation for new borrowers happens asynchronously and doesn't block the sync process.
We intentionally don't auto-create user accounts from Encompass data. Email addresses in Encompass can be incorrect or premature, so we require loan officers to manually invite borrowers. This prevents spam and accidental invitations to wrong addresses.
Encompass can either hard delete or move loans to a trash folder (soft delete). In both cases, we implement soft deletes in our system to preserve audit trails and enable potential recovery.
Operational Resilience
Reliable integration requires more than just good architecture. It requires operational visibility and multiple layers of verification.
Multiple Reconciliation Layers
We don't trust any single mechanism for data consistency. Our reconciliation strategy includes:
- Webhook-driven updates for real-time sync (primary path)
- Background jobs checking modified dates and lock status (catches missed webhooks)
- Manual sync buttons in the admin UI (immediate user-driven recovery)
- Automated support tickets for failures requiring manual intervention
Each layer catches failures the previous layer might miss. This defense-in-depth approach has proven essential for maintaining data consistency despite Encompass's unreliability.
Monitoring and Visibility
We use Datadog for engineering monitoring, tracking saga success rates, circuit breaker states, queue depths, and webhook processing latency. For operational needs, we built an admin UI with per-loan filtering that shows integration status, logs all API interactions, and provides manual override controls.
This dual approach serves different audiences: engineers need aggregate metrics and alerting, while support staff need loan-specific details and the ability to trigger immediate actions.
Lessons Learned and Best Practices
Building this integration taught us several principles that apply to any unreliable third-party system:
Don't trust third-party APIs. Although unlikely, even HTTP 200 responses can represent failures. Always validate critical operations explicitly, especially for systems with poor documentation or known bugs.
Business-driven failure prioritization. Not all failures are equal. Understanding which operations are truly critical versus nice-to-have allows you to design partial failure states that balance reliability with user experience.
Queue-based patterns for ordering and consolidation. When dealing with high-volume updates or long-running locks, queues provide natural points for consolidation, de-duplication, and backpressure management.
Local state management reduces API dependency. Maintaining lock state, integration status, and pending operations locally allows you to make decisions without constant API queries, improving both performance and resilience.
Multiple reconciliation layers catch what webhooks miss. Webhooks are never perfectly reliable. Background jobs that periodically reconcile state are essential for long-term data consistency.
Operational visibility and manual overrides are essential. No matter how good your automation is, there will be edge cases that require human intervention. Provide your support team with the visibility and tools they need to resolve issues quickly.
Rate limiting and circuit breakers prevent cascading failures. Defensive mechanisms aren't just for your system—they protect the third-party system from being overwhelmed by your integration as well.
Conclusion
Encompass integration requires defensive architecture and pragmatic tradeoffs. The patterns described here—saga orchestration with partial failures, lock-aware queuing, webhook reconciliation, and operational resilience— aren't specific to Encompass. They apply to any integration with unreliable third-party systems, particularly in regulated industries where enterprise software often prioritizes features over API reliability.
The key is balancing reliability, user experience, and operational overhead. Perfect reliability is impossible with systems like Encompass, but well-designed architecture can provide the resilience and observability needed for production use.
If you're building integrations with unreliable third-party systems—whether in mortgage tech or elsewhere—I'd be interested in hearing about the patterns you've found effective. Feel free to connect with me on LinkedIn or reach out to discuss approaches to these challenges.
