Reliability

CustomerNode is a multi-tenant SaaS platform that customers run their day-to-day work on. This page describes how we operate the platform for availability, how we respond when something goes wrong, and how we communicate with customers during and after incidents.

AVAILABILITY COMMITMENT

Specific availability commitments — including any service-level agreement and associated remedies — are documented in your enterprise agreement and Master Services Agreement. This page describes our operational approach; it does not create or modify contractual SLAs.

Platform availability All systems operational

99.99% trailing 5 days · as of 2026-07-11

2026-07-06 · 100.00%2026-07-07 · 100.00%2026-07-08 · 99.93%2026-07-09 · 100.00%2026-07-10 · 100.00%

5 days agotoday

Measured by our own monitoring from production endpoint health checks — gap-aware, so windows where monitoring itself was unavailable count as downtime, not excluded. An operational measurement, not a contractual SLA.

Integration Reliability Core

Reliability at CustomerNode spans two layers, operated together as the Integration Reliability Core:

Platform availability — our services. Public endpoints are health-checked continuously on a 60-second cadence; the measured availability above reflects real endpoint reachability.
Third-party resilience — their services. Outbound calls to integrations (Salesforce, Slack, Jira Service Management, Google Calendar, and others) run through a durable queue: a call that fails because a third party is briefly unavailable is automatically retried with backoff, so a hiccup on their side does not silently drop your data.

Availability approach

Production infrastructure runs in independently audited cloud environments with redundancy at the compute, storage, and network layers. Critical infrastructure components are designed with redundancy and failover mechanisms intended to reduce the likelihood of customer-visible downtime. Component restarts and rolling deploys are routine operational events, not incidents.

Architecture for resilience

Managed edge: public-facing endpoints sit behind a managed edge layer that performs TLS termination, traffic filtering, and rate limiting.
Stateless application tier: the application is horizontally scalable; instances can be replaced and rescheduled independently.
Replicated storage: primary databases use managed services with redundancy and point-in-time recovery.
Health monitoring: production services are regularly monitored, and failed instances are removed from rotation automatically.

Backup & recovery

Production data is backed up on a regular schedule. Backups are encrypted at rest and stored separately from primary infrastructure. We periodically test restoration procedures to validate recovery objectives. Recovery point and recovery time objectives are documented internally and shared with enterprise customers under NDA on request.

Maintenance windows

The platform supports rolling deployments for most changes without customer-visible downtime. When a change requires customer-visible downtime — possible for certain database migrations or infrastructure cutovers — we schedule it during a low-traffic window and provide advance notice to affected customers via email and in-app notice.

Incident response

We operate a documented incident response process covering detection, triage, containment, eradication, recovery, and post-incident review.

Detection: automated monitoring of error rates, latency, and authentication anomalies; supplemented by customer reports.
Triage & severity: incidents are classified by customer impact (SEV-1 platform-wide outage, SEV-2 partial degradation, SEV-3 limited impact).
Containment: on-call engineers stabilize the platform before pursuing root cause; this may include traffic shedding or feature degradation to preserve core functionality.
Recovery: we restore service and validate that customer-visible behavior is correct before declaring resolution.
Post-incident review: material incidents are reviewed internally and the findings drive corrective actions tracked to completion.

Customer notification

During material platform incidents, CustomerNode communicates with affected customers on a commercially reasonable cadence appropriate to the severity and scope of the incident.

Proactive updates: we initiate communication once incident scope is confirmed, rather than waiting on customer reports.
Ongoing communication: while an incident is open, we provide updates as the situation evolves.
Post-resolution summary: following material incidents, we share with affected customers a summary of what happened, customer-visible impact, and the corrective actions we are taking.
Security incidents involving customer data: handled under the separate notification commitments described in our Data Processing Addendum.

Business continuity & disaster recovery

Our business continuity plan addresses scenarios including cloud infrastructure disruption, loss of a critical subprocessor, and loss of access to office or personnel infrastructure. The plan is reviewed on a recurring schedule. A summary of recovery objectives and validated scenarios is available to current and prospective customers under NDA during security review.

Subprocessor & dependency monitoring

Our reliability depends in part on third-party services (see Subprocessors). We monitor the health of critical subprocessors and have documented fallback or degradation strategies for the dependencies whose loss would otherwise be customer-visible.

Continuous monitoring

In addition to managed-infrastructure monitoring, we operate an internal monitoring service, the CustomerNode Watchdog, that evaluates production health on a continuous, short interval. It checks compute, memory, and storage capacity; platform services; public endpoints; the primary database and cache; TLS certificates; backup freshness and integrity; critical third-party dependencies; and security signals such as host file integrity, authentication anomalies, and unexpected network activity.

A condition must persist before it generates an alert, which limits noise from transient events, and on-call engineers are notified when an issue is confirmed. The monitoring service is itself supervised by an independent heartbeat, so a failure of the monitoring service is detected rather than silent. Scheduled daily and weekly reports record an overall health score, capacity and usage trends, backup and certificate status, application error rates and latency, and security observations. The service surfaces information for engineers to act on; remediation follows the incident response process described above.

The examples below are generated from the same report templates we use internally, populated with synthetic data:

Nightly health report (sample)

The end-of-day report: overall health score, capacity and usage trends, services, backups, certificates, application metrics, and security posture.

View sample →

Error digest (sample)

The aggregated application-error report: grouped exceptions, counts, affected paths, and sample diagnostics from a recent window.

View sample →

These are internal operational artifacts, not a customer-facing feature, shown here with synthetic data to illustrate how we monitor service quality. Live production reports are not publicly available.

Status & transparency

Customers who need real-time visibility into platform health, or notification of in-progress incidents, may request to be added to our incident notification list at [email protected]. A public status page is on our roadmap; until it is live, in-progress SEV-1 and SEV-2 incidents are communicated by email to affected tenants and via in-app notices where possible.

CustomerNode™

CustomerNode™

Reliability

Integration Reliability Core

Availability approach

Architecture for resilience

Backup & recovery

Maintenance windows

Incident response

Customer notification

Business continuity & disaster recovery

Subprocessor & dependency monitoring

Continuous monitoring

Status & transparency