Splunk Observability Cloud degradation
Incident Report for Splunk Observability Cloud US2
Resolved
As of 7pm PST the last systems have been brought back to nominal operations and this incident has been resolved. Splunk APM data will not be available or appear incomplete in the time range between 4pm and 7pm PST.
Posted Jan 08, 2022 - 19:22 PST
Monitoring
Our cloud provider has confirmed that the recovery we've been seeing is real, and their networking issues have been resolved. We are continuing to work on bringing all systems back to a stable state. At this point, most of the Splunk Observability Cloud offerings are operating nominally.

Splunk APM's Monitoring MetricSets have recovered to real-time processing, and we're actively working on bringing the rest of the APM trace data processing pipelines back online for Troubleshooting MetricSets and raw trace data and trace search. As a part of this effort, and to accelerate the recovery of those pipelines and bring the most important real-time visibility back to our customers, we will be skipping over the data ingested during the incident; a precise time range of the lost data will be provided when the incident is resolved.
Posted Jan 08, 2022 - 19:03 PST
Update
We are still waiting on further updates from our cloud provider in this region, but we are starting to see early signs of stability and recovery. The Splunk Observability Cloud Web Interface is operational and responding normally, datapoint ingest is operational, and metric timeseries based charts, dashboards and detectors are functional.

Splunk APM Monitoring MetricSets have started their recovery and should be back to real-time within the next 30 minutes. We are working on bringing the Troubleshooting MetricSets and raw trace processing pipelines back online.
Posted Jan 08, 2022 - 18:45 PST
Identified
We identified the outage is related to network issues by the cloud provider (Status - https://status.cloud.google.com/incidents/NMcnk6aE8xMHHwRGmyry). We are working with the cloud provider to resolve it and ensure the availability of our services.

The current impact on Splunk Observabiity Cloud includes:
- The web interface and login had a degraded performance and may require multiple tries to login.
- Charts may not load or load slowly and intermittently.
- Detectors are not alerting in real time.
- Processing of Splunk APM trace spans is delayed, leading to Troubleshooting MetricSets, Monitoring MetricSets, and raw traces not being available or not representing the most current data.
- Small amounts of trace data were lost at the onset of the incident and we may drop a small amount of data to bring the system back to real time once the cloud provider network issue is resolved. Data ingest for both datapoints and traces is otherwise not affected at this time.
Posted Jan 08, 2022 - 17:00 PST
Update
We identified the outage is related to network issues by the cloud provider (Status - https://status.cloud.google.com/incidents/NMcnk6aE8xMHHwRGmyry). We are working with the cloud provider to resolve it and ensure the availability of our services.

The current impact on Splunk Observabiity Cloud includes:
- The web interface and login had a degraded performance and may require multiple tries to login.
- Charts may not load or load slowly and intermittently.
- Detectors are not alerting in real time.
- Processing of Splunk APM trace spans is delayed, leading to Troubleshooting MetricSets, Monitoring MetricSets, and raw traces not being available or not representing the most current data.
- Small amounts of trace data were lost at the onset of the incident and we may drop a small amount of data to bring the system back to real time once the cloud provider network issue is resolved. Data ingest for both datapoints and traces is otherwise not affected at this time.
Posted Jan 08, 2022 - 17:00 PST
Investigating
We are investigating an issue impacting several systems of the Splunk Observability Cloud. The web interface, including dashboards and charts, may be slow to load. Processing of Splunk APM trace spans is delayed, leading to Troubleshooting MetricSets, Monitoring MetricSets and raw traces not being available or not representing the most current data. Small amounts of trace data were lost at the onset of the incident, but data ingest for both datapoints and traces is otherwise not affected at this time.
Posted Jan 08, 2022 - 15:36 PST
This incident affected: Splunk Observability Cloud Web Interface and Splunk APM (Splunk APM Monitoring MetricSets, Splunk APM Troubleshooting MetricSets, Splunk APM Trace Data, Splunk APM Tag Spotlight).