Skip to main content
Data Engineering

From SCADA Historian to ML Pipeline: Architecture Patterns for Midstream Operations

12 min read Midstreamly Engineering Team
Architecture diagram showing SCADA historian connected to ML processing pipeline

Getting from raw SCADA historian data to a working machine learning pipeline for rotating equipment anomaly detection is not a software problem — it is a data engineering problem that happens to involve software. The historian connection is rarely the hard part. The hard parts are tag selection and data quality validation, time-series alignment when data comes from multiple systems with different scan rates, and establishing a clean labeling scheme for supervised model components. This article documents the architecture patterns that hold up in midstream field deployments, based on what we've seen across gas processing and pipeline pump station environments in the Texas and Louisiana Gulf Coast corridor.

Starting Point: What Your AVEVA PI Historian Actually Stores

OSIsoft PI System (now AVEVA PI) is the dominant historian in midstream operations. Most large operators run PI Server versions 2018 or later with PI Asset Framework (PI AF) hierarchies that map tags to physical asset context. For a rotating equipment condition monitoring project, the PI AF hierarchy is your roadmap — it tells you how tags are organized relative to physical assets (compressor train, pump set, driver) and reveals the naming conventions for the tags you care about.

The PI System stores time-series data using a compression algorithm (exception reporting) that only writes a new value when the signal changes by more than a configured deviation threshold. This is critical to understand before you start pulling historian data for ML feature engineering. A bearing temperature tag configured with a 0.5°C exception deviation will only write values when temperature changes by that amount — meaning during stable operation, you may have gaps of hours in the stored data, not because the sensor stopped reporting but because the value didn't move. A naive time-series ML pipeline that forward-fills these gaps will produce misleading features. The correct approach is to use PI's "archive at regular intervals" option on vibration-relevant tags, or to configure explicit periodic retrieval in your data extraction layer.

For vibration-specific tags, if your protection system (Bently Nevada System 1, Emerson AMS, or similar) has a historian connector writing to PI, the scan rate is typically configurable. Vibration overall values (RMS velocity, displacement p-p) are often logged at 1–60 second intervals. High-frequency spectral data is generally not stored in the process historian — it lives in the machinery management software's own database and requires a separate extraction strategy.

Tag Selection Strategy

A common mistake in early-stage condition monitoring deployments is to pull every available tag and try to build models on the full dataset. For a compressor train, that might be 300–500 PI tags. Models built on 500 features without domain knowledge-driven selection tend to pick up correlations that are artifacts of operational patterns rather than early indicators of mechanical degradation.

A structured tag selection approach for a centrifugal compressor train starts with three categories:

  • Vibration tags: Radial vibration X/Y on each bearing (proximity probe, mils p-p), casing vibration on bearing housing (velocity mm/s or acceleration g RMS), axial position (mils), phase reference (keyphasor).
  • Process condition tags: Suction/discharge pressure and temperature, molecular weight (if available), inlet flow, speed (RPM from the governor or directly from a Bently Nevada speed input), interstage pressures for multi-stage units.
  • Condition indicators: Bearing temperatures (both drive and non-drive end), lube oil header pressure and temperature, seal gas differential pressure, motor current (for motor-driven compressors).

This typically produces 30–60 tags per compressor train that cover the failure modes of interest, rather than 500 tags covering every process variable ever instrumented. The domain knowledge used in this selection is what separates a condition monitoring pipeline from a general-purpose data lake.

Time-Series Windowing and Feature Extraction

For anomaly detection on rotating equipment, the input to an ML model is not a single-point value — it is a feature vector computed over a rolling window of historical data. The window length needs to balance responsiveness (short windows detect recent changes faster) against noise sensitivity (longer windows smooth out transient process upsets that shouldn't trigger alerts).

In practice, a 24-hour rolling window with 1-hour stride produces a feature vector update rate that works well for the failure modes most relevant to midstream rotating equipment. Bearing defect frequency energy tends to grow over days to weeks, not hours, so a 24-hour window captures meaningful trend information while updating frequently enough to be operationally useful. For surge detection on centrifugal compressors — where the relevant dynamics operate on a seconds-to-minutes timescale — this window is far too long, and a separate near-real-time pipeline with 1–5 minute windows is more appropriate.

Feature extraction from the time-series window should include: statistical features (mean, variance, RMS, kurtosis, crest factor) on vibration channels; operating point features (mean speed, mean flow, speed variance to detect load cycling); and, where spectral data is available, energy in defined frequency bands corresponding to bearing defect frequencies and harmonics. The operating point features serve double duty — they are predictive features themselves, and they define the normalization context for vibration features (a vibration RMS of 0.45 mm/s means different things at 50% load versus 100% load).

OT Network Architecture: Read-Only Access and the Firewall Problem

OT/IT network architecture in midstream facilities varies enormously. Some operators have relatively open IT/OT network boundaries; others maintain strict Purdue Model segmentation where Level 2 (SCADA/historian) is physically isolated from Level 3 (business network) with a data diode or a DMZ. Designing your data pipeline architecture for the most restrictive case — and then relaxing constraints where the operator's environment allows — is the only approach that works across the fleet.

The access pattern for historian-based condition monitoring should be read-only via PI Web API (for modern PI 2016+ installations) or OPC-DA/UA connector for older installations. PI Web API supports HTTPS authentication over TCP/443, which is often the only port allowed outbound through OT network firewalls. The alternative — direct PI SDK connections over native PI port TCP/5450 — requires a more permissive firewall rule that many OT security teams are reluctant to open.

For facilities that prohibit any outbound connection from the OT network, an on-premise data collector agent (lightweight software installed in the OT DMZ or on the PI server host) can queue data and push batches to a cloud processing endpoint via an outbound-only HTTPS connection. This avoids the need for any inbound firewall rules to the OT environment, satisfying the security posture of most midstream operators with formal OSHA 1910.119 PSM programs that treat OT network integrity as a process safety control.

Feature Store and Model Serving Architecture

Once the data pipeline is established and feature extraction is running, the ML model deployment question is less complex than it first appears. For the anomaly detection use case — flagging deviations from established normal behavior — isolation forest models trained on 14–21 days of normal-operation baseline data provide a good starting point that doesn't require failure labels (which are scarce in rotating equipment datasets). The isolation forest scores each feature vector by how isolated it is from the normal-operation distribution; scores above a threshold trigger alert states.

For RUL modeling — predicting time-to-failure for assets showing progressive degradation — LSTM (Long Short-Term Memory) neural networks require labeled failure events and clean run-to-failure sequences. These are harder to obtain from operational data because operators intervene before failure in most cases, truncating the time series before it reaches failure state. Weibull parametric survival models work better with the censored data typical of midstream equipment fleets, where most observations are right-censored (the machine was still running when last observed). The choice between LSTM and Weibull for RUL is not primarily a modeling choice — it is a data availability choice, covered in more depth in a separate article on RUL modeling methods.

The practical recommendation for a growing midstream operator standing up a condition monitoring data pipeline for the first time: build the historian connection and feature extraction layer first, run it for 30–60 days to validate data quality, then add the anomaly model. The temptation to start with the ML model and then figure out the data pipeline is the most common failure mode we see in field deployments. It produces models that overfit to data artifacts rather than equipment behavior. We are not saying the modeling step is trivial — it is not. We are saying that no model survives contact with poorly validated data, and the historian integration and feature extraction layer is where data quality problems are caught and corrected before they corrupt your models.

Midstreamly is built to connect to your existing AVEVA PI, Bently Nevada, or Emerson AMS installation — no data lake or ML infrastructure required on your side. Talk to a field deployment engineer.

Midstreamly Engineering Team

Rotating Equipment & Condition Monitoring