An anomaly is any data point or suspicious event that stands out from the baseline/expected pattern. When data unexpectedly deviates from the established dataset, it can show an early sign of system malfunction, breaches, or Backup configuration changes.
e.g. unexpected data deletions, modifications, or excess data insertions.
Anomalies don’t always signify an issue but they are all worth investigating to better understand why a deviation occurred and if that anomaly is a valid point as compared to baseline or training set.
Importance
- Identify ransomware attacks sooner
- Limit downtime
- Limit/detect early data loss
- Better understanding of changes in the environment
Challenges
Anomaly detection is only valuable if it can find true anomalies that means training the system before it can be useful. Otherwise, the system can relay an excessive number of alerts /anomalies beyond what one could feasibly investigate.
Retraining anomaly detection system helps in re-establishing a new baseline.
Anomaly detection approach
To detect anomalies, required data is collected periodically. This data is used for training the model and detect anomalies.
As of now QoreStor collects required data with the interval of 5-minutes.
Once the QoreStor is installed or upgraded to 7.5.0, the data collection starts. Data collected in the first 3 months (90 days) is used for getting a baseline. Once a baseline is established, anomalies are predicted using those baseline. After that every 30 days, the baseline is re-established with the last 90 days of data.
Anomaly detection categories
Anomalies detection is categorized in the following categories: System-level, Container-level, Storage-group level.
System-level
Anomalies : Following anomalies are detected:
- System-level log in authentication failures /anomalies
- QoreStor UI authentication failures /anomalies
- Protocol (OST/RDS) authentication failures /anomalies
- OS audit process stopped – this anomaly is reported as soon as the OSAudit process is either stopped or stopped/paused logging to audit files due to low space or any other issue
- Report high or excess system load average
- diskio anomalies reported for repository/metadata/enclosure filesystems:
- Report excess number of IO operations executed
- Excess average wait time being observed
Authentication-related anomalies do not need training as thresholds are used. For instance, if “root” user login authentication fails 3 times or more from a host “abc”, then it is reported as an anomaly.
The anomaly report shows the expected value and observed /current value with date range.
Load average
During the training period, the load average value is collected every 15 minutes. Once the training period is over, the maximum load average value is computed. The maximum value is chosen so that it occurs a certain number of times repeatedly (it need not be consecutive). We call this the trained maximum value.
Reporting
Please check if the current load average value exceeds the trained maximum value for a minimum of 30 minutes consecutively. If it exceeds, report it as a load-average anomaly.
Diskio
During the training, the total number of I/O operations and average wait time are collected periodically for all the QSs consumed by file systems like repository/metadata/enclosure. Once the training is over, the maximum total I/O operations are noted. The average wait time is trained using a linear regression model and is saved for later reference.
Detection /Reporting
Once the training is over, compare the current total number of I/O operations executing with the maximum training I/O count reported. If it exceeds the maximum count, report an anomaly.
Similarly, compute current average wait times and compare them with trained/expected average wait times from the linear regression model. If the average wait time exceeds 15 minutes or more in the last 1 hour, report it as an anomaly.
Diskio report is generated with the following anomaly description:
- Filesystem path on which anomalies reported
- Expected total number of I/O operations and current total I/O operations OR expected average wait time and current average wait time date range.
CLI: ‘system’ CLI can be used to configure the above anomaly detections. Please refer the QoreStor CLI Reference guide for more information.
Report: To determine the System level anomalies, the following entities are shown in the report.
- Client-name – Client from which failed authentication happened
- User name – Username used in authentication
- Failed count – Number of failed attempts
- Failed start/end time - The period of failed attempts occurred.
Container-level
Qorestor detects anomalies related to data ingest, data overwrites and data expiry at container level. For this, corresponding data is collected at regular intervals like the following metrics on that container.
Ingest and Overwrite: Detects Backup pattern and data size.
- Number of bytes ingested onto this container across clients within regular intervals
- The number of bytes overwritten across clients within regular intervals
Expiry
Files-deleted – Number of files/images deleted across clients. Internally tracks the total sizes of all deleted files. Data collected over 30-minutes of interval is used for anomaly detection.
NOTE: : Even if the containers are removed, anomalies can be queried through CLI or UI.
CLI : Container CLI can be used to tune/set anomaly detection metrics. Please refer to the QoreStor CLI Reference guide for more information.
Anomaly settings applied at the Storage Group level are automatically applied to all containers in it unless explicitly disabled/turned off at the individual container level.
Report : The container level anomaly report shows the following anomaly types :
- Ingest – Shows bytes-ingested and corresponding savings (which is not inline with the training period/dataset)
- Overwrite – Total bytes overwritten from backup (not expected as per training set)
- Expiry – number of files deleted and the sum of all file sizes (not expected as per training set)
- Start/end time – The time of anomaly occurred in the container
Storage-group level
At the storage group level, savings anomalies are detected. Savings are further classified into the following sub-categories:
- Savings – dedupe - If total post-dedupe bytes are outside of the training range
- Savings – compression - If total post-compression bytes are outside of the training range
CLI: storage_group CLI can be used to set anomaly detection metrics at the storage group level. Please refer to the storage_group command in the QoreStorCommand Line Reference Guide Reference guide.
Report: The storage group anomaly report shows the following metrics:
- Anomaly type – Savings
- Anomaly sub-type – deduplication or compression
To decide if this is an anomaly, following parameters are used:
- Current value of deduplication or compression bytes
- Minimum value and maximum value expected i.e., expected range
- Difference with range i.e., with nearest minimum or maximum value
Retraining
Automatic retraining: Once first-time training is completed (after 3 months of data), periodically every one month, retraining happens with the last 3 months of data. This is to update recent data for the training period. This helps in tuning the baseline. This happens for both containers and storage groups.
Manual retraining: ocamltrain CLI can be used to do on-demand retraining. Please refer to the QoreStor Command Line Reference Guide guide for more information.
Alerts
Following alerts/events are raised related to anomaly detection :
- When stats collection is not happening
- When anomaly detection service is not running
- When OS-audit stops running, the os-authentication anomaly detection is enabled at the system level
Reports
Anomaly reports are shown in QoreStor UI or can be queried using ocamlreport CLI. Emails can also be configured using the following CLI to send anomalies as and when detected.
/opt/qorestor/bin/email_anomalies --configure
Please refer the QoreStorCommand Line Reference Guide guide for more details.