Using AI for database administration automation – extracting useful information from log files

Posted in: DBA Lounge, Oracle, Technical Track

When talking about automation for Oracle Database Administration, besides automation of software installation and patching, with the use of techniques from machine learning, we can also try to automate a few other tasks. This is especially true since it is well instrumented and generates a lot of data around its operation. For example, we can try to automate some routine tasks we do while troubleshooting.

 

As troubleshooting starts with the database being in trouble, we initially need to determine what the actual trouble (problem) is. One thing we do about this is go into log and trace files related to the database instance. In a Real Application Cluster (RAC) environments this gets more complex and tedious as there is a number of log/trace files related to the clusterware, as well as the database instances,  which would have useful data about the problem at hand. Imagine being notified that there is a problem with a RAC database running on an Exadata system running on 6 compute nodes. What log files should you check first, on which node and what should you look for to initially understand what is going on?

 

Intelligent monitoring

What if the monitoring system was a bit smarter and just showed you, on a single page, relevant extracts from all log files/trace files at once? This might be useful to help you quickly understand what is actually going on at the moment and will also help for later root cause analysis.

 

The theory behind it

So one thing I thought might be useful and tried to do a proof-of-concept solution for, is a machine learning/deep learning model called sequence to sequence (seq2seq). This is usually used for language translation, text summarization, speech recognition and automated question answering. My idea is to use in a so-called auto-encoder configuration in this case.

 

As a bit of ML/DL background, at a very high level, sequence models learn sequences (of numbers, text, audio..) and are able to predict the next item in a sequence. Seq2seq model learns to represent a sequence as a lower dimension vector (done by the encoder) while the decoder learns to generate another sequence from that vector. So a translation engine is given a number of sentences (sequences) in one language and another. For an autoencoder we give it the same sentence as a source and as a target, so it basically learns to recreate sentences similar to the ones given during the training.

 

So how is this useful for our log files? The emphasis, in the previous sentence, is it can recreate sequences that it has come across during the training process. So what if we train the model with a log file/trice files while the system was in a normal state/operation? Line by line, it will learn to recreate those lines and lines similar to it. If all of sudden it comes across a line that is very different from anything the model came across during training, it will not be able to recreate it accurately i.e. it will generate a sentence (sequence) that will be vastly different from the original line in the log file it has come across. So this is in a way anomaly detection in a text.

 

We can then compare the source (original) line from the log file and compare it to the one generated by this model, and if we see that it is very different we will report it as an anomaly in the log file. Since the model was trained in a normal state of the system operation, such an anomaly should be something out of ordinary and possibly useful for the troubleshooting.

 

Some of those log lines will have nothing to do with an issue or a problem, they will just be something different and out of norm from the log lines seen during model training so we may need to do additional filtering of those by doing a supervised classification, which would of course require someone to go through such extracts from log files and classify them as being important or not.

 

To make the output more useful, I used a ML clustering algorithm called Mean Shift to automatically group those log lines that happened around the same time (using the timestamp from the log. Grouped like that we can treat those clusters like separate incidents that happened around some time (central point timestamp in the specific cluster of log lines). The grouping can be improved by grouping by some specific words or phrases by using the same algorithm.

 

Basic proof-of-concept implementation 

To more rapidly experiment with this I have used the ML/DL service in AWS called SageMaker which offers prebuilt Seq2Seq algorithm which can be used out of the box:

 

I have experimented using the Oracle High Availability Service daemon trace files (ohasd.trc) but it should be able to work on any log/trace file (of course it would need to be trained to “understand” that type of file.

 

Model training

SageMaker offers an example Jupyter notebook with all steps required to train and use their Seq2Seq model, so only some steps in that need to be adjusted for it to be used on data you want to for training, but also to create an endpoint which will be used for inference later.

 

We can upload some ohasd.trc file which should be from a time where the system behaved normally without any incidents.

We need to split the ohasd.trc file for training and validation:

 

head -n 32750 ohasd.trc > ohasd.trc.train
tail -n 8188 ohasd.trc > ohasd.trc.test

 

Then tokenize the trace files  (split in separate words and convert them to numbers) so they can be ingested into the se2seq model. AWS provides a python script for that, but I have modified it so it just uses tokens that don’t contain special characters and numbers, so it will be easier for the ML model to be trained on it (have fewer items in the vocabulary and smaller sentences which would be easier for the seq2seq model to recreate):

 

python3 create_vocab_proto.py

 

We load the files to S3, so SageMaher can use it.

Create a training job with these parameters can be created and run:

create_training_params = {
        ..
        "HyperParameters": {
        "max_seq_len_source": "20",
        "max_seq_len_target": "20",
        "optimized_metric": "bleu",
        "batch_size": "256",        "checkpoint_frequency_num_batches": "1000",
        "rnn_num_hidden": "512",
        "num_layers_encoder": "3",
        "num_layers_decoder": "3",
        "num_embed_source": "512",
        "num_embed_target": "512",
        "checkpoint_threshold": "3"
...

As seen we are optimizing for the metric BLEU, which is metric used to compare two sentences for their similarity. After creating and running the training job an endpoint can be created so this model can be used over an HTTP request from anywhere, so we would not need to install python ML libraries on the actual database server. The endpoints run on a ec2 instance in the background (which we don’t manage) but we are billed as long as the endpoint is up and running for that ec2 instance.

 

Some results

Created a script that would read lines from the ohasd.rc file and send them to the created endpoint SageMaker endpoint can be used just as normal REST API call using http.

To see if the model being trained would be able to actually filter out some interesting lines from the log, I have intentionally killed the ASM instance by killing some of it’s background processes so it would crash.

The algorithm still showed quite a number of log lines, that it thinks are unusual, so to make the output more useful, as mentioned previously, using  Mean Shift (used scikit-learn python library). It grouped lines with a nearby timestamp, so basically events that occurred near each other, so supposedly belong to a same incident. Some sample output is shown below. The column “Cluster n:”  indicates to which group (cluster) of lines that lines is part of – meaning to which related event, where “n” os the number of that group.

 

Anomalies detected and grouped into Cluster 0 seems not to be something that we might care about so we can additionally training a classifier to just ignore anomalies like that. For  “Cluster 2” seems to be the ASM going down. This can be further developed, so the system automatically summarizes in a high level note, what had happened, or even provide some root cause, so when the DBA connects to troubleshoot, he/she will know ehere to start and what is going on:

 

Cluster 0: |ORIG: | 2018-01-02 17:47:21.379 :OHASDMAIN:57868352:  OHASD params []
Cluster 0: |ORIG: | 2018-01-02 17:47:21.379 :OHASDMAIN:57868352:  Socket cleanup:0x49bde30
Cluster 0: |ORIG: | 2018-01-02 17:47:21.379 :OHASDMAIN:57868352:  Got [0] potential names
Cluster 0: |ORIG: | 2018-01-02 17:47:21.383 :  OCRRAW:57868352: proprioo: for disk 0 (/u01/grid/cdata/localhost/ip-172-31-86-123.olr), id match (1), total id sets, (1) need recover (0), my votes (0), total votes (0), commit_lsn (1), lsn (1)
Cluster 0: |ORIG: | 2018-01-02 17:47:21.383 :  OCRRAW:57868352: proprioo: my id set: (1777565868, 1028247821, 0, 0, 0)
Cluster 0: |ORIG: | 2018-01-02 17:47:21.383 :  OCRRAW:57868352: proprioo: 1st set: (1777565868, 1028247821, 0, 0, 0)
Cluster 0: |ORIG: | 2018-01-02 17:47:21.383 :  OCRRAW:57868352: proprioo: 2nd set: (0, 0, 0, 0, 0)
Cluster 0: |ORIG: | 2018-01-02 17:47:21.387 :  OCRAPI:57868352: a_init:18: Thread init successful
Cluster 0: |ORIG: | 2018-01-02 17:47:21.387 :  OCRAPI:57868352: a_init:19: Client init successful
Cluster 0: |ORIG: | 2018-01-02 17:47:21.388 :OHASDMAIN:57868352:  Version compatibility check passed: Software Version: 12.2.0.1.0 Release Version: 12.2.0.1.0 Active Version: 12.2.0.1.0
Cluster 0: |ORIG: | 2018-01-02 17:47:21.392 : CRSMAIN:57868352:  Logging level for Module: GIPCBASE 0
Cluster 0: |ORIG: | 2018-01-02 17:47:21.396 :   CRSPE:57868352: ...done : 0
Cluster 0: |ORIG: | 2018-01-02 17:47:21.396 :OHASDMAIN:57868352:  Initializing ubglm...

Cluster 2: |ORIG: | 2018-02-02 13:00:44.538 :    AGFW:2113812224: {0:7:16} Verifying msg rid = ora.asm ip-172-31-86-123 1
Cluster 2: |ORIG: | 2018-02-02 13:00:44.538 :    AGFW:2113812224: {0:7:16} Received state LABEL change for ora.asm ip-172-31-86-123 1 [old label  = Started, new label = Abnormal Termination]
Cluster 2: |ORIG: | 2018-02-02 13:00:44.538 :   CRSPE:2101204736: {0:7:16} State change received from ip-172-31-86-123 for ora.asm ip-172-31-86-123 1
Cluster 2: |ORIG: | 2018-02-02 13:00:44.539 :   CRSPE:2101204736: {0:7:16} Processing unplanned state change for [ora.asm ip-172-31-86-123 1]
Cluster 2: |ORIG: | 2018-02-02 13:00:44.539 :   CRSPE:2101204736: {0:7:16} Scheduled local recovery for [ora.asm ip-172-31-86-123 1]
Cluster 2: |ORIG: | 2018-02-02 13:00:44.540 :   CRSPE:2101204736: {0:7:16} RI [ora.asm ip-172-31-86-123 1] new internal state: [CLEANING] old value: [STABLE]
Cluster 2: |ORIG: | 2018-02-02 13:00:44.540 :   CRSPE:2101204736: {0:7:16} state change vers moved to 6 for RI:ora.asm ip-172-31-86-123 1
Cluster 2: |ORIG: | 2018-02-02 13:00:44.540 :   CRSPE:2101204736: {0:7:16} Sending message to agfw: id = 50979
Cluster 2: |ORIG: | 2018-02-02 13:00:44.540 :   CRSPE:2101204736: {0:7:16} CRS-2679: Attempting to clean 'ora.asm' on 'ip-172-31-86-123'
Cluster 2: |ORIG: | 2018-02-02 13:00:44.547 :    AGFW:2113812224: {0:7:16} ora.orcl.db 1 1 received state from probe request. Old state = ONLINE, New state = ONLINE
Cluster 2: |ORIG: | 2018-02-02 13:00:44.547 :    AGFW:2113812224: {0:7:16} ora.orcl.db 1 1 received state from probe request. Old state = ONLINE, New state = ONLINE
Cluster 2: |ORIG: | 2018-02-02 13:00:44.547 :   CRSPE:2101204736: {0:7:17} ora.DATA.dg ip-172-31-86-123 1: uptime exceeds uptime threshold , resetting restart count
Cluster 2: |ORIG: | 2018-02-02 13:00:44.547 :   CRSPE:2101204736: {0:7:17} Scheduled local recovery for [ora.DATA.dg ip-172-31-86-123 1]
Cluster 2: |ORIG: | 2018-02-02 13:00:44.552 :    AGFW:2113812224: {0:7:17} ora.orcl.db 1 1 received state from probe request. Old state = ONLINE, New state = ONLINE
Cluster 2: |ORIG: | 2018-02-02 13:00:46.551 :    AGFW:2113812224: {0:7:18} Verifying msg rid = ora.orcl.db 1 1
Cluster 2: |ORIG: | 2018-02-02 13:00:46.551 :    AGFW:2113812224: {0:7:18} Received state LABEL change for ora.orcl.db 1 1 [old label  = Open,HOME=/u01/app/oracle/product/12.2.0/db, new label = Abnormal Termination,HOME=/u01/app/oracle/product/12.2.0/db]
Cluster 2: |ORIG: | 2018-02-02 13:00:46.552 :   CRSPE:2101204736: {0:7:18} State change received from ip-172-31-86-123 for ora.orcl.db 1 1

 

email

Interested in working with Gorjan? Schedule a tech call.

2 Comments. Leave new

Intressting technique. But I think in this use case AI / ML is not really needed.
More interessting would be a use case like:
Identifing the statements (AWR) which get slower as data is growing. So I could predict when the SLA get violated.

Reply
Gorjan Todorovski
April 2, 2018 11:59 am

Something like that is coming in my next blog post :)

Reply

Leave a Reply

Your email address will not be published. Required fields are marked *