An Oracle-based approach to the "taxi fare" prediction problem - episode 2

Jose Rodriguez

February 18, 2019

Tags: Oracle, Technical Track, Machine Learning

This is the second part of the series on the Taxi Fare prediction problem from an Oracle perspective. You can read Episode 1 here. In this Episode, I will put our model to work. That is, I will show several queries that predict taxi fares based on the features previously selected. I will also review model performance metrics. Remember, performance in this case does not relate to response time, but to fitness or how accurate the predictions are expected to be. Finally, I will comment on the pros and cons of this approach.

Show me the... results!

Well, now that we have a model what we all want is to see how it fares (see what I did here?). Of course, Oracle includes predictive functions in the Analytics option feature Data Mining SQL Functions. For this exercise, I will use the PREDICTION function because, well, we are predicting stuff. This function has a relatively simple but powerful syntax allowing the introduction of grouping options, cost base data and predictors for regression, classification and anomaly detection problems. It also has two modes of operation: Syntax and Analytic Syntax. Simply put, we can use an existing model with the Syntax option, or we can use the Analytic Syntax option to "create a model on the fly". Let's see a simple example showing both the actual fare amount and the one predicted by the model just for the first 10 rows in the data. First, using the linear regression model created on the 54 million rows train data table we can see that the variation in percentage is quite high, as much as 49% only for this small test set.

SQL> WITH PREDICTION AS 
  2 (SELECT FARE_AMOUNT, PREDICTION (LIN_REG_TAXI_FARE_3 USING *) PREDICTED_FARE
  3 FROM TRAIN_DISTANCE_BUCKET_3
  4 WHERE CASEID < 11
  5 )
  6 SELECT FARE_AMOUNT, 
  7 ROUND(PREDICTED_FARE,2) PREDICTED_FARE, 
  8 ROUND(ABS(FARE_AMOUNT-PREDICTED_FARE),2) ABS_DIFF, 
  9 ROUND(ABS(FARE_AMOUNT-PREDICTED_FARE)/FARE_AMOUNT*100,2) "%DIFF"
  10 FROM PREDICTION;
 
 FARE_AMOUNT PREDICTED_FARE ABS_DIFF %DIFF
 ----------- -------------- ---------- ----------
  6.5 8.81 2.31 35.48
  5.7 8.53 2.83 49.71
  10.9 12.25 1.35 12.4
  18 18.64 .64 3.54
  6.1 7.92 1.82 29.89
  13 12.64 .36 2.81
  8.5 10.56 2.06 24.2
  9 10.99 1.99 22.07
  6.5 8.67 2.17 33.36
  9.3 10.35 1.05 11.26
 
 10 rows selected. 
 
 Elapsed: 00:00:04.275

And now for the support vector machine model created on the 2 million rows train data table. Simply based on these cases, it looks like the SVM model is more accurate that the GLM one.

SQL> WITH PREDICTION AS 
  2 (SELECT FARE_AMOUNT, PREDICTION (LIN_REG_TAXI_FARE_3_SVM USING *) PREDICTED_FARE
  3 FROM TRAIN_DISTANCE_BUCKET_3
  4 WHERE CASEID < 11
  5 )
  6 SELECT FARE_AMOUNT, 
  7 ROUND(PREDICTED_FARE,2) PREDICTED_FARE, 
  8 ROUND(ABS(FARE_AMOUNT-PREDICTED_FARE),2) ABS_DIFF, 
  9 ROUND(ABS(FARE_AMOUNT-PREDICTED_FARE)/FARE_AMOUNT*100,2) "%DIFF"
  10 FROM PREDICTION;
 
 FARE_AMOUNT PREDICTED_FARE ABS_DIFF %DIFF
 ----------- -------------- ---------- ----------
  6.5 7.79 1.29 19.91
  5.7 7.24 1.54 26.96
  10.9 11.99 1.09 10.04
  18 20.4 2.4 13.32
  6.1 6.38 .28 4.58
  13 12.51 .49 3.75
  8.5 9.9 1.4 16.43
  9 9.61 .61 6.73
  6.5 7.31 .81 12.53
  9.3 8.62 .68 7.32
 
 10 rows selected. 
 
 Elapsed: 00:00:05.736

Another interesting function that Oracle provides is the PREDICTION_DETAILS function. This function returns an XML string that describes the attributes of the prediction, that is for a prediction problem, the topN attributes that have the most influence in the prediction and their relative weights. Let's see another simple example with the GLM:

SQL> SELECT FARE_AMOUNT, 
  ROUND(PREDICTION (LIN_REG_TAXI_FARE_3 USING *),2) PREDICTED_FARE, 
  REDICTION_DETAILS (LIN_REG_TAXI_FARE_3 USING *) DETAILS
  FROM TRAIN_DISTANCE_BUCKET
  WHERE CASEID <11;
 
 FARE_AMOUNT PREDICTED_FARE DETAILS 
 ----------- -------------- ------------------------------------------------------------------------------------------
  6.5 8.81
  5.7 8.53
 
  10.9 12.25
 
  18 18.64
 
  6.1 7.92
 
  13 12.64
 
  8.5 10.56
 
  9 10.99
 
  6.5 8.67
 
  9.3 10.35
 
 10 rows selected. Elapsed: 00:00:04.176

Something to note in the example above is that Oracle does not always choose the same attributes and the number of attributes does not seem to imply higher accuracy. But this is only a small portion of the data and may not be representative. Now, the same with the SVM model:

SQL> SELECT FARE_AMOUNT, ROUND(PREDICTION (LIN_REG_TAXI_FARE_3_SVM USING *),2) PREDICTED_FARE, 
  2 PREDICTION_DETAILS (LIN_REG_TAXI_FARE_3_SVM USING *) DETAILS
  3 FROM TRAIN_DISTANCE_BUCKET_3
  4 WHERE CASEID <11; 
 
 FARE_AMOUNT PREDICTED_FARE DETAILS 
 ----------- -------------- ------------------------------------------------------------------------------------------------------------------------
  6.5 7.79
  5.7 7.24
 
  10.9 11.99
 
  18 20.4
 
  6.1 6.38
 
  13 12.51
 
  8.5 9.9
 
  9 9.61
 
  6.5 7.31
 
  9.3 8.62
 
 10 rows selected. Elapsed: 00:00:04.346

Again, we can see that Oracle is choosing different attributes for each case. In any case, once the model has been created it is quite easy to use SQL, either in batch mode or real time, to obtain predictions.

Model fitness

There are several methods to evaluate the accuracy of a regression model. The Oracle documentation mentions two commonly used: Root Mean Squared Error (RMSE) and Mean Absolute Error (MEA), and provides SQL to calculate them. So let's do it first for the GLM and then for the SVM model.

-- First for the GLM model
 SQL> with prediction as 
  2 (SELECT FARE_AMOUNT, PREDICTION (Lin_reg_taxi_fare_3 USING *) PREDICTED_FARE
  3 FROM train_distance_bucket_3)
  4 select SQRT(AVG((predicted_fare - fare_amount) * (predicted_fare - fare_amount))) rmse, 
  5 AVG(ABS(predicted_fare - fare_amount)) mae
  6* from prediction p;
  RMSE MAE
 ---------- ----------
 6.66524393 2.96620921
 
 Elapsed: 00:01:39.752
 
 -- Then for the SVM model
 SQL> with prediction as 
  2 (SELECT FARE_AMOUNT, PREDICTION (LIN_REG_TAXI_FARE_3_SVM USING *) PREDICTED_FARE
  3 FROM train_distance_bucket_3 )
  4 select SQRT(AVG((predicted_fare - fare_amount) * (predicted_fare - fare_amount))) rmse, 
  5 AVG(ABS(predicted_fare - fare_amount)) mae
  6 from prediction p ;
 
  RMSE MAE
 ---------- ----------
 7.76230946 2.39739187
 
 Elapsed: 00:01:42.077

With the data above we can say that the GLM model is better in general error terms, smaller RMSE. But based on the MAE, the SVM model is slightly better. Note that in the code, both calculations have been performed on the 54 million rows table, although the SVM model was trained with only 2 million rows of data. While Oracle indicates that the SVM model is more complex and powerful than GLM, for this particular data set it shows worse performance. This may be because the SVM algorithm is better suited to a data set with a large number of attributes, which is not the case here. For those of you who have worked with the original Taxi Fare problem you may know that the RMSE for a basic algorithm is approximately 8, so getting a value of 6.6 is quite an improvement.

Model fitness obtained with the Data Miner

As I explained in Episode 1, the Data Miner extension for SQL*Developer provides tools to create models like the ones I've created with PL/SQL, but based on a GUI used to define what they call a data workflow. In the simplest way to do it, I've created a very simple workflow that starts with the same data table I used to create the first ones, and had the DM to create both a GLM and a SVM regression model with its own default parameters. [caption id="attachment_105819" align="aligncenter" width="150"]

Simple workflow to create GLM and SVM models[/caption] This simple workflow reads the data from the table, obtains basic statistics information of it (Explore data node) and generates a GLM and an SVM model with default parameters (Regress Build 1 node). This workflow can be executed immediately or scheduled to be executed later. This is useful if the base data changes regularly to keep the models trained with the newest data. Once the workflow has been executed we can review each model information and, quite interestingly, a comparison of both models. Like so: [caption id="attachment_105820" align="aligncenter" width="150"]

GLM_vs_SVM[/caption] As we can see on the graph, both models perform similarly, but the SVM model shows a slightly worse predictive confidence of 22% compared with the 32% of the GLM model, which is not that good either. This shows that the models do not fit very well for this particular data set. It may be a matter of training with more data or preparing the data in a different way to obtain better results. I may try and add a third episode to this series with attempts at improving the results.

Final thoughts

And this is pretty much it. In this two-part series, I've showed that the Taxi Fare problem can be solved using Oracle tools, more specifically the Advanced Analytics option. I've prepared the data, created a couple models and obtained predictions, all without using any external tool and with basic SQL and PL/SQL, which was the initial objective of the experiment. The use of the Data Miner extension is useful and I probably missed many features, but it is not required to achieve what I was looking for. It is true that the algorithms I used for this particular problem do not seem to fit very well, but this may simply be a lack data for training, that the algorithm configuration is not properly adjusted, or that the training data set is not good enough.

So what are the advantages of this approach?

The whole idea of the experiment was to do it within Oracle and nothing more. It worked out, so the main advantage I can see is that a person, call them a DBA, data scientist or developer, who has knowledge of SQL and, more importantly, the understanding of business data, can build a model in a relatively easy way. The second benefit I see is that the data to be used is already there. Daily ETL and DWH processes are surely in place and adding one more flow to get the data shaped to the needs of our model should be, again, relatively easy. Once the model has been created it can be used with simple SQL queries on new data as it arrives, either in real time or in batch processing, depending on our needs.

Fine, what's the trick then?

Well, creating a model can be very expensive in compute resources, depending on the amount of data used to train it. In one of my tests, it took for the SVM model 9 hours to complete on a 54 million row data set running in an OCI VM.Standard2.2 instance. This is only 2 OCPUs and 32GB of RAM and surely any enterprise level server, or an Exadata, can do it much faster, but you see my point. If the DWH server is already burdened with daily ETL loads and reporting, adding the model creation may be a bit too much. Also, there is the matter of the licensing. On top of the Enterprise Edition license fees, and everything else the company may be using already, the Advanced Analytics option has to be licensed as well. This, again, may not be a big deal with a ULA in place or just a drop in the ocean depending on the licenses budget, but it is a factor to consider. And that's all folks. Should you have any questions, use the comments section to ask them and I'll be happy to answer them.

Insight and analysis of technology and business strategy