Implementing and evaluating various machine learning models for pipe burst prediction

. By accurate predicting of pipe bursts, it is possible to schedule pipe maintenance, rehabilitation and improve level of services in water distribution networks (WDNs). In this study we aimed to implement five artificial intelligence and machine learning regression models such as multivariate adaptive regression splines (MARS), M5' regression tree (M5'), Least square support vector regression (LS-SVR), fuzzy regression based on c-means clustering (FCMR) and regressive 10 convolution neural network with support vector regression (RCNN-SVR) for predicting pipe burst rate and evaluating the performance of these models. The most effective parameters for regression models are pipes age, diameter, depth of installation, length, average and maximum hydraulic pressure. In the present study, collected data include 158 cases for polyethylene (PE) and 124 cases for asbestos cement (AC) pipes during 2012-2019. The results indicate that RCNN-SVR model has a great performance of pipe burst rate (PBR) prediction.


Introduction
Water distribution networks (WDNs) are critical infrastructures. The objective of WDNs is to provide water with desirable quantity, quality and pressure for the consumers. However, in case of pipe failure which is the progressive effect of physical, operational and weather-related factors, might fail the WDN to achieve these goals (Kakoudakis, 2019). A pipe bursts when the residual strength of a deteriorated pipe can no longer resist the force inflicted on it (Berardi et al., 2008). Pipe burst 20 prediction helps to prioritize the maintenance, repair, rehabilitation and replacement of pipes after assessing and forecasting pipe propensity to burst. In addition, pipe burst prediction can be used for budget allocation and cost analysis of dynamic or static designing of water distribution networks. In the literature, there are typically two categories consisting of physical and statistical methods for modeling of pipes burst (Grigg, 2007) (Rajani and Kleiner, 2001). Physical models are developed to understand the physical process of pipe deterioration. In this models, the items that may affect the pipes burst, include 25 environmental conditions, quality of manufacturing, installation procedure, internal and external loads, surrounding soil, ground traffic and etc. (Wilson et al., 2015). The physical mechanisms of pipe burst are complex and not well-understood, and there is limited data available on the breakage failure modes due to the inspection difficulty and lack of historical data https://doi. org/10.5194/dwes-2021-7 Drinking Water Engineering and Science Discussions Open Access Preprint. Discussion started: 29 March 2021 c Author(s) 2021. CC BY 4.0 License. (Rajani and Kleiner, 2001). Statistical methods, model the pipes burst based on historical data. The assumption of these models is that pipes with similar specification and working environment will experience similar deterioration pattern 30 (Kleiner and Rajani, 2010). Since physical models developed for pipe failure prediction are complicated and expensive, they can be used on a limited number of pipes. But statistical models based on historical data are less expensive, and have vast applications.
The goal of any data analysis is to extract accurate estimation from the raw information. One of the most substantial and typical issues is whether there is statistical relationship between a response variable (Y) and explanatory variables (Xi). One 35 way to answer this issue is to employ regression analysis in order to model its relationship (Alexopoulos, 2010). Different studies have been proposed various pipe failure prediction methods such as physical (Randall-Smith et al., 1992), multivariate adaptive regression spline (Kutylowska, 2019), artificial neural networks (Achim et al., 2007) (Kutylowska, 2017), support vector machines (Kutylowska, 2018), fuzzy logic (Rajani and Tesfamariam, 2007), neuro-fuzzy systems (Christodoulou et al., 2004) (Tabesh et al., 2009) and evolutionary polynomial regression (Berardi et al., 2008). 40 In this research, pipe length (L), diameter (Dim), average hydraulic pressure (Pa), maximum hydraulic pressure (Pm), age (A) and installation depth (ID) are used as input of regression models and pipe burst rate (PBR) obtained as the output. In addition, correlation between these factors and PBR have been investigated. After implementing the various artificial intelligence and machine learning models such as multivariate adaptive regression splines (MARS), M5' regression tree (M5'), Least square support vector regression (LS-SVR), fuzzy regression based on c-means clustering (FCMR) and 45 regressive convolution neural network with support vector regression model (RCNN-SVR) in a real water distribution network, the corresponding predicted PBR values have been evaluated to find the best model-based prediction method.

Methodology
In order to implement the regression models, six different input variables consist of pipe diameter, length, age, depth of installation, average and maximum hydraulic pressure have been used. The output of all mentioned prediction models is 50 PBR. PBR values are calculated using the following equation:

PBR=
number of annual pipe bursts pipe length (km) The collected data have been split into training and test sets by random sampling. 85% of data have been selected for training and the rest of them have been used to test the models. By using several evaluation indices, the testing dataset evaluate the performance of the models on future unseen data. Further analysis will be performed to investigate the Pearson correlation of the PBR and the variables.

Description of regression models
Regression analysis is a form of predictive modelling technique which investigates the relationship between a dependent (target) variable and independent variable(s) (predictor). In this section five multivariate regression models for pipe failure prediction will be implemented and discussed. the number of basis functions (BFs) in the model, which adjusted at the first step; "+" means the argument that is a truncated power function, K m is the knot quantity; S k,m is +1 or -1 which shows the BF's direction; V(k,m) is the variable label and t k,m is the cut-off point. 75 The BFs represent the relationship between the knots using the reflected pairs of hockey stick function (f) as follows: Or Here, c is a threshold value that denotes the knot, where the behavior of the function changes. This model searches over the space of all inputs and predictor values (knots) as well as the interactions between variables. During this search, an increasingly larger number of basis functions are added to the model to minimize a lack-of-fit criterion. As a result of these 80 operations, MARS automatically determines the most important independent variables as well as the most significant interactions among them. It is noted that the search for the best predictor and knot location is performed in an iterative process. The predictors as well as the knot location, having the most contribution to the model, are selected first. Also, at the end of each iteration, the introduction of an interaction is checked for possible model improvements.
The obtained BFs for Joopar WDN for AC and PE pipes are: 85 AC pipes:

M5' model tree (M5')
M5 tree is a decision tree learner for regression problems introduced by Quinlan (1992). The M5 tree has three main types of nodes; decision nodes, leaf nodes and a root node. A decision node has two or more branches, each representing values for the attributes. Leaf node represents a decision on the numerical target, and the topmost decision node in a tree is called root node. The model is established according to a binary decision tree in which there are linear regression functions in the leaf 90 nodes, which sets a relationship between independent and dependent variables (Rahimikhoob et al., 2013). Wang  The M5' method builds a tree in three phases; growing phase, pruning phase and smoothing phase. In growing phase, the dataset is split on different attributes. Then the standard deviation for each branch is calculated. The resulting standard deviation is subtracted from the standard deviation before the split. The result is the standard deviation reduction (SDR) which is based on the decrease in standard deviation after a dataset is split on an attribute. Constructing a decision tree is all about finding attribute that returns the highest standard deviation reduction. SDR is represented by Quinlan (1992): Where, K represents a set of examples that reaches the node; K i and sd represent the subset of examples that has the i'th outcome of the potential set and the standard deviation respectively (Wang et al., 2010).
At the end of the first phase, there is a large tree that over fits the data, so a pruning phase must be employed. In this phase, the tree is pruned back from each leaf until an estimate of the expected error that will be experienced at each node cannot be reduced any further (Wang et al., 2010). Finally, the smoothing phase is performed to compensate for the sharp 105 discontinuities that will inevitably occur between adjacent linear models at the leaves of the pruned trees, particularly for some models constructed from a smaller number of training examples (Ditthakit et al., 2012). In this phase, the adjacent linear equations are updated in such a way that the predicted outputs for the neighboring input vectors corresponding to the different equations are becoming close in value. This process substantially increases the accuracy of prediction (Witten and Frank, 2005).

Fuzzy c-regression (FCR):
Fuzzy c-regression (FCR) model introduced by Hathaway and Bezdek (1993). This method is an extension of fuzzy c-means approach which is one of the most popular clustering method. It performs classification based on the iterative minimization of the following objective function and constraints (Bezdek et al. 1984;Bezdek 1981;Dave 1992): Subject to: Where i∈{1,…,c} , j∈{1,…,n} , n is number of data points, c is number of clusters, μ is the fuzzy membership matrix, q is the fuzzifier where q≥0 , V is cluster center vector. X is a data vector and D i,j is the distance between observation x j and cluster center v i . By using a Lagrangian multiplier, V and μ can be obtained by optimizing the objective function in (1): The membership values are initialized randomly and both these and the cluster centers are iteratively updated until the 120 maximum change in μ i,j becomes less than or equal to a specified threshold ε. q is normally set to 2 as this is the best value for the fuzzifier while the membership μ i,j is randomly initialized. The cluster center v i and membership values μ i,j are then iteratively updated using (32) and (33) respectively until either the maximum number of iterations or threshold ε is reached (Ameer et all, 2008). Finally the weighted least square is used for regression model, in which weights are membership values of train data and for each cluster, regression coefficients (β) is calculated: Where Y is observed PBR, X is dependent variables and W i =diag{μ i } for all train observations. Then by using calculated v i the membership values of test data are used for prediction: Where x test (j) is jth test observation such that:

least-squares support vector regression (LSSVR)
Support vector machines (Vapnik, 1995) (Vapnik, 1998a) (Vapnik, 1998b) have been introduced for solving pattern 130 recognition problems. The SVM system used to estimate regression is called Support Vector Regression (SVR) which has been used in various different prediction problems. This method maps data x into a high dimensional feature space using non-linear mapping and performs linear regression in this space.
In which b∈R and W will be found by minimizing the following objective function (ζ) with constraints: Where y is observed PBR, l is the number of observations, Z=(φ(x 1 ),φ(x 2 ), …, φ(x l )) in which φ is a mapping to some 135 higher (maybe infinite) dimensional Hilbert space (H), ξ=(ξ 1 ,ξ 2 , …, ξ l ) T is a vector consisting of slack variables, and γ is a positive real regularized parameter.
The Lagrangian function for the optimization problem is: Where α is a vector consisting of Lagrange multipliers. So we have the following set of linear equations: By eliminating w and ξ, one can obtain the following linear system: Where H=K+γ -1 I l and K=Z T Z which is defined as K i,j =φ(x i ) T φ(x j )=κ(x i ,x j ) and κ(0,0) is a kernel function. The solution of this problem can be found by the following three steps: (1) Solve η , ν from H.η=1 l and H.ν=y ; (2) Compute s=1 l T .η ; ( (24)

Regressive Convolution Neural network and SVR (RCNN-SVR):
Zhang and Li (2018) Where n is the total number of observed data, PBR obs is the observed value of PBR, PBR pre is the predicted value of PBR, PBR obs ̅̅̅̅̅̅̅̅̅ is mean of the PBR observed values and var(PBR obs ) is variance of the PBR observed values.

175
The WDN of Joopar city is selected as the case study for pipe failure prediction. Joopar with an altitude of 1893 m height above sea level is located in about 25 km south of Kerman, Iran. It has an area of 12 Km 2 and covers 2622 water subscribes with 51.6 km of water distribution pipes (Figure 3 and 4). The network with a lifespan of more than 50 years, was built in the early days with asbestos cement pipes and developed with polyethylene. In this case study, 158 cases of pipe failure for polyethylene (PE) pipes with diameters of 29.4-101.4 mm and 124 cases for asbestos cement (AC) pipes with diameters of 180 100-200 mm have been used as regression model datasets which have been collected by author during 2012-2019. As mentioned, diameter, length, installation depth, age, maximum and average hydraulic pressure of pipes are considered as the main variables that influence the PBR of pipes. Figure 5 visualizes a graphical representation of these pipe features for burst cases.

Results and discussion
Pearson correlation coefficients between PBR and pipe burst features have been determined to confirm the suggested 225 relationship between the age, diameter, maximum and average pressure of pipes with PBR. Performances of models have been assessed via calculating some error criteria that helps us to find the best regression model.

Correlation coefficients
The linear relationship of the collected data is measured with the Pearson correlation coefficients. As can be seen from the obtained results listed in between diameter and PBR. Based on local investigations, it has been found that old asbestos cement pipes can bear a pressure more than the present pressure in the network. Findings show there is a strong positive correlation between age and PBR because of aged pipes, verifying that by increasing the age of pipes, PBR will increase. Also it can be seen that there is a positive correlation coefficients between both P avg and P max and PBR .
According to equation (1), PBR has inverse relation with length and because of low variation of failure statistics with length 240 during the investigation period, large negative correlation can be seen in both PE and AC pipes and PBR.

Evaluation of regression models performance
According to the mentioned regression techniques, data-driven pipe burst models were set up for the asbestos cement and polyethylene pipes in Joopar WDN.

Conclusion
Failure of pipes in water distribution networks (WDNs) is an inevitable event that leads numerous issues. Prediction of pipe 280 burst helps to optimize the budget allocation and better utilization programming. This paper compared and evaluated five artificial intelligence and machine learning methods; multivariate adaptive regression splines ( RCNN-SVR is the most accurate prediction model, which has the lowest values of RMSE, NMSE and MAPE which can effectively predict the burst rate. The positive correlation coefficient between age and PBR is high in approximately 50-yearold AC pipes and low in PE pipes. Also analyses show that there is positive correlation between pressure and PBR for PE and AC pipes. As length is one of the main parameters in PBR formula, the correlation between length and PBR is evident.