Predicting turbidity and Aluminum in drinking water treatment plants using Hybrid Network (GA- ANN) and GEP

Turbidity is the most important parameter needed to check the status of drinking water, as it is an integrated parameter because its high values indicate high values of other parameters related to water quality. 15 Coagulation and flocculation are the most essential processes for the removal of turbidity in drinking water treatment plants. Using alum coagulants increases the aluminum residuals in treated water, which have been linked to Alzheimer's disease pathogenesis. In this paper, a hybrid algorithm (GA-ANN) used to predict the turbidity values in the drinking water purification plant in Al Qusayr was used. 20 The models were constructed using raw water data: turbidity of raw water, pH, conductivity, temperature, and coagulant dose, to predict the turbidity values coming out of the plant. Several models were built and fitness was detected for each model, the network with the highest fitness was selected, and then a hybrid prediction network was constructed. The selected network was the most able to predict turbidity of the outlet with high accuracy with a correlation 25 coefficient (0. 9940) and a root mean square error of 0.1078. And 4 equations for determining the value of the residual aluminum was obtained using Gene expression method, and the best equation produced results with a very good accuracy, in this regard it can be referred to RMSE =0.02 R = 0.9 for the best model.


Introduction:
Due to the population growth around the globe, the scarcity of water resources has been a serious issue in the latest decades. (Haghiri et al 2018) Surface water is one of the main sources of drinking water, but it is usually unsafe to be used without a treatment.

5
There are many types of drinking water purification plants, according to the characters of the raw water. Still the primary goal of them is to produce safe water that does not contain pathogenic, microorganisms or toxic compounds.
Turbidity is an important indicator in drinking water quality. (Chenyu and Haiping 2020) It is a lack in the water clarity, because of the presence of suspended or colloidal particles.

10
The excessive presence of colloids in raw water sources can cause many problems. It may disturb the operations of the water treatment plant (WTP) and can be very costly. (ABIDEEN, M 2016) Water turbidity impedes the appropriate disinfection operation; it increases the needed dose of used disinfectant. Therefore, turbidity measurement is a basic test for assessing water quality, and it is an important operational parameter to evaluate the stability of the drinking water purification plant's work.

15
Suspensions seek to settle at the bottom of the waterbed. However, fine particles with dimensions from 3-4 to 0.1 microns and colloidal particles with dimensions of 0.1-0.001 microns remain in the suspended state water.
Thus, the particles suitable for natural deposition technology are only those which their dimensions greater than 50-30 microns . Table.1 shows the sedimentation period due to the gravity for different impurities in one meter of water with a 20 temperature of 20℃, based on Stokes's Law:  The residual aluminum increases water turbidity, it also may have some health effects on consumers, aluminum hydroxide can build up on the inside of the pipes, reducing flow, and it has been suggested as a causal factor in Alzheimer's disease.
-The process of coagulation and flocculation of water impurities has been used for a long time. Coagulation is 30 the process of increasing the sizes of dispersed particles by using materials that are able to standardize them.
Although traditional modeling has been used to describe biological processes, it has been built by writing equations related to the speed of microorganism growth, substrate consumption, and product formation. Because

35
Coagulation is a complicated nonlinear process because there are many physical and chemical variables that influence this process (Kim, C. M., & Parnichkun, 2017) Modeling a complex process such as in water treatment plants is not easy, due to the non-linear processes (physical, chemical, biological, and biochemical) (S.I. Abba et al , 2019) Modeling is usually done in one of the following ways: 40 1-Numeric or deterministic methods.
2-Data-Driven methods: methods based on data.
Techniques based on data have gotten a lot of attention recently. Due to their ease of use in the field of process monitoring implementation, and fewer underlying requirements. (Haghiri et al 2018) In recent years, a variety of artificial intelligence (AI) techniques, have been used in modeling complex 45 nonlinear water treatment processes, Due to the fact they have many benefits in conventional data modeling, When it is required to explicitly and accurately determine an output related to an input and an unknown physical process. ( Kim, C, & Parnichkun, M. 2017) Modeling has been performed for Bansong drinking treatment plant, hybrid of k-means clustering and adaptive neuro-fuzzy inference system. Raw water quality data were classified into four clusters 50 according to its properties by a k-means clustering technique. Then the ANFIS model was built. The research indicated that k-means-ANFIS models can be used as a robust tool during rainy seasons which is the most challenging period of operation. RMSE = 0.0572 , R= 0.9. ( Kim, C, & Parnichkun, M. 2017) In another study, (J. Tomperi et al) Residual aluminum in drinking water was predicted using Multiple Linear Regression (MLR) and Artificial Neural Network (ANN) models.

55
Variables that affected the amount of residual aluminum according to the study were: raw water temperature, raw water KMnO4 and PAC/KMnO4 (Poly-Aluminum Chloride/Potassium permanganate)-ratio. The study gave a good result for the two methods, but the Alum was not used in the plant that the study passed its data on. So it cannot be considered in such cases.
Another study done by (Daghbandan, A et al ) Polyelectrolyte, pH, turbidity, polyaluminium chloride, 60 temperature, and electrical conductivity were used as input parameters in a study used GMDH to predict Aluminum and turbidity in drinking water plant, the study gave great results, we added the turbidity of raw water into the inputs.
While the methods described above can be used to predict residual turbidity and aluminum, they have drawbacks that should be considered.

65
Artificial neural networks have a low reproducibility due to the randomly given weight and bias. Therefore, it is useful to use hybrid methods to develop its performance.
This study adjusted the architecture of the ANN network with Genetic algorithms.
For aluminum residuals ANN, MLR methods were used in previous studies.
Many new studies gave a promising result when using GEP, so it is appropriate to use in many fields in 70 modeling. Because of its great ability to model nonlinear relations. And that is why it was used to predict the residual aluminum.  The raw water from Orontes River enters the plant, and the amount of water drained for the station is controlled at the outlet.
All chemical additives are added before the main dispenser. Then, water is distributed into 4 circular precipitations each with a diameter of 31 m, then to the sand filters, a number of 20 filters of the double-exposed 85 type, which works on gravity and is equipped with a filter layer of up to 1.5 m.

-Basic components of an artificial neural network
It consists of the following basic components or at least some of them :( input layer -output layer -hidden layers -interconnections (weight). (WILMOT, et al 2005)

(GAs ( Genetic Algorithms: (MELANIE, M. 1996)
Genetic algorithm is known as a smart algorithm that can be used to find and improve the solution of complex 110 problems, it is one of the efficient research methods based on the principles of natural selection and genetics.
The genetic algorithm is successfully applied to find an acceptable solution (near to ideal) in matters related to science, including medical and engineering sciences, because it has greatly reduced the time and effort required by program designers. It is a form of genetic algorithm that was first proposed based on Darwin's theory of evolution. And it works in the same way as a group in evolution abandons undesirable members and produces genetically modified offspring. No functional relationship is considered at the start of this method's operation.

Gene Expression
This method optimizes the model's construction as well as its components.

120
Natural selection and genetic recombination are the ideas that underpin these adaptive machine methods.
These computer techniques produce entities (chromosomes representing potential solutions) that learn about the given problem and, given enough time to simulate, adapt to their environments (objective function and constraints).
As opposed to conventional modeling methods, GP has the advantage of not assuming any priori functional form 125 of the solution. In a typical regression method, for example, the model structure is specified in advance (which is difficult to do in general) and the model coefficients are calculated. GP uses a genetic algorithm as its basic search strategy, but GP varies from standard GA in that it usually works for parse trees rather than bit strings. A parse tree is constructed using a terminal set (the problem's variables) and a function set (the simple operators that make up the function gene 2005).

130
The basic steps for this approach are the following: 1-Create a new mating pool from the current population by selecting solutions based on their fitness (as determined by the evaluation/objective function); 2-apply the genetic operators (crossover, mutation) to selected mating pairs from the mating pool, resulting in offspring structures that can be incorporated into the new population 135 3-The current population will be replaced by a new population.
Each chromosome has at least a head and a tail, the head contains the function node and terminal node and the tail contains just the terminal node, the number of the tail gene is linked to the head gene number. As shown in the following equation: t is the number of the tail gene, h is the head gene numbers , is the largest breach amount.

Mythology:
-For the prediction of turbidity a hybrid ANN-GA was obtained.
The data has been processed and outlier values were excluded, which are conflicting values, outliers and missing 145 values, because it hinders the proper training of the neural network and greatly affect its performance.
First, a genetic algorithm was used to determine the input parameters used.
The Fitness function used in the comparison of input patterns was the inverse of the generalized regression network error (GRNN) resulting from the test set.
The generalized regression neural network (GRNN) is rapidly trained and sensitive to the explanatory inputs for 150 the dependent variable adopted in the study. The parameters used at first as inputs were chosen according to their importance to the turbidity values of the treated water. Turbidity of raw water, affects coagulation and rates of Settling, Conductivity because it is an 155 online measure of water quality. Temperature because it affects reaction rates and rates of settling. pH because the pH determines the solubility. Alum dose Impacts turbidity by neutralizing negatively charged particles and  After determining the inputs, a hybrid method using a combination of neural networks and genetic algorithms were used, the genetic algorithms were used to get the optimal parameters of the neural model (number of neurons and the activation functions) and to accelerate the convergence process by giving appropriate initial weights to the network before training, which makes the network train more efficiently, and as well as greater 170 generalization capacity.
The work was done using Matlab 2012a program language to create a feedforward network, with two layers in all used networks, one hidden layer and an Output Layer containing a single neuron (One Node) representing the predicted value of turbidity of the water coming out of the station, and the bias was considered as a weight of an additional constant income, with assuming the number of neurons ranged between (10) and (50) Table .5 shows the parameters of (ANN-GA) used in the study. To improve the initial weights of the network, the first step is the random generation of the primitive population consisting of a set of chromosomes, , then the fitness function is evaluated for each individual ,and the best individuals are selected, genetic algorithms steps are done (crosses and mutations) on them, and repeating the evaluation process of the new generation members, after obtaining the ideal weights, comes the step of training the artificial neural network using the reverse propagation algorithm (LM) and calculating the mean square error, 190 comparing it with the permissible error, this process is done when the number of generations is over or the permissible error is reached and then the best architecture and best RMSE evaluation are presented.
-The number of data used in the study was (300) views, (70%) of the observations were randomly selected as a training group, (15%) as an investigation and calibration group, and (15%) as a testing group, and the number of iterations was (500) and number of retraining times = 5 times, in order to obtain the best architecture of the 195 neural network, and the training results for the different networks are shown in Fig.3 .

Figure 3. MSE Results of various networks
From the above figure, we note that the best architecture for the neural network is (5-25-1), according to the lowest MSE value. The next step was training the best network with LM algorithm the, which shows good 200 network performance in the training phase.

Gene Expression models:
In the first stage of model building, data was entered, and a set of functions was selected to create primary chromosomes For this, 70% of the data available from the drinking water purification plant in Quseir was entered to train the 225 model.
The remaining data (30% of the data) building were used for validation.
In the next stage, the fit function was chosen, where the RMSE was chosen to evaluate the efficiency of the model.

Results and Discussion:
The best architecture for the neural network was (5-25-1), it was trained with LM algorithm.
The statistical constants are shown in Table .6, we can note the good convergence between the real values and the values resulting from the neural model, hence the neural network (5-25-1) was able to predict the maximum and minimum values in the series 240

Figure 7. Frequency distribution of errors.
A sensitivity analysis was also conducted for the effect of the absence of any network input (Tur-out, Alum dose, T, pH, Conductivity, (Tur-in) on its performance, as the results in Table 7 indicate that the most influential component is the coagulant dose, followed by (income turbidity, temperature, conductivity, PH), respectively.   The input parameters were selected manually First models were built using the inputs: Then different models were built deleting one or two of the input parameters. The bsest model was Model 1 with the following parameter: Turbidity, conductivity, Temperature, Alum Dose, pH. 315 + Turbidity ((conductivity − 100.607) × ((t − pH) − 4.55))) + Conductivity As was shown, models with the parameters mentioned in model1 is able to predict the residual aluminum with a 320 very geed accuracy.
Comparing to the previous studies, it is obvious that GEP method is very good at simulating the process of drinking water, and it should be used to model and predict different parameters related to drinking water.

325
Modeling methods that are based on data offer valuable tools for process modeling and controlling process in drinking water treatment plants and providing a good alternative to the conventional methods based on numeric simulation. Many articles studied modeling methods to predict the coagulant dose, thus not enough had studied the prediction of turbidity and the aluminum residuals.

330
Articles that discussed turbidity have used k-means clustering and adaptive neuro-fuzzy inference system (Kim, C. M., & Parnichkun, 2017) and GMDH (Daghbandan, A et al ) In this research GA was used to obtain the optimal structure of an artificial intelligence network used to predict turbidity of the treated water, as it is a great parameter to evaluate the stability of the plant process.
In this way, GAs with a specific encoding scheme is first presented to evolutionary design of the ANN neural 335 networks used in predicting turbidity.
The parameter selected was similar to some studies before (Kim, C. M., & Parnichkun, 2017) with adding the alum dose as an input.
The inputs parameters have a significant effect on the models accuracy, and the most affecting parameter on turbidity were, alum dose, raw water turbidity, temperature, conductivity and pH, respectively. And GUI 340 inference were obtained for the ease of use from different operators.
And the study gave a good prediction models that were created using gene expression method with a few important variables, for predicting residual aluminum.
As contract of another article (J. Tomperi et al), the MLR could not give a good equation in this case.
The residual aluminum in drinking water plants can be predicted with very good accuracy using resulted 345 nonlinear equation. Differences between the results can be noticed when using deferent inputs, and the best inputs were alum dose, raw water turbidity, temperature, conductivity and pH with a very close results when the absence of conductivity.
The created prediction models could be improved by using different variable selected or using different modeling methods. Yet, the results of this study were promising and models obtained can be used in an early 350 warning to give information about the water treatment plant Acknowledgments This work was supported by Alquser Drinking Water Treatment plant.

Data availability
The datasets of the current study are available from the authors upon reasonable request.