Imputation of Missing Meteorological Data with Evolutionary and Machine Learning Methods Case Study: Long-term Monthly Precipitation and Temperature of Mashhad

Document Type : Research Article


1 Ferdowsi university of Mashhad

2 Ferdowsi University of Mashhad


Introduction: Temperature and precipitation are two of the main variables in meteorology and climatology. These are basic inputs in water resource management. The length of the statistical period plays a pivotal role in the accurate analysis of these variables. Observation data at Iran's first synoptic station from 1330 (1951) is available at the Iranian Meteorological Organization website The historical monthly precipitation and temperature of five stations in Iran is available since 1880 with missing data. These data measured by the Embassy of the United States and Britain from the Qajar period and recorded in World Weather records books. These synoptic stations include Mashhad, Isfahan, Tehran, Bushehr, and Jask. The monthly missing data were predominantly recorded during World War II (1941-1949). Unfortunately, these data have missing. Therefore, the accuracy of simulating these variables is very important.  The current research aimed to predict the missing values of monthly temperature and precipitation in Mashhad station. The stations in the neighboring countries were selected due to the distance to Mashhad, relationship, and completeness of data since 1880, as the predictive variables. Monthly precipitation of Ashgabat from Tajikistan and Sarakhs, Kooshkah, Bayram Ali, Kerki and Repetek from Turkmenistan were selected as an independent variable in the making of Missing Rainfall in Mashhad. Also, the temperature of Ashgabat, Bayram Ali, Gudan, Sarakhs, and Tajan were selected to restore the monthly temperature of the Mashhad station. This research has fitted ten multiple regression models to monthly rainfall of Mashhad station and has fitted 6 multiple regression to the monthly temperature of Mashhad. then the parameters of these patterns are optimized by genetic and Ant Colony algorithm. Also, the Artificial Neural Network (MLP) model and Support vector regression have been selected and implemented in order to simulate monthly precipitation and temperature data of Mashhad.
 Materials and Methods: In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships among variables. It includes many techniques for modeling and analyzing several variables when the focus is on the relationship between a dependent variable and one or more independent variables (or 'predictors'). Genetic algorithm (GA) is a metaheuristic inspired by the process of natural selection that belongs to the larger class of evolutionary algorithms (EA). Genetic algorithms are commonly used to generate high-quality solutions to optimization and search problems by relying on bio-inspired operators such as mutation, crossover, and selection. Ant colony optimization algorithm (ACO) is a probabilistic technique for solving computational problems which can be reduced to finding good paths through graphs. This algorithm is a member of the ant colony algorithms family, in swarm intelligence methods, and it constitutes some metaheuristic optimizations. Artificial neural networks are one of the main tools used in machine learning. As the “neural” part of their name suggests, they are brain-inspired systems which are intended to replicate the way that we humans learn. Neural networks consist of input and output layers, as well as (in most cases) a hidden layer consisting of units that transform the input into something that the output layer can use. They are excellent tools for finding patterns which are far too complex or numerous for a human programmer to extract and teach the machine to recognize. In machine learning, support vector machines (SVMs, also support vector networks) are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis. Given a set of training examples, each marked as belonging to one or the other of two categories, an SVM training algorithm builds a model that assigns new examples to one category or the other, making it a non-probabilistic binary linear classifier (although methods such as Platt scaling exist to use SVM in a probabilistic classification setting).
Results and Discussion: At the first stage, several multiple regressions were fitted to monthly precipitation (with coefficients ranging from 0.63 to 0.81) and six patterns for monthly temperature (0.986-0.993). Afterward, GA and ACO were applied to improve the accuracy of the selected regression models by optimizing their parameters. At the next stage, ANN and SVR were used to estimate the monthly missing values separately. Finally, the results of the previous stages were compared using the root mean square error (RMSE), and the optimal models were applied to determine the missing values of monthly temperature and precipitation of Mashhad. The results showed that the Genetic Algorithm and Ant Colony increase the accuracy of the estimation of missing rainfall data significantly more than the previous methods. The lowest error criterion (RMSE) between regression patterns is 9.8 millimeters. By genetic algorithm, this criterion is reduced to 2.56 mm, and by ant colony algorithm to 2.559.
Conclusion: Comparison of the above methods in restoration temperature and precipitation shows that evolutionary methods (GA and ACO) are the best for estimating the missing monthly precipitation and machine learning methods (ANN and SVR) are the best to imputation missing data of monthly temperature.


1- Arghami N.R., Sanjari N., and Bozorgnia A. 2010. Elementary Survey Sampling, Mashhad University Pub, pp 435.
2- Dastorani M.T., Moghadamnia A., Piri J., and Rico-Ramirez M. 2010. Application of ANN and ANFIS models for reconstructing missing flow data. Environmental Monitoring and Assessment 166(1-4): 421-434.‏
3- Dingman S.L. 2002. Physical Hydrology, Second Edition, PRENTICE HALL.
4- Dipak V.P., and Bichkar R.S. 2010. Multiple Imputation of Missing Data with Genetic Algorithm based Techniques, IJCA Special Issue on Evolutionary Computation for Optimization Techniques.
5- El Assaad H., Same A., Govaert G., and Aknin P. 2016. A variational Expectation–Maximization algorithm for temporal data clustering, Computational Statistics and Data Analysis 103: 206–228.
6- Farzandi M., Rezaee-Pazhand H., and Sanaeinejad H. 2014. Restoration and development of 127 years of monthly temperature in Mashhad, Journal of Climate Research 5(17&18): 123-111. (In Persian with English abstract)
7- Ghahraman B., and Ahmadi F. 2007. Application of Geo statistics in Time series: Mashhad Annual Rainfall, Iran-Watershed Management Science & Engineering 1(1): 7-15.
8- Golabi M.R., Akhond-Ali A.M., and Radmanesh F. 2013. Comparison of performance of different artificial neural network algorithms, Journal of Applied Geosciences Research 30: 131-169. (In Persian)
9- Iqbal M., Wen J., Wang Sh., Tian Hu., and Adnan M. 2018. Variations of precipitation characteristics during the period 1960-2014 in the Source Region of the Yellow River, China. Journal of Arid Land 10(3): 388-401.
10- Jacob D., Reed D.W., and Robson A.J. 1999. Choosing a pooling group. Flood Estimation Handbook. Vol. 3. Institute of Hydrology, Wallingford, UK.
11- Khalili A., and Bazrafshan J. 2008. Evaluation of drought duration risk using annual secular precipitation data in ancient stations of Iran, Journal of Geophysical, Volume 2, Number 2. (In Persian with English abstract)
12- Liao W., Li D., and Cui Sh. 2018. A heuristic optimization algorithm for HMM based on SA and EM in machinery diagnosis, J Intell Manuf © Springer Science 29(8): 1845-1857.
13- Little R., JA Rubin D., B. 2002. Statistical analysis with missing data. John Wiley & Sons, 408 pages.‏
14- Motia Ghader H., Lotfi Sh., and Seyyedsafelan M.M. 2010. A Review of Some Intelligent Optimization Techniques, Shabestar Branch of Islamic Azad University Publishing, 215 pp.
15- Preis A., and Ostfeld A. 2008. A coupled model tree–genetic algorithm scheme for flow and water quality predictions in watersheds. Journal of Hydrology (Elsevier) 349: 364–375.
16- Ranhao S., Baiping Z., and Jing T. 2008. A Multivariate Regression Model for Predicting Precipitation in the Daqing Mountains, Mountain Research and Development 28(3): 318-325.
17- Rezaee-Pazhand H., and Bozorgnia A. 2002. Nonlinear Regression Analysis with application, Mashhad University Pub, 400 pp. (In Persian)
18- Safavi S.A.A., Pour Jafarian N., and Safavi S.A. 2014. Optimization based on Meta-heuristic algorithms, Publishing Institute of Academic Publishers. (In Persian with English abstract)
19- Shanmuganathan S., and Samarasinghe S. 2016. Artificial Neural Network Modelling, Springer International Publishing Switzerland, 468 pages.
20- Smithsonian Institution. (1927, 1934, 1947): World weather records, 1910-1920., 1921-1930., 1931-1940., Smithson. Miss C. Collect. 79,90,105. (Publication2913.,3216.,3803)
21- Smola A., and Vishwanathan S.V.N. 2008. Introduction to Machine Learning. Cambridge university press 234 pages.
22- Souri A. 2017. Advanced econometric studies, c. 2, with the use of stata12 and eviews8, Culture Publishing, 1022 pages. (In Persian)
23- Yozgatligil C., Aslan S., Iyigun C., and Batmaz I. 2013. Comparison of missing value imputation methods in time series: the case of Turkish meteorological data, Theory Apply Climatology 112: 143–167.
Volume 33, Issue 2 - Serial Number 64
May and June 2019
Pages 361-377
  • Receive Date: 14 July 2018
  • Revise Date: 14 January 2019
  • Accept Date: 28 April 2019
  • First Publish Date: 22 June 2019