|Year : 2020 | Volume
| Issue : 5 | Page : 222-229
Present status and future forecast of COVID-19 in India using time series modelling
Rohit Patawa1, Pramendra Singh Pundir1, Puneet Kumar Gupta2
1 Department of Statistics, University of Allahabad, Allahabad, Uttar Pradesh, India
2 ICFAI Business School (IBS), The ICFAI University, Dehradun, Uttarakhand, India
|Date of Submission||30-Apr-2020|
|Date of Decision||31-Aug-2020|
|Date of Acceptance||30-Sep-2020|
|Date of Web Publication||29-Oct-2020|
Dr. Puneet Kumar Gupta
ICFAI Business School (IBS), The ICFAI University, Dehradun, Uttarakhand
Source of Support: None, Conflict of Interest: None
Background and Aim: The aim of this study is to analyse and forecast the coronavirus disease 2019 (COVID-19) cases in India, which may help the government and residents of India to mitigate the effect of COVID-19.
Material and Methods: Univariate time series modelling has been used to forecast COVID-19 cases in India. The model is built to predict the number of confirmed cases, recovered cases, and death cases based on the data available between 30th June, 2020 and 28th August, 2020. Behaviour of the forecasts has been discussed at each round and also compared with the original values with its error measures such as mean absolute percentage error and root mean squared error.
Results: For both the models used for fitting, namely exponential smoothing with multiplicative error and multiplicative trend, the later is found to be more appropriate than others. More importantly, at all stages of the forecast, the overall forecast error was <5%, which seems to be a good forecast.
Conclusions: The present study may be valuable for Indian governments in the direction of making policies and accordingly taking suitable actions to mitigate the spread of COVID-19 cases in India.
Keywords: Autoregressive integrated moving average, COVID-19, exponential smoothing, forecast
|How to cite this article:|
Patawa R, Pundir PS, Gupta PK. Present status and future forecast of COVID-19 in India using time series modelling. Curr Med Res Pract 2020;10:222-9
| Introduction|| |
Coronavirus disease 2019 (COVID-19) is an infectious disease caused by the most recently discovered coronavirus from the family of coronaviruses, which may cause illness in animals or humans. In humans, several coronaviruses are known to cause respiratory infections ranging from the common cold to more severe diseases such as middle east respiratory syndrome coronavirus (COV) and severe acute respiratory syndrome COV (SARS-COV). The most recently discovered coronavirus was SARS-COV-2. This new virus and disease were unknown before the outbreak in Wuhan, China, in late December 2019 and which has affected >225 countries and territories around the world as on 28th August. Previously, a lot of calamities had affected smaller regions, diminished gradually or stopped with time. However, the way Coronavirus spread has been something we never would have thought of. Despite great developments in medicine and technology our scientists and pharmaceutical companies are in the process of developing effective drugs and vaccines.
On 11th March, 2020, the WHO announced the outbreak as a pandemic, and then in India, it was announced as a pandemic and more than a dozen states invoked the Epidemics Disease Act, 1897. On 22nd March 22, 2020, India observed a 14 h voluntary Janata Curfew or lockdown, followed by a 21-day lockdown of the entire nation as announced by the Indian Prime Minister Mr. Narendra Modi. As on 28th August 2020, India, had 3,387,500 confirmed cases, 742,023 active cases and had reported 61,529 deaths. The lockdown started not with the idea of eradicating the virus, but it was a measure for creating a second window of opportunity to contain the virus by deploying and training and levelling up the health-care facilities in the country to treat and isolate patients.
During the battle against COVID-19 in India, various health and clinical researchers based on the modelling of data have also played an important role in assessing the behaviour via forecasting the number of cases, bed, health equipment that may be needed in the future to mitigate the spread of the virus. In this regard, numerous researches took place so far to estimates the key parameters of the spread of the cases of COVID-19, like forecasting number of confirmed cases, doubling time, mortality rate, etc., using number of statistical methods for the same.,,,,, Recently, specific techniques have been offered to identify un-traced contacts, undetected international cases, or the actual infected cases based on analytical tools, or the widespread.,, In view of the continuous improvement in health facilities and national measures related to the spread of COVID-19, several scholars have analysed the effects of such changes by means of statistical thinking, and stochastic simulation.,
In these statistical studies,, forecasting based on time series modelling has received comparatively less consideration; however, all of them may have put a more detailed picture of epidemic analysis. In lieu of the above, the present study focuses on the forecasts of COVID-19 cases in India, which can help the government and residents of India to mitigate the effect of COVID-19. Univariate time series modelling has been used to forecast the COVID-19 cases in India. The model is built to predict the number of confirmed cases, recovered cases, and deaths based on the data available from 30th June, 2020 to 28th August, 2020.
| Methods|| |
The present study mainly highlights the analysis of cumulative confirmed cases of COVID-19 in India. There are mainly three variables of interest which are related to the observed cases of COVID-19 on a daily basis, namely total infected cases, total recovered cases and the total number of deaths. The data were retrieved by the Centre for Systems Science and Engineering (CSSE) at Johns Hopkins University (using the source https://github.com/CSSEGISandData/COVID-19 accessed on August 29, 2020). The daily testing data has been retrieved from ICMR (https://www.icmr.gov.in/accessed on August 28, 2020) daily COVID testing updates for India.
The dataset contains the information about all the three variables for global cases (starting from 22nd January, 2020) and we have extracted the data for India only. As the number of cases has been started rapidly increasing mainly from 7th March 7, 2020, so the analysis covers the period from 7th March 7, 2020 to 28th August, 2020. Apart from the study on the total number of infected cases, the present study also figures out the behaviour of the total number of recovered cases as well as the total number of deaths in India. All three variables are plotted in [Figure 1]; it can be observed that all three variables are increasing exponentially with respect to time.
|Figure 1: Cumulative confirmed, deaths and recovered cases from COVID-19 in India on daily basis|
Click here to view
Two methods of time series modelling, namely, autoregressive integrated moving averages (ARIMAs) modelling and exponential smoothing (ES), have been adopted to analyse and forecast the total number of infected cases of COVID-19. Before unfolding the methodology, we give a basic description of the time series models which have been used in the prediction:
Autoregressive integrated moving average model
In the modern world, ARIMA modeling has versatile nature of forecasting patterns, such as trend stationarity and unit root stationarity and are the most commonly used techniques to model time series data and also to predict the future (forecasting). The AR part in ARIMA stands for autoregressive, which is a regression for the variable of interest on its own lagged values. The MA part, i.e., moving average represents the current value of variable of interest as a linear combination of current and lagged values of error terms, whereas the integrated term is used for differencing of variable of interest to handle nonstationarity of data. In general, the nonseasonal ARIMA models are denoted by ARIMA (p, d, q) where p, q and d are non-negative integers stand for the order of autoregressive, moving average part and differencing, respectively. Let Yt is given time-series data at index (time) t then ARIMA (p, d, q) can be written as:
Where, α;i=1, 2, ......p and θjj=1, 2, ......q denote the coefficients of autoregressive and moving average part respectively and are error terms whereas stands for differencing of time series for d times.
ES can capture a different variety of forecasting patterns based on the nature of seasonality and trend (such as they are additive or multiplicative in nature). Apart From ARIMA modelling, ES is also the most commonly used technique to smooth time series and signal processing also. It is an extension of numerical analysis first, suggested by Robert Goodell Brown to use in statistical literature for forecasting and letter expended by Holt., Let be the sequence of time series data, simple ES method is written as:
s = y0
s = αyt(1-α)St-1
Where st; the output of best fit from ES and α is smoothing factor 0≤ α ≤1. Further extensions have been done to handle the trend and seasonality in data, which are known as double ES or Holt's trend method, and triple ES or Holt-Winter's seasonal method,,, respectively. Both the components (trend and seasonality) can be multiplicative and additive in nature. The ES family, especially suitable for short series and shown good forecast accuracy over several forecasting problems. Both ARIMA and ES methods can handle nonstationary and nonlinear data, respectively.
| Results|| |
Before fitting both the model to the dataset, some basics assumptions have been checked. It can be seen from [Figure 1] that all the variables are an exponential increase in nature. The plot between the daily arrival of new cases and their growth rate indicated to move towards the integrated term in ARIMA and multiplicative methods to be used in ES method. It has also been found to be non-seasonal data after plotting the autocorrelation function (ACF) and partial ACF and observing their behaviour over lag periods. The log-likelihood, Akaike information criteria and Bayesian information criteria, as well as some other measures (mean absolute percentage error [MAPE] and root mean squared error [RMSE]), have been used to select the better forecasting model. For all the rounds, three forecasting methods have been employed (ARIMA, ES with the additive trend and ES with multiplicative trend). Cross-validation has also been done by splitting the available points into train and test sets to check the performance of the fitted model and to conclude.
Initially, all the models were applied to the first 115 data points and predicted fifteen data points ahead. The performance measures have been calculated for both train and test set and to find the best model, then the next fifteen data points were included and again predicted the next fifteen data points, so on. And finally, 15 days ahead forecast has been made from the date of the final available data point, i.e., 29th August, 2020. Among all the used models for fitting, ES with multiplicative error and the multiplicative trend is found to be more appropriate than others. The fitted models with their respective estimated parameters and the performance measures in each round of forecast are presented in [Table 1]. Behaviour of the forecasts has been discussed at each round and has also been compared with the original values (if available). Error measures (MAPE and RMSE) have also been given in the table. Following are the round-wise details of the forecasts:
First round of forecasts: 30th June, 2020 to 14th July 2020
For the first round of forecasts, 115 data points starting from 7th March to 29th June, 2020, were analysed. The forecasts for fifteen points ahead from 30th June, 2020 (with their 90% confidence intervals [CIs]) are presented in [Figure 2] in brown colour (CI: Grey colour), (the values on the y-axis are log-scaled). The RMSE for this period of forecast is about 18,552 and the MAPE is around 2%. All the forecasted points are close to their respective original values. Here, we observe that the estimates are positively biased. The mean estimate of confirmed cases for 15 days ahead, i.e., on 14th July 2020, is 972,549 with 90% prediction intervals ranging from around 4 lakhs to 19 lakhs. The actual confirmed cases on 14th July 2020were 936,181; hence, the observed forecast error for this point is 36,368 (with absolute percentage error of about 4%). This can be seen that the forecast is positively biased for this time point and also that the actual number of the cases lie within the 90% prediction intervals. During this time period, the daily arrival of new cases was around 29,000 at the last time point of this period, i.e., on 14th July 2020, from around 18,000 cases per day on 30th Jun 2020 [from [Figure 3]. In this duration, number of COVID testing (per day) was also increased up to 3 lakhs tests per day from 2 lakhs tests per day.
|Figure 2: Cumulative actual confirmed cases of COVID-19 in India, with forecast and prediction intervals (log-scaled y-axis)|
Click here to view
Second round of forecasts: July 15, 2020 to July 29, 2020
For this round of forecasting, we increased the number of data points for model fitting, and data have been taken up to 14th July 2020, which are 130 data points. Again, on the basis of that fitted model 15 days ahead forecasts with their 90% prediction intervals) have been obtained and plotted in [Figure 2] with green and yellow colour, respectively. The RMSE and MAPE for this 15 days forecast period are 27,330 and around 1.7%. The mean forecasted value for 15th July 2020, is 1,538,997, whereas the actual confirmed case on this date was 1581,963 (with an absolute percentage error of about 2.7%), which is negatively biased but close to the true value. For this round of forecasts, all the forecasted values are close and negatively positive biased to their respective actual values of confirmed cases with around 1%–3% MAPE. During this period, the rate of arrival of new positive cases was around 50,000 cases per day [from [Figure 3]. From [Table 1], it can be seen that the coefficient of trend (value of beta for multiplicative trend) is less than the previous round of the fitted model, which shows that the growth of arrival of new cases is little less from previous time span but still multiplicative in nature.
Third round of forecasts: July 30, 2020 to August 13, 2020
The third set of forecasts with their prediction intervals have been produced using the data set up to 29th July 2020, having 145 data points. Forecasts and their 90% prediction intervals are depicted in [Figure 2] with green and blue and sky blue colour, respectively. The RMSE of forecast for this 15 days' time period is 29,385 and MAPE about 1%. Furthermore, the actual cases are lying within prediction intervals. We observe that the actual values for this period closely follow the mean forecast and almost all the forecasts are positively biased for respective actual values. At the end of this time period (i.e., on 13th August 2020), we recorded a positive error of around 61,000 cases (with absolute percentage error of about 2.5%). From [Figure 3], new case recorded on 30th July 2020 was around 53,000 cases which was around 300 cases greater than the previous date and also, during this 15 days' time period, the daily arrival of new cases rapidly increased up to 65,000 cases per day, but the growth of new cases was irregular. The limit of the number of tests increased up to 8.5 lakhs tests from 6.5 lakhs tests per day, but this increment was also in an irregular manner, which caused the growth of new cases in this time period.
Fourth round of forecasts: 14th August 2020 to 28th August 2020
In this round of forecasts, 160 data points have been used for analysis, up to 13th August, 2020. Fifteen days ahead, forecasts and prediction intervals are shown in [Figure 2] with purple and green colour, respectively. The forecasts are positively biased and these biases are larger as compared to the previous rounds of forecasts having RMSE about 51,943 and MAPE 1.4%. The mean forecast on 28th August 2020, is around 35.5 lakhs, whereas the actual confirmed case on this date was 34.5 lakhs. The absolute error on this date is having MAPE 3%, which is higher compared to the previous rounds of the forecast. At the starting of this 15 days' time span, the daily arrival of new cases was 65,000 cases, which reached up to 77,000 cases per day on the last day of this period. However during this interval, the growth in the arrival of new cases was not in a regular manner. It firstly decreased up to 55,000 cases per day and then again started increasing and ended with the daily number of new cases >75,000. Furthermore, the pattern of number of tests conducted per day was in an irregular manner but was around 7.5 lakhs to 10 lakhs tests per day. Hence, almost all the forecasts are positively biased and is increased with respect to time and reached up to around 84,000 cases on at the last day of forecast, i.e., 14th August, 2020. Still, we observe that all the actual values lie within the prediction interval.
Fifth round of forecasts: 29th August 2020 to 12th September 2020
The final set of forecasts and prediction intervals have been produced using the most recent data set available up to 28th August 2020. In the fitted model, the slope is higher than the previous fitted models with multiplicative trend. These forecasts and their 90% prediction intervals are presented in [Figure 2] with red and pink colour, respectively. The forecast for the next day, i.e., 29th August 2020, is around 34.4 lakhs cases with 90% prediction intervals ranging from about 33 lakhs to 37 lakhs and for the last day of this round, i.e., September 12, 2020, the mean forecast is about 47 lakhs having 90% prediction intervals ranging from about 27 lakhs to about 83 lakhs. As from the previous round of forecast, arrival of daily new cases was between 67 K and 78 K cases in the last 5 days, so if daily new arrival cases will increase with the same trend, then actual cases will be close to the forecasted values. If the daily limit of the number of tests will increase then it is most likely that the arrival of new cases will increase as per its previous behaviour in the Indian scenario.
Recoveries from COVID-19
In addition to the forecasting, we have also focused on the behaviour of recovered cases. [Figure 4] represents the behaviour of total recovered cases as percentage of total cumulative confirmed cases as well as number of recovered cases per death overtime. Both the figures are computed starting from 7th March, 2020 to August 28, 2020. At the start, both the lines have much deviation because there were not much cases of recoveries and death. At the end of May, i.e., on 29th May 2020, a jump can be seen on both the measures because at this date larger number of cases had been recovered in Maharashtra and around 12,000 cases had recovered in overall India, which is much more than any previous date. From the end week of June, again, both the lines started increasing in a similar manner. At the end of June, the recovery percentage was around 60%, and the recovery rate per death was 20 cases. On 28th August, 2020, the total recovered cases reached around 76% with an increase of around 16% and at the end of June; the recovery rate per death was 42, which also increased about 22 per death. From these, it can also be seen that these two ratios have a strong relationship. However from the end of June, the number of recovered cases per death was much increased as compared to previous months and also as compared to the recovery percentage. Hence, it can be concluded from these results that the recovery percentage, as well as recovery rate per death in India, are continuously increasing, which is a good sign in comparison to other countries which are affected by this pandemic.
|Figure 4: Recoveries as a percentage of cumulative confirmed cases and recovered cases per death over time.|
Click here to view
| Discussion|| |
In this study, we have considered the forecast of the spread of the COVID-19 disease in India by analysing the openly accessible data from 7th March 2020 to 28th August 2020 using univariate time series models, which assume that data are accurate and past patterns will continue to apply. As on 28th August, 2020, India had 3,387,500 confirmed cases, 742,023 active cases and 61,529 deaths. These figures always belong to the lower and upper bound of our forecasted prediction interval done at various stages. However, by the time of writing the manuscript, cured cases also started increasing very rapidly, implying a little flatter curve of the active cases. We hope that the present study will be a valuable tool for authorities and Indians in the direction of making policies and accordingly taking suitable actions to mitigate the spread of COVID-19 cases in India.
In addition to forecasting, we have also focused little attention on the behaviour of recovered cases. [Figure 3] represents behaviour of total recovered cases as a percentage of total cumulative confirmed cases as well as the number of recovered cases per death overtime. Both the figures are represented starting from 7th March 2020 to 28th August 2020. At the start, both the lines have many deviations because of there are not many cases of recoveries and death, but from the ending of March, both lines are depicting their steady nature. Here, the recovery percentage was around very low (around 8%), which has now increased by up to 76% at the end of August 2020. Moreover, the ratio of recovered cases over death has also increased from 4 to around 42 numbers of recoveries per death in the same duration.
| Conclusion|| |
The outbreak of COVID-19 had left broad and reflective impacts globally. However, so far, India seems to be in a better situation in comparison to other developed countries, but the increasing number of new cases on a daily basis in India is causing a very serious problem in various aspects. The new cases and deaths (outside India) clearly indicate that the COVID-19 eruption may have tragic results, globally, if necessary mitigation measures are not implemented in time. Assessing the trend of COVID-19 cases may help in controlling the disease by making suitable policies, especially in respect of the local challenges.
Financial support and sponsorship
Conflicts of interest
There are no conflicts of interest.
| References|| |
Coronavirus Disease (COVID-19) Pandemic Ministry of Health and Family Welfare, New Delhi, India. COVID-19 India; 2020. Available from: https://www.mohfw.gov.in/
. [Last accessed on 2020 Aug 28].
Yang Y, Lu Q, Liu M, Wang Y, Zhang A, Jalali N, et al
. Epidemiological and clinical features of the 2019 novel coronavirus outbreak in china. medRxiv 2020. Available from: https://doi.org/10.1101/2020.02.10.20021675
. [Last accessed on 2020 Aug 28].
Zhao S, Lin Q, Ran J, Musa SS, Yang G, Wang W, et al
. Preliminary estimation of the basic reproduction number of novel coronavirus (2019-nCoV) in China, from 2019 to 2020: A data-driven analysis in the early phase of the outbreak. Int J Infect Dis 2020;92:214-7.
Nishiura H, Linton NM, Akhmetzhanov AR. Serial interval of novel coronavirus (COVID-19) infections. Int J Infect Dis 2020;93:284-6.
Lai S, Bogoch I, Ruktanonchai N, Watts A, Lu X, Yang W et al
. Assessing spread risk of Wuhan novel coronavirus within and beyond China, January-April 2020: A travel network-based modelling study. medRxiv 2020. [doi: 10.1101/2020.02.04.20020479].
Nishiura H, Jung SM, Linton NM, Kinoshita R, Yang Y, Hayashi K, et al
. The extent of transmission of novel coronavirus in Wuhan, China, 2020. J Clin Med 2020;9:1-5.
De Salazar PM, Niehus R, Taylor A, Buckee C, Lipsitch M. Using predicted imports of 2019-ncov cases to determine locations that may not be identifying all imported cases. medRxiv 2020; Available from: https://www.medrxiv.org/content/10.1101/2020.02.;4:v2
. [Last accessed on 2020 Aug 28].
Zhao H, Man S, Wang B, Ning Y. Epidemic size of novel coronavirus-infected pneumonia in the Epicenter Wuhan: Using data of five-countries' evacuation action. medRxiv 2020; Available from: https://doi.org/10.1101/2020.02.12.20022285
. [Last accessed on 2020 Aug 28].
Nishiura H, Kobayashi T, Miyama T, Suzuki A, Jung S, Hayashi K, et al
. Estimation of the asymptomatic ratio of novel coronavirus (2019-nCoV) infections among passengers on evacuation flights. medRxiv 2020. Available from: https://doi.org/10.1101/20200.02.03.20020248
. [Last accessed on 2020 Aug 28].
Kucharski AJ, Russell TW, Diamond C, Liu Y, Edmunds J, Funk S, et al
. Early dynamics of transmission and control of COVID-19: A mathematical modelling study. Lancet Infect Dis 2020;20:553-8.
Chinazzi M, Davis JT, Ajelli M, Gioannini C, Litvinova M, Merler S, et al
. The effect of travel restrictions on the spread of the 2019 novel coronavirus (COVID-19) outbreak. Science 2020;368:395-400.
Hellewell J, Abbott S, Gimma A, Bosse NI, Jarvis CI, Russell TW, et al
. Feasibility of controlling COVID-19 outbreaks by isolation of cases and contacts. Lancet Glob Health 2020;8:e488-96.
Quilty BJ, Clifford S, Flasche S, Eggo RM, CMMID nCoV working group. Effectiveness of airport screening at detecting travellers infected with novel coronavirus (2019-nCoV). Euro Surveill. Available from: 10.2807/1560-7917.ES.2020.25.5.2000080. [Last accessed on 2020 Aug 28].
Zeng T, Zhang Y, Li Z, Liu X, Qiu B. Predictions of 2019-ncov transmission ending via comprehensive methods. arXiv 2020;. Available from: https://arxiv.org/abs/2002.04945
. [Last accessed on 2020 Aug 28].
Box GE, Jenkins GM, Reinsel GC, Ljung GM. Time Series Analysis: Forecasting and Control. Hoboken, NJ: John Wiley & Sons; 2008
Brown RG. Exponential smoothing for predicting demand. Oper Res 1957;5:145.
Holt CC. Forecasting trends and seasonal by exponentially weighted moving averages. ONR Memo 1957;52.
Holt CC. Forecasting seasonal and trends by exponentially weighted moving averages. Int J Forecast 2004;20:5-10.
Winters PR. Forecasting sales by exponentially weighted moving averages. Manage Sci 1960;6:324-42.
Makridakis S, Spiliotis E, Assimakopoulos V. The M4 competition: 100,000 time series and 61 forecasting methods. Int J Forecast 2020;36:5474.
[Figure 1], [Figure 2], [Figure 3], [Figure 4]