Using Machine Learning to Predict the Spread of Mosquito-Borne Disease in Tropical Cities

This project aims to predict local epidemics of dengue fever using environmental data collected by U.S. Federal Government agencies. Using historical data for San Juan, Puerto Rico and Iquitos, Peru, predictions were generated with the machine learning tools Google TensorFlow and Facebook Prophet. Raw data is available through DrivenData.


Mosquito-borne Disease

Mosquito-borne disease is a common scourge throughout the world, causing hundreds of million of illnesses and millions of deaths every year. While most prevalent in tropical environments, there are over 3,000 species of mosquitoes found across the globe. Mosquito-borne illnesses can be caused by bacteria, viruses, or parasites transmitted to humans when the mosquito bites. Dengue is a viral disease spread by species of female mosquitoes of the Aedes type, principally Aedes aegypti. Other diseases transmitted by mosquitoes include Malaria, West Nile, Zika, Chikungunya, and Yellow Fever. The spread of mosquito-borne disease is related to climate variables such as temperature and precipitation. Although the relationship is complex, climate change may produce changes in mosquito prevalence and disease burden that will have significant public health implications worldwide.



Dengue Fever

Dengue fever is a mosquito-borne viral disease that occurs in tropical and sub-tropical parts of the world. In mild cases, symptoms are similar to the flu: fever, rash, and muscle and joint pain. In severe cases, dengue fever can cause severe bleeding, low blood pressure, and even death. In recent years dengue fever has been spreading. Historically, the disease has been most prevalent in Southeast Asia and the Pacific islands. These days many of the nearly half-billion cases per year are occurring in Latin America.

This project examined two tropical cities impacted by Dengue: San Juan, Puerto Rico and Iquitos, Peru. Both cities have about 400,000 people. San Juan is a coastal city. Iquitos is an inland city amid the Amazon Rainforest.



Machine Learning Approach

Our goal was to predict the number of dengue cases each week in each location based on environmental variables describing changes in temperature, precipitation, and vegetation. Predictions were generated based on a dataset with weekly cases from 1991 to 2007 for San Juan and 2001 to 2009 for Iquitos.

Two prediction models were used. Facebook Prophet forecasts time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. TensorFlow is a Deep Learning library developed by Google, we developed a simple Long Short-Term Memory(LSTM) model.


Predictions for San Juan, Puerto Rico


San Juan Results Discussion

While we had data from 1991 – 2008 for San Juan, the time period previous to 2000 seemed to have unreliable trends, large outliers, and large fluctuations of the number of infected individuals, so the Facebook Prophet time series model was trained on data from 2000 onward, however the Google TensorFlow LSTM model was not noticeably impacted by these fluctuations and used the complete dataset. The models process the data in completely different manners and that requires preprocessing to be different, this difference is the reason that the Prophet model is able to output predictions for the entire dataset and TensorFlow output only the last year. To easily visualize how well the model can generalize and predict unknown values, we cut off the last year of the San Juan data set and used that for testing model accuracy. For San Juan, the Facebook Prophet Model had a training mean absolute error (MAE) of 5.67 cases per week, and a testing MAE of 14.64 cases per week, the TensorFlow model had a testing MAE of 23.31 cases per week. With true values up to 170 cases in one week for the week of October 1st , 2008, a MAE of about 15 on the testing data isn’t problematic, but still leaves plenty of room for improvement. A major challenge in this machine learning competition is that the environmental factors do not encompass enough information to capture what could be causing the big spikes in the number of cases (as shown in the above graph in 2005 and 2007). There were rules limiting our ability to bring in extra data sets for this competition, but some data related to travel, or socioeconomic fluctuations may bring in some more information that allows machine learning models to more effectively capture a relationship that explains the larger Dengue outbreaks.

Predictions for Iquitos, Peru


Iquitos Results Discussion

Traditional time series models essentially operate by finding repeating trends within different time periods (like year, month, week, hour), then combine these trends with an overall trend to create time series predictions. When looking at the above graph, generally speaking Iquitos has it’s larger spikes of cases in the earlier months of the year, however in 2008 there was a large outbreak of Dengue in October and that outbreak was then followed by Iquitos’s more normal increase in cases in January and February of 2009. This large outlier and change in trend was throwing off the models predictions by adding a huge amount of uncertainty in the trend at the end of the training data. In order to deal with this, we removed extreme outliers (self defined as being more than three times the interquartile range greater than the 3rd, or upper, quartile). Removing these outliers reduced the overall accuracy of the training data, but seemed to allow the model to generalize the trend into the future with more certainty. Similar to San Juan, Iquitos had an unreliable trend for the earlier years in the data set, so the Facebook Prophet model was trained on data from January 2005 to June 2009, then tested for prediction accuracy on the last year of data available. The Facebook Prophet model had a training MAE of 5.29 cases per week and a testing MAE of 3.45 cases per week and the TensorFlow model had a testing MAE of 2.79 cases per week. This may seem like the model is predicting the test year better than the training years, however the testing year does not contain outliers and mean measurements are sensitive to outliers so the improved MAE is to be expected.


Conclusion

As seen in both of the above graphs plotting the time series predictions over the training and testing true values, both of the machine learning models aren’t able to accurately predict large spikes in Dengue Fever. While neither of these models are perfectly tuned, fine tuning a machine learning model takes quite a bit of hypothesis testing and therefore time, there seems to be an obvious need to bring in more related data in order to identify potential causal factors for the big spikes in the number of infected individuals per week. Even though correlation coefficient isn’t a direct measurement of a feature variables impact on a target variables outcome, it still gives a good indication of features that can help a machine learning model predict more accurately. San Juan’s environmental data features have a correlation coefficient range of -0.12 to 0.19 with respect to the number of cases per week, and Iquitos’s range is -0.13 to 0.23. This shows a relationship between environmental factors and the number of Dengue cases per week exists, but none of the variables have an extremely strong proportional or inverse relationship to explain the large fluctuations in the trend.