Predicting Overall Health of Communities

Introduction:

A food desert is defined as an area with low grocery store access. Living in one of these areas is often thought to be associated with poor health. Low income individuals may be more likely to rely on walking or public transportation to get around and if they work full time andare living paycheck to paycheck, they might not have a lot of time and money to spare. It is no surprise that these individuals often have to resort to fast food restaurants and other unhealthy options just because these options are more affordable and accessible. Healthy food options end up being a privilege, available and affordable for mostly only higher income individuals. However, healthy living is a necessity of a long and comfortable life. If people want to be able to eat a healthy, well-rounded diet, they should be allowed that option regardless of economic background.

Originally, I wanted to study how socioeconomic and geographical factors could be used to predict a community’s quality of health, but as I further explored the dataset, I decided to focus more on the relationship between health and systemic issues in society, such as race and poverty, as well as the effectiveness of government programs and local infrastructure in improving the health of a community.

Dataset/Data Cleaning:

The dataset I used is the Food Environment Atlas provided by the United States Department of Agriculture (USDA). This dataset is available to download from the USDA website as an xls file.

After loading the file onto a Jupyter Notebook, I ended up with eleven total tables. This included tables about the populations of the states and counties in the United States. The rest of the tables included information by county: access and proximity to the grocery store, food store availability, restaurant availability and expenditures, food assistance program participation, food insecurity rates, food prices and taxes, local food and farming, health and physical activity, and socioeconomic factors.

In the health table, there were two main variables that I could use to determine the overall health of a county: diabetes rate or obesity rate. To decide which variable to use, I created correlation matrices showing the strength of the relationships between those two variables and all of the possible features. After noticing that diabetes rate consistently tended to have a higher correlation with other variables than obesity rate did, I decided that I would use diabetes rate as the measure of the overall health of a county.

Data Analysis:

First, I utilized Plotly to visualize diabetes rate across the country. From this, I noticed that counties in the southeast tended to have higher diabetes rates.

Next, I wanted to explore the relationship between diabetes rate and the rest of the variables. I added the diabetes rate and obesity rate columns to all of the other tables besides the state and county tables. With each of these tables, I created a correlation matrix, allowing me to see the correlation coefficient for each variable and all of the other variables on the table. I visualized these matrices using seaborn heatmaps. To determine which features to select for my model, I used the correlation matrices to find out which variables resulted in a correlation coefficient less than -0.27 or greater than 0.27. This boundary of 0.27 was chosen because I only wanted to look at variables that were at least moderately correlated with diabetes rate and I made the assumption that around 0.3 was moderate. I acknowledge that this was not the best practice to follow, so in the future, I would go about choosing this number more methodically.

From the access and proximity to the grocery store table, the “Households, no car & low access to store (%), 2015” and “Black, low access to store (%), 2015” were selected. In this case, a household is considered “low access” if they do not live close to a supermarket. From the restaurant availability and expenditures table, “Full-service restaurants/1,000 pop, 2014,” “Expenditures per capita, fast food, 2012*,” and “Expenditures per capita, restaurants, 2012*.”

The one variable selected from the store availability table was “SNAP-authorized stores/1,000 pop, 2016.” Two more SNAP variables were selected from the food assistance programs table: “SNAP participants (% pop), 2016*” and “SNAP benefits per capita, 2015.” The Supplemental Nutrition Assistance Program, or SNAP, provides assistance to low income individuals, allowing them to purchase food at participating stores. Other food assistance programs variables included “Students eligible for free lunch (%), 2014” and “School Breakfast Program participants (% pop), 2015*.” High diabetes rate values were associated with high value of all of these variables. This makes it appear as if these programs are not effective in promoting better health. The positive associations could be a result of other characteristics about the type of people who participate in these programs.

Two variables were chosen from the food prices and taxes table. The USDA defines “Price of sodas/national average, 2010**” to be “[regional] average price of sodas relative to the national average price.” Higher diabetes rates appear to be associated with lower soda prices. The “Price of low-fat milk/price of sodas, 2010**” is defined to be the “[ratio] of the regional average price of low-fat milk to the regional average price of sodas relative to the national average price ratio.” Higher diabetes rates appear to be associated with higher milk to soda price ratios.

One physical activity variable, “Recreation & fitness facilities/1,000 pop, 2014,” was selected. Counties with better access to fitness facilities tend to have lower diabetes rates. Not unexpectedly, higher diabetes rates are associated with lower access to fitness facilities.

Two food insecurity features were selected: “Household food insecurity (%, three-year average), 2013-15*,” “Household very low food security (%, three-year average), 2013-15*.” Of the socioeconomic variables, there were three race related variables: “% Black, 2010,” “% Hispanic, 2010,” and “% Asian, 2010.” There were also five income related variables: “Median household income, 2015,” “Poverty rate, 2015,” “Persistent-poverty counties, 2010,” “Child poverty rate, 2015,” and “Persistent-child-poverty counties, 2010.” “Persistent-poverty counties, 2010” and “Persistent-child-poverty counties, 2010” were both indicator variables for whether or not a county was categorized as being poverty persistent. According to the USDA, a county is classified as having persistent-poverty if “20 percent or more of residents were poor as measured by each of the 1980, 1990, 2000 censuses, and 2007-11 American Community Survey 5-year average.”

In total, I selected 23 features to use for my model. The only tables I did not select features from were the state table, county table, and the local food farming table. I created one single dataframe with all of these features and created one last correlation matrix heatmap. I used this heatmap to look at the relationships between all of the features. If a feature and another feature have a high correlation, this could indicate that they provide redundant information. For example, poverty rate and child poverty rate are very highly correlated. Knowing which features are possibly redundant, allows me to decide whether or not I want to remove one of the variables or combine them in some way. Although I did look at which variable pairs had a high correlation, I decided to still try out the model without removing or changing any of the features.

Model:

I decided to start with a linear regression model using ScikitLearn. The data had been split into training, validation, and test sets. The features I selected for my X were the features that I closely examined during the exploratory data analysis portion based on the 0.27 correlation coefficient threshold. After fitting the model to the training data, I decided to use root mean square error to evaluate the model. I ended up with a training RMSE of about 1.358. I also performed cross validation using five folds, which resulted in an error of about 1.304. The lower cross validation error indicated that overfitting was most likely not something that I needed to worry about. Using the validation set, I got an error of about 1.316. Due to time constraints, I did not further refine my model, so I went ahead and used the test set to evaluate my model. The testing error was about 1.388 and the R2 value was about 0.686. Considering that I was trying to predict a percentage, an error of 1.388 is fairly low. The R2 value of 0.686 meant that about 68.6% of the variance could be explained by the model.

Future Plans:

One thing that I wanted to try but did not get around to implementing was regularization. In the future, I would like to try using Lasso and Ridge models. In addition, I could experiment with different methods of feature selection. One way could be through using regularization. For my current model, I selected features based on the correlation coefficient between each feature and diabetes rate, only using features that resulted in a correlation greater than 0.27 or less than -0.27. However, this boundary was chosen fairly arbitrarily. As mentioned previously, I would like to go about choosing this boundary in a more systematic manner if I even use this method to choose features at all. I hope to explore different and more exhaustive methods of feature selection. I could better utilize cross validation to help with feature selection and further refining the model. I could also work more with combining redundant features or performing transformations to variables that might have a nonlinear relationship with the target variable. Also, I could do further research to determine if there is a variable that better represents the overall health of a community. This could require me to combine multiple health related variables in some way. There are many improvements that I can make to my analysis, but it could be beneficial for me to wait because there may be a lot of updated data available after this year due to the census occurring.

Conclusion:

There are still many improvements that I would like to implement in the future, but even from the progress I have made so far, I can begin to answer my initial question. High minority group population percentages, high poverty rates, low access to health promoting facilities, and high participation in government food programs seem to be more prevalent in communities with poor health. These factors may be useful in predicting the overall health of a community. Being aware of these factors can help us decide how to improve the health of individuals across the nation.

References

Semester

Spring 2020

Researcher

Jacqueline Yu

Navigation

Introduction
Dataset/Data Cleaning
Data Analysis
Model
Future Plans
Conclusion
References

Executive / Directors

Member Profiles

Big-Little Tree