Back to Research & Publication
Among 16,700 data sets on data.gov, I found two data sets particularly interesting.
My goal: to see if there is a correlation between 35 year olds Heart Disease Mortality Rates and Youth Tobacco Rates of middle and high school students.
The main programming language I used in this research was R, and packages I installed were ggmap, mapproj, leaflet (Javascript Library), ggplot2, and dplyr.
Fall 2017
Mee Kyoung Seo
I cleaned the csv file by taking out all the NA values and grouping values by state for both data sets. To use the package Leaflet, I had to merge the two data sets, Heart Mortality data and Youth Tobacco rate data, with the longitude and latitude table to make the format appliable to Leaflet.
The two maps coded below will show:
I first visualized the relationship by grouping by state. I then used Pearson’s product-moment correlation, which is the method used for finding relationship between two variables, to calculate the correlation of 0.48 (round up to 2nd digit). I used this method to calculate the correlation coefficient because it accurately shows the linear relationship between the two columns.
It was interesting to see that there was a moderately positive linear relationship: some correlation between youth tobacco Rates and heart mortality rates mortality rates. There are multiple possible reasons why we see this, the most obvious being that states with higher youth tobacco rates likely have a greater proportion of people who smoke. As smoking increases the risk of cardiovascular disease, this may lead to higher heart mortality rates. In addition, states with higher youth tobacco rates may have fewer smoking regulations, leading to higher exposure to the smoke (air pollution). Certain states have stricter ban for tobacco in public spaces (notably 25 of them, such as California, Hawaii, Massachusetts, Michigan, New York, etc), which may also explain why states show big differences in terms of smoking rates and heart mortality rates.