Project Title

Back to Research & Publication

American Sexual Education Policy and the LGBT Community



Introduction

As groups like the Human Rights Campaign have found, Inclusive sexual education has a profound impact on the LGBT community. Personally, growing up in Arkansas and then moving to California, I found being a part of the LGBT community vastly different in the two areas. In assessing what cultural and educational factors could have contributed to this difference, I decided to examine HRC’s claim and analyze the Sexual Education policies across the United States. Thus, this project began as an exploration of the relationship between sexual education policy, its various dimensions, and the LGBT community presence across the US. Using Python, I employed several classification modeling techniques as well as A/B testing and RFE feature selection to evaluate the statistical significance of different policy features and the relationship between policies. The Naive Bayes, RFE informed Naive Bayes, and Centroid classification schemes had maximum replicable accuracies of 80%, 85%, and 55% respectively.

Data Collection and Analysis

I mainly utilized three datasets in my project, Missouri.edu's collection of national sexual education data, Guttmacher.org's tables of state-based sexual education data, and Williamsinstitute.law.ucla.edu's summary of LGBT populations courtesy of Gallup surveys and the 2020 data.census.gov data. The Williams Institute data had to be entered manually, while the Missouri and Guttmacher data were downloaded as CSV files and uploaded to the notebook environment such that each row included all sexual education policy data for a given state. The CSV data was somewhat incomplete, so all NaN values were replaced with 0’s where appropriate and the rows containing a majority of zeroes were dropped, omitting a few states with insufficient population data. Then, to pair the states’ populations with their sexual education policies, the tables were joined together by state, yielding the complete dataframe.

Since almost all of the data features were binary, since states either implemented policies or did not, the data was not immediately well-visualizable. To account for this, I went on to try and assess which policy features were most significant, but first I combined 5 policy features to create a new feature called “Religious Pressure Index” on a scale of 0 to 6 and combined 6 more features to create another new feature called “Inclusivity / SVSH Index” on a scale of 0 to 7. These features weren’t for classification, but simply seeing trends. Using these features, I was able to get some sense of the relationship between LGBT population (%) and major policies. (See fig. 1.1 and 1.2 below). While some relationship between LGBT % and the two indices was apparent, there still was a concern of how these indices predicted LGBT % as a continuous variable, given that they were built on binary features.





Additionally, it is worth noting that the distribution of the LGBT population percentage across all 50 states wasn’t quite uniform, but seemed to be almost a normal distribution (see fig. 1.3). Next, I set out to establish which features were the most significant, first by Principal Component Analysis (PCA), a dimensionality-reduction technique to try and assess how many policy dimensions really affected the data, and then by Recursive Feature Elimination (RFE), a feature selection tool to assess which policies were most significant in modeling. Finally, to account for the binary nature of the policy data, I decided to use a binary classification of LGBT population %. This yielded a more promising relationship, and so I proceeded with this binary classification model in mind, in essence using all policy features to evaluate a state as a “0” or “1” for low or high LGBT population. Notably, the distributions for the RCP index vary less than that of the ISVSH index, and the RCP index also has much clearer means, but both still had a large amount of overlap between classes. (see fig. 2.1 and 2.2)



Feature Selection, Classification, and Feature Analysis

I began classification with Principal Component Analysis on the data. Promisingly, the first 4 principal components accounted for more than a majority of variance in the data (76%).



With this knowledge, I built a centroid classification algorithm (see fig. 3.2 for centroids plotted as stars) to take in the euclidean distance of a new data point (a state) and compute of the two classes’ principal components which one it was closest to. This model was wildly inaccurate (see fig 3.3), leading me to the conclusion that I would need to both understand which policy features were most significant as well as employ a less strict boundary in my classification model.

From there, I used a Naive Bayesian classification model. This model calculates the probability that all of the features of a data point (state) relate to a classification result in order to classify the data point using the higher probability. Thus, I trained the Naive Bayes classifier with 70% of the state data, and used the remaining 30% to test the classifier. (see fig 3.3). Motivated by the success of this model, I decided to build a third model based on the most significant features (policies) present in the data set. Using Lasso and Ridge regression models on the continuous data, I found the “HIV/STI Includes Sexual Orientation” feature to be the most significant policy feature in both models. Additionally, I used Sklearn’s Recursive Feature Elimination (RFE) algorithm to assess which policy features were most significant in the binary classification problem.

Using the 5 most common “top” features from RFE, Lasso regression, and Ridge regression, I built a third classifier, called the “RFE-informed classifier”, also using Naive Bayes. This classifier outperformed the naive classifier slightly (56.8 vs. 55.8 % average accuracy over 10,000 trials) which indicates that there may be some dependence within the data that causes the Naive Bayes classifier to underperform. Additionally, due to the very small data set, some classifications could be wildly inaccurate, thus leading to the relatively large variance for all three classifiers (see fig. 3.3).



As a final feature selection curiosity, I explored using A/B and Permutation testing to identify the statistical significance of some of the most significant features of the data. Permutation testing shuffles the classifications of all of the states some 1000 times and then computes the distribution of the difference of means of a given policy feature for all states, thus allowing me to compare the observed value with a set of normally distributed values based on the real data. For “HIV/STI: Sexual Orientation”, this result was statistically significant, with an average p-value of 0.047, meaning that the observed result was at least as rare as 99.53% of the random samples. Similarly, “Cannot Promote Religion” saw a p-value of 0.15, meaning it wasn’t statistically significant with a p-value cutoff of 0.05, but was certainly still of note. (see fig. 4.1 and 4.2).



Conclusion

Primarily, there seems to be a correlation between policy features and LGBT population, even if it isn’t tremendously strong. Additionally, the “HIV/STI includes Sexual Orientation” policy feature was shown by several methods to be the strongest predictor of LGBT population, being both the only statistically significant feature relative to a normal distribution as well as the most important feature in both Ridge and Lasso regression.

Since the sample size of the study was so small by necessity (there are only 50 states), perhaps future studies could utilize more time periods and policy data to eliminate this massive variance in classification due to size. With this inaccuracy, it’s possible that this data is confounded by factors such as political orientation or regional culture across the United States. Future studies could account for this by following a longer timeline so as to track policy across enough time to map changes and understand what active effect sexual education policy has independent of other factors.

The RFE-informed Naive Bayes classification was able to consistently predict from the data most accurately, with accuracy above 75% more than 30% of the time, compared to accuracy closer to 70% from the unchanged Naive Bayes. Still, this classification shows that sexual education policy is somewhat predictive of LGBT population, and that specifically policy surrounding sexual orientation and HIV/STI education is correlated noticeably with LGBT population.

References:

  1. https://www.hrc.org/resources/a-call-to-action-lgbtq-youth-need-inclusive-sex-education
  2. https://iq.opengenus.org/gaussian-naive-bayes/
  3. https://builtin.com/data-science/step-step-explanation-principal-component-analysis
  4. https://www.census.gov/en.html
  5. https://library.missouri.edu/guides/data/arc-educ/
  6. https://www.guttmacher.org/state-policy/explore/sex-and-hiv-education