Back to Data Consulting

Trace Data



Project Description (by Luke Dai)

This semester, SAAS collaborated with Trace Data, a data security company, to analyze json files using natural language processing. Our objective was to extract the meaning within json files using any method possible. Unlike regular books or newspapers, json files contain shorthand, acronyms, and other strings that represent english phrases (ie ‘mgr’ for ‘manager’ or ‘IPAddr’ for ‘IP Address’). This disconnect from what we humans can decipher and machine can interpret gave rise to the problem where there are no known methods for machines to understand these json files. We were given ten thousand json files of various contents from different websites to use to analyze. Unfortunately, we only had 120 json files that had tags of their contents.Thus, we set our research into three paths - using json tags, frequency of keys, and structure of the json. We had eight people in total which I split into pairs and tasked in each direction, with frequency having two pairs. In the end, we were able to roughly extract concepts from the json files without the need of labeled data.

Firstly, Philip Kabranov and I worked on using the 120 labeled json files to classify them. Each of the 120 json file had one or more labels like payment, revenue or user that indicated the subject of part of the json file. Due to receiving this data relatively late in the semester, we only had three weeks to work on this so I opted for the many statistical models we usually use in our classes. In summary, we tried and failed to use many standard statistics models like Naive Bayes Classifier, Support Vector Machine, Cross-Correlation, and Decision Trees. This is because there was a fundamental flaw with the probability assumption. The classes only indicate a portion of the json and so it became impossible to impossible to indicate which keys would relate to each category. This would then mean that the existence or absence of any key does not impact the probability of the json being that category. Unfortunately, in addition to the low amount of data, we were unable to use the above mentioned models to classify the json files. In the end, our best model was to use the probability of keys belonging within a category and finding the maximum, then labeling it based on whether that maximum probability passes a certain threshold. This is very robust due to its simplicity and can be adjustable depending on situations of whether one wants more specificity vs sensitivity.

Moving from supervised into unsupervised learning, two pairs of my team attempted to use the frequency of keys to cluster the json files. The first pair, Amal Bhatnagar and Yehchan Yoo, focused on using tf-idf to turn the frequencies into a certain vector space. Then they used K-Means to find n-clusters in that vector space, grouping the json files into common concepts. They also tried Fuzzy K-Means and excluding different layers of the json file but the best result came from Hard K-Means and including all keys within the json file. They found that the json files were clustered in different sources, like online orders, biological research, web publications and et cetera. Meanwhile, the other pair, Amanda Ma and Pengyuan Chen, used Latent Dirichlet Allocation to find common topics within the json files. They were able to extract concepts, like music, movies, education and social media, from the json files and organize the json files according to the “percentage” of content that belongs to each topic. Overall, using the frequency of keys proved very successful in extracting meanings from the json files and can be very useful to autonomously organize large data.

Finally, our most complicated model is using the structure of the json file to extract meaning from the json files. The structure of the json file is essentially the tree structure of the json dictionary data structure.Because jsons are dictionaries, we are able to construct and make a relationship tree between keys. The final pair, Michael Chau and Sahil Rao, worked to research Node2Vec which can turn a tree or graph into a vector space. This vector space of keys basically can teach us where certain keys belong in the dimension of “location within a json file”. With this knowledge, I believe a future research into prediction of the location of keys would result in machines being able to understand the relationship between keys, such as why name or id is always underneath user data. In conclusion, this structural analysis is an opening to allow further machine learning models to completely extract the meaning of keys.

After the Fall semester of 2019, I believe we were able to not only provide an adequate way to extract concepts and topics from json files, we were able to provide a stepping stone that can extract the exact meaning of keys in the json files. This is definitely not a complete research as there are definitely many applications to look into as well as further research to build on top of what we did this semester. As a final note, I would thank my entire team for their hard work and thank you for reading!

Presentation

Semester

Fall 2019

Project Manager

Luke Dai