Research Post # 2

Aug 23

I’ve just finished my second week at the research internship I am doing at a university neuroscience lab. This week I was assigned my project, and it is really interesting!

A research group, separate from ours, coded a data scrape to find nearly every mention of a certain kind of gene and interaction found in neurons in any scientific publication. This machine came back with two datasets: one containing all the genes and their specific attributes and one containing interactions between two genes and the attributes of the interaction.

The gene data contains 19,867 genes and each one has several four attributes: its disease associated category (whether or not it is associated to disease), the number of publications it appeared in, its ortholog count and which specific species the gene was present in. For clarification, orthologs are genes that originated from a common ancestor and were separated by a speciation event, resulting in that genes presence in two species. This data observes nine species: homo sapiens, c. elegans, mus musculus, danio rerio, s. pombe, rattus norvegicus, xenopus tropicalis, drosophila melanogaster, and s. cerevisiae. In the dataset, there are 11,424 genes that are not associated to disease, 4,796 genes that are associated to disease, and 3,647 genes were returned with an unknown category.

The gene interaction data contains 1,048,576 connections and seven main attributes: the interaction category, the interaction type, the method the interaction was found by, the number of organisms the interaction is present in, the specific organisms the interaction appeared in, and whether the interaction is bidirectional.

I have been performing exploratory data analysis with the datasets in pandas to understand the distributions of the various attributes, frequencies, unique variations, and visual plots. Next, I will be coding the simplest form of a visualization of the datasets. The brain is thought of as a graph or network with nodes and edges. I will be using the genes as nodes and gene interactions as edges for this visualization. To begin, I will take the first 10,000 edges every corresponding node that is included. This data will be sufficient to create an adjacency matrix (which is a data structure used to visualize graphs). Matrices can be thought of as a list of lists. Essentially, an adjacency matrix is symmetrical and would have a list of nodes on the horizontal and vertical axes. If there is an edge between them, there would be a 1 on the intersection and if there is no edge, there would be a 0. This adjacency matrix will be combined with a one dimensional array containing a 0 or 1, representing non-disease-associated or disease-associated for every node. This will be visualized in the graph using different colors. Since this is the simplest visualization of the graph, the unknown nodes are going to be ignored. Using this graph, we will train a model using the data and eventually apply it to the unknown genes and predict the probability of the unknown genes being disease associated.

The goal of this project is to develop and train a machine learning model to predict the probability that the unknown genes are associated to disease. Knowing this information can help medical professionals in a number of ways including understanding the underlying biological mechanisms of diseases, therapeutic target identification, drug development, and can even serve as biomarkers for diagnostic testing, potentially allowing personalized treatments.

See you next week.

Anusha Kumar

Research Post # 2

Research Post # 3

Research Post # 1