Characterizing An Unknown Gene: Analysis Of AGI At5g52420
INTRODUCTION
According to the literature, advancements in biotechnology and genetic technologies like sequencing tools have generated the extensive amount of biological data leading to better scientific picture. This vast amount of information has played a key role for characterizing genetic sequences and their corresponding protein through utilizing formerly characterized sequences. Therefore, this study investigates the gene (of unknown) function via the examination of protein domains and protein-protein interactions data. The transmembrane protein sequence of the Arabidopsis thaliana gene with the accession number of NP_568771.1 and the gene identifier of At5g52420 was analyzed in this lab report. This gene is involved in biological processes in 24 plant structures and 13 growth stages. It is expressed in leaf apex, shoot system, carpel, petal, root, stem, pollen, seed, sepal, flower, and etc. Also, it is present in the flowering stage, plant embryo globular stage, petal differentiation and other growth stages. It is located in the endoplasmic reticulum and the integral component of membrane. This gene is physically located on chromosome 5 with coordinates of 21281569 to 21282806 bp. This gene has 1 transcript, 68 orthologous, and 2 paralogous. This protein has a length of 242 amino acids, an isoelectric point of 6.8 and a molecular weight of 26736.7 Da.
According to gene ontology (GO) in BioGRID, this gene has only 141 high throughput physical interactions. As stated above, the purpose of this study is to identify gene expression patterns of sequence of interest from Arabidopsis Thaliana, utilizing the results to examining the protein-protein interaction (PPI) data through a GO enrichment analysis and performing co-expression analysis using AgriGO and GeneMANIA in order to hypothesize the function of the studied gene.
METHODS
(Protein) Domain Analysis
Different web-based tools had been used for analyzing and examining protein domains of the representative sequence. Interproscan was used because it offers a convenient way to search profile HMM database (Pfam) and other profile and motif databases. Although, it does not include the CDD (conserved domain database). First, the FASTA sequence was retrieved from NBCI and pasted it in CDD search box and submitted. Second, submitted the FASTA sequence in Interproscan search box with its default and then, exported the result in SVG format. In addition, SMART (simple modular architecture retrieval tool) was used with the sequence and selection of single peptides and internal peptide for further domain analysis.
Protein-Protein Interactions
BioGRID (General Repository for Interaction Datasets) was used to display the non-redundant physical and genetic interactions. The gene identifier of At5g52420 with Arabidopsis thaliana as its organism was searched, then “network” button was clicked to display the BioGRID graph network. Cytoscape, was used as another powerful tool for visualizing the data available at BioGRID through retrieving data from public database for this AGI ID. This tool is a general platform for different complex network visualization. GeneMANIA was used to find other related genes to set of input genes by using a very large set of functional association data with only having the AGI ID and organism which was Arabidopsis. Also, it could be accessed via the Cytoscape app. Arabidopsis Interactions Viewer was used to view predicted and experimentally- determined PPIs in Arabidopsis. AGI ID was used with selection of query protein-protein interactions from BAR, query interactions from BioGrid and intact options. DIP (Database of interacting proteins) lists protein pairs that are known to interact with each other. The BLAST search result was displayed by using FASTA sequence. STRING could be opened after clicking on interaction image on SMART result page from domain analysis step to show 3D interaction network of query protein.
Coexpression tools
Expression Angler was the utilized for coexpression analysis. It calculated the correlation coefficient for all gene expression vector as compared to the one for the gene identifier. After entering the AGI ID, limiting the output to the top 50 coexpressed genes, and choosing AtGenExpress Tissue Compendium options were selected and clicked submit Query. Top 50 AGIs were retrieved to use for further analysis in AGRIGO. After choosing view formatted data set after median centering and normalization from output page, its heatmap of median-centered and normalized gene expression levels of 50 co-expressed genes with gene was displayed. ATED-II is a gene coexpression database that was utilized to produce information regarding to expression levels of thousands of genes simultaneously which could be helpful for defining gene coexpression pattern. After searching the gene with its AGI ID, took a look at locus page and coexpressed gene list. Then, sort the table by MR tissue and export it as CSV and saved the top 50 AGIs for GO enrichment analysis. For comparing the top 50 from these two online coexpression tools,
Bio-Analytic resource venn selector tool was used. After giving data from both tool, looked over the duplicate AGIs to see the commonalities and intersections between these two sets of AGIs.
Gene Ontology (GO) Enrichment Analysis
AGRIGO, GO analysis toolkit and database for agricultural community, was used with Singular Enrichment Analysis (SEA) with choosing Affymetrix ATH1 Genome Array option because the statistical testing purposes can only be compared against those genes that were present on the specific platform that was used to measure the expression level. Hypergeometric as statistical test, significance level of 0.05 and minimum number of mapping entries of 3 were chosen for advanced options. In first round, the 50 AGIs from BAR Expression angular was used and in second round the top 50 from ATED-II.
Expression Analysis
The developmental map and seed map which had the max expression were displayed by searching the At5g52420 as primary Gene ID in Arabidopsis electronic fluorescent pictograph (eFP) Browser. This tool was chosen to examine the expression pattern of gene of interest from different data source in Arabidopsis
RESULT
Domain Analysis
No conserved domain was identified through any of online tools were used for domain analysis. According to first lab result through using MSA, there were some conserved domain and regions. Therefore, lack of having databases and researches on conserved domain of this gene might be the reason of not showing any conserved domain. SMART confidently predicted 5 transmembrane regions with 2 low complexity regions. Also, five orthologs from different taxonomy group such as: eukaryotes (super-kingdom), Viridiplantae (kingdom), Streptophyta (phylum), Brassicales (order), all organisms (no rank). Interproscan result page did not show any predicted protein family membership, homologous superfamilies, domains and repeats, and GO term prediction. It only showed different features from different sources such as transmembrane, TMhelix, cytoplasmic, non-cytoplasmic, mobidb-lite.
Protein-Protein Interactions
BioGRID represented 141 physical interactions (100% high throughput) with no genetic interactions. This would mean all interactions were identified by experiments where gene product (protein) physically interacted with another protein. Having no genetic interaction could mean there was no experiment to show changes to specific gene would influence the gene of interest or vice versa. It identified one GO cellular component in Endoplasmic Reticulum with experimental evidence of IDA (inferred from direct assay). After sorting the interactions by evidence, SOB3 (At1G76500), sugar transporter ERD6-like 5 (At1g54730), embryo defective 1923 protein (EMB1923), putative membrane protein (At3g42725) were among first 10 interactions. SOB3 was the first interaction in the list and its role in two-hybrid was BAIT which was expressed as DNA binding protein (DBP) fusion. DIP results were not useful for this study because there were all about Homo Spains (human) and Drosphila Melanogaster.
STRING interaction graph also showed, SOB3 has specific physical interaction with query protein. According to Arabidopsis interaction viewer graph, the gene was studied in this lab report is located in membrane and mostly had interactions with genes that are located in membrane. GENEMANIA identified high level of coexpression, one localization, and AHL29 interaction with query protein as one significant physical interaction. The undirected coexpressed gene network which was obtained by ATED-II showed five significant coexpression and expanded map coexpression network. Therefore, SOB3 and AHL29 were been two most important proteins that the studied gene have interaction with.
GO enrichment Analysis
According to biological process GO and cellular component GO, Co-expressed genes were enriched for oxidation reduction and membranes, respectively. The cellular GO was more focused on cell membrane and the biological one was more significant in metabolic processes. There was no GO for molecular function available.
Expression Analysis
eFP identified gene expression pattern level in Arabidopsis in various organs and cell types and in response to various stimuli. Seed had the maximum gene expression which was shown in fig. 3C. Under tissue specific search, it showed pistil tissue primarily consisting of ovaries. According to eFP result and expression angler, highest gene expression levels belong to see stage of 8 to 10 without siliques. Therefore, this gene could have crucial role and function in Arabidopsis seed.
Coexpression Tool
Expression Angler showed highest expression in seed stages of 8, 9, and 10. ATED-II locus page result showed five genes that were directly connected with the target gene as the network (microarray) such as: At1g17100 (SOUL heme-binding family protein), At5g37680 (ADP-ribosylation factor-like A1A), At3g22530, At1g69800 (Cystathionine beta-synthase (CBS) protein), At1g45230 (protein of unknown function (DUF3223). It also displayed the There were only 12 common AGI IDs from two datasets in bio-analytical resource venn selector tool and some of these were expressed protein. Only data from ATED-II worked thorough AGRIGO, the data from Expression Angler did not give result.
DISCUSSION
This Arabidopsis thaliana gene with AGI of At5g52420 was expressed which means it has function. It has a gene product of cell-to-cell mobile RNA which is synthesized in one cell or tissue and transported to another cell or tissue that may be adjacent, neighboring, or distant. There is a study that reported 2006 genes producing mobile RNAs and move between different organs under normal or nutrient limiting conditions. Plasmodesmata are cytoplasmic channels that facilitate the intracellular movements of different molecules (small to complex molecules) like proteins and different kinds of RNA species through neighboring cells across their cell wall. Each of these channels consists cytoplasm between endoplasmic reticulum and plasma membrane. The photoassimilate pathway in Arabidopsis is from shoot to root and is referred to energy-storing monosaccharides produced by photosynthesis. Another name of this protein-coding gene model is phosphate starvation-induced gene Interacting Root-Cell Enriched 3 (PRCE3). This gene is necessary for plant growth because phosphorous is crucial component of nucleic acids, ATP and membrane phospholipids. Inorganic phosphate (Pi) is present in soils. Therefore, there is a connection between what is found regarding to this gene and why it has highest expression in seed stages. According to GO enrichment analysis, this gene could be involve in metabolic process (i.e., oxidation and reduction),