URC 2023 Abstract
Shanthan Sudhini
Efficient Retrieval and Preprocessing of Prokaryotic Genomic Data for Large-Scale Analysis
Efficient retrieval and preprocessing of metadata and DNA sequences are essential for large-scale prokaryotic genomic analysis, as data from NCBI databases often contain inconsistencies, missing values, and redundancies. Without proper data cleaning and structuring, downstream analyses may be compromised. This study focuses on extracting DNA sequences and associated metadata for prokaryotic families from NCBI's nucleotide (NT) and BioSample databases, cleaning and standardizing metadata, and storing the processed data in an SQL database. To ensure scalability and reliability, robust error logging and HTTP request management strategies are implemented for efficient large-scale data retrieval. Data acquisition is performed using Biopython's Bio.Entrez package to systematically query NT and BioSample databases. Metadata cleaning involves handling missing values, removing duplicate records, and standardizing critical attributes such as taxonomy, geographic origin, host information, and isolation source, ensuring a uniform dataset for downstream computational analysis. Parallel processing techniques—including multiprocessing for standard computing environments and MPI for high-performance computing clusters—enhance computational efficiency, reducing data processing time by approximately 50%. The retrieved nucleotide sequences and metadata are stored in an SQLite database, while corresponding FASTA files are systematically organized for future use. This framework provides a scalable and efficient solution for prokaryotic genomic data mining, improving accessibility for bioinformatics research. The structured dataset will support machine-learning applications and large-scale comparative analyses of prokaryotic genomes. Future directions include further optimizations and integrating additional genomic repositories to expand data coverage.
Presenter: 406
Shanthan Sudhini Sophomore Edward E. Whitacre Jr. College of Engineering Texas Tech University Affiliations:
Abstract: A406
Impact Area: Technology
Session: A - Tues. April 1, 10:00 AM, TTU Museum Sculpture Garden
Project Author(s)
Shanthan Sudhini , Era Sharma, Amanda M. V. Brown
Mentor
Amanda M.V. Brown Biological Sciences TTU College of Arts & Sciences
Center for Transformative Undergraduate Experiences
-
Address
TrUE, Drane Hall #239, MS 1010 -
Phone
806.742.1095 -
Email
true@ttu.edu