The Big Data Department of CNIC has gained new progress in supporting the construction of the microbiological field database and its analysis system----Computer Network Information Center,Chinese Academy of Science

The Big Data Department of CNIC has gained new progress in supporting the construction of the microbiological field database and its analysis system

Date: Oct 28, 2021

The Big Data Department of CNIC and the Institute of Microbiology, CAS and other teams have made new progress in the construction of microbiology database and analysis system, and have jointly published two related research results in the internationally renowned academic journal "Nucleic Acids Research”.

Type strains are strains preserved in a pure state when the microorganisms are named, classified, recorded, and published. They are standard reference materials for microbial taxonomy, and are also ideal biotechnology research tools. They have important scientific research and industry value. Type strains have been scattered in more than 100 collection centers around the world for a long time, and they are extremely precious resources for each collection center. In 2018, the Institute of Microbiology of CAS initiated a global model microbial genome sequencing program to select model microbial strains (including bacteria, archaea and culturable fungi) that have not been sequenced from the Global Microbial Resources Collection. It is expected to complete more than 10,000 within 5 years. There are now 26 microbial resource collections from more than 12 countries including ATCC in the United States, JCM and NBRC in Japan, and KCTC in South Korea. These collection centers have formally joined the program and have produced important phased results.

Submission pages and results of the analysis for the two pipelines, provided by gcType

The Big Data Department of CNIC and the team of researcher Ma Juncai under the Institute of Microbiology, CAS have built the Global Catalogue of Type Strain (gcType), which integrates more than 13,944 genome data of 16701 effectively published prokaryotes. gcType is currently the most comprehensive and functional data platform with regard to the model Microbial genome data that provides users with one-stop data management, genome annotation, and new species identification analysis. The results of the cooperation are published in the internationally renowned academic journal Nucleic Acids Research.

As the global new coronavirus genome continues to spread, the new coronavirus genome continues to mutate during the epidemic. In addition to the collection and display of data, these databases contain functions such as virus typing and traceability analysis, therefore providing important information for the monitoring and tracking of the global epidemic. However, with the in-depth study of mutation, the functional impact of mutation has gradually become the focus of attention. At present, multiple infectivity-enhancing mutation axes, including Alpha, Beta, and Delta poisoning, have been discovered in many countries and regions around the world. The risk of immune escape may reduce the protection of disease control methods, affect the applicability of disease diagnosis, and evacuate the epidemic. Therefore, the existing database that focuses on data collection and display cannot meet the needs of the future. A virus mutation assessment and systematic early warning system based on big data is needed to systematically evaluate and interpret the impact of various mutations that may occur in the present and in the future, so as to form an effective epidemic prevention and control strategy.

Features of the variations evaluation and prewarning system (VarEPS) portal

The Big Data Department of CNIC, with cooperation of researcher Ma Juncai under the Institute of Microbiology, CAS and other teams have released the "New Coronavirus Variation Evaluation and Early Warning System" (SARS-CoV-2 Variation Evaluation and Prewarning System), referred to as the VarEPS database. VarEPS is the world's first system for multi-dimensional risk assessment and early warning of known and virtual variants in the SARS-CoV-2 genome. Starting from the perspectives of genomics and structural biology, VarEPS conducts multi-dimensional evaluation of mutations based on the evaluation of the frequency of mutation sites including the difficulty of nucleotide mutations, the difficulty of amino acid substitution, the effect of mutations on the secondary structure of proteins, and the effects of single amino acid mutations , so as to comprehensively analyze the effects of known mutations and potential virtual mutations on the function of the virus. On this basis, the system uses an artificial intelligence classifier algorithm to effectively group mutant strains in terms of spreadability and affinity for neutralizing antibodies, and realizes risk assessment and early warning based on virus sequences. The results of the cooperation were published in the internationally renowned academic journal Nucleic Acids Research.

For details, please contact Ms. Meng Zhen (zhenm99@cnic.cn).

The Big Data Department of CNIC has gained new progress in supporting the construction of the microbiological field database and its analysis system

Appendix：