贵州农华生物科技有限公司*Bioinformatic resource in Bacterial Genome Data mining and Bioinformatic Analysis

Bioinformatic resource
Biolinux (Linux system, which has preinstalled a lots of softewares for bioinformatics), the detailed install information could refer to the UBUNTU
Genome comaprison software:
	Mauve (multiple genome alignments); MUMmer(Ultra-fast alignment of large-scale DNA and protein sequences), which was used to 1) comapred a pair of genoms with a rapid rate; 2) interference the synteny of genomes; 3) SNP analysis; 4) reorder the sequence of the contigs for draft genome using referance genome; Local Blast: very useful tools for genome comparison, like specif genes and core genome identifed for multiple genomes; ACT (Artemis Comparison Tool, comparison of a pair of genomes);
Inquire the Number of Genome Sequences:
	●Draft (WGS) Genome ●Complete Genome
Genome Habitat Information Retrive:
	●IMG database
Genome Meta Information Retrive:
	●GOLD
Genome Blast
	●TBLASTN search translated nucleotide databases using a protein query
Web Based Genome Comparison
	●mGenomeSubtractor If you used this server in your study, plase cite this reference: Shao Y, He X, Harrison EM, Tai C, Ou HY, Rajakumar K, Deng Z. mGenomeSubtractor: a web-based tool for parallel in silico subtractive hybridization analysis of multiple bacterial genomes.Nucleic Acids Res. 2010 Jul;38(Web Server issue):W194-200
Serveral powerfule softwares to infer genomic flux pattren
	ClonalFrame: Inference of bacterial microevolution using multilocus sequence date
	ClonalFrameML: Efficient inference of recombination in whole bacterial genomes
	GenoPlast: Inference of homologous recombination in bacteria using whole genome sequences
download from https://github.com/xavierdidelot/ClonalOrigin/wiki/Install
	ClonalOrigin: Inference of homologous recombination in bacteria using whole genome sequences
	(Note:安装与使用方法来源于Xavierdidelot（https://github.com/xavierdidelot/ClonalOrigin/wiki/Usage, https://github.com/xavierdidelot/ClonalOrigin/wiki/Usage)
Phylogenetic tree build
	phyML, Rasxl, MEGA
Sequence alignment
	ClustalW, Musclue
Perl script for sequence analysis and genome comparison
	cds_faa.pl: extract CDS amino sequence from Genome annotation file (GenBank format);
	cds_fna.pl: extract CDS nucleotide sequence from Genome annotation file (GenBank format);
	fasta_header.pl: extract the sequence ID for a multiple sequence fasta file;
	Usage for these perl
	ANI_bacth.pl: 批量计算基因组平均核苷酸（ANI)
	extractseuquence.pl: 根据序列ID提取序列
	ANI_bacth_patall.pl: 批量计算基因组平均核苷酸（ANI),分割多cpu平行计算
	blocksplit.pl: 根据特定标示符分割文件
	ani_vs.pl: 处理 jspecies运算结果，以文本形式输出
	cntPairwiseDiffs.pl
	Bacth_muscle.pl: 多线程运行MUSCLE
	cntSeqLen.pl
	catfasta2phyml.pl
	Concat_Sequence.pl
	Cluster_blast.pl
	gbf2tbl.pl: 从gbk文件提取5列注释信息，用于基因组提交
	core_blast.pl
	test.pl
	hash_extract.pl: 以blast结果为输入文件，转换，提输出N个基因组的共同核心基因组序列文件
	process.pl
	intersection.pl: 多个文件列表，取交集
	multi-thread.pl
	parallel_work.pl: 多线程运行
	multiple_tm.pl
	extract_protein_dir.pl: 将目录文件夹的gbk，批量提取cds gene
	FASTA_ID_mofied.pl
	extract_rRNA_dir.pl: 从目录文件夹的gbk件，批量提取16S rRNA gene
	extract_fas_from_gbk.pl
	extract_rRNA_sequence.pl: 从目录文件夹的gbk件，批量提取rRNA gene (5S,23S,16S)
Homologous gene detection Tool: An accurate orthology recognition is an essential step for comparative genomic researches
	Proteinortho: Detection of (Co-)orthologs in large-scale analysis: ProteinOrtho significantly reduces the amount of memory required for orthology analysis in comparison with existing tools (OrthoMCL and Multi-Paranoid). It finds co-orthologs in large sets of data, containing different species, specifically designed to handle hundreds of species at the same time, containing millions of proteins. However, it is still based on BLAST +, PERL, and Python libraries to run. It is available in its 5.1 last updated in April 2016 version. New Tools in Orthology Analysis: A Brief Review of Promising Perspectives NORTH: a highly accurate and scalable Naive Bayes based ORTHologous gene cluster prediction algorithm RAFTS3G: an efficient and versatile clustering software to analyses in large protein datasets FastOrtho is a reimplementation of the orthomcl program that does not require the use of databases or perl Ortholog-Finder: A Tool for Constructing an Ortholog Data Set: Ortholog-Finder is a tool developed to build ortholog data sets for phylogenetic analysis. Results published by its developers suggest that it can tolerate gene loss after gene duplication and HGT, because most of the phylogenetic trees are accurately reproduced even when these events occur. The algorithm is written in PERL, compatible with Linux/Unix platforms and it needs BLAST, ClustalW, MAFF, and BioPERL to be run. However, this program does not support the maximum likelihood or the Bayes methods. orthAgogue: an agile tool for the rapid prediction of orthology relations: OrthAgogue is a tool that was developed in order to predict orthology among large sets of data. It is available in 32 and 64-bit versions and up to the 1.0.2 version (the last update was made in July, 2013). The program relies on the library called Intel Threading Building Blocks (Intel TBB) and on the one called C Minimal Perfect Hashing Library (or library hash CMPH). Thread number settings and threshold overlap are some of its features operating in the agility of the pipeline. It is a tool relatively easy to use because it does not rely on very elaborate prerequisites, however it needs input files in a tabular format, generated by the BLAST algorithm. SwiftOrtho: A fast, memory-efficient, multiple genome orthology classifier Discovery of multi-operon colinear syntenic blocks in microbial genomes DOOR: a prokaryotic operon database for genome analyses and functional inference Tracing the ancestry of operons in bacteria CSBFinder: discovery of colinear syntenic blocks across thousands of prokaryotic genomes OrthoVenn2: a web server for whole-genome comparison and annotation of orthologous clusters across multiple species: OrthoVenn, uses the interactive Venn diagram to generate views of clusters. It is a Web-server based tool searching for orthology between multiple sequences among different species. The user can select up to six species and analyze them against the dataset available in the OrthoVenn Website. This tool conveys gene ontology information with each protein function relating to the respective clusters generated. The pipeline features various methodologies to achieve inference of groups such as BLAST “all-vs.-all,” MCL, and even a predictor of hypothetical proteins, which makes this tool somewhat slow and it implies a limited number of queries for analysis. Orthonome – a new pipeline for predicting high quality orthologue gene sets applicable to complete and draft genomes: Orthonome was developed to bring more accuracy to orthology methods. Its pipeline counts on the MSOAR method in order to classify groups of homologs. It was proved to have advantages on complete and draft genomes in Drosophilid genomes and it combines multiple pipelines to accuracy and recall of ortholog assignments. It is available on a Web-server basis and this limits the number of inputs. Furthermore, the quality of its assembly and annotations has an impact on its performance, making therefore convenient to take reference genomas as a comparative data set. ORCAN—a web-based meta-server for real-time detection and functional annotation of orthologs A brief review of software tools for pangenomics PorthoMCL: parallel orthology prediction using MCL for the realm of massive genome availability: PorthoMCL is a fast tool that needs low requirements for identifying orthologous and paralogous groups in genomes. It is much faster and more modulable tool as compared to OrthoMCL using the same mathematical fundament to investigate orthology. PorthoMCL can facilitate comparative genomics analysis through a large number of sequenced genomes but it still requires BLAST, PERL, and Python package to be run. BPGA – an ultra-fast pan genome analysis pipeline The real cost of sequencing: scaling computation to keep pace with data generation PanOCT: automated clustering of orthologs using conserved gene neighborhood for pan-genomic analysis of bacterial strains and closely related species: PanOCT derives its peculiarities from the fact that it was developed in order to avoid using traditional methods in graph-based detection of orthologs and because it is considered a high-output kind of tool. It uses conserved gene neighborhood (CGN) strategy to improve accuracy in the clusters generated by the algorithm for pan-genomic analysis of prokaryotic species or closely related strains. Therefore, it presents difficulty in detecting groups when organisms are evolutionarily too far apart. It uses PERL packages, BLAST + and it is limited to an analysis of up to 25 genomes providing that the hardware used for analysis has a minimum of 14 GB of RAM available. It closely resembles clustering tools with several interesting execution options such as e-value threshold, cut-off identity, creation of files with paralogs groups, BLAST standardization score histogram, window size on either side of match to use CGN, among others. The developers believe that the various options are necessary due to the preference of the orthgroup the co-orthologs with the same genomic context, and additional information need to be reported indicating co-ortholog relationships. It is currently available in the 3.23 version (last updated in July 2016). OrthoFinder: solving fundamental biases in whole genome comparisons dramatically improves orthogroup inference accuracy: OrthoFinder was developed to solve fundamental biases in genome comparisons, improving the accuracy of the inference of orthologous groups. It works as a single command that takes as input a directory of FASTA files (one per specie) and, with the help of statistical algorithms, it generates output files containing genes of orthologous groups of these species. This mechanism is interesting because it seeks to minimize the bias of the length, previously undetected in the genes grouped in orthologs, resulting in significant improvements in the accuracy of the results. It is now at its 1.1.10 version (updated in September 2017) and it depends on the following packages: Python, BLAST+, MCL graph clustering algorithm, MAFFT, and FastTree. A brief review of software tools for pangenomics Orthograph: a versatile tool for mapping coding nucleotide sequences to clusters of orthologous genes: Orthograph counts on its specific algorithm in order to solve earlier implementations of RBH strategies with graph-based mapping while maintaining the high sensitivity and accuracy of the RBH approach. This tool needs various softwares such as BLAST, SWIPE, MAFF, MySQL in order to be run. Hieranoid: hierarchical orthology inference: Hieranoid tool uses hierarchichal approaches to infer orthology using the bit-score method based on the InParanoid algorithm. It computes the orthology graph of a protein through from the global alignment of the sequence with a minimum overlap of 50%. This, though entails a disadvantage as the re-arrangements of orthologs with an extensive domain may be missed. Future plans, announced by the developers of this tool, include the use of domain information in order to fix this failure. The most interesting thing about this pipeline is that the search for similarity can be adjusted using BLAST or the Usearch algorithm. MorFeus: a web-based program to detect remotely conserved orthologs using symmetrical best hits and orthology network scoring: MorFeus is a software aiming at detecting orthologs when it is difficult to find orthology relationships among evolutionarily distant sequences. It runs on the Web, but the sequence can serve as an input only if it possesses an ID of the RefSeq. The configuration options allow to pick a particular kingdom (Archae, Bacteria, Fungi, Metazoa) or the whole dataset, the default e-value is 100 and the output is sent to the email address registered by the user. Unfortunately the local version has not been updated since March 2014 and it requires pre-installed software such as python, biopython, networkx, gnuplot, and BLAST + besides the need of the registration of the user via Web. OrthoInspector: comprehensive orthology analysis and visual exploration: OrthoInspector is a tool that offers just one simple and fast algorithm to detect orthology and in-paralogy. It is currently available in its 2.21 version (updated in August, 2015). However, prerequisites that this tool requires turn it cumbersome to operate. For example the entry file in XML format, the BLASTP+ package, the creation of the data set in PostgreSQL or in MySQL, besides the addition of the JAVA package. GeM-Pro: a tool for genome functional mining and microbial profiling.Gem-Pro is a new tool for gene mining and functional profiling of bacteria. It initially identifies homologous genes using BLAST and then applies three filtering steps to select orthologous gene pairs. The first one uses BLAST score values to identify trivial paralogs. The second filter uses the shared identity percentages of found trivial paralogs as internal witnesses of non-orthology to set orthology cutoff values. The third filtering step uses conditional probabilities of orthology and non-orthology to define new cutoffs and generate supportive information of orthology assignations. Additionally, a subsidiary tool, called q-GeM, was also developed to mine traits of interest using logistic regression (LR) or linear discriminant analysis (LDA) classifiers. q-GeM is more efficient in the use of computing resources than Gem-Pro but needs an initial classified set of homologous genes in order to train LR and LDA classifiers. Hence, q-GeM could be used to analyze new set of strains with available genome sequences, without the need to rerun a complete Gem-Pro analysis. Finally, Gem-Pro and q-GeM perform a synteny analysis to evaluate the integrity and genomic arrangement of specific pathways of interest to infer their presence. The tools were applied to more than 2 million homologous pairs encoded by Bacillus strains generating statistical supported predictions of trait contents. The different patterns of encoded traits of interest were successfully used to perform a descriptive bacterial profiling. RAFTS3G: an efficient and versatile clustering software to analyses in large protein datasets

Curriculum vitae for Dr. Xiangyang Li
Biying ● FudanEmail ● ChineseVersion