Janssen Genomics
technical
demystifying bioinformatics

Approaches to characterising biological macromolecular data

Here you will find some explanations, help and background material for the most common algorithms and methods that we deploy (click here)

 

Most bio-macromolecules are polymers that can be represented as sequences of their constituent components: sugars for polysaccharides, amino acids for proteins, and nucleotides for DNA.
The sequential order in which a protein's amino acids or DNA's nucleotides are strung together give that molecule its properties. When characterising a new gene or protein, a usual first step is to look for compositional patterns of known function, origin, or relationships in the sequence. The molecule can then be further characterised by examining its physicochemical properties, overall compositional bias, positioning in a family cluster, predicted structure and sub-cellular localisation. Of course the reverse of this process can also be implemented: a relevant new gene can be identified in a database or genome by searching for the desired properties. Thus previously unknown genes can be found by their properties or the properties of their protein product, a process crucial to describing biochemical pathways, finding evolutionary homologs, or discovering antigens. In the sections below you can find some explanations, background and technical details of how functional, relational or evolutionary information is gained from biological sequences.

 

The subject areas covered are:

  • probabilistic models
  • motifs and modules
  • comparative genomics
  • inferred phylogenies

 

bioinformatics software standards

We strongly support a bioinformatics standard. Click here for more information.

Similarity search strategies

One of the most fundamental tasks in analysing a sequence involves searching for similar sequences in databases, which usually provides the first clues of whether the sequence belongs to an already studied and known gene /protein family. If there is a similarity to another sequence, then they may be homologous (i.e. sequences that descended from a common ancestral sequence). Knowing the function of a similar/homologous sequence will often give a good indication of the identity of the unknown sequence.
Several strategies exist for finding similar sequences, ranging from direct alignment of raw sequence data to searching databases with probabilistic models (the latter will be covered in another section below).

Heuristic alignment algorithms
Although a number of dynamic programming algorithms exist that are guaranteed to provide the most sensitive and accurate (according to scoring scheme) alignment of two sequences, these approaches are too computationally intensive to realistically cope with extensively searching todays large genome and proteome databases. To solve the computing time issues, heuristic approaches have been developed to produce faster algorithms.

Basic Local Alignment
BLAST is probably the best known and most widely used heuristic tool for searching sequence databases. The BLAST package provides programs for finding high scoring local alignments between a query sequence and a target database. A very readable and and easy to understand guide to BLAST can be found in the freely available paper:

Pertsemlidis A, Fondon JW 3rd. Having a BLAST with bioinformatics (and avoiding BLASTphemy).Genome Biol. 2001;2(10):REVIEWS2002. Epub 2001 Sep 27.

FASTA
Another widely used heuristic approach for sequence searching is the FASTA package, which uses a multistep method to finding local high scoring alignments. FASTA can identify gapped alignments through a three step process starting from local short word matches. The last stage of FASTA uses standard dynamic programming, producing scores directly comparable to full local alignment algorithms and generally achieving greater sensitivity than BLAST. Details can be found in:

Pearson WR.Flexible sequence similarity searching with the FASTA3 program package. Methods Mol Biol. 2000;132:185-219.

A comparison of similarity search programs

Bioinformatics. 2003 Dec 12;19(18):2456-60.

Proc Natl Acad Sci U S A. 1998 May 26;95(11):6073-8.

 

BACK TO TOP OF PAGE

Sequence families

to be soon continued....

 

 

[home] [services_products] [contacts]

Copyright 2004 - 2005 Janssen Genomics  bioinformatics   http://janssen-genomics.com