Low complexity filtering
The server filters your query sequence for low compositional
complexity regions by default. Low complexity regions commonly give
spuriously high scores that reflect compositional bias rather than
significant position-by- position alignment. Filtering can elminate
these potentially confounding matches (e.g., hits against proline-rich
regions or poly-A tails) from the blast reports, leaving regions whose
blast statistics reflect the specificity of their pairwise alignment.
Queries searched with the blastn program are filtered with DUST. Other
programs use SEG.
Low complexity sequence found by a filter program is substituted using the
letter "N" in nucleotide sequence (e.g., "NNNNNNNNNNNNN") and the letter "X"
in protein sequences (e.g., "XXXXXXXXX"). Users may turn off filtering by
using the "Filter" option on the "Advanced options for the BLAST server" page.
Reference for the DUST program:
Tatusov, R. L. and D. J. Lipman, in preparation.
Hancock, J. M. and J. S. Armstrong (1994). SIMPLE34: an
improved and enhanced implementation for VAX and Sun computers of the
SIMPLE algorithm for analysis of clustered repetitive motifs in nucleotide
sequences. Comput Appl Biosci 10:67-70.
Reference for the SEG program:
Wootton, J. C. and S. Federhen (1993). Statistics of
local complexity in amino acid sequences and sequence databases.
Computers in Chemistry 17:149-163.
Wootton, J. C. and S. Federhen (1996). Analysis of
compositionally biased regions in sequence databases. Methods in Enzymology
266: 554-571.
Reference for the role of filtering in search strategies:
Altschul, S. F., M. S. Boguski, W. Gish,
J. C. Wootton (1994). Issues in searching molecular sequence
databases. Nat Genet 6: 119-129.