Send response to journal:
In the context of Christopher Noble's return to basics, I would appreciate some clarification of the methods employed to characterize the HIV genome.
On the HIV Sequence Database FAQ webpage http://hiv- web.lanl.gov/content/hiv-db/HTML/FAQ.html, we are told that:
"A consensus sequence is a sequence of the most common nucleotide or amino acid at each position in an alignment. We generally use a 50% cut- off, such that at least 50% of the sequences have the same character at this position, or else we replace the character with a question mark. Another way to create a consensus is to take the most frequently occurring character, even if it is not the majority."
"Consensus sequences are built from an alignment. The alignment itself might be dominated by one type of sequence, such as subtype B sequences from the United States. So in general a consensus sequence is not the same as the common ancestor of the sequences, although in some cases it can approximate an ancestral sequence."
On a linked page, in regard to "INTERPRETING THE FORMAT OF CONSENSUS SEQUENCES", we are told that:
"If most positions in an alignment are dashes inserted to maintain the alignment, in the consensus sequence no amino acid is put in that position."
Therefore, question marks and dashes do not appear to be equivalent. Would you agree that dashes indicate a complete lack of information for a particular nucleotide position? If so, what is the source for the original alignments on which consensus genomes are built?
Do individual researchers complete the process of determining a consensus genome prior to submission of the sequence to the HIV database? If so, is there a minimum requirement for the number of sequences used?
In the example above, there is only one nucleotide position which is reported as less than 100% consistent, but more than 50%. How common is this occurrence in practice, as a percentage of the nucleotide positions in the string being reported? What percentage typically fails to satisfy the 50% cut-off?
Assuming a certain number of sequences as the minimum basis for a consensus genome, what degree of homology is demonstrated among different consensus genomes derived from the same patient? Is this typically attempted?
It appears that when researchers discuss the variability of HIV, they may be thinking of "less than 100% consistency at a certain nucleotide position", "less than 50%", some range of variation among consensus genomes (either between or within patients), or any number of observations of mutation or evolution over time. Are there any nomenclature conventions which insure a commonality of context?
Brian Foley has previously stated that, "Almost all regions of the HIV-1 or HIV-2 or SIV genome are 'novel' or unique to the individual viral isolate from which they are obtained. The pol gene evolves more slowly than the env gene over time, but all regions of the genome evolve at some rate. Only very short regions, such as the Lys-tRNA primer-binding site, and the polypurine tract, are highly conserved in all SIVs and HIVs. These short regions are less than 50 bp in length."
Would it be accurate to describe these "less than 50 bp" sequences as absolutely consistent in all HIV and SIV genomes ever reported? If less than 100%, by how much?
How many of these 50 bp or so sequences are there, and with which genes are they associated?
Thanks in advance, I think your answers may help to insure we're all talking about the same thing in regard to genome variability and identity.
Competing interests: None declared