ET-TDT (Evolutionary Tree-Transmission Disequilibrium Test)

DOS Programs Files and Examples

Programs (MS-DOS format):   inferhap.exe     ettdt.exe
Examples:   Example 1       Example 2

Instructions

Overview

ET-TDT is a procedure to combine the benefits of measured haplotype analysis with TDT to find which haplotypes or groups of haplotypes, as defined in an evolutionary tree, are responsible for increased (or decreased) relative risk for a genetic condition. The genetics and statistics are discussed in Transmission/Disequilibrium Test Meets Measured Haplotype Analysis: Family-Based Association Analysis Guided by Evolution of Haplotypes by Howard Seltman, Kathryn Roeder, and B. Devlin, Am. J. Hum. Genet., 68:1250-1263, 2001. The abstract (and full text for those with a subscription) can be found at the AJHG web site. The DOS-compatible software is designed to use closely spaced markers from trios of mother, father and affected child to parse a cladogram depicting the evolutionary relationships among the haplotypes into regions of homogeneous relative risk. Missing marker data is allowed.

After the researcher constructs the cladogram, e.g. using PAUP software, the analysis takes place in two steps. Step one, performed by the inferhap program, infers the haplotypes of the trios. Three output files are produced: an ``unambiguous'' file holds the haplotypes for the trios which can be inferred unambiguously, the ``ambiguous'' file holds the various possibilities for each trio which cannot be inferred unambiguously, and the ``counts'' file specifies the number of unambiguous trios, the number of ambiguous trios, and the number of possible haplotype combinations consistent with each ambiguous trio.

Step two is the actual ET-TDT analysis, conducted by the ettdt program using the output files of step one as its input. The ettdt program constructs a representation of the initial cladogram from your description of the edges connecting the haplotype nodes. Then it iteratively performs the following procedure. An attempt is made to collapse each terminal node into its adjacent node. A collapse is rejected if the score test for a common parameter for the two nodes, conditional on the collapsing of the cladogram up to the previous iteration, is rejected at the alpha level (default 0.05) possibly using a Bonferonni correction (default yes). Note that after the first iteration, some nodes may contain multiple haplotypes. In fact, it may be desirable to collapse some infrequent nodes before the first iteration. Once the collapse of an edge is rejected, no further attempts will be made to collapse it. The iterations proceed until there are no more terminal nodes.

Performing the haplotype inference step

The marker file is a plain ASCII text file with one line per trio. Each line consists of three blocks, first for the two parents (in either order) and then for the offspring. The blocks are separated by at least one space or tab. Each block is made of a number of sub-blocks equal to the number of different markers tested. The sub-blocks are separated by at least one space or tab. Each sub-block consists of exactly two marker codes separated by space(s) or tab(s). The marker codes identify the two alleles found for that person at that location. The marker codes can be of any form as long as there are no embedded spaces. Homozygous markers must be included twice. The default code for a missing marker is ``-1'', but any other code may be used if specified in the program command line option ``mi=''. A missing parent is coded as 2M missing value codes, where M is the number of markers in a haplotype.

A separate plain ASCII file, the haplotype file, must list all possible haplotypes in the data. This file has one line per haplotype (say, H of them), and M+1 columns. The first column is the haplotype code; in the current implementation the haplotype codes must run from ``0'' to ``H-1'', although this requirement will be relaxed in future versions.

If the marker file is called ``marker.dat'' and the haplotype file is called ``haplist'', then simply running inferhap at the DOS command prompt will create the three output files: counts.dat, unamb.dat and amb.dat. These files are the input to the ``ettdt'' program.

There are several optional run-string parameters for the ``inferhap'' program. Only the first two letters of each option are required. They may be included in any order, so, e.g., the following commands are equivalent:

inferhap ha=g6pd.hap ma=g6pd mi=na pre=
inferhap pre= ha=g6pd.hap ma=g6pd mi=na
The major run-string parameters are:

argument default meaning
ha[plotypefile]= haplist name of the haplotype file
ma[rkerfile]= (blank) middle characters of the marker file name (after prefix; before suffix)
mi[ssingcode]= -1 code for a missing marker
pr[efix]= marker prefix for marker file name
su[ffix]= .dat suffix for marker file name
re[code]=   list of haplotype codes that are pre-collapsed
dr[op]= 1000 number of ambiguities that trigger dropping a trio

The use of the markerfile=XX argument, in addition to changing the name of the marker file read by ``inferhap'', changes the names of the output files to countsXX.dat, unambXX.dat and ambXX.dat. This may be useful for keeping track of markers and inferred haplotypes from multiple sources.

The command inferhap help gives a short description of the file types and the run-string parameters.

The recode= argument is used to manually specify which (infrequent) haplotypes are recoded as (more frequent) haplotypes using the syntax recode=a:b,c:d etc. where a:b means haplotype "a" is included in a haplotype group with "b". Be sure the second haplotype is not recoded elsewhere in the recode= argument, e.g. a:b,b:c is invalid. On the other hand it is OK to use, e.g. a:b,c:b. The new haplotypes and haplotype groups will automatically be recoded using a 0 to H'-1 scheme (where H' is the number of haplotype groups in the pre-collapsed cladogram). A printout of the new codings is produced to aid in construction of the (now condensed) haplotype file for use with the ``ettdt'' program.

The drop= argument drops cases with at least that many haplotype possibilities for the trio. This allows dropping very uninformative cases.

Performing the ET-TDT step

The input to the ``ettdt'' program consists of the three files produced by the ``inferhap'' program plus the ``cladogram file''. The cladogram file is a plain ASCII text file with one line per edge of the cladogram diagram. Each line consists of the haplotype codes for the two nodes at either end of the edge, separated by space(s) or tab(s). The edges may be listed in any order, and the nodes for each edge may be listed in either order.

Simply running ettdt performs the ET-TDT analysis on the data in counts.dat, unamb.dat and amb.dat according to the cladogram described in the file ``cladogram''. The optional run-string parameters are listed here. As for ``inferhap'' only two letters are needed, and they may be included in any order.

argument default meaning
cl[adefile]= cladogram name of cladogram file
fi[lenum]= (blank) XX in markerXX.dat for marker file name
al[pha]= 0.05 type 1 error level desired
bo[nferroni]= 1 0 means don't use Bonferonni correction
ve[rbose]= 0 1 prints more detail
re[jectids]=   codes for brief output format (see below)
th[isthetaeq1]= 0 haplotype code that is considered to have theta=1

The brief output format allows you to specify 1 or more (say, R) edges to monitor. Then the output consists only of R+1 numbers, where the first number is the total number of rejected edge collapses, and the additional numbers are zero or one to indicate collapse or no collapse for each of the R specified edges. This output format is useful for simulation studies. The argument for ``rejectids='' is easiest to specify after running the program with ``ve=1''. The argument takes the form re=a:b[,c:d[...]] where each comma separated sub-argument is a separate edge to test, and the codes separated by a colon are the first (or only) haplotype codes listed in the ``ve=1'' output for the first two ``Check'' output items that are separated by an ``x''. (See example below.)

The default ``ve=0'' output format only prints the final cladogram in a one line per node format. A sample final cladogram output is

Cladogram:
 O1: (1,2) -> [4,3]  theta=2.21
 O3: (3) -> [1]  theta=1.01
 O4: (4,5,0) -> [6,1]  theta=1
 O6: (6,7,8) -> [4]  theta=2.08

This cladogram has four ``open'' nodes. (Open nodes must remain uncollapsed.) The numbers in parentheses mean that the node nominally labelled ``O1'' contains haplotypes 1 and 2. The numbers in square brackets mean that this node is connected to nodes ``O4'' and ``O3''. The estimated theta for this node is 2.21. Further study of this cladogram printout reveals that the collapsed cladogram has the form (3)-(1,2)-(4,5,0)-(6,7,8), where the nodes consisting of haplotypes 1 and 2 and of 6, 7 and 8 have higher haplotype relative risk.

Using the optional ``ve=1'' format the printout first shows the initial cladogram using the labels ``T'' for a terminal node (will attempt to collapse in this iteration), ``O'' for a node that will remain open, and a blank for nodes that are neither terminal nor already checked. The printout then shows collapses checked for the current iteration, the score test p-value and whether collapsing is accepted or rejected. At the end of each iteration a new cladogram is printed. If there are still terminal nodes, another iteration is performed.