After the researcher constructs the cladogram, e.g. using PAUP software, the analysis takes place in two steps. Step one, performed by the inferhap program, infers the haplotypes of the trios. Three output files are produced: an ``unambiguous'' file holds the haplotypes for the trios which can be inferred unambiguously, the ``ambiguous'' file holds the various possibilities for each trio which cannot be inferred unambiguously, and the ``counts'' file specifies the number of unambiguous trios, the number of ambiguous trios, and the number of possible haplotype combinations consistent with each ambiguous trio.
Step two is the actual ET-TDT analysis, conducted by the ettdt program using the output files of step one as its input. The ettdt program constructs a representation of the initial cladogram from your description of the edges connecting the haplotype nodes. Then it iteratively performs the following procedure. An attempt is made to collapse each terminal node into its adjacent node. A collapse is rejected if the score test for a common parameter for the two nodes, conditional on the collapsing of the cladogram up to the previous iteration, is rejected at the alpha level (default 0.05) possibly using a Bonferonni correction (default yes). Note that after the first iteration, some nodes may contain multiple haplotypes. In fact, it may be desirable to collapse some infrequent nodes before the first iteration. Once the collapse of an edge is rejected, no further attempts will be made to collapse it. The iterations proceed until there are no more terminal nodes.
A separate plain ASCII file, the haplotype file, must list all possible haplotypes in the data. This file has one line per haplotype (say, H of them), and M+1 columns. The first column is the haplotype code; in the current implementation the haplotype codes must run from ``0'' to ``H-1'', although this requirement will be relaxed in future versions.
If the marker file is called ``marker.dat'' and the haplotype file is called ``haplist'', then simply running inferhap at the DOS command prompt will create the three output files: counts.dat, unamb.dat and amb.dat. These files are the input to the ``ettdt'' program.
There are several optional run-string parameters for the ``inferhap'' program. Only the first two letters of each option are required. They may be included in any order, so, e.g., the following commands are equivalent:
inferhap ha=g6pd.hap ma=g6pd mi=na pre= inferhap pre= ha=g6pd.hap ma=g6pd mi=naThe major run-string parameters are:
| argument | default | meaning |
|---|---|---|
| ha[plotypefile]= | haplist | name of the haplotype file |
| ma[rkerfile]= | (blank) | middle characters of the marker file name (after prefix; before suffix) |
| mi[ssingcode]= | -1 | code for a missing marker |
| pr[efix]= | marker | prefix for marker file name |
| su[ffix]= | .dat | suffix for marker file name |
| re[code]= | list of haplotype codes that are pre-collapsed | |
| dr[op]= | 1000 | number of ambiguities that trigger dropping a trio |
The use of the markerfile=XX argument, in addition to changing the name of the marker file read by ``inferhap'', changes the names of the output files to countsXX.dat, unambXX.dat and ambXX.dat. This may be useful for keeping track of markers and inferred haplotypes from multiple sources.
The command inferhap help gives a short description of the file types and the run-string parameters.
The recode= argument is used to manually specify which (infrequent) haplotypes are recoded as (more frequent) haplotypes using the syntax recode=a:b,c:d etc. where a:b means haplotype "a" is included in a haplotype group with "b". Be sure the second haplotype is not recoded elsewhere in the recode= argument, e.g. a:b,b:c is invalid. On the other hand it is OK to use, e.g. a:b,c:b. The new haplotypes and haplotype groups will automatically be recoded using a 0 to H'-1 scheme (where H' is the number of haplotype groups in the pre-collapsed cladogram). A printout of the new codings is produced to aid in construction of the (now condensed) haplotype file for use with the ``ettdt'' program.
The drop= argument drops cases with at least that many haplotype possibilities for the trio. This allows dropping very uninformative cases.
Simply running ettdt performs the ET-TDT analysis on the data in counts.dat, unamb.dat and amb.dat according to the cladogram described in the file ``cladogram''. The optional run-string parameters are listed here. As for ``inferhap'' only two letters are needed, and they may be included in any order.
| argument | default | meaning |
|---|---|---|
| cl[adefile]= | cladogram | name of cladogram file |
| fi[lenum]= | (blank) | XX in markerXX.dat for marker file name |
| al[pha]= | 0.05 | type 1 error level desired |
| bo[nferroni]= | 1 | 0 means don't use Bonferonni correction |
| ve[rbose]= | 0 | 1 prints more detail |
| re[jectids]= | codes for brief output format (see below) | |
| th[isthetaeq1]= | 0 | haplotype code that is considered to have theta=1 |
The brief output format allows you to specify 1 or more (say, R) edges to monitor. Then the output consists only of R+1 numbers, where the first number is the total number of rejected edge collapses, and the additional numbers are zero or one to indicate collapse or no collapse for each of the R specified edges. This output format is useful for simulation studies. The argument for ``rejectids='' is easiest to specify after running the program with ``ve=1''. The argument takes the form re=a:b[,c:d[...]] where each comma separated sub-argument is a separate edge to test, and the codes separated by a colon are the first (or only) haplotype codes listed in the ``ve=1'' output for the first two ``Check'' output items that are separated by an ``x''. (See example below.)
The default ``ve=0'' output format only prints the final cladogram in a one line per node format. A sample final cladogram output is
Cladogram: O1: (1,2) -> [4,3] theta=2.21 O3: (3) -> [1] theta=1.01 O4: (4,5,0) -> [6,1] theta=1 O6: (6,7,8) -> [4] theta=2.08
This cladogram has four ``open'' nodes. (Open nodes must remain uncollapsed.) The numbers in parentheses mean that the node nominally labelled ``O1'' contains haplotypes 1 and 2. The numbers in square brackets mean that this node is connected to nodes ``O4'' and ``O3''. The estimated theta for this node is 2.21. Further study of this cladogram printout reveals that the collapsed cladogram has the form (3)-(1,2)-(4,5,0)-(6,7,8), where the nodes consisting of haplotypes 1 and 2 and of 6, 7 and 8 have higher haplotype relative risk.
Using the optional ``ve=1'' format the printout first shows the initial cladogram using the labels ``T'' for a terminal node (will attempt to collapse in this iteration), ``O'' for a node that will remain open, and a blank for nodes that are neither terminal nor already checked. The printout then shows collapses checked for the current iteration, the score test p-value and whether collapsing is accepted or rejected. At the end of each iteration a new cladogram is printed. If there are still terminal nodes, another iteration is performed.