Body

The localized haplotype-cluster model uses variable-order Markov chains (VOMCs) to create
an empirical model for haplotype probabilities that adapts to the changing structure of
linkage disequilibrium (LD) across the genome. By clustering partial haplotypes based on
the Markov property as represented by a directed acyclic graph (DAG), the model is able
to take advantage of context-sensitive conditional independencies to improve estimates of
haplotype frequencies while still respecting the dependencies induced by LD. We introduce a
method for training such models using regularized likelihood functions to prevent overfitting
along with a method for cross-validation to select a regularization parameter which accounts
for the high probability of out-of-sample haplotypes not accommodated by the model. When
applied to dense single nucleotide polymorphism (SNP) markers from population data, our
method obtains a better-fitting and more parsimonious model than the leading method.

In addition, we note that these models represent a VOMC defined in a single direction
along the genome, which ignores the LD structure that could be represented by conditional
independencies in the opposite direction. Therefore, fitting the model to the same data in the
reverse direction along the genome usually results in different haplotype frequency estimates,
which is an undesirable property for genomic models. We develop a method of reconciling
two DAG models fit in opposite directions along the genome that takes advantage of the
differing LD structure represented in both models to derive a new bidirectional model.

When trying to detect segments of identity by descent (IBD) among individuals, background
LD can be a source of noise that obfuscates haplotypic similarity due to recent
coancestry. Methods of IBD segment detection that do not account for LD can have a
high false positive rate. We introduce a method for IBD segement detection using a hidden
Markov model (HMM) that incorporates a DAG model in the hidden layer to adjust for LD.
Unlike similar methods, ours models the full set of 15 IBD states among the four chromosomes
of two individuals. When applied to simulated dense SNP marker data, our method
provides more accurate IBD segment detection than other leading methods.