blood samples or stored blood samples (or extracted DNA or RNA) made available by a biorepository. Depending on the goal of the study and its design, genome‐wide genotyping, sequencing, gene expression, or epigenetic analysis may be performed on these samples. Some studies may be able to re‐use stored genotype or sequence data available from public repositories (such as dbGaP [https://www.ncbi.nlm.nih.gov/gap] or the European Genome‐phenome Archive [https://www.ebi.ac.uk/ega/home]) or from prior studies of the sample being used. The technologies and approaches to these molecular experiments are covered in Chapter 10. In each case, it is important to formulate a quality control plan to detect potential laboratory errors such as sample switches, failed genotyping probes, sequencing errors, and batch effects. When possible, coordinating laboratory analysis with initial analytic quality control is optimal for finding and correcting such errors. If archived genomic data are being used, careful review of the initial quality control protocols and further checks (when possible) in the subsequent analysis is recommended.
Statistical Analysis
The analysis of genetic and phenotypic data for a complex trait is multifaceted and depends on the research question, study design, genomic data available, and phenotypic characteristics. Methods to analyze these data are under constant development, and new approaches are continuously being released. Therefore, the analytic strategy for a genomic study must be reviewed periodically and revised if necessary to take advantage of newly developed approaches. Depending on the study design, the analytic plan may include linkage analysis (Chapter 6) in families or association studies in families or population samples (Chapters 8 and 9). These approaches are not mutually exclusive – a design may start with a linkage analysis of large families followed by association analysis within regions of linkage. Similarly, other multi‐stage studies conduct a GWAS of individual SNPs (Chapter 9) and then incorporate gene–gene and gene–environment interactions to identify additional genetic loci. Additionally, “data mining” approaches may be applied to these datasets to extract even more genetic information using data reduction techniques, set‐based tests, and pathway analyses. These more complex analyses are discussed in detail in Chapter 11.
Bioinformatics
The large amount of information generated by any genomic study of a complex trait requires careful attention to quality control, efficient and secure storage, and compliance with data‐sharing requirements and privacy protections. These activities require a well‐designed and secure database system. Such systems have evolved over time from text files to relational databases, to large‐scale “data warehouses.” Such datasets also require large‐scale processing power with ample attached storage to facilitate linkage and association studies. High‐throughput sequencing in particular requires a large amount of storage and computational power for genome alignment (or assembly) and base calling. For multi‐site studies, these resources may need to be accessible from multiple locations, requiring levels of access and security depending on the role on the study and need to access other sites’ information. In addition to maintaining local resources for a study, a bioinformatics team also must be familiar with many different public sources of genomic data (e.g. UCSC and Ensembl browsers, ENCODE databases, sequence repositories, dbGaP) and be able to submit results to public repositories for sharing with the wider research community. These issues are discussed in more detail in Chapter 7.
Follow‐up
Variant Detection
Once a single gene (or region) is implicated by a screen (linkage or association), it is necessary to examine it for potentially functional variations that might explain the linkage or association signal. For positional cloning efforts, this generally consisted of sequencing the minimum candidate region and identifying mutations that segregated with the trait in families. For complex traits, this effort is more difficult, and the variant being sought may be a more common, yet functional, polymorphism. Several strategies, including haplotype analysis, conditional analysis, and exhaustive sequencing, may be used in this case. The analyses required for such efforts are discussed in Chapters 8 and 9. However, statistical analysis of a single dataset only goes so far to establish a trait‐associated variant. Additional studies, including replication in independent datasets and functional studies in cellular and animal models, may be required to ultimately determine if a variant influences the biology underlying the complex trait.
Replication
The literature on most complex traits is at this point littered with initial reports of allelic or genotypic associations that cannot be replicated at all (or are replicated in a small minority of studies). Reproducibility of findings in independent samples is a critical characteristic most investigators seek when weighing the evidence for a trait‐associated variant. Because of this, most studies (particularly those seeking government or foundation funding) now include a plan for replication of findings in a second dataset. These replication datasets should be independent of the initial finding (e.g. do not overlap with the discovery dataset) and be assessed in similar fashion (e.g. phenotype definitions agree, ascertainment is similar, genetic analysis is comparable). This does not mean that the datasets must be from the same population – indeed, demonstrating replication across populations (e.g. European, Asian, and African) for a common complex trait locus may add strength to the study. However, for rare variants, cross‐population replication might be more difficult (due to population‐specific alleles); for such studies, replication in a second sample from the sample population would be desirable.
Functional Studies
While most disease gene discovery efforts have claimed success based on finding variants that segregate with traits in pedigrees or polymorphisms significantly associated with the trait in population samples, this is, strictly, not sufficient evidence. More conclusive is evidence arising from biological systems (e.g. cultured cells, animal models, or human blood and tissue samples) that the trait can be either induced by introduction of the allele or ameliorated by blocking the action of the allele. In genetically complex traits, where the responsible variation may be a common polymorphism, it is even more critical that such evidence be found before success is declared.
Tests in biological systems can be of several types. Perhaps the most common is to test the action of the gene in a model organism, such as mouse, zebrafish, or fruit fly. With transgenic models, the proposed trait‐associated variant is introduced into the germline of the organism and the resulting offspring are examined for evidence of the abnormal phenotype. With knockout models, the action of the gene in question is eliminated and the offspring are examined for evidence of an abnormal phenotype. Similar experiments can be performed in cultured cells, where the introduction of the variant (or gene knockout) is easier. However, finding the appropriate cell line and determining the appropriate cellular phenotype corresponding to the trait may be difficult. Recent advances in generating relevant cellular models have utilized inducible pluripotent stem cell (iPSC) technology, by which cells (blood, fibroblast) from an individual with a phenotype and genotype of interest can be reprogrammed and differentiated to a cell type of interest (such as neuron or retinal pigment epithelium). Such cells might be closer to the affected tissue type and have more recognizable phenotypes due to the genetic variant under study. A further advance incorporates gene editing technology (e.g. CRISPR/Cas9) into the approach, whereby an established iPSC line can be edited to introduce (or correct) a variant of interest.