Methods For Detecting Copy Number Variation (CNV)
Despite the importance of CNVs, a systematic and standardized way of detecting CNVs is still in its early phases of being developed. One of the main reasons for this is because identifying CNVs is a challenging task. For instance, detecting regions where CNVs are located due to the repetitive nature of these area, therefore it can be difficult to pinpoint the exact CNV(Li). Other difficulties include the robustness of different CNV calling algorithms, differentiating between the mechanisms of CNV formation, and the different types of CNVs (such as deletions, insertions, duplications and inversion) (Li). Quantitative PCR, paralog-ratio testing, and molecular copy number counting are some examples of confirmed methods used to validate or replicate CNVs on a targeted single or multiple locus scale (Li). However, to identify more genome-wide CNVs, more powerful and high-throughput platforms are required (Li). For these reasons, high-resolution array platforms, primarily used for detecting SNPs and gene expression studies, have now been adapted to be able to detect CNVs (Li). Comparative gene hybridization was the first developed method for detecting CNV, developed by Nimblgen and Agilent (Li). With CGH, target DNA and reference DNA are labelled with different fluorescent probes (Li).
The fluorescence ratio of each is measured in the chromosomal region of interest, making it possible to see the gain or loss of fluorescence (in situ hybridization) of the target in comparison to the reference. The probes are cDNA or long synthetic oligonucleotides, used to investigate genome-wide distributions or unique repetitive regions. Although CGH provides high sensitivity and specificity to users, some of the negative features include low throughput and a spatial resolution of 5-10Mb, which is on the lower ends. SNP arrays, such as the ones provided by Affymetrix and Illumina, “use short base-pair sequences to capture fragments of DNA and to infer copy number based on hybridization intensities without the reference sample being co-hybridized to a target sample” (Li). This allows for SNP genotyping and the sample DNA quantity needed is significantly reduced in comparison to CGH. The downside of this detection method is that SNP array distribution is limited to the number of informative SNPs (Li). To improve CNV detection sensitivity and accuracy, some companies such as Agilent and NimbleGen have designed CNV specific arrays, while Illumina and Affymetrix have updated their probe selection to include non-SNP sites. For example, the Affymetrix Genome-wide Human SNP 6. 0 chips has ~946k non-SNP probes in addition to the 907k SNP probes, out of which 140K are CNV-specific.
For analysis, whole-genome CNV data, the main steps include normalization, probe-level modelling and segmentation, and association analysis. Normalization is done to help reduce extraneous and extreme variables such as GC content, bias occurring because of differences in binding affinities and spatial artifacts (Li). Probe level modelling occurs on single and multi-locus level. Single-locus modelling involves measuring a single fragment by combining the probes that measure it generating a raw fragment copy numbers (Li). Multi-locus modelling measures the copy number of an entire region. This is achieved by creating a meta-probe set from the raw copy numbers of neighboring fragments or DNA probe loci (Li). For segmentation, log ratios of probe intensities are used to locate break points, which are used to differentiate neighbouring regions (Li). Finally, using statistical algorithms, association analysis can be performed to look for the possible effect of the CNV of interest (Li). Some of the CGH analysis software include circular binary segmentation (CBS), Gain and loss analysis of DNA (GLAD). In short, these software programs look at areas of high or low log ratios which in turn reveal changes in copy number (Li). Furthermore, with SNP array data, the normalized signal intensities of arbitrary alleles A and B will increase or decrease if there is a duplication or deletion, respectfully. “For CNV estimation, a and b are transformed into R = a + b and θ = arctan(a/b)/(π/2), so that R measures the combined signal intensity of two alleles and θ measures the relative allelic intensity ratio; Log R ratio (LRR) is defined as log2(Robserved/Rexpected), in which Rexpected is measured from reference samples B allele frequency (BAF) is the normalized measure of relative signal intensity ratio of two alleles” (Li). Some of the most widely used packages for CNV analsysis from SNP arrays are Genotyping Console (Affymetrix) and Beadstudio (Illumina). Some additional softwares such as QuantiSNP, PennCNV, GenoCNV, and MixHMM are based on hidden Markov models in which *hidden states represent the underlying copy number of probes* (Li). Many more softwares have been developed as well, however they are not used as extensively as the ones mentioned above.
With the decreasing costs of Next Generation Sequencing (NGS), CNV research through whole genome (WGS) and whole exome sequencing (WES) has become more accessible. An important advantage of using NGS data for CNV detection is that CNV breakpoints can be identified with more accuracy (at base pair level) which can help in identifying mutational mechanisms and functionality of the CNV (Li). The most common methods for CNV discoveries clone-based sequencing, split-read mapping, paired-end read mapping, read-depth analysis, and mated short-read analysis (Li). Each method has its advantages and disadvantages in terms of read length, sequencing coverage and average span between read pairs, which can affect the quality and accuracy of each method, which is why many of them are used hand in hand (LI). Please refer to Figure x for more details on the individual analyses. CNV calling using short read-alignments requires depth of coverage (DOC) and/or break points. DOC has a positive correlation with copy numbers which helps in the discovery of large CNVs, while breakpoints help detect SV boundaries and short indels. Most algorithms usually use one of the two factors, however an HMM developed by Shen and colleagues uses both to detect medium-sized deletions at a low coverage. Other methods such as AGE use a Smith-Waterman algorithm with a gap extension and others such a CNVnator use a mean shift approach.