Title: | Match Cases to Controls Based on Genotype Principal Components |
---|---|
Description: | Matches cases to controls based on genotype principal components (PC). In order to produce better results, matches are based on the weighted distance of PCs where the weights are equal to the % variance explained by that PC. A weighted Mahalanobis distance metric (Kidd et al. (1987) <DOI:10.1016/0031-3203(87)90066-5>) is used to determine matches. |
Authors: | Derek W. Brown [aut, cre] |
Maintainer: | Derek W. Brown <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.3.3 |
Built: | 2025-03-11 04:30:52 UTC |
Source: | https://github.com/machiela-lab/pcamatchr |
A sample dataset containing the first 20 eigenvalues calculated from 2504 individuals in the Phase 3 data release of the 1000 Genomes Project. The principal component analysis was conducted using PLINK.
eigenvalues_1000G
eigenvalues_1000G
A data frame with 20 rows and 1 variable:
calculated eigenvalues
Machiela Lab
eigenvalues_1000G genome_values <- eigenvalues_1000G values <- c(genome_values)$eigen_values
eigenvalues_1000G genome_values <- eigenvalues_1000G values <- c(genome_values)$eigen_values
A sample dataset containing all the eigenvalues calculated from 2504 individuals in the Phase 3 data release of the 1000 Genomes Project. The principal component analysis was conducted using PLINK.
eigenvalues_all_1000G
eigenvalues_all_1000G
A data frame with 2504 rows and 1 variable:
calculated eigenvalues
Machiela Lab
eigenvalues_all_1000G genome_values <- eigenvalues_all_1000G values <- c(genome_values)$eigen_values
eigenvalues_all_1000G genome_values <- eigenvalues_all_1000G values <- c(genome_values)$eigen_values
Weighted matching of controls to cases using PCA results.
match_maker( PC = NULL, eigen_value = NULL, data = NULL, ids = NULL, case_control = NULL, num_controls = 1, num_PCs = NULL, eigen_sum = NULL, exact_match = NULL, weight_dist = TRUE, weights = NULL )
match_maker( PC = NULL, eigen_value = NULL, data = NULL, ids = NULL, case_control = NULL, num_controls = 1, num_PCs = NULL, eigen_sum = NULL, exact_match = NULL, weight_dist = TRUE, weights = NULL )
PC |
Individual level principal component. |
eigen_value |
Computed eigenvalue for each PC. Used as the numerator to calculate the percent variance explained by each PC. |
data |
Dataframe containing id and case/control status. Optionally includes covariate data for exact matching. |
ids |
The unique id variable contained in both "PC" and "data." |
case_control |
The case control status variable. |
num_controls |
The number of controls to match to each case. Default is 1:1 matching. |
num_PCs |
The total number of PCs calculated within the PCA. Can be used as the denomiator to calculate the percent variance explained by each PC. Default is 1000. |
eigen_sum |
The sum of all possible eigenvalues within the PCA. Can be used as the denomiator to calculate the percent variance explained by each PC. |
exact_match |
Optional variables contained in the dataframe on which to perform exact matching (i.e. sex, race, etc.). |
weight_dist |
When set to true, matches are produced based on PC weighted Mahalanobis distance. Default is TRUE. |
weights |
Optional user defined weights used to compute the weighted Mahalanobis distance metric. |
A list of matches and weights.
# Create PC data frame by subsetting provided example dataset pcs <- as.data.frame(PCs_1000G[,c(1,5:24)]) # Create eigenvalues vector using example dataset eigen_vals <- c(eigenvalues_1000G)$eigen_values # Create full eigenvalues vector using example dataset all_eigen_vals<- c(eigenvalues_all_1000G)$eigen_values # Create Covarite data frame cov_data <- PCs_1000G[,c(1:4)] # Generate a case status variable using ESN 1000 Genome population cov_data$case <- ifelse(cov_data$pop=="ESN", c(1), c(0)) # With 1 to 1 matching if(requireNamespace("optmatch", quietly = TRUE)){ library(optmatch) match_maker(PC = pcs, eigen_value = eigen_vals, data = cov_data, ids = c("sample"), case_control = c("case"), num_controls = 1, eigen_sum = sum(all_eigen_vals), weight_dist=TRUE ) }
# Create PC data frame by subsetting provided example dataset pcs <- as.data.frame(PCs_1000G[,c(1,5:24)]) # Create eigenvalues vector using example dataset eigen_vals <- c(eigenvalues_1000G)$eigen_values # Create full eigenvalues vector using example dataset all_eigen_vals<- c(eigenvalues_all_1000G)$eigen_values # Create Covarite data frame cov_data <- PCs_1000G[,c(1:4)] # Generate a case status variable using ESN 1000 Genome population cov_data$case <- ifelse(cov_data$pop=="ESN", c(1), c(0)) # With 1 to 1 matching if(requireNamespace("optmatch", quietly = TRUE)){ library(optmatch) match_maker(PC = pcs, eigen_value = eigen_vals, data = cov_data, ids = c("sample"), case_control = c("case"), num_controls = 1, eigen_sum = sum(all_eigen_vals), weight_dist=TRUE ) }
A sample dataset containing information about population, gender, and the first 20 principal components calculated from 2504 individuals in the Phase 3 data release of the 1000 Genomes Project. The principal component analysis was conducted using PLINK.
PCs_1000G
PCs_1000G
A data frame with 2504 rows and 24 variables:
sample ID number
three letter designation of 1000 Genomes reference population
three letter designation of 1000 Genomes reference super population
gender of individual
principal component 1
principal component 2
principal component 3
principal component 4
principal component 5
principal component 6
principal component 7
principal component 8
principal component 9
principal component 10
principal component 11
principal component 12
principal component 13
principal component 14
principal component 15
principal component 16
principal component 17
principal component 18
principal component 19
principal component 20
https://www.internationalgenome.org
head(PCs_1000G) genome_PC <- PCs_1000G # Create PCs PC <- as.data.frame(genome_PC[,c(1,5:24)]) head(PC)
head(PCs_1000G) genome_PC <- PCs_1000G # Create PCs PC <- as.data.frame(genome_PC[,c(1,5:24)]) head(PC)
Function to plot matches from match_maker output
plot_maker( data = NULL, x_var = NULL, y_var = NULL, case_control = NULL, line = T, ... )
plot_maker( data = NULL, x_var = NULL, y_var = NULL, case_control = NULL, line = T, ... )
data |
match_maker output |
x_var |
Principal component 1 |
y_var |
Principal component 2 |
case_control |
Case or control status |
line |
draw line |
... |
Arguments passed to |
None
# run match_maker() # Create PC data frame by subsetting provided example dataset pcs <- as.data.frame(PCs_1000G[,c(1,5:24)]) # Create eigenvalues vector using example dataset eigen_vals <- c(eigenvalues_1000G)$eigen_values # Create full eigenvalues vector using example dataset all_eigen_vals<- c(eigenvalues_all_1000G)$eigen_values # Create Covarite data frame cov_data <- PCs_1000G[,c(1:4)] # Generate a case status variable using ESN 1000 Genome population cov_data$case <- ifelse(cov_data$pop=="ESN", c(1), c(0)) # With 1 to 1 matching if(requireNamespace("optmatch", quietly = TRUE)){ library(optmatch) match_maker_output<- match_maker(PC = pcs, eigen_value = eigen_vals, data = cov_data, ids = c("sample"), case_control = c("case"), num_controls = 1, eigen_sum = sum(all_eigen_vals), weight_dist=TRUE ) # run plot_maker() plot_maker(data=match_maker_output, x_var="PC1", y_var="PC2", case_control="case", line=TRUE) }
# run match_maker() # Create PC data frame by subsetting provided example dataset pcs <- as.data.frame(PCs_1000G[,c(1,5:24)]) # Create eigenvalues vector using example dataset eigen_vals <- c(eigenvalues_1000G)$eigen_values # Create full eigenvalues vector using example dataset all_eigen_vals<- c(eigenvalues_all_1000G)$eigen_values # Create Covarite data frame cov_data <- PCs_1000G[,c(1:4)] # Generate a case status variable using ESN 1000 Genome population cov_data$case <- ifelse(cov_data$pop=="ESN", c(1), c(0)) # With 1 to 1 matching if(requireNamespace("optmatch", quietly = TRUE)){ library(optmatch) match_maker_output<- match_maker(PC = pcs, eigen_value = eigen_vals, data = cov_data, ids = c("sample"), case_control = c("case"), num_controls = 1, eigen_sum = sum(all_eigen_vals), weight_dist=TRUE ) # run plot_maker() plot_maker(data=match_maker_output, x_var="PC1", y_var="PC2", case_control="case", line=TRUE) }