| Title: | Risk Analysis of Genomic Copy Number Variation |
|---|---|
| Description: | Provides a complete seven-step workflow for copy number variation (CNV) analysis applicable to any disease or condition where samples with genomic copy number data is available. Supports built-in grading and risk stratification presets for seven major cancers (viz. prostate, breast, colorectal, lung, cervical, lymphoma, melanoma) based on clinically validated systems including ISUP Grade Groups, Nottingham Grading System, Dukes staging, IASLC TNM, FIGO, Ann Arbor/Lugano classification, and Breslow depth. Generalizable to other disease types. An automatic mode derives a normalised Risk Score from the data using min-max normalisation and adaptive binning. Custom user-defined thresholds are supported for any other disease type. Downstream functions for CNV aberration detection, recurrence analysis, gene annotation, CNV matrix generation, and CNV-RNA expression correlation are disease-type agnostic. |
| Authors: | Ashok Palaniappan [aut, cre] (ORCID: <https://orcid.org/0000-0003-2841-9527>), Priyanka Ramesh [aut], Ida Titus [aut], Sangeetha Muthamilselvan [aut] |
| Maintainer: | Ashok Palaniappan <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.1.0 |
| Built: | 2026-06-06 07:15:36 UTC |
| Source: | https://github.com/cran/RiskyCNV |
Reads a CNV (Copy Number Variation) data file and identifies genomic segments showing significant aberrations (gains or losses) based on a user-defined effect size threshold. Results are split by chromosome and returned as a named list.
aberration(cnv_data_file, effect_size = 0.3)aberration(cnv_data_file, effect_size = 0.3)
cnv_data_file |
Character. Path to the CNV data file
(whitespace-delimited, with a header). Must contain columns:
|
effect_size |
Numeric. Threshold for calling aberrations. Segments
with |
Segments with Segment_Mean between -effect_size and
effect_size (inclusive) are considered neutral and excluded from
the output. The default threshold of 0.3 is widely used in TCGA-based
CNV analyses. This function is cancer-type agnostic and can be applied
to CNV data from any solid tumour.
A named list where each element corresponds to a chromosome
(e.g., "1", "2", ...) and contains a data frame of
aberrant segments for that chromosome. Each data frame includes the
columns: Chromosome, Start, End,
Num_Probes, Segment_Mean, Sample,
Aberration (Gain or Loss), and Aberration_Code
(1 = Gain, 0 = Loss).
Mermel CH, et al. (2011). GISTIC2.0 facilitates sensitive and confident localization of the targets of focal somatic copy-number alteration in human cancers. Genome Biol, 12(4):R41.
cnv_file <- system.file("extdata", "cnv_data.txt", package = "RiskyCNV") aberrations <- aberration( cnv_data_file = cnv_file, effect_size = 0.3 ) print(aberrations[["2"]])cnv_file <- system.file("extdata", "cnv_data.txt", package = "RiskyCNV") aberrations <- aberration( cnv_data_file = cnv_file, effect_size = 0.3 ) print(aberrations[["2"]])
Finds the overlap between a gene annotation file and a recurrent CNV file using genomic ranges, and annotates each CNV region with the corresponding gene symbol. Requires the GenomicRanges package.
annotate( genes_file, risk_file, output_dir = ".", seqnames_field_genes = "Chr", start_field_genes = "Start", end_field_genes = "End", gene_symbol_field = "GeneSymbol", seqnames_field_risk = "Chr", start_field_risk = "Start", end_field_risk = "End", sample_field = "Sample", segment_mean_field = "Segment_Mean" )annotate( genes_file, risk_file, output_dir = ".", seqnames_field_genes = "Chr", start_field_genes = "Start", end_field_genes = "End", gene_symbol_field = "GeneSymbol", seqnames_field_risk = "Chr", start_field_risk = "Start", end_field_risk = "End", sample_field = "Sample", segment_mean_field = "Segment_Mean" )
genes_file |
Character. Path to the gene annotation CSV file. Must contain chromosome, start, end, and gene symbol columns (see parameters below for defaults). |
risk_file |
Character. Path to the recurrent CNV CSV file (e.g.,
the file path returned by |
output_dir |
Character. Directory where the annotated CSV will be
saved. Default is the current directory ( |
seqnames_field_genes |
Character. Column name for chromosome in
the gene file. Default is |
start_field_genes |
Character. Column name for start position in
the gene file. Default is |
end_field_genes |
Character. Column name for end position in the
gene file. Default is |
gene_symbol_field |
Character. Column name for gene symbols in
the gene file. Default is |
seqnames_field_risk |
Character. Column name for chromosome in
the CNV file. Default is |
start_field_risk |
Character. Column name for start position in
the CNV file. Default is |
end_field_risk |
Character. Column name for end position in the
CNV file. Default is |
sample_field |
Character. Column name for sample IDs in the CNV
file. Default is |
segment_mean_field |
Character. Column name for segment mean
values in the CNV file. Default is |
This function uses GenomicRanges::findOverlaps with
type = "within" to find genes that fall entirely within each
CNV region. This function is cancer-type agnostic and can be applied
to CNV data from any solid tumour with a compatible gene annotation
reference file.
A data frame containing annotated CNV regions with columns:
Sample, GeneSymbol, Segment_Mean, Chr,
Start, End. The result is also written to a timestamped
CSV file in output_dir.
genes_file <- system.file("extdata", "gene_annotation.csv", package = "RiskyCNV") cnv_file <- system.file("extdata", "annotated_cnv.csv", package = "RiskyCNV") annotated <- annotate( genes_file = genes_file, risk_file = cnv_file, output_dir = tempdir() ) head(annotated)genes_file <- system.file("extdata", "gene_annotation.csv", package = "RiskyCNV") cnv_file <- system.file("extdata", "annotated_cnv.csv", package = "RiskyCNV") annotated <- annotate( genes_file = genes_file, risk_file = cnv_file, output_dir = tempdir() ) head(annotated)
Reads a CSV file containing sample metadata and assigns each sample to a risk category based on a specified scoring column. Supports built-in presets for seven major disease types, fully custom user-defined risk boundaries, or automatic classification using a normalised Risk Score derived from the data itself.
classify_risk( file_path, column_name, disease_type = "auto", n_groups = 3, score_min = NULL, score_max = NULL, risk_groups = NULL, output_dir = NULL )classify_risk( file_path, column_name, disease_type = "auto", n_groups = 3, score_min = NULL, score_max = NULL, risk_groups = NULL, output_dir = NULL )
file_path |
Character. Path to the input CSV file containing sample metadata. |
column_name |
Character. Name of the column containing the grading or staging score (e.g., Gleason score, Nottingham score, TNM stage). |
disease_type |
Character. Disease type for built-in preset risk
groupings. Supported values: |
n_groups |
Integer. Number of risk groups to create. Only used when
|
score_min |
Numeric or NULL. Minimum possible value of the score. If NULL (default), automatically detected from the data. |
score_max |
Numeric or NULL. Maximum possible value of the score. If NULL (default), automatically detected from the data. |
risk_groups |
Named list of functions. Required only when
|
output_dir |
Character or NULL. Directory to save the output CSV file. If NULL (default), output is saved in the same directory as the input file. |
When disease_type = "auto", the function computes a normalised
Risk Score for each sample using min-max normalisation:
The Risk Score ranges from 0 (lowest risk) to 1 (highest risk). Risk group boundaries are then determined automatically:
If the score distribution is approximately symmetric (skewness
between -0.5 and +0.5), equal-width boundaries are used, dividing
the 0-1 range into n_groups equal intervals.
If the score distribution is skewed (skewness outside -0.5 to +0.5), quantile-based boundaries are used, ensuring approximately equal numbers of samples per group.
The splitting method chosen is reported via a message. Risk group labels
are generated automatically based on n_groups.
Built-in presets use clinically validated risk stratification systems:
D'Amico classification (D'Amico et al., 1998): low_risk (<=6), intermediate_risk (7), high_risk (>=8).
Nottingham Prognostic Index (Galea et al., 1992): low_risk (3-5), intermediate_risk (6-7), high_risk (8-9).
Dukes-based risk (Dukes, 1932): low_risk (A), intermediate_risk (B/C), high_risk (D).
TNM stage-based (Goldstraw et al., 2016): low_risk (I), intermediate_risk (II/III), high_risk (IV).
FIGO stage-based (Bhatla et al., 2019): low_risk (I), intermediate_risk (II/III), high_risk (IV).
Ann Arbor/Lugano (Cheson et al., 2014): limited (I/II), advanced (III/IV).
Breslow depth (Breslow, 1970): low_risk (<=1.0mm), intermediate_risk (1.0-4.0mm), high_risk (>4.0mm).
A named list where each element corresponds to a risk group and contains the sample IDs belonging to that group. The number of elements matches the number of risk groups detected or specified.
D'Amico AV, et al. (1998). Biochemical outcome after radical prostatectomy. JAMA, 280(11):969-974.
Galea MH, et al. (1992). The Nottingham prognostic index. Breast Cancer Res Treat, 22(3):207-219.
Dukes CE. (1932). The classification of cancer of the rectum. J Pathol Bacteriol, 35:323-332.
Goldstraw P, et al. (2016). The IASLC Lung Cancer Staging Project. J Thorac Oncol, 11(1):39-51.
Bhatla N, et al. (2019). Revised FIGO staging for carcinoma of the cervix uteri. Int J Gynaecol Obstet, 145(1):129-135.
Cheson BD, et al. (2014). The Lugano Classification. J Clin Oncol, 32(27):3059-3068.
Breslow A. (1970). Thickness and depth of invasion in the prognosis of cutaneous melanoma. Ann Surg, 172(5):902-908.
# Auto mode - let the function decide risk grouping (any disease) sample_file <- system.file("extdata", "sample_data.csv", package = "RiskyCNV") result <- classify_risk( file_path = sample_file, column_name = "gleason_score", disease_type = "auto", n_groups = 3, output_dir = tempdir() ) print(names(result)) # Prostate cancer preset result_prostate <- classify_risk( file_path = sample_file, column_name = "gleason_score", disease_type = "prostate", output_dir = tempdir() ) print(result_prostate$low_risk) # Custom risk groups for any disease result_custom <- classify_risk( file_path = "samples.csv", column_name = "risk_score", disease_type = "custom", risk_groups = list( "low_risk" = function(x) x <= 5, "high_risk" = function(x) x > 5 ), output_dir = tempdir() )# Auto mode - let the function decide risk grouping (any disease) sample_file <- system.file("extdata", "sample_data.csv", package = "RiskyCNV") result <- classify_risk( file_path = sample_file, column_name = "gleason_score", disease_type = "auto", n_groups = 3, output_dir = tempdir() ) print(names(result)) # Prostate cancer preset result_prostate <- classify_risk( file_path = sample_file, column_name = "gleason_score", disease_type = "prostate", output_dir = tempdir() ) print(result_prostate$low_risk) # Custom risk groups for any disease result_custom <- classify_risk( file_path = "samples.csv", column_name = "risk_score", disease_type = "custom", risk_groups = list( "low_risk" = function(x) x <= 5, "high_risk" = function(x) x > 5 ), output_dir = tempdir() )
Computes Pearson correlations between CNV segment means and RNA expression values for each gene present in both datasets. RNA data is log2-transformed prior to analysis. Three result files are written: all correlations, those with p-value < 0.05, and those with both p-value < 0.05 and correlation coefficient > 0.8.
correlate_with_expr(cnv_file, rna_file)correlate_with_expr(cnv_file, rna_file)
cnv_file |
Character. Path to the CNV matrix CSV file (output of
|
rna_file |
Character. Path to the RNA expression CSV file. Rows are genes; first column is gene names; remaining columns are sample IDs (trimmed to 12 characters for TCGA-style matching). |
Sample IDs in the RNA file are trimmed to 12 characters to match
TCGA-style identifiers. Infinite values from log2(0) are
replaced with 0. Pearson correlation is computed using
stats::cor.test with use = "complete.obs". This
function is cancer-type agnostic.
A named list with three data frames:
All computed Pearson correlations with
columns gene, cor_val, p.value.
Subset where p.value < 0.05.
Subset where p.value < 0.05 AND
cor_val > 0.8.
Results are also written to CSV files in the temporary directory.
Chin L, et al. (2011). Making sense of cancer genomic data. Genes Dev, 25(6):534-555.
cnv_file <- system.file("extdata", "cnv_matrix.csv", package = "RiskyCNV") rna_file <- system.file("extdata", "rna_data.csv", package = "RiskyCNV") results <- correlate_with_expr( cnv_file = cnv_file, rna_file = rna_file ) head(results$all_correlations)cnv_file <- system.file("extdata", "cnv_matrix.csv", package = "RiskyCNV") rna_file <- system.file("extdata", "rna_data.csv", package = "RiskyCNV") results <- correlate_with_expr( cnv_file = cnv_file, rna_file = rna_file ) head(results$all_correlations)
Takes an annotated CNV CSV file (output of annotate) and
reshapes it into a wide-format matrix where rows are samples, columns
are gene symbols, and values are mean segment means. Duplicate
sample-gene combinations are resolved by taking the mean.
create_CNVMatrix(input_file)create_CNVMatrix(input_file)
input_file |
Character. Path to the input CSV file containing
columns |
Duplicate Sample-GeneSymbol combinations are summarised
by taking their mean Segment_Mean before pivoting, avoiding
list-column issues in the output. This function is cancer-type agnostic.
A data frame in wide format with samples as rows and gene
symbols as columns. Missing values are represented as NA.
The matrix is also saved as a timestamped CSV file in the temporary
directory.
annot_file <- system.file("extdata", "annotated_cnv.csv", package = "RiskyCNV") cnv_mat <- create_CNVMatrix(annot_file) dim(cnv_mat) head(cnv_mat)annot_file <- system.file("extdata", "annotated_cnv.csv", package = "RiskyCNV") cnv_mat <- create_CNVMatrix(annot_file) dim(cnv_mat) head(cnv_mat)
Reads a CSV file containing sample metadata and classifies each sample into grade or stage groups based on a specified scoring column. Supports built-in presets for seven major disease types, fully custom user-defined thresholds, or automatic classification using a normalised Risk Score derived from the data itself.
extract_metadata( file_path, column_name, disease_type = "auto", pattern_col = NULL, n_groups = 3, group_type = "grade", score_min = NULL, score_max = NULL, thresholds = NULL, output_dir = NULL )extract_metadata( file_path, column_name, disease_type = "auto", pattern_col = NULL, n_groups = 3, group_type = "grade", score_min = NULL, score_max = NULL, thresholds = NULL, output_dir = NULL )
file_path |
Character. Path to the input CSV file containing sample metadata. |
column_name |
Character. Name of the column containing the grading or staging score (e.g., Gleason score, Nottingham score, TNM stage). |
disease_type |
Character. Disease type for built-in preset thresholds.
Supported values: |
pattern_col |
Character or NULL. Only used when
|
n_groups |
Integer. Number of grade or stage groups to create. Only
used when |
group_type |
Character. Type of group labels to generate. Only used
when
|
score_min |
Numeric or NULL. Minimum possible value of the score. If NULL (default), automatically detected from the data. |
score_max |
Numeric or NULL. Maximum possible value of the score. If NULL (default), automatically detected from the data. |
thresholds |
Named list of functions. Required only when
|
output_dir |
Character or NULL. Directory to save the output CSV file. If NULL (default), output is saved in the same directory as the input file. |
For prostate cancer, an optional pattern_col parameter allows
accurate distinction between Grade Group 2 (Gleason 3+4=7) and Grade
Group 3 (Gleason 4+3=7) using the primary histological pattern column.
Prostate cancer Grade Group 2 vs Grade Group 3 distinction:
Both Grade Group 2 (Gleason 3+4=7) and Grade Group 3 (Gleason 4+3=7) have the same total Gleason score of 7, making them indistinguishable from the total score alone. The primary histological pattern determines the correct assignment:
Primary pattern 3 + secondary pattern 4 → Grade Group 2
Primary pattern 4 + secondary pattern 3 → Grade Group 3
Supply the name of the primary pattern column via pattern_col
(typically "pattern1") to enable this distinction. If
pattern_col is not supplied, all Gleason 7 samples are assigned
to Grade Group 2 and a message is shown.
Auto mode:
When disease_type = "auto", the function computes a normalised
Risk Score for each sample using min-max normalisation:
The Risk Score ranges from 0 (lowest risk) to 1 (highest risk). Group boundaries are determined automatically based on distribution skewness:
Symmetric distribution (skewness between -0.5 and +0.5): equal-width boundaries
Skewed distribution (skewness outside -0.5 to +0.5): quantile-based boundaries
A named list where each element corresponds to a grade or stage group and contains the sample IDs belonging to that group.
Epstein JI, et al. (2016). The 2014 ISUP Consensus Conference on Gleason Grading. Am J Surg Pathol, 40(2):244-252.
Elston CW & Ellis IO. (1991). Pathological prognostic factors in breast cancer. Histopathology, 19(5):403-410.
Dukes CE. (1932). The classification of cancer of the rectum. J Pathol Bacteriol, 35:323-332.
Goldstraw P, et al. (2016). The IASLC Lung Cancer Staging Project. J Thorac Oncol, 11(1):39-51.
Bhatla N, et al. (2019). Revised FIGO staging for carcinoma of the cervix uteri. Int J Gynaecol Obstet, 145(1):129-135.
Cheson BD, et al. (2014). The Lugano Classification. J Clin Oncol, 32(27):3059-3068.
Breslow A. (1970). Thickness and depth of invasion in the prognosis of cutaneous melanoma. Ann Surg, 172(5):902-908.
sample_file <- system.file("extdata", "sample_data.csv", package = "RiskyCNV") # Prostate preset — without pattern column (Grade Group 2 and 3 merged) result <- extract_metadata( file_path = sample_file, column_name = "gleason_score", disease_type = "prostate", output_dir = tempdir() ) print(names(result)) # Prostate preset — with pattern column (Grade Group 2 and 3 distinguished) result_full <- extract_metadata( file_path = sample_file, column_name = "gleason_score", disease_type = "prostate", pattern_col = "pattern1", output_dir = tempdir() ) print(names(result_full)) # Auto mode result_auto <- extract_metadata( file_path = sample_file, column_name = "gleason_score", disease_type = "auto", n_groups = 3, group_type = "grade", output_dir = tempdir() ) print(names(result_auto)) # Custom thresholds result_custom <- extract_metadata( file_path = sample_file, column_name = "gleason_score", disease_type = "custom", thresholds = list( "Stage I" = function(x) x <= 6, "Stage II" = function(x) x == 7, "Stage III" = function(x) x == 8, "Stage IV" = function(x) x > 8 ), output_dir = tempdir() ) print(names(result_custom))sample_file <- system.file("extdata", "sample_data.csv", package = "RiskyCNV") # Prostate preset — without pattern column (Grade Group 2 and 3 merged) result <- extract_metadata( file_path = sample_file, column_name = "gleason_score", disease_type = "prostate", output_dir = tempdir() ) print(names(result)) # Prostate preset — with pattern column (Grade Group 2 and 3 distinguished) result_full <- extract_metadata( file_path = sample_file, column_name = "gleason_score", disease_type = "prostate", pattern_col = "pattern1", output_dir = tempdir() ) print(names(result_full)) # Auto mode result_auto <- extract_metadata( file_path = sample_file, column_name = "gleason_score", disease_type = "auto", n_groups = 3, group_type = "grade", output_dir = tempdir() ) print(names(result_auto)) # Custom thresholds result_custom <- extract_metadata( file_path = sample_file, column_name = "gleason_score", disease_type = "custom", thresholds = list( "Stage I" = function(x) x <= 6, "Stage II" = function(x) x == 7, "Stage III" = function(x) x == 8, "Stage IV" = function(x) x > 8 ), output_dir = tempdir() ) print(names(result_custom))
Filters a CNV data file for samples belonging to a specified risk group and identifies genomic regions that recur across multiple samples above a given threshold. Results are saved as a CSV file.
recurrent(x, risk_level, cnv_data_file, threshold = 2)recurrent(x, risk_level, cnv_data_file, threshold = 2)
x |
A named list of sample ID vectors, as returned by
|
risk_level |
Character. The risk group to analyse. Must be a name
present in |
cnv_data_file |
Character. Path to the CNV data file
(whitespace-delimited, with a header). Must contain columns:
|
threshold |
Numeric. Minimum number of samples a CNV region must
appear in to be considered recurrent. Default is |
Sample IDs in the CNV file are trimmed to 12 characters and hyphens are
replaced with dots to match standard TCGA-style identifiers. The output
CSV is saved inside a timestamped subdirectory under recurrent_cnv/
in the temporary directory. This function is cancer-type agnostic.
Character. The file path of the saved CSV file containing the recurrent CNV regions for the specified risk group.
sample_file <- system.file("extdata", "sample_data.csv", package = "RiskyCNV") cnv_file <- system.file("extdata", "cnv_data.txt", package = "RiskyCNV") risk_result <- classify_risk( file_path = sample_file, column_name = "gleason_score", disease_type = "prostate", output_dir = tempdir() ) output_path <- recurrent( x = risk_result, risk_level = "low_risk", cnv_data_file = cnv_file, threshold = 2 ) print(output_path)sample_file <- system.file("extdata", "sample_data.csv", package = "RiskyCNV") cnv_file <- system.file("extdata", "cnv_data.txt", package = "RiskyCNV") risk_result <- classify_risk( file_path = sample_file, column_name = "gleason_score", disease_type = "prostate", output_dir = tempdir() ) output_path <- recurrent( x = risk_result, risk_level = "low_risk", cnv_data_file = cnv_file, threshold = 2 ) print(output_path)
Provides a complete seven-step workflow for copy number variation (CNV) analysis applicable to any disease or condition where genomic copy number data is available. The package supports three classification approaches: built-in grading and risk stratification presets for seven major disease types (prostate, breast, colorectal, lung, cervical, lymphoma, melanoma) based on clinically validated scoring systems; an automatic mode that derives a normalised Risk Score from the data itself using min-max normalisation and adaptive binning; and fully user-defined custom thresholds for any disease type not covered by the presets. Downstream functions for CNV aberration detection, recurrence analysis, gene annotation, CNV matrix generation, and CNV-RNA expression correlation are disease-type agnostic and work with any genomic dataset.
The recommended analysis pipeline proceeds in seven steps:
extract_metadata — Classify samples into grade or
stage groups based on a clinical scoring parameter. Supports
disease-specific presets, auto mode, and custom thresholds.
classify_risk — Stratify samples into risk
categories. Supports disease-specific presets, auto mode, and
custom thresholds.
aberration — Detect CNV gains and losses from
segmented CNV data using a user-defined effect size threshold.
recurrent — Identify CNV regions that recur
across multiple samples within a risk group.
annotate — Annotate recurrent CNV regions with
gene symbols using genomic range overlaps.
create_CNVMatrix — Construct a sample x gene
CNV expression matrix.
correlate_with_expr — Compute Pearson correlations
between CNV profiles and RNA expression data.
Built-in clinically validated thresholds for prostate (ISUP/Gleason), breast (Nottingham), colorectal (Dukes), lung (IASLC/TNM), cervical (FIGO), lymphoma (Ann Arbor/Lugano), and melanoma (Breslow depth).
Automatically computes a normalised Risk Score using min-max normalisation. Group boundaries are determined adaptively based on distribution skewness — equal-width for symmetric distributions, quantile-based for skewed distributions. Works for any numeric scoring column without prior knowledge of the scoring system.
Users supply their own threshold functions for any disease type, scoring system, or number of groups.
prostate (ISUP/Gleason), breast (Nottingham/NGS), colorectal (Dukes), lung (IASLC/TNM), cervical (FIGO), lymphoma (Ann Arbor/Lugano), and melanoma (Breslow depth).
Maintainer: Ashok Palaniappan [email protected] (ORCID)
Authors:
Priyanka Ramesh
Ida Titus
Sangeetha Muthamilselvan
Epstein JI, et al. (2016). Am J Surg Pathol, 40(2):244-252.
Elston CW & Ellis IO. (1991). Histopathology, 19(5):403-410.
Dukes CE. (1932). J Pathol Bacteriol, 35:323-332.
Goldstraw P, et al. (2016). J Thorac Oncol, 11(1):39-51.
Bhatla N, et al. (2019). Int J Gynaecol Obstet, 145(1):129-135.
Cheson BD, et al. (2014). J Clin Oncol, 32(27):3059-3068.
Breslow A. (1970). Ann Surg, 172(5):902-908.