defineClonesScoper - Assigning Ig sequences into clonal groups


The defineClonesScoper function provides a computational pipline for assigning Ig sequences into clonal groups sharing same V gene, J gene, and junction length.


defineClonesScoper(db, model = c("identical", "hierarchical",
"spectral"), method = c("nt", "aa", "single", "average", "complete",
"novj", "vj"), germline_col = "GERMLINE_IMGT",
sequence_col = "SEQUENCE_IMGT", junction_col = "JUNCTION",
v_call_col = "V_CALL", j_call_col = "J_CALL",
clone_col = c("clone_id", "CLONE"), targeting_model = NULL,
len_limit = NULL, first = FALSE, cdr3 = FALSE, mod3 = FALSE,
max_n = NULL, threshold = NULL, base_sim = 0.95, iter_max = 1000,
nstart = 1000, nproc = 1, verbose = FALSE, log_verbose = FALSE,
out_dir = ".", summerize_clones = FALSE)


data.frame containing sequence data.
one of the "identical", "hierarchical", or "spectral". See Details for description.
one of the "nt", "aa", "single", "average", "complete", "novj", or "vj". See Details for description.
character name of the column containing the germline or reference sequence.
character name of the column containing input sequences.
character name of the column containing junction sequences. Also used to determine sequence length for grouping.
character name of the column containing the V-segment allele calls.
character name of the column containing the J-segment allele calls.
one of the "CLONE" or "clone_id" for the output column name containing the clone ids.
TargetingModel object. Only applicable if model = "spectral" and method = "vj". See Details for description.
IMGT_V object defining the regions and boundaries of the Ig sequences. If NULL, mutations are counted for entire sequence. Only applicable if model = "spectral" and method = "vj".
specifies how to handle multiple V(D)J assignments for initial grouping. If TRUE only the first call of the gene assignments is used. If FALSE the union of ambiguous gene assignments is used to group all sequences with any overlapping gene calls.
if TRUE removes 3 nts from both ends of "junction_col" (converts IMGT junction to CDR3 region). if TRUE remove junction_col(s) with length less than 7 nts.
if TRUE removes junction_col(s) with number of nucleotides not modulus of 3.
The maximum number of N’s to permit in the junction sequence before excluding the record from clonal assignment. Note, under model "hierarchical" and method "single" non-informative positions can create artifactual links between unrelated sequences. Use with caution. Default is set to be "NULL" for no action.
the distance threshold for clonal grouping if model = "hierarchical"; or the upper-limit cut-off if model = "spectral".
required similarity cut-off for sequences in equal distances from each other. Only applicable if model = "spectral".
the maximum number of iterations allowed for kmean clustering step.
the number of random sets chosen for kmean clustering initialization.
number of cores to distribute the function over.
if TRUE report a summary of each step cloning process; if FALSE process cloning silently.
if TRUE write verbose logging to a file in out_dir.
specify the output directory to save log_verbose. The input file directory is used if this is not specified.
if TRUE performs a series of analysis to assess the clonal landscape. See Value for description.


For summerize_clones = FALSE, a modified data.frame with clone identifiers in the clone_col column. For summerize_clones = TRUE returns a list containing:

  • db: modified db data.frame with clone identifiers in the clone_col column.
  • vjl_group_summ: data.frame of clones summary, e.g. size, V-gene, J-gene, junction lentgh, and so on.
  • inter_intra: data.frame containing minimum inter (between) and maximum intra (within) clonal distances.
  • eff_threshold: effective cut-off separating the inter (between) and intra (within) clonal distances.
  • plot_inter_intra: ggplot histogram of inter (between) versus intra (within) clonal distances. The effective threshold is shown with a horizental dashed-line.

If log_verbose = TRUE, it will write verbose logging to a file in the current directory or the specified out_dir.


defineClonesScoper provides a computational platform to explore the B cell clonal relationships in high-throughput Adaptive Immune Receptor Repertoire sequencing (AIRR-seq) data sets. Three models are included which perform clustering among sequences of B cell receptors (BCRs, also referred to as Immunoglobulins, (Igs)) that share the same V gene, J gene and junction length:

  • model = "identical": defines clones among identical junctions. Available method(s) are: (1) "nt" (nucleotide based clustering) and (2) "aa" (amino acid based clustering).
  • model = "hierarchical": hierarchical clustering-based method for partitioning sequences into clones. Availabe agglomeration method(s) are: (1) "single", (2) "average", and (3) "complete". The fixed threshold (a numeric scalar where the tree should be cut) must be provided.
  • model = "spectral": provides an unsupervised pipline for assigning Ig sequences into clonal groups. If method = "novj", clonal relationships are inferred using an adaptive threshold that indicates the level of similarity among junction sequences in a local neighborhood. If method = "vj": clonal relationships are inferred not only based on the junction region homology, but also takes into account the mutation profiles in the V and J segments. germline_col and sequence_col must be provided. Mutation counts are determined by comparing the input sequences (in the column specified by sequence_col) to the effective germline sequence (calculated from sequences in the column specified by germline_col). Not mandatory, but the influence of SHM hot- and cold-spot biases in the clonal inference process will be noted if a SHM targeting model is provided through argument targeting_model (see createTargetingModel for more technical details).


results <- defineClonesScoper(ExampleDb, 
model="hierarchical", method="single", 
threshold=0.15, summerize_clones=TRUE)