Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011;6(7):e21681.
doi: 10.1371/journal.pone.0021681. Epub 2011 Jul 8.

An evaluation protocol for subtype-specific breast cancer event prediction

Affiliations

An evaluation protocol for subtype-specific breast cancer event prediction

Herman M J Sontrop et al. PLoS One. 2011.

Abstract

In recent years increasing evidence appeared that breast cancer may not constitute a single disease at the molecular level, but comprises a heterogeneous set of subtypes. This suggests that instead of building a single monolithic predictor, better predictors might be constructed that solely target samples of a designated subtype, which are believed to represent more homogeneous sets of samples. An unavoidable drawback of developing subtype-specific predictors, however, is that a stratification by subtype drastically reduces the number of samples available for their construction. As numerous studies have indicated sample size to be an important factor in predictor construction, it is therefore questionable whether the potential benefit of subtyping can outweigh the drawback of a severe loss in sample size. Factors like unequal class distributions and differences in the number of samples per subtype, further complicate comparisons. We present a novel experimental protocol that facilitates a comprehensive comparison between subtype-specific predictors and predictors that do not take subtype information into account. Emphasis lies on careful control of sample size as well as class and subtype distributions. The methodology is applied to a large breast cancer compendium involving over 1500 arrays, using a state-of-the-art subtyping scheme. We show that the resulting subtype-specific predictors outperform those that do not take subtype information into account, especially when taking sample size considerations into account.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. Conceptual overview of the stratification protocol.
1) toy sample set, comprised of three subtypes (blue, red and green), lighter (darker) shades indicate positive (negative) cases. 2) stratified split (by class label and subtype) of the data into a training set formula image and a validation set formula image. For each set separately various partitions are created. The yellow dashed line illustrates the strict separation of training (top) and validation (bottom) parts. 3) the most refined partition involves a single subtype per part. The typed version (tp) partitions formula image by parts stratified by class label and subtype. The untyped (un) counterpart involves parts stratified by class label only, however, each untyped part involves an identical number of positive and negative training samples as its typed counterpart. Here lighter (darker) open circles represent positive (negative) cases. Alternative partitions can be constructed by pooling some or all of the initial parts, as depicted in 4) and 5). On each training part a separate predictor is constructed, which is evaluated on a specific set of validation samples. Note that paired typed and untyped predictors are evaluated on the same set of validation samples. 5) presents a special case for which typed and untyped training sets are identical and equal the overall training set formula image. This set is used to construct the baseline predictor. The untyped predictors associated with partitions 1 and 2 represent down-scaled versions of the baseline and serve to assess the influence of sample size.
Figure 2
Figure 2. Partitioning scheme.
The Hasse diagram depicts all possible partitions (grey ovals) w.r.t. an example breast cancer subtype set formula image, representing the subtypes lumA, lumB, Her2, and basal, respectively. White ovals indicate parts. The lines represent a move from one partition to another by either merging two parts (bottom to top) or splitting one part into two parts (top to bottom). The top layer depicts the coarsest partition in which all elementary types have been pooled into a single part, making it essentially untyped. The bottom layer represents the most refined partition, i.e. one part for each elementary subtype. For each distinct part a separate predictor is constructed. The partition in the top layer is used for baseline predictor construction. In this example formula image, formula image, formula image and formula image.
Figure 3
Figure 3. Stratification toy example.
For a detailed explanation, see the running text.
Figure 4
Figure 4. Bird's eye view of evaluation protocol.
For additional details, see running text. 1) Stratified split w.r.t. class label and subtype of the complete data set in a training set formula image and a validation set formula image. 2) Construction of typed training sets formula image, formula image and formula image. 3) Construction of untyped training sets formula image, formula image and formula image. 4) Baseline predictor construction. 5) Typed predictor construction. 6) Untyped predictor construction. 7) Stratification of validation set by subtype. 8) Invoke baseline predictor on validation samples. 9) Invoke typed predictors on associated validation samples. 10) Invoke matching untyped predictors on same validation sets. Steps 1–10 are repeated for all folds formula image. 11–13) Subtype-specific performance estimation based on the aggregated event predictions (over all folds) per subtype, as made by the baseline (11), typed (12), and untyped (13) predictors. 14–16) Overall performance estimation based on the aggregated event predictions over all folds made by the baseline (14), typed (15), and untyped (16) predictors.
Figure 5
Figure 5. Overall performance overview for all partitions.
Performance overview of overall performance corresponding to the 15 distinct partitions w.r.t. the elementary subtype set formula image, that represents the subtypes lumA, lumB, Her2 and basal, respectively (Figure 2). The left panel corresponds to experiments involving the balanced compendia formula image, while the right panel corresponds to experiments involving the full unbalanced compendium formula image. In each panel the top numbers formula image indicate the number of different parts in each of the partitions, while the bottom line identifies the precise makeup of the various partitions e.g. the notation Bformula imageHformula imageLa.Lb indicates a partition into three parts, involving separate basal and Her2 groups, while having a combined luminal group. In each panel the coarsest partition is situated at the outer left, which corresponds to the baseline predictor (indicated in bold), that is, a single predictor that targets all samples. The most refined partition is situated at the outer right, which uses a separate predictor for each elementary subtype. A horizontal dotted line indicates the performance of the baseline predictors. Vertical dotted lines are used to group the partitions by their number of parts, as indicated by the top numbers. Results represent averages over 100 repeats. Rows represent seven frequently used performance indicators: area under curve (auc), balanced accuracy (bar), sensitivity (sen), specificity (spc), accuracy (acc), positive predictive value (ppv) and negative predictive value (npv). Performance for typed predictors is indicated with a dot, performance for untyped predictors with a cross.

Similar articles

Cited by

References

    1. van't Veer L, Dai H, van de Vijver M, He Y, Hart A, et al. Gene expression profiling predicts clinical outcome of breast cancer. Nature. 2002;415:530. - PubMed
    1. Weigelt B, Baehner FL, Reis-Filho JS. The contribution of gene expression profiling to breast cancer classification, prognostication and prediction: a retrospective of the last decade. Journal of Pathology. 2010;220:263–280. - PubMed
    1. Perou C, Sørlie T, Eisen M, van de Rijn M, Jeffrey S, et al. Molecular portraits of human breast tumours. Nature. 2000;406:747–752. - PubMed
    1. Desmedt C, Haibe-Kains B, Wirapati P, Buyse M, Larsimont D, et al. Biological processes associated with breast cancer clinical outcome depend on the molecular subtypes. Clinical Cancer Research. 2008;14:5158. - PubMed
    1. Kapp A, Jeffrey S, Langerod A, Borresen-Dale AL, Han W, et al. Discovery and validation of breast cancer subtypes. BMC Genomics. 2006;7:231. - PMC - PubMed

Publication types