Integration of Multiple Data Types for Analysis of Cancer Cell Lines

  • Dmitriy Sonkin

    Student thesis: Doctoral Thesis


    Cancer cell lines play an important and critical part in oncology research. The advances in understanding of cancer biology which were achieved in the last few decades would be virtually impossible without using cancer cell lines as research models. Therefore better understanding of molecular properties of such models is crucial element in cancer research. Recently collaborations between Sanger Institute and Massachusetts General Hospital Cancer Center and also between Broad institute and Novartis Institutes for BioMedical Research Inc. generated mRNA expression, copy number, microRNA expression, sequencing and compound sensitivity data for each of the cell lines from the collection covering almost a thousand of the available cancer cell lines. Such data provides rich sources to explore important insights into tumor biology; however they also highlight the need for additional approaches for integrative analysis of multiple data types. The work presented in this thesis is significant contribution to the efforts to make use of all available data per sample and existing biological knowledge to the greatest possible extent. In particular this work covers two research topics: tumor suppressor genes status and gene sets activity analysis on sample by sample basis.

    Tumor suppressor genes play a major role in the etiology of human cancer, and typically achieve a tumor promoting effect upon complete functional inactivation. Bi-allelic inactivation of tumor suppressors may occur through genetic mechanisms (such as loss-of-function mutations, DNA loss), epigenetic mechanisms (such as promoter methylation or histones modifications), signaling mechanisms or a combination of these inactivation mechanisms. Prior to the work presented in this thesis no nomenclature system existed in order to capture the complexity of tumor suppressor genes functional status and correspondingly no computational framework existed to generate such status. In order to address this deficiency, a comprehensive nomenclature system and computational framework was developed for the assessment of tumor suppressor genes functional “status”. It is utilizing several orthogonal genomic data types, including mutation data, copy number, LOH and expression. Through correlation with additional data types (compound sensitivity and gene set activity) it is shown that this integrative method, which allows accounting for multiple mechanisms of tumor suppressor genes inactivation, provides a more accurate assessment of tumor suppressor genes status than can be inferred by expression, copy number, or mutation alone. The utilization of this comprehensive and systematic computational framework led to marked improvement in annotation of TP53 status across extensive collection of cancer cell lines. Identifying cell lines with high confidence wild type TP53 status provides critically important foundation for efforts to identify signature to predict sensitivity to inhibitors of MDM2 driven degradation of TP53.

    Approach to perform gene set activity on sample by sample basis is covered in this thesis along with its application to the extensive collection of the cancer cell lines. Underlying implementation is used in part to establish pSTAT5 mRNA expression signature in hematopoietic cancer cell lines. This signature can potentially make it possible to identify patients whom may benefit from JAK inhibitor(s), based on JAK-STAT signaling.
    Date of AwardDec 2013
    Original languageEnglish
    SupervisorTatiana Tatarinova (Supervisor) & Denis Murphy (Supervisor)

    Cite this