Information Systems & Service Design

Data Science Lifecycle Refinement with a Mapping of the Corresponding Data Science Tool Landscape

  • Problem Description

    The Cross-industry standard process for data mining is (CRISP-DM) the most common data science lifecycle method (KD Nuggets, 2014; Shearer, 2000) applied by power users in the industry. It covers the stages (1) Business Understanding, (2) Data Understanding, (3) Data Preparation, (4) Modelling, (5) Evaluation, and (6) Deployment. Usually, it is necessary to move back and forth within the cyclically ordered stages.

    CRISP-DM’s starting point was in 1996 and therefore has collected the following shortcomings over time: (1) Lack of dealing with unstructured content; No consideration of (a) collaboration, (b) information sharing and (c) self-service (Bani-Hani et al., 2017). For example, the phase (6) could be split up in visualization and reporting, related to shortcoming (b) and (c). Additionally, the tool landscape does not well support diverse user groups along the stages of the data science lifecycle. There is no consensus on how stages should be implemented. In the meantime, both research and big tech and consulting companies enhanced and refined data science lifecycle methods.

    Goal of Thesis

    The overall goal of the thesis is to extend CRISP-DM to a more dynamic and more flexible methodology for cross-disciplinary teams to feature reciprocal iterations between different user roles. Therefore, the student shall harmonize available data science lifecycle methods and map the corresponding Self-Service Business Intelligence & Analytics (SSBI&A) tool landscape to the refined the stages. To get a more general result, the tools shall be classified and not mapped explicitly. The potential to collaborate with the student working on the thesis “BI&A Role Discovery” [Link: https://issd.iism.kit.edu/thesis_1560.php] is appreciated to consider SSBI&A roles in the data science lifecycle.

    Work Packages

    1. Conduct a literature review considering data science lifecycles like CRISP-DM
    2. Investigate the data science tool landscape and derive tool classes
    3. Allocate classes to every single identified data science lifecycle
    4. Create a new, harmonized data science lifecycle and map corresponding tool classes

    Skills required

    • High intrinsic motivation and proper time management
    • Interest in Data Science; experience in applying data science lifecycle methods appreciated
    • Fluent English (as the thesis has to be written in English)

    Benefits

    • Excellent overview of data science lifecycle with corresponding tools
    • Setting a standard for data science lifecycles
    • Insights, which tools are actually in use and demanded in the industry
    • Possibility to publish the results

    Contact

    If you are interested, drop me an email with a short motivation statement, your CV and your current transcript of records. If you have questions before, do not hesitate to contact me.

    sven.michalczyk@kit.edu

    References

    Bani-Hani, I., Deniz, S., and Carlsson, S. 2017. “Enabling Organizational Agility Through Self-Service Business Intelligence: The Case of a Digital Marketplace,” in Proceedings of the 21st Pacific Asia Conference on Information Systems (PACIS) (148).

    KDNuggets. 2014. “What main methodology are you using for your analytics, data mining, or data science projects?,” (available at https://www.kdnuggets.com/polls/2014/analytics-data-mining-data-science-methodology.html; retrieved April 10, 2019).

    Shearer, C. 2000. “The CRISP-DM model: the new blueprint for data mining,” Journal of data warehousing (5:4), pp. 13–22.