data science thesis structure

Thesis/Capstone for Master's in Data Science | Northwestern SPS - Northwestern School of Professional Studies

Post-baccalaureate
Undergraduate
Professional Development
Pre-College
Center for Public Safety
Get Information

Data Science

Capstone and thesis overview.

Capstone and thesis are similar in that they both represent a culminating, scholarly effort of high quality. Both should clearly state a problem or issue to be addressed. Both will allow students to complete a larger project and produce a product or publication that can be highlighted on their resumes. Students should consider the factors below when deciding whether a capstone or thesis may be more appropriate to pursue.

A capstone is a practical or real-world project that can emphasize preparation for professional practice. A capstone is more appropriate if:

you don't necessarily need or want the experience of the research process or writing a big publication
you want more input on your project, from fellow students and instructors
you want more structure to your project, including assignment deadlines and due dates
you want to complete the project or graduate in a timely manner

A student can enroll in MSDS 498 Capstone in any term. However, capstone specialization courses can provide a unique student experience and may be offered only twice a year.

A thesis is an academic-focused research project with broader applicability. A thesis is more appropriate if:

you want to get a PhD or other advanced degree and want the experience of the research process and writing for publication
you want to work individually with a specific faculty member who serves as your thesis adviser
you are more self-directed, are good at managing your own projects with very little supervision, and have a clear direction for your work
you have a project that requires more time to pursue

Students can enroll in MSDS 590 Thesis as long as there is an approved thesis project proposal, identified thesis adviser, and all other required documentation at least two weeks before the start of any term.

From Faculty Director, Thomas W. Miller, PhD

Capstone projects and thesis research give students a chance to study topics of special interest to them. Students can highlight analytical skills developed in the program. Work on capstone and thesis research projects often leads to publications that students can highlight on their resumes.”

A thesis is an individual research project that usually takes two to four terms to complete. Capstone course sections, on the other hand, represent a one-term commitment.

Students need to evaluate their options prior to choosing a capstone course section because capstones vary widely from one instructor to the next. There are both general and specialization-focused capstone sections. Some capstone sections offer in individual research projects, others offer team research projects, and a few give students a choice of individual or team projects.

Students should refer to the SPS Graduate Student Handbook for more information regarding registration for either MSDS 590 Thesis or MSDS 498 Capstone.

Capstone Experience

If students wish to engage with an outside organization to work on a project for capstone, they can refer to this checklist and lessons learned for some helpful tips.

Capstone Checklist

Start early — set aside a minimum of one to two months prior to the capstone quarter to determine the industry and modeling interests.
Networking — pitch your idea to potential organizations for projects and focus on the business benefits you can provide.
Permission request — make sure your final project can be shared with others in the course and the information can be made public.
Engagement — engage with the capstone professor prior to and immediately after getting the dataset to ensure appropriate scope for the 10 weeks.
Teambuilding — recruit team members who have similar interests for the type of project during the first week of the course.

Capstone Lesson Learned

Access to company data can take longer than expected; not having this access before or at the start of the term can severely delay the progress
Project timeline should align with coursework timeline as closely as possible
One point of contact (POC) for business facing to ensure streamlined messages and more effective time management with the organization
Expectation management on both sides: (business) this is pro-bono (students) this does not guarantee internship or job opportunities
Data security/masking not executed in time can risk the opportunity completely

Publication of Work

Northwestern University Libraries offers an option for students to publish their master’s thesis or capstone in Arch, Northwestern’s open access research and data repository.

Benefits for publishing your thesis:

Your work will be indexed by search engines and discoverable by researchers around the world, extending your work’s impact beyond Northwestern
Your work will be assigned a Digital Object Identifier (DOI) to ensure perpetual online access and to facilitate scholarly citation
Your work will help accelerate discovery and increase knowledge in your subject domain by adding to the global corpus of public scholarly information

Get started:

Visit Arch online
Log in with your NetID
Describe your thesis: title, author, date, keywords, rights, license, subject, etc.
Upload your thesis or capstone PDF and any related supplemental files (data, code, images, presentations, documentation, etc.)
Select a visibility: Public, Northwestern-only, Embargo (i.e. delayed release)
Save your work to the repository

Your thesis manuscript or capstone report will then be published on the MSDS page. You can view other published work here .

For questions or support in publishing your thesis or capstone, please contact [email protected] .

Instructions for MSc Thesis

Before the thesis.

Before you start work on your thesis, it is important to put some thought into the choice of topic and familiarize yourself with the criteria and procedure. To do that, follow these steps, in this order:

Step 0: Read the university instructions .

Read the MSc thesis instructions and grading criteria on the university website. Computer Science Master's program: [link] . Data Science Master's program: [ link ].

Step 1: Choose a topic .

Choose a topic among the ones listed on the group's webpage [ link ].

You can also propose your own topic. In this case, you must explain what the main contribution of the thesis will be and identify at least one scientific publication that is related to the topic you propose.

Step 2: Contact us .

Submit the application form [ link ] to let us know of your interest to do your thesis in the group. Note : If you contact us, then please be ready to start work on the thesis within one month .

Step 3: Agree on the topic .

We have a brief discussion about the topic and devise a high-level plan for thesis work and content. We also discuss a start date , when you start work on the thesis. In addition, you should contact a second evaluator for the thesis.

Thesis timeline

Below you find the milestones after you have started work on the thesis. In parenthesis, you find an estimate of when each milestone occurs. The thesis work ends when you submit it for approval. The total duration from start to end of the thesis should be about four months.

Milestone #0: Thesis outline (at most 3 weeks from the start) .

You create a first outline of the thesis. The outline should contain the titles of the chapters, along with a (tentative) list of sections and contents. An indicative template for the outline is shown below on this page.

Milestone #1: A draft with first results (about 2 months from start) .

All chapters should contain some readable content (not necessarily polished). Most importantly, some results should already be described. Ideally, you should be able to complete and refine the results within one more month.

Milestone #2: A draft with all results (about 1 month before the end).

Most content should now be in the draft. Some polishing remains and some results may still be refined. Notify the second evaluator that you are near the end of the thesis work. Optionally, you may send the thesis draft and receive preliminary comments from the second evaluator.

Milestone #3: Submit the thesis for approval (end of thesis work).

You will receive a grade and comments after the next program board's meeting.

Supervision

What you can expect from the supervisor:

Comments for the thesis draft after each milestone (see timeline above) and, if necessary, a meeting.
Suggestions for how to proceed in cases when you encounter a major hurdle.

In addition, you are welcome to participate in the group meetings and discuss your thesis work with other group members.

Note however that one of the grading criteria for the thesis is whether you worked independently -- and in the end, the thesis should be your own work.

Template for Thesis Outline

Below you find a suggested template for the outline of the thesis. You may adapt it to your work, of course (e.g., change chapter titles or structure).

A summary of the thesis that mentions the broader topic of the thesis and why it is important; the research question or technical problem addressed by the thesis; the main thesis contributions (e.g., data gathering, developed methods and algorithms, experimental evaluation) and results.

Chapter 1: Introduction

The introduction should motivate the thesis and give a longer summary. It should be written in a way that allows anyone in your program to understand it, even if they are not experts in the topic.

What is the broader topic of the thesis?
Why is it important?
What research question(s) or technical problems does the thesis address?
What are the most related works from the literature on the topic? How does the thesis differ from what has already been done?
What are the main thesis contributions (e.g., data gathering, developed methods and algorithms, experimental evaluation)?
What are the results?

Chapter 2: Related literature

Organize this chapter in sections, with one section for each research area that is related to your thesis. For each research area, cite all the publications that are related to your topic, and describe at least the most important of them.

Chapter 3: Preliminaries

In this chapter, place the information that is necessary for you to describe the contributions and results of the thesis. It may be different from thesis to thesis, but could include sections about:

Setting. Define the terms and notation you will be using. State any assumptions you make across the thesis. Background on Methods . Describe existing methods from the literature (e.g., algorithms or ML models) that you use for your work. Data (esp. for a Data Science thesis). If the main contribution is data analysis, then describe the data here, before the analysis.

Chapter 4: Methodological contribution

For a Computer Science thesis, this part typically describes the algorithm(s) developed for the thesis. For a Data Science thesis, this part typically describes the method for the analysis.

Chapter 5: Results

This chapter describes the results obtained when the methods of Chapter 4 are used on data.

For a Computer Science thesis, this part typically describes the performance of the developed algorithm(s) on various synthetic and real datasets. For a Data Science thesis, this part typically describes the findings of the analysis.

The chapter should also describe what insights are obtained from the results.

Chapter 6: Conclusion

Summarize the contribution of the thesis.
Provide an evaluation: are the results conclusive, are there limitations in the contribution?
How would you extend the thesis, what can be done next on the same topic?

Thesis Option

Data Science master’s students can choose to satisfy the research experience requirement by selecting the thesis option. Students will spend the majority of their second year working on a substantial data science project that culminates in the submission and oral defense of a master’s thesis. While all thesis projects must be related to data science, students are given leeway in finding a project in a domain of study that fits with their background and interest.

All students choosing the thesis option must find a research advisor and submit a thesis proposal by mid-April of their first year of study. Thesis proposals will be evaluated by the Data Science faculty committee and only those students whose proposals are accepted will be allowed to continue with the thesis option.

To account for the time spent on thesis research, students choosing the thesis option are able substitute three required courses (the Capstone and two "free" elective courses (as defined in the final bullet point on the degree requirement page )) with AC 302.

In Applied Computation

How to Apply
Learning Outcomes
Master of Science Degree Requirements
Master of Engineering Degree Requirements
CSE courses
Degree Requirements
Data Science courses
Data Science FAQ
Secondary Field Requirements
Advising and Other Activities
AB/SM Information
Alumni Stories
Financing the Degree
Student FAQ

LIBRARIES | ARCH

Data science masters theses.

The Master of Science in Data Science program requires the successful completion of 12 courses to obtain a degree. These requirements cover six core courses, a leadership or project management course, two required courses corresponding to a declared specialization, two electives, and a capstone project or thesis. This collection contains a selection of masters theses or capstone projects by MSDS graduates.

Collection Details

MSc in Data Science, Project Guide, 2018-2019

NEW: List of project areas is available!

Introduction

The project is an essential component of the Masters course. It is a substantial piece of full-time independent research in some area of data science. You will carry out your project under the individual supervision of a member of CDT staff.

The project will occupy a large part of your time during the Spring semester, and 100% of your time from late May/early June — once your examinations have completed — until mid-August. A dissertation describing the work must be submitted by a deadline in mid-August.

Choosing a Project

You are expected to choose a project at the end of Semester 1. Students are expected to find their own projects in consultation with supervisors. To help with this, staff will post some project ideas in late October. These will be indicative of their areas of interest, but they shouldn't be interpreted as a fixed menu; they are simply the starting point for discussion. The procedure for project selection is:

You should identify some research areas that interest you, on the basis of your coursework so far, your independent reading, the guest lectures in IRDS, and the set of project ideas proposed by staff in late October.
Arrange meetings with supervisors in those research areas to discuss potential MSc projects. Often supervisors will have several potential project ideas in mind, but you should of course bring up any potential directions that you have been thinking about.
IMPORTANT: Once you have identified a project and supervisor who is willing to take you on, you will need to fill out a brief form identifying the topic and supervisor. The deadline for this is the 12th of December, 2018 .
The project proposals will all be reviewed for suitability by the CDT project coordinator. However, your proposal is not a contract and we are not going to hold you to it. It should simply represent a good-faith attempt to identify a topic of mutual interest to you and your supervisor.

Schedule and Important Dates

The overall schedule is: You will meet with supervisors during Semester 1 and select a project shortly after Semester 1 classes end. Once you have selected a project, we recommend that you get a head start on your project over the winter break. During Semester 2, you will work approximately 50% on coursework and 50% on your project. After classes end in Semester 2, you will have a revision period for your exams — during this period we recommend that you focus on your exams. Once the exams complete, you should return to your project work, spending 100% time on it until the final deadline in mid-August.

Here are the important dates and deadlines for 2018-19:

November -- You should start meeting with potential MSc supervisors now (if you have not begun already)
12 December 2018 -- MSc project selections due (RTDS students).
11 January 2018, noon -- Interim Report due (RTDS+ students).
1 March 2018, noon -- Interim Report due (RTDS students).
April - May 2018 -- Revision period and exams. During this period we would not expect you to be making much progress on your project
late May 2018 -- Begin full time work on project.
mid-August 2018 (exact date TBD, probably 16 Aug) -- Deadline for submission of dissertation.
October 2018 -- Board of Examiners meets and marks announced

Supervision

As part of choosing a project, you will also choose a supervisor. Your supervisor gives technical advice and also assists you in planning the research. Students should expect approximately weekly meetings with their supervisor. Backup supervisors may be allocated to cover periods of absence of the supervisor, if necessary.

Interim Report

At the beginning of March (or early January for RTDS+), you will submit an interim report about how your project has gone so far. This should be 6-8 pages. This report will not form part of the mark; it is solely for feedback, so it is in your best interest to complete it. The report should describe the research problem that you are considering, explain why it is important, what methods you expect to use, how you expect to evaluate your results, what results you have been able to obtain so far, and what your plans are for the summer. You should write this in such a way that you can re-use the text in your final MSc project report.

Relationship to Your PhD Project

The MSc project is designed to be a first research project that prepares you for the more extended work that you will do in your PhD. The project is intended to be novel research — we hope that in some cases the MSc projects will lead to publishable results, although this is not required and will not always be possible, depending on the nature of the project. Your supervisor should help you identify a topic that has the potential to lead into a larger PhD project, should you decide to continue research in the area.

That said, it is not required that your PhD research be in the same area as your MSc research. Some students will indeed continue their PhD work with the same research area and supervisor as their MSc. Others will choose a different PhD supervisor. Both of these outcomes are expected and are perfectly fine.

Of course if you do already have a good idea about your intended PhD topic, you will want to take this into account when selecting your MSc topic — whether it be to choose a topic in the same area, or to choose a topic that will provide you with complementary experience.

Projects with External Collaborators

Some students may wish to undertake a project which relates to the activities of one of our external partners. Alternatively, some projects that supervisors suggest to you may have a natural relationship with one of the CDT partners. This is encouraged. A student undertaking such a project will still need to find an academic supervisor who is willing to take on the project. During the project phase, students working on such projects have both an academic supervisor and a designated contact at the partner organization.

We strongly encourage you to discuss your projects with other students, talk informally about your progress, and get advice from your peers about any issues. Last year this happened as part of the CDT Tea meetings; this year, we will discuss whether to continue this or to have more formal tutorials.

The Dissertation

Title page with abstract.
Introduction : an introduction to the document, clearly stating the hypothesis or objective of the project, motivation for the work and the results achieved. The structure of the remainder of the document should also be outlined.
Background : background to the project, previous work, exposition of relevant literature, setting of the work in the proper context. This should contain sufficient information to allow the reader to appreciate the contribution you have made.
Description of the work undertaken : this may be divided into chapters describing the conceptual design work and the actual implementation separately. Any problems or difficulties and the suggested solutions should be mentioned. Alternative solutions and their evaluation should also be included.
Analysis or Evaluation : results and their critical analysis should be reported, whether the results conform to expectations or otherwise and how they compare with other related work. Where appropriate evaluation of the work against the original objectives should be presented.
Conclusion : concluding remarks and observations, unsolved problems, suggestions for further work.
Bibliography .

In addition, the dissertation must be accompanied by a statement declaring that the student has read and understood the University's plagiarism guidelines.

In the acknowledgments section of your dissertation, in addition to thanking anyone that you wish, you should also acknowledge the funding sources that have supported you during the year. Please follow these instructions for acknowledging your funding sources . You should get to know them well as you will also need to follow them for every paper that you publish during your PhD.

Students should write as they go , but should also budget several weeks towards the end of the project to focus on writing. Where appropriate the dissertation may additionally contain appendices in which relevant program listings, experimental data, circuit diagrams, formal proofs, etc. may be included. However, students should keep in mind that they are marked on the quality of the dissertation, not its length.

The dissertation must be word-processed using either LaTeX or a system with similar capabilities. The LaTeX thesis template can be found via the local packages web page. You don't have to use these packages, but your thesis must match the style (i.e., font size, text width etc) shown in the sample output for an Informatics thesis.

Computing Resources

Many projects will require computing resources. Please see the CDT handbook for information about what computing resources are available to CDT students.

If a project requires anything more, this needs to be requested at the time of writing the proposal, and the supervisor needs to explicitly ask for additional resources if necessary (start by talking to the CDT projects organizer, below).

Technical problems during project work are only considered for resources we provide; no technical support, compensation for lost data, extensions for time lost due to technical problems with external hard- and software as provided will be given, except where this is explicitly stated as part of a project specification and adequately resourced at the start of the project.

Students must submit their project by the deadline in mid August (see above). Students need to submit hard copy, electronic copy and archive software as detailed below.

Hard Copy. Two printed copies of the dissertation, bound with the soft covers provided by the School, must be submitted to the ITO before the deadline.
Electronic Copy. Students must follow the instructions for how to submit their project electronically. Please use the online submission form that is linked from there.
Software. Students are required to preserve any software they have generated, source, object and make files, together with any associated data that has been accumulated. When you submit the electronic copy of your thesis you will also be asked to provide an archive file (tar or zip) containing all the project materials. You should create a directory, for example named PROJECT , in your file space specifically for the purpose. Please follow the accepted practice of creating a README file which documents your files and their function. This directory should be compressed and then submitted, together with the electronic version of the thesis, via the online submission webpage. See these instructions for how to submit your project electronically.

Project Assessment

Understanding of the problem
Completion of the work
Quality of the work
Quality of the dissertation
Knowledge of the literature
Critical evaluation of previous work
Critical evaluation of own work
Justification of design decisions
Solution of conceptual problems
Amount of work
Evidence of outstanding merit e.g. originality
Inclusion of material worthy of publication

The project involves both the application of skills learned in the past and the acquisition of new skills. It allows students to demonstrate their ability to organise and carry out a major piece of work according to sound scientific and engineering principles. The types of activity involved in each project will vary but all will typically share the following features:

Research the literature and gather background information
Analyse requirements, compare alternatives and specify a solution
Design and implement the solution
Experiment and evaluate the solution
Develop written and oral presentation skills

You may have noticed that there is both a 90pt version of the project (RTDS) and a 120pt version (RTDS+). The 120pt version is for students who have a previous Master's degree in an area relating to data science along with a clear project and a supervisor in mind when they arrive, and therefore want to take fewer classes and a larger project. If you wish to choose this option, you must speak to the CDT Year 1 organizer during course registeration; see the MSc by Research Course Handbook for more information.

The RTDS+ project works the same as the RTDS project, except that: (a) You are expected to have selected a supervisor by 21 September; (b) You should commence work on your project part-time in the autumn; (c) You should submit an interim report by 11 January; and (d) The markers will look to see evidence of more work or a more advanced project, commensurate to the additional amount of time you have had. For example, a larger project might make a larger research contribution, apply more advanced methodology, contain more extensive experimental evaluation, etc.

This page is currently maintained by Adam Lopez .

Reference management. Clean and simple.

How to structure a thesis

A typical thesis structure

1. abstract, 2. introduction, 3. literature review, 6. discussion, 7. conclusion, 8. reference list, frequently asked questions about structuring a thesis, related articles.

Starting a thesis can be daunting. There are so many questions in the beginning:

How do you actually start your thesis?
How do you structure it?
What information should the individual chapters contain?

Each educational program has different demands on your thesis structure, which is why asking directly for the requirements of your program should be a first step. However, there is not much flexibility when it comes to structuring your thesis.

Abstract : a brief overview of your entire thesis.

Literature review : an evaluation of previous research on your topic that includes a discussion of gaps in the research and how your work may fill them.

Methods : outlines the methodology that you are using in your research.

Thesis : a large paper, or multi-chapter work, based on a topic relating to your field of study.

The abstract is the overview of your thesis and generally very short. This section should highlight the main contents of your thesis “at a glance” so that someone who is curious about your work can get the gist quickly. Take a look at our guide on how to write an abstract for more info.

Tip: Consider writing your abstract last, after you’ve written everything else.

The introduction to your thesis gives an overview of its basics or main points. It should answer the following questions:

Why is the topic being studied?
How is the topic being studied?
What is being studied?

In answering the first question, you should know what your personal interest in this topic is and why it is relevant. Why does it matter?

To answer the "how", you should briefly explain how you are going to reach your research goal. Some prefer to answer that question in the methods chapter, but you can give a quick overview here.

And finally, you should explain "what" you are studying. You can also give background information here.

You should rewrite the introduction one last time when the writing is done to make sure it connects with your conclusion. Learn more about how to write a good thesis introduction in our thesis introduction guide .

A literature review is often part of the introduction, but it can be a separate section. It is an evaluation of previous research on the topic showing that there are gaps that your research will attempt to fill. A few tips for your literature review:

Use a wide array of sources
Show both sides of the coin
Make sure to cover the classics in your field
Present everything in a clear and structured manner

For more insights on lit reviews, take a look at our guide on how to write a literature review .

The methodology chapter outlines which methods you choose to gather data, how the data is analyzed and justifies why you chose that methodology . It shows how your choice of design and research methods is suited to answering your research question.

Make sure to also explain what the pitfalls of your approach are and how you have tried to mitigate them. Discussing where your study might come up short can give you more credibility, since it shows the reader that you are aware of its limitations.

Tip: Use graphs and tables, where appropriate, to visualize your results.

The results chapter outlines what you found out in relation to your research questions or hypotheses. It generally contains the facts of your research and does not include a lot of analysis, because that happens mostly in the discussion chapter.

Clearly visualize your results, using tables and graphs, especially when summarizing, and be consistent in your way of reporting. This means sticking to one format to help the reader evaluate and compare the data.

The discussion chapter includes your own analysis and interpretation of the data you gathered , comments on your results and explains what they mean. This is your opportunity to show that you have understood your findings and their significance.

Point out the limitations of your study, provide explanations for unexpected results, and note any questions that remain unanswered.

This is probably your most important chapter. This is where you highlight that your research objectives have been achieved. You can also reiterate any limitations to your study and make suggestions for future research.

Remember to check if you have really answered all your research questions and hypotheses in this chapter. Your thesis should be tied up nicely in the conclusion and show clearly what you did, what results you got, and what you learned. Discover how to write a good conclusion in our thesis conclusion guide .

At the end of your thesis, you’ll have to compile a list of references for everything you’ve cited above. Ideally, you should keep track of everything from the beginning. Otherwise, this could be a mammoth and pretty laborious task to do.

Consider using a reference manager like Paperpile to format and organize your citations. Paperpile allows you to organize and save your citations for later use and cite them in thousands of citation styles directly in Google Docs, Microsoft Word, or LaTeX:

🔲 Introduction

🔲 Literature review

🔲 Discussion

🔲 Conclusion

🔲 Reference list

The basic elements of a thesis are: Abstract, Introduction, Literature Review, Methods, Results, Discussion, Conclusion, and Reference List.

It's recommended to start a thesis by writing the literature review first. This way you learn more about the sources, before jumping to the discussion or any other element.

It's recommended to write the abstract of a thesis last, once everything else is done. This way you will be able to provide a complete overview of your work.

Usually, the discussion is the longest part of a thesis. In this part you are supposed to point out the limitations of your study, provide explanations for unexpected results, and note any questions that remain unanswered.

The order of the basic elements of a thesis are: 1. Abstract, 2. Introduction, 3. Literature Review, 4. Methods, 5. Results, 6. Discussion, 7. Conclusion, and 8. Reference List.

Chair of Data Science

Master and bachelor thesis.

Master and Bachelor Theses

We provide topics for theses in

Bachelor Internet Computing and Bachelor Computer Science
Master Computer Science
Master AI Engineering

For conducting a bachelor or master thesis at the chair, we expect that students successfully participated in courses offered at the chair. Ideally, you successfully conducted one of our advanced seminars or labs and you want to continue on a similar topic. In these cases, you can talk directly with the lecturer referring to the lab or seminar and ask about possibilities for a thesis.

If you do not talk directly to one of the lecturers about a thesis you should look at the Stud.IP Group AI Announcements ( https://studip.uni-passau.de/studip/dispatch.php/course/details?sem_id=4d229d8f38a46587ab171699ff24c22e&again=yes ) . We publish topics there in case spots for thesis students are left at the chair. However, please note that this is rarely the case. Please also note that we will not answer any e-mails on requests of a master thesis topic or on suggestions of a master thesis topic (especially from industry). Please also do not contact employees at the chair for master thesis topics.

The Thesis' Process at our Chair

Pursuing a thesis at our Chair is organised in two following phases:

Preparation Phase (0-1 Months): After you have decided to work on a topic, you will be given 1 month time to prepare your work in detail if needed. That means you must specify the research questions to be answered in detail, analyze literature and plan all steps necessary to answer your research questions. The result is either a written expose for bachelor students or a presentation given in the "Oberseminar Medieninformatik" for master students. Only after a successful presentation and an accepted expose we will allow to register the thesis with the student secretary.
Implementation Phase (3-6 Months): After registering your thesis you have up to 6 months to complete it and hand it in to the Student's Secretary. You can of course complete it earlier. You must agree with your advisor on a schedule for meetings and other working modalities. We also expect you to be proactive, i.e. if there are problems it is your responsibility to engage with the supervisor.

Individual Steps in Detail

Requirements for the Written Thesis:

Thesis can be written in English or German.
The structure should follow scientific rules. See our guide therefore.
A master thesis should range between 60 and 80 pages (excluding appendix, toc, and lists of figures/tables, references) depending on the complexity of the subject. The layout must be concise and reasonably dense (e.g. no filling of pages with figures)
A bachelor thesis should range between 30 and 40 pages (excluding appendix, toc, and lists of figures/tables, references) depending on the complexity of the subject.
Selected master thesis will be made completely available via the OPUS 4 server of the university.
All source code and data sets developed must be made Open Source (exceptions are source code with sensitive material or IPR violations), preferable via Github.com or Zenodo
Students will create a repository at the FIM Gitlab for managing source code, experiments and the written thesis. The supervisor of the thesis has to be added as a collaborator to the project in order to ease communication.

There is also a (still uncomplete) guide on how to conduct thesis at our chair. The guide should give you an idea what it is expected from your side and how to decompose the goal of getting your grade into smaller sub-units.

Beim Anzeigen des Videos wird Ihre IP-Adresse an einen externen Server (Vimeo.com) gesendet.

Press Enter to activate screen reader mode.

Department of Computer Science

Thesis projects and research in ds.

The Master's thesis is a mandatory course of the Master's program in Data Science. The thesis is supervised by a professor of the data science faculty list .

Research in Data Science is a core elective for students in Data Science under the supervision of a data science professor.

Research in Data Science

The project is in independent work under the supervision of a member of the faculty in data science

Only students who have passed at least one core course in Data Management and Processing, and one core course in Data Analysis can start with a research project.

Before starting, the project must be registered in mystudies and a project description must be submitted at the start of the project to the studies administration by e-mail (address see Contact in right column).

Master's Thesis

The Master's Thesis requires 6 months of full time study/work, and we strongly discourage you from attending any courses in parallel. We recommend that you acquire all course credits before the start of the Master’s thesis. The topic for the Master’s thesis must be chosen within Data Science.

Before starting a Master’s thesis, it is important to agree with your supervisor on the task and the assessment scheme. Both have to be documented thoroughly. You electronically register the Master’s thesis in mystudies.

It is possible to complete the Master’s thesis in industry provided that a professor involved in the Data Science Master’s program supervises the thesis and your tutor approves it.

Further details on internal regulations of the Master’s thesis can be downloaded from the following website: www.inf.ethz.ch/studies/forms-and-documents.html .

Overview Master's Theses Projects

Chair of programming methodology.

Prof. Dr. Martin Vechev

Institute for Computing Platform

Prof. Dr. Gustavo Alonso
Prof. Dr. Torsten Hoefler
Prof. Dr. Ana Klimovic
Prof. Dr. Timothy Roscoe

Institute for Machine Learning

Prof. Dr. Valentina Boeva
Prof. Dr. Joachim Buhmann
Prof. Dr. Ryan Cotterell
external page Prof. Dr. Menna El-Assady call_made
Prof. Dr. Niao He
Prof. Dr. Thomas Hofmann
Prof. Dr. Andreas Krause
external page Prof. Dr. Fernando Perez Cruz call_made
Prof. Dr. Gunnar Rätsch
external page Prof. Dr. Mrinmaya Sachan call_made
external page Prof. Dr. Bernhard Schölkopf call_made
Prof. Dr. Julia Vogt

Institute for Persasive Computing

Prof. Dr. Otmar Hilliges

Institute of Computer Systems

Prof. Dr. Markus Püschel

Institute of Information Security

Prof. Dr. David Basin
Prof. Dr. Srdjan Capkun
external page Prof. Dr. Florian Tramèr call_made

Institute of Theoretical Computer Science

Prof. Dr. Bernd Gärtner

Institute of Visual Computing

Prof. Dr. Markus Gross
Prof. Dr. Marc Pollefeys
Prof. Dr. Olga Sorkine
Prof. Dr. Siyu Tang

Disney Research Zurich

external page Prof. Dr. Robert Sumner call_made

Automatic Control Laboratory

Prof. Dr. Florian Dörfler
Prof. Dr. John Lygeros

Communication Technology Laboratory

Prof. Dr. Helmut Bölcskei

Computer Engineering and Networks Laboratory

Prof. Dr. Laurent Vanbever
Prof. Dr. Roger Wattenhofer

Computer Vision Laboratory

Prof. Dr. Ender Konukoglu
Prof. Dr. Luc Van Gool
Prof. Dr. Fisher Yu

Institute for Biomedical Engineering

Prof. Dr. Klaas Enno Stephan

Integrated Systems Laboratory

Prof. Dr. Luca Benini
Prof. Dr. Christoph Studer

Signal and Information Processing Laboratory (ISI)

Prof. Dr. Amos Lapidoth
Prof. Dr. Hans-Andrea Loeliger

D-MATH does not publish Master's Theses projects. In case of interest contact the professor directly.

FIM - Insitute for Mathematical Research

Prof. Dr. Alessio Figalli

Financial Mathematics

Prof. Dr. Josef Teichmann

Institute for Operations Research

Prof. Dr. Robert Weismantel
Prof. Dr. Rico Zenklusen

RiskLab Switzerland

external page Prof. Dr. Patrick Cheridito call_made
external page Prof. Dr. Mario Valentin Wüthrich call_made

Seminar for Applied Mathematics

Prof. Dr. Rima Alaifari
Prof. Dr. Siddhartha Mishra

Seminar for Statistics

Prof. Dr. Afonso Bandeira
Prof. Dr. Peter Bühlmann
Prof. Dr. Yuansi Chen
Prof. Dr. Nicolai Meinshausen
Prof. Dr. Jonas Peters
Prof. Dr. Johanna Ziegel

Law, Economics, and Data Science Group

Prof. Dr. Eliott Ash , D-GESS)

Institute for Geodesy and Photogrammetry

Prof. Dr. Konrad Schindler (D-BSSE)

The plan in Data Science leads to the Master of Science degree. This plan is designed to equip students with the capability of integrating a wide spectrum of interdisciplinary knowledge and skills to uncover and utilize data to produce, apply and communicate value-adding intelligence for organizations and the society, in various key technical, analytical, architectural and managerial positions.

Requirements

Data science basic preparation.

Students entering the Master of Science in Data Science program should hold a Bachelor’s Degree with a background in programming in Python or equivalent, data structures, and probability and statistics. Students without the necessary background can take the following 3 foundation courses.

Foundation Course Requirements

DASC 5031 - Python for Data Science Credit Hours: 3
DASC 5032 - Data Structures for Data Science Credit Hours: 3
STAT 3334 - Probability and Statistics for Scientists and Engineers Credit Hours: 3

Additional Information

None of the above courses may apply to the graduate degree.

Students may select from the thesis option or the extended course work option. The thesis option requires 33 credit hours of graduate work and the extended course work option requires 36 credit hours.

Core Requirements (21 hours)

The following courses, or approved substitutions, are required for both the thesis option and extended course work options

CSCI 5388 - Big Data Analytics Credit Hours: 3
CSCI 5833 - Data Mining: Tools and Techniques Credit Hours: 3
DASC 5133 - Introduction to Data Science Credit Hours: 3
DASC 5231 - Visualization in Data Science Credit Hours: 3
DASC 5333 - Database Systems for Data Science Credit Hours: 3
STAT 5135 - Applied Statistical Methods Credit Hours: 3
STAT 5532 - Linear Models and Regression Analysis Credit Hours: 3

Data Science Thesis Option (12 hours)

6 hours of approved related courses

6 hours of DASC 6939 - Master’s Thesis Research Credit Hours: 3

DASC 6939 - Master’s Thesis Research Credit Hours: 3

Students interested in pursuing the thesis option are encouraged to take DASC 5939 (Independent Study in Data Science) during their first year, in order to write up their thesis proposals (with the sponsoring of a faculty adviser). All electives must be approved before enrolling.

Data Science Extended Course Work Option (15 hours)

Students desiring to follow the extended course work option must successfully complete the capstone project course ( DASC 6838 ) and 12 hours of approved electives.

DASC 6838 must be taken after completion of the required core and during last 12 hours. All electives must be approved before enrolling.

Data Science Specializations

Business analytics specialization.

CINF 5231 - Strategic Information Systems Credit Hours: 3
CSCI 5832 - Financial Data Mining Credit Hours: 3
DSCI 5131 - Business Analytics I Credit Hours: 3
ISAM 5330 - Management Information Systems Credit Hours: 3
CINF 5432 - Data Warehousing and Business Intelligence Credit Hours: 3
MGMT 6135 - Data Visualization and Communication Credit Hours: 3

Cloud and Big Data Solutions Specialization

DASC 5335 - Deep Learning Credit Hours: 3
CSCI 5355 - Internet of Things (IoT) Credit Hours: 3
CSCI 5336 - Machine Learning Credit Hours: 3

Bioinformatics Specialization

BIOL 4341 - Biochemistry I Credit Hours: 3
BIOT 5733 - Bioinformatics Credit Hours: 3
CSCI 5933 - Computational Bioinformatics Credit Hours: 3

BIOL 4341 and BIOL 4351 have other BIOL courses as prerequisites. A student who has already taken this course in undergraduate study should take an additional elective course below.

Bioinformatics Elective

BIOT 5011 - Methods of Biotechnology Discussions Credit Hours: 1
BIOT 5021 - Methods of Biotechnology Credit Hours: 2
BIOT 5431 - Genomic Analysis Credit Hours: 3
BIOT 5331 - Stem Cell Biotechnology Credit Hours: 3
BIOT 5433 - Marine Biotechnology Seminar Credit Hours: 3
BIOT 5535 - Environmental Biotechnology Credit Hours: 3

Analytics Insight

10 Best Research and Thesis Topic Ideas for Data Science in 2022

These research and thesis topics for data science will ensure more knowledge and skills for both students and scholars

Handling practical video analytics in a distributed cloud: With increased dependency on the internet, sharing videos has become a mode of data and information exchange. The role of the implementation of the Internet of Things (IoT), telecom infrastructure, and operators is huge in generating insights from video analytics. In this perspective, several questions need to be answered, like the efficiency of the existing analytics systems, the changes about to take place if real-time analytics are integrated, and others.
Smart healthcare systems using big data analytics: Big data analytics plays a significant role in making healthcare more efficient, accessible, and cost-effective. Big data analytics enhances the operational efficiency of smart healthcare providers by providing real-time analytics. It enhances the capabilities of the intelligent systems by using short-span data-driven insights, but there are still distinct challenges that are yet to be addressed in this field.
Identifying fake news using real-time analytics: The circulation of fake news has become a pressing issue in the modern era. The data gathered from social media networks might seem legit, but sometimes they are not. The sources that provide the data are unauthenticated most of the time, which makes it a crucial issue to be addressed.
TOP 10 DATA SCIENCE JOB SKILLS THAT WILL BE ON HIGH DEMAND IN 2022
TOP 10 DATA SCIENCE UNDERGRADUATE COURSES IN INDIA FOR 2022
TOP DATA SCIENCE PROJECTS TO DO DURING YOUR OMICRON QUARANTINE
Secure federated learning with real-world applications : Federated learning is a technique that trains an algorithm across multiple decentralized edge devices and servers. This technique can be adopted to build models locally, but if this technique can be deployed at scale or not, across multiple platforms with high-level security is still obscure.
Big data analytics and its impact on marketing strategy : The advent of data science and big data analytics has entirely redefined the marketing industry. It has helped enterprises by offering valuable insights into their existing and future customers. But several issues like the existence of surplus data, integrating complex data into customers’ journeys, and complete data privacy are some of the branches that are still untrodden and need immediate attention.
Impact of big data on business decision-making: Present studies signify that big data has transformed the way managers and business leaders make critical decisions concerning the growth and development of the business. It allows them to access objective data and analyse the market environments, enabling companies to adapt rapidly and make decisions faster. Working on this topic will help students understand the present market and business conditions and help them analyse new solutions.
Implementing big data to understand consumer behaviour : In understanding consumer behaviour, big data is used to analyse the data points depicting a consumer’s journey after buying a product. Data gives a clearer picture in understanding specific scenarios. This topic will help understand the problems that businesses face in utilizing the insights and develop new strategies in the future to generate more ROI.
Applications of big data to predict future demand and forecasting : Predictive analytics in data science has emerged as an integral part of decision-making and demand forecasting. Working on this topic will enable the students to determine the significance of the high-quality historical data analysis and the factors that drive higher demand in consumers.
The importance of data exploration over data analysis : Exploration enables a deeper understanding of the dataset, making it easier to navigate and use the data later. Intelligent analysts must understand and explore the differences between data exploration and analysis and use them according to specific needs to fulfill organizational requirements.
Data science and software engineering : Software engineering and development are a major part of data science. Skilled data professionals should learn and explore the possibilities of the various technical and software skills for performing critical AI and big data tasks.

Disclaimer: Any financial and crypto market information given on Analytics Insight are sponsored articles, written for informational purpose only and is not an investment advice. The readers are further advised that Crypto products and NFTs are unregulated and can be highly risky. There may be no regulatory recourse for any loss from such transactions. Conduct your own research by contacting financial experts before making any investment decisions. The decision to read hereinafter is purely a matter of choice and shall be construed as an express undertaking/guarantee in favour of Analytics Insight of being absolved from any/ all potential legal action, or enforceable claims. We do not represent nor own any cryptocurrency, any complaints, abuse or concerns with regards to the information provided shall be immediately informed here .

Smart Contracts and Data Security: Challenges and Solutions

Personal Branding in the Digital Age: What You Need to Know?

Cardano CEO vs. the SEC; 8 Better Cryptos Leading the Market

BTC, ETH and Scorpion Casino Ignite Market Momentum; Bulls Drive $60 Million Liquidation as $SCORP Surpasses $3.5 Million Milestone

Analytics Insight® is an influential platform dedicated to insights, trends, and opinion from the world of data-driven technologies. It monitors developments, recognition, and achievements made by Artificial Intelligence, Big Data and Analytics companies across the globe.

Select Language:
Privacy Policy
Content Licensing
Terms & Conditions
Submit an Interview

Special Editions

Dec – Crypto Weekly Vol-1
40 Under 40 Innovators
Women In Technology
Market Reports
AI Glossary
Infographics

Latest Issue

Disclaimer: Any financial and crypto market information given on Analytics Insight is written for informational purpose only and is not an investment advice. Conduct your own research by contacting financial experts before making any investment decisions, more information here .

Second Menu

Buy Me a Coffee

Home » Thesis – Structure, Example and Writing Guide

Thesis – Structure, Example and Writing Guide

Table of contents.

Definition:

Thesis is a scholarly document that presents a student’s original research and findings on a particular topic or question. It is usually written as a requirement for a graduate degree program and is intended to demonstrate the student’s mastery of the subject matter and their ability to conduct independent research.

History of Thesis

The concept of a thesis can be traced back to ancient Greece, where it was used as a way for students to demonstrate their knowledge of a particular subject. However, the modern form of the thesis as a scholarly document used to earn a degree is a relatively recent development.

The origin of the modern thesis can be traced back to medieval universities in Europe. During this time, students were required to present a “disputation” in which they would defend a particular thesis in front of their peers and faculty members. These disputations served as a way to demonstrate the student’s mastery of the subject matter and were often the final requirement for earning a degree.

In the 17th century, the concept of the thesis was formalized further with the creation of the modern research university. Students were now required to complete a research project and present their findings in a written document, which would serve as the basis for their degree.

The modern thesis as we know it today has evolved over time, with different disciplines and institutions adopting their own standards and formats. However, the basic elements of a thesis – original research, a clear research question, a thorough review of the literature, and a well-argued conclusion – remain the same.

Structure of Thesis

The structure of a thesis may vary slightly depending on the specific requirements of the institution, department, or field of study, but generally, it follows a specific format.

Here’s a breakdown of the structure of a thesis:

This is the first page of the thesis that includes the title of the thesis, the name of the author, the name of the institution, the department, the date, and any other relevant information required by the institution.

This is a brief summary of the thesis that provides an overview of the research question, methodology, findings, and conclusions.

This page provides a list of all the chapters and sections in the thesis and their page numbers.

Introduction

This chapter provides an overview of the research question, the context of the research, and the purpose of the study. The introduction should also outline the methodology and the scope of the research.

Literature Review

This chapter provides a critical analysis of the relevant literature on the research topic. It should demonstrate the gap in the existing knowledge and justify the need for the research.

Methodology

This chapter provides a detailed description of the research methods used to gather and analyze data. It should explain the research design, the sampling method, data collection techniques, and data analysis procedures.

This chapter presents the findings of the research. It should include tables, graphs, and charts to illustrate the results.

This chapter interprets the results and relates them to the research question. It should explain the significance of the findings and their implications for the research topic.

This chapter summarizes the key findings and the main conclusions of the research. It should also provide recommendations for future research.

This section provides a list of all the sources cited in the thesis. The citation style may vary depending on the requirements of the institution or the field of study.

This section includes any additional material that supports the research, such as raw data, survey questionnaires, or other relevant documents.

How to write Thesis

Here are some steps to help you write a thesis:

Choose a Topic: The first step in writing a thesis is to choose a topic that interests you and is relevant to your field of study. You should also consider the scope of the topic and the availability of resources for research.
Develop a Research Question: Once you have chosen a topic, you need to develop a research question that you will answer in your thesis. The research question should be specific, clear, and feasible.
Conduct a Literature Review: Before you start your research, you need to conduct a literature review to identify the existing knowledge and gaps in the field. This will help you refine your research question and develop a research methodology.
Develop a Research Methodology: Once you have refined your research question, you need to develop a research methodology that includes the research design, data collection methods, and data analysis procedures.
Collect and Analyze Data: After developing your research methodology, you need to collect and analyze data. This may involve conducting surveys, interviews, experiments, or analyzing existing data.
Write the Thesis: Once you have analyzed the data, you need to write the thesis. The thesis should follow a specific structure that includes an introduction, literature review, methodology, results, discussion, conclusion, and references.
Edit and Proofread: After completing the thesis, you need to edit and proofread it carefully. You should also have someone else review it to ensure that it is clear, concise, and free of errors.
Submit the Thesis: Finally, you need to submit the thesis to your academic advisor or committee for review and evaluation.

Example of Thesis

Example of Thesis template for Students:

Title of Thesis

Table of Contents:

Chapter 1: Introduction

Chapter 2: Literature Review

Chapter 3: Research Methodology

Chapter 4: Results

Chapter 5: Discussion

Chapter 6: Conclusion

References:

Appendices:

Note: That’s just a basic template, but it should give you an idea of the structure and content that a typical thesis might include. Be sure to consult with your department or supervisor for any specific formatting requirements they may have. Good luck with your thesis!

Application of Thesis

Thesis is an important academic document that serves several purposes. Here are some of the applications of thesis:

Academic Requirement: A thesis is a requirement for many academic programs, especially at the graduate level. It is an essential component of the evaluation process and demonstrates the student’s ability to conduct original research and contribute to the knowledge in their field.
Career Advancement: A thesis can also help in career advancement. Employers often value candidates who have completed a thesis as it demonstrates their research skills, critical thinking abilities, and their dedication to their field of study.
Publication : A thesis can serve as a basis for future publications in academic journals, books, or conference proceedings. It provides the researcher with an opportunity to present their research to a wider audience and contribute to the body of knowledge in their field.
Personal Development: Writing a thesis is a challenging task that requires time, dedication, and perseverance. It provides the student with an opportunity to develop critical thinking, research, and writing skills that are essential for their personal and professional development.
Impact on Society: The findings of a thesis can have an impact on society by addressing important issues, providing insights into complex problems, and contributing to the development of policies and practices.

Purpose of Thesis

The purpose of a thesis is to present original research findings in a clear and organized manner. It is a formal document that demonstrates a student’s ability to conduct independent research and contribute to the knowledge in their field of study. The primary purposes of a thesis are:

To Contribute to Knowledge: The main purpose of a thesis is to contribute to the knowledge in a particular field of study. By conducting original research and presenting their findings, the student adds new insights and perspectives to the existing body of knowledge.
To Demonstrate Research Skills: A thesis is an opportunity for the student to demonstrate their research skills. This includes the ability to formulate a research question, design a research methodology, collect and analyze data, and draw conclusions based on their findings.
To Develop Critical Thinking: Writing a thesis requires critical thinking and analysis. The student must evaluate existing literature and identify gaps in the field, as well as develop and defend their own ideas.
To Provide Evidence of Competence : A thesis provides evidence of the student’s competence in their field of study. It demonstrates their ability to apply theoretical concepts to real-world problems, and their ability to communicate their ideas effectively.
To Facilitate Career Advancement : Completing a thesis can help the student advance their career by demonstrating their research skills and dedication to their field of study. It can also provide a basis for future publications, presentations, or research projects.

When to Write Thesis

The timing for writing a thesis depends on the specific requirements of the academic program or institution. In most cases, the opportunity to write a thesis is typically offered at the graduate level, but there may be exceptions.

Generally, students should plan to write their thesis during the final year of their graduate program. This allows sufficient time for conducting research, analyzing data, and writing the thesis. It is important to start planning the thesis early and to identify a research topic and research advisor as soon as possible.

In some cases, students may be able to write a thesis as part of an undergraduate program or as an independent research project outside of an academic program. In such cases, it is important to consult with faculty advisors or mentors to ensure that the research is appropriately designed and executed.

It is important to note that the process of writing a thesis can be time-consuming and requires a significant amount of effort and dedication. It is important to plan accordingly and to allocate sufficient time for conducting research, analyzing data, and writing the thesis.

Characteristics of Thesis

The characteristics of a thesis vary depending on the specific academic program or institution. However, some general characteristics of a thesis include:

Originality : A thesis should present original research findings or insights. It should demonstrate the student’s ability to conduct independent research and contribute to the knowledge in their field of study.
Clarity : A thesis should be clear and concise. It should present the research question, methodology, findings, and conclusions in a logical and organized manner. It should also be well-written, with proper grammar, spelling, and punctuation.
Research-Based: A thesis should be based on rigorous research, which involves collecting and analyzing data from various sources. The research should be well-designed, with appropriate research methods and techniques.
Evidence-Based : A thesis should be based on evidence, which means that all claims made in the thesis should be supported by data or literature. The evidence should be properly cited using appropriate citation styles.
Critical Thinking: A thesis should demonstrate the student’s ability to critically analyze and evaluate information. It should present the student’s own ideas and arguments, and engage with existing literature in the field.
Academic Style : A thesis should adhere to the conventions of academic writing. It should be well-structured, with clear headings and subheadings, and should use appropriate academic language.

Advantages of Thesis

There are several advantages to writing a thesis, including:

Development of Research Skills: Writing a thesis requires extensive research and analytical skills. It helps to develop the student’s research skills, including the ability to formulate research questions, design and execute research methodologies, collect and analyze data, and draw conclusions based on their findings.
Contribution to Knowledge: Writing a thesis provides an opportunity for the student to contribute to the knowledge in their field of study. By conducting original research, they can add new insights and perspectives to the existing body of knowledge.
Preparation for Future Research: Completing a thesis prepares the student for future research projects. It provides them with the necessary skills to design and execute research methodologies, analyze data, and draw conclusions based on their findings.
Career Advancement: Writing a thesis can help to advance the student’s career. It demonstrates their research skills and dedication to their field of study, and provides a basis for future publications, presentations, or research projects.
Personal Growth: Completing a thesis can be a challenging and rewarding experience. It requires dedication, hard work, and perseverance. It can help the student to develop self-confidence, independence, and a sense of accomplishment.

Limitations of Thesis

There are also some limitations to writing a thesis, including:

Time and Resources: Writing a thesis requires a significant amount of time and resources. It can be a time-consuming and expensive process, as it may involve conducting original research, analyzing data, and producing a lengthy document.
Narrow Focus: A thesis is typically focused on a specific research question or topic, which may limit the student’s exposure to other areas within their field of study.
Limited Audience: A thesis is usually only read by a small number of people, such as the student’s thesis advisor and committee members. This limits the potential impact of the research findings.
Lack of Real-World Application : Some thesis topics may be highly theoretical or academic in nature, which may limit their practical application in the real world.
Pressure and Stress : Writing a thesis can be a stressful and pressure-filled experience, as it may involve meeting strict deadlines, conducting original research, and producing a high-quality document.
Potential for Isolation: Writing a thesis can be a solitary experience, as the student may spend a significant amount of time working independently on their research and writing.

About the author

Muhammad Hassan

Researcher, Academic Writer, Web developer

Data Collection – Methods Types and Examples

Delimitations in Research – Types, Examples and...

Research Process – Steps, Examples and Tips

Research Design – Types, Methods and Examples

Institutional Review Board – Application Sample...

Evaluating Research – Process, Examples and...

Structuring your thesis
Information and services
Higher Degree by Research
HDR candidature support
How to write a thesis

The best structure for your HDR thesis will depend on your discipline and the research you aim to communicate.

Before you begin writing your thesis, make sure you've read our advice on thesis preparation for information on the requirements you'll need to meet.

Once you've done this, you can begin to think about how to structure your thesis. To help you get started, we've outlined a basic structure below, but the requirements for your discipline may be different .

If you need help determining a suitable structure:

read other theses in your discipline – you can search for UQ theses on the Library website. For prime examples, search for theses that received commendations from their examiners
check with your advisor.

A basic thesis structure includes the following sections:

Introduction and literature review

Results or findings.

An abstract is a summary of your entire thesis and should provide a complete overview of the thesis, including your key results and findings.

An abstract is different to your introduction, and shouldn't be used to advertise your thesis — it should provide enough information to allow readers to understand what they'll learn by reading the thesis.

Your abstract should answer the following questions:

What did you do?
How did you do it?
Why was it worth doing?
What were the key results?
What are the implications or significance of the results?

As your abstract will have a word limit, you may be unable to answer every question in detail. If you find yourself running out of words, make sure you include your key findings before other information.

All theses require introductions and literature reviews, but the structure and location of these can vary.

In some cases, your literature review will be incorporated into the introduction. You may also review literature in other parts of your thesis, such as in the methods section.

Other options for structuring an introduction and literature review include:

a brief introductory chapter with a longer, separate literature review chapter
a long introductory chapter with a brief introductory section followed by literature review sections
a brief introductory chapter with detailed literature reviews relevant to the topic of each chapter provided separately in each chapter — this is common in a thesis comprised of publications.

If you have a separate introduction and literature review, they should complement, not repeat, each other.

The introduction should outline the background and significance of the broad area of study, as well as your:

general aims – what you intend to contribute to the understanding of a topic
specific objectives – which particular aspects of that topic you'll be investigating
the rationale for proceeding in the way that you did
your motivation or the justification for your research – the level of detail can vary depending on how much detail you will be including in a literature review.

The literature review should provide a more detailed analysis of research in the field, and present more specific aims or hypotheses for your research. What's expected for a literature review varies depending on your:

program – a PhD thesis requires a more extensive literature review than an MPhil thesis
discipline – analyse well-written examples from your discipline to learn the conventions for content and structure.

To get some ideas about how to structure and integrate your literature review, look at how to write a literature review and an example analysis of a literature review , or talk to your advisor.

A possible structure for your methods section is to include an introduction that provides a justification and explanation of the methodological approach you chose, followed by relevant sub-sections. Some standard sub-sections of a methods chapter include:

Participants
Procedures.

How the methods section is structured can depend on your discipline, so review other theses from your discipline for ideas for structure.

Regardless of structure, the methods section should explain:

how you collected and analysed your data – you only need to include enough detail that another expert in the field could repeat what you've done (you don't have to detail field standard techniques or tests)
why you chose to collect specific data
how this data will help you to answer your research questions
why you chose the approach you went with.

You may want to present your results separately to your discussion. If so, use the results section to:

specify the data you collected and how it was were prepared for analysis
describe the data analysis (e.g. define the type of statistical test that was applied to the data)
describe the outcome of the analysis
present a summary and descriptive statistics in a table or graph.

Use tables and figures effectively

Reports usually include tables, graphs and other graphics to present data and supplement the text. To learn how to design and use these elements effectively, see our guides to:

incorporating tables, figures, statistics and equations (PDF, 1.2MB)
graphic presentation (PDF, 2.9MB) .

Use the discussion section to:

comment on your results and explain what they mean
compare, contrast and relate your results back to theory or the findings of other studies
identify and explain any unexpected results
identify any limitations to your research and any questions that your research was unable to answer
discuss the significance or implications of your results.

If you find that your research ends up in a different direction to what you intended, it can help to explicitly acknowledge this and explain why in this section.

Use the conclusion section to:

emphasise that you've met your research aims
summarise the main findings of your research
restate the limitations of your research and make suggestions for further research.

In some cases, the discussion and conclusion sections can be combined. Check with your advisor if you want to combine these sections.

Thesis writing tips

Learning Advisers

Our advisers can help undergraduate and postgraduate students in all programs clarify ideas from workshops, help you develop skills and give feedback on assignments.

How a Learning Adviser can help

Further support

Writing a clear and engaging research paper Thesis and dissertation writing: an examination of published advice and actual practice Scientific writing

Survey Paper
Open access
Published: 01 July 2020

Cybersecurity data science: an overview from machine learning perspective

Iqbal H. Sarker ORCID: orcid.org/0000-0003-1740-5517 1 , 2 ,
A. S. M. Kayes 3 ,
Shahriar Badsha 4 ,
Hamed Alqahtani 5 ,
Paul Watters 3 &
Alex Ng 3

Journal of Big Data volume 7 , Article number: 41 ( 2020 ) Cite this article

142k Accesses

238 Citations

51 Altmetric

Metrics details

In a computing context, cybersecurity is undergoing massive shifts in technology and its operations in recent days, and data science is driving the change. Extracting security incident patterns or insights from cybersecurity data and building corresponding data-driven model , is the key to make a security system automated and intelligent. To understand and analyze the actual phenomena with data, various scientific methods, machine learning techniques, processes, and systems are used, which is commonly known as data science. In this paper, we focus and briefly discuss on cybersecurity data science , where the data is being gathered from relevant cybersecurity sources, and the analytics complement the latest data-driven patterns for providing more effective security solutions. The concept of cybersecurity data science allows making the computing process more actionable and intelligent as compared to traditional ones in the domain of cybersecurity. We then discuss and summarize a number of associated research issues and future directions . Furthermore, we provide a machine learning based multi-layered framework for the purpose of cybersecurity modeling. Overall, our goal is not only to discuss cybersecurity data science and relevant methods but also to focus the applicability towards data-driven intelligent decision making for protecting the systems from cyber-attacks.

Introduction

Due to the increasing dependency on digitalization and Internet-of-Things (IoT) [ 1 ], various security incidents such as unauthorized access [ 2 ], malware attack [ 3 ], zero-day attack [ 4 ], data breach [ 5 ], denial of service (DoS) [ 2 ], social engineering or phishing [ 6 ] etc. have grown at an exponential rate in recent years. For instance, in 2010, there were less than 50 million unique malware executables known to the security community. By 2012, they were double around 100 million, and in 2019, there are more than 900 million malicious executables known to the security community, and this number is likely to grow, according to the statistics of AV-TEST institute in Germany [ 7 ]. Cybercrime and attacks can cause devastating financial losses and affect organizations and individuals as well. It’s estimated that, a data breach costs 8.19 million USD for the United States and 3.9 million USD on an average [ 8 ], and the annual cost to the global economy from cybercrime is 400 billion USD [ 9 ]. According to Juniper Research [ 10 ], the number of records breached each year to nearly triple over the next 5 years. Thus, it’s essential that organizations need to adopt and implement a strong cybersecurity approach to mitigate the loss. According to [ 11 ], the national security of a country depends on the business, government, and individual citizens having access to applications and tools which are highly secure, and the capability on detecting and eliminating such cyber-threats in a timely way. Therefore, to effectively identify various cyber incidents either previously seen or unseen, and intelligently protect the relevant systems from such cyber-attacks, is a key issue to be solved urgently.

Popularity trends of data science, machine learning and cybersecurity over time, where x-axis represents the timestamp information and y axis represents the corresponding popularity values

Cybersecurity is a set of technologies and processes designed to protect computers, networks, programs and data from attack, damage, or unauthorized access [ 12 ]. In recent days, cybersecurity is undergoing massive shifts in technology and its operations in the context of computing, and data science (DS) is driving the change, where machine learning (ML), a core part of “Artificial Intelligence” (AI) can play a vital role to discover the insights from data. Machine learning can significantly change the cybersecurity landscape and data science is leading a new scientific paradigm [ 13 , 14 ]. The popularity of these related technologies is increasing day-by-day, which is shown in Fig. 1 , based on the data of the last five years collected from Google Trends [ 15 ]. The figure represents timestamp information in terms of a particular date in the x-axis and corresponding popularity in the range of 0 (minimum) to 100 (maximum) in the y-axis. As shown in Fig. 1 , the popularity indication values of these areas are less than 30 in 2014, while they exceed 70 in 2019, i.e., more than double in terms of increased popularity. In this paper, we focus on cybersecurity data science (CDS), which is broadly related to these areas in terms of security data processing techniques and intelligent decision making in real-world applications. Overall, CDS is security data-focused, applies machine learning methods to quantify cyber risks, and ultimately seeks to optimize cybersecurity operations. Thus, the purpose of this paper is for those academia and industry people who want to study and develop a data-driven smart cybersecurity model based on machine learning techniques. Therefore, great emphasis is placed on a thorough description of various types of machine learning methods, and their relations and usage in the context of cybersecurity. This paper does not describe all of the different techniques used in cybersecurity in detail; instead, it gives an overview of cybersecurity data science modeling based on artificial intelligence, particularly from machine learning perspective.

The ultimate goal of cybersecurity data science is data-driven intelligent decision making from security data for smart cybersecurity solutions. CDS represents a partial paradigm shift from traditional well-known security solutions such as firewalls, user authentication and access control, cryptography systems etc. that might not be effective according to today’s need in cyber industry [ 16 , 17 , 18 , 19 ]. The problems are these are typically handled statically by a few experienced security analysts, where data management is done in an ad-hoc manner [ 20 , 21 ]. However, as an increasing number of cybersecurity incidents in different formats mentioned above continuously appear over time, such conventional solutions have encountered limitations in mitigating such cyber risks. As a result, numerous advanced attacks are created and spread very quickly throughout the Internet. Although several researchers use various data analysis and learning techniques to build cybersecurity models that are summarized in “ Machine learning tasks in cybersecurity ” section, a comprehensive security model based on the effective discovery of security insights and latest security patterns could be more useful. To address this issue, we need to develop more flexible and efficient security mechanisms that can respond to threats and to update security policies to mitigate them intelligently in a timely manner. To achieve this goal, it is inherently required to analyze a massive amount of relevant cybersecurity data generated from various sources such as network and system sources, and to discover insights or proper security policies with minimal human intervention in an automated manner.

Analyzing cybersecurity data and building the right tools and processes to successfully protect against cybersecurity incidents goes beyond a simple set of functional requirements and knowledge about risks, threats or vulnerabilities. For effectively extracting the insights or the patterns of security incidents, several machine learning techniques, such as feature engineering, data clustering, classification, and association analysis, or neural network-based deep learning techniques can be used, which are briefly discussed in “ Machine learning tasks in cybersecurity ” section. These learning techniques are capable to find the anomalies or malicious behavior and data-driven patterns of associated security incidents to make an intelligent decision. Thus, based on the concept of data-driven decision making, we aim to focus on cybersecurity data science , where the data is being gathered from relevant cybersecurity sources such as network activity, database activity, application activity, or user activity, and the analytics complement the latest data-driven patterns for providing corresponding security solutions.

The contributions of this paper are summarized as follows.

We first make a brief discussion on the concept of cybersecurity data science and relevant methods to understand its applicability towards data-driven intelligent decision making in the domain of cybersecurity. For this purpose, we also make a review and brief discussion on different machine learning tasks in cybersecurity, and summarize various cybersecurity datasets highlighting their usage in different data-driven cyber applications.

We then discuss and summarize a number of associated research issues and future directions in the area of cybersecurity data science, that could help both the academia and industry people to further research and development in relevant application areas.

Finally, we provide a generic multi-layered framework of the cybersecurity data science model based on machine learning techniques. In this framework, we briefly discuss how the cybersecurity data science model can be used to discover useful insights from security data and making data-driven intelligent decisions to build smart cybersecurity systems.

The remainder of the paper is organized as follows. “ Background ” section summarizes background of our study and gives an overview of the related technologies of cybersecurity data science. “ Cybersecurity data science ” section defines and discusses briefly about cybersecurity data science including various categories of cyber incidents data. In “ Machine learning tasks in cybersecurity ” section, we briefly discuss various categories of machine learning techniques including their relations with cybersecurity tasks and summarize a number of machine learning based cybersecurity models in the field. “ Research issues and future directions ” section briefly discusses and highlights various research issues and future directions in the area of cybersecurity data science. In “ A multi-layered framework for smart cybersecurity services ” section, we suggest a machine learning-based framework to build cybersecurity data science model and discuss various layers with their roles. In “ Discussion ” section, we highlight several key points regarding our studies. Finally, “ Conclusion ” section concludes this paper.

In this section, we give an overview of the related technologies of cybersecurity data science including various types of cybersecurity incidents and defense strategies.

Cybersecurity

Over the last half-century, the information and communication technology (ICT) industry has evolved greatly, which is ubiquitous and closely integrated with our modern society. Thus, protecting ICT systems and applications from cyber-attacks has been greatly concerned by the security policymakers in recent days [ 22 ]. The act of protecting ICT systems from various cyber-threats or attacks has come to be known as cybersecurity [ 9 ]. Several aspects are associated with cybersecurity: measures to protect information and communication technology; the raw data and information it contains and their processing and transmitting; associated virtual and physical elements of the systems; the degree of protection resulting from the application of those measures; and eventually the associated field of professional endeavor [ 23 ]. Craigen et al. defined “cybersecurity as a set of tools, practices, and guidelines that can be used to protect computer networks, software programs, and data from attack, damage, or unauthorized access” [ 24 ]. According to Aftergood et al. [ 12 ], “cybersecurity is a set of technologies and processes designed to protect computers, networks, programs and data from attacks and unauthorized access, alteration, or destruction”. Overall, cybersecurity concerns with the understanding of diverse cyber-attacks and devising corresponding defense strategies that preserve several properties defined as below [ 25 , 26 ].

Confidentiality is a property used to prevent the access and disclosure of information to unauthorized individuals, entities or systems.

Integrity is a property used to prevent any modification or destruction of information in an unauthorized manner.

Availability is a property used to ensure timely and reliable access of information assets and systems to an authorized entity.

The term cybersecurity applies in a variety of contexts, from business to mobile computing, and can be divided into several common categories. These are - network security that mainly focuses on securing a computer network from cyber attackers or intruders; application security that takes into account keeping the software and the devices free of risks or cyber-threats; information security that mainly considers security and the privacy of relevant data; operational security that includes the processes of handling and protecting data assets. Typical cybersecurity systems are composed of network security systems and computer security systems containing a firewall, antivirus software, or an intrusion detection system [ 27 ].

Cyberattacks and security risks

The risks typically associated with any attack, which considers three security factors, such as threats, i.e., who is attacking, vulnerabilities, i.e., the weaknesses they are attacking, and impacts, i.e., what the attack does [ 9 ]. A security incident is an act that threatens the confidentiality, integrity, or availability of information assets and systems. Several types of cybersecurity incidents that may result in security risks on an organization’s systems and networks or an individual [ 2 ]. These are:

Unauthorized access that describes the act of accessing information to network, systems or data without authorization that results in a violation of a security policy [ 2 ];

Malware known as malicious software, is any program or software that intentionally designed to cause damage to a computer, client, server, or computer network, e.g., botnets. Examples of different types of malware including computer viruses, worms, Trojan horses, adware, ransomware, spyware, malicious bots, etc. [ 3 , 26 ]; Ransom malware, or ransomware , is an emerging form of malware that prevents users from accessing their systems or personal files, or the devices, then demands an anonymous online payment in order to restore access.

Denial-of-Service is an attack meant to shut down a machine or network, making it inaccessible to its intended users by flooding the target with traffic that triggers a crash. The Denial-of-Service (DoS) attack typically uses one computer with an Internet connection, while distributed denial-of-service (DDoS) attack uses multiple computers and Internet connections to flood the targeted resource [ 2 ];

Phishing a type of social engineering , used for a broad range of malicious activities accomplished through human interactions, in which the fraudulent attempt takes part to obtain sensitive information such as banking and credit card details, login credentials, or personally identifiable information by disguising oneself as a trusted individual or entity via an electronic communication such as email, text, or instant message, etc. [ 26 ];

Zero-day attack is considered as the term that is used to describe the threat of an unknown security vulnerability for which either the patch has not been released or the application developers were unaware [ 4 , 28 ].

Beside these attacks mentioned above, privilege escalation [ 29 ], password attack [ 30 ], insider threat [ 31 ], man-in-the-middle [ 32 ], advanced persistent threat [ 33 ], SQL injection attack [ 34 ], cryptojacking attack [ 35 ], web application attack [ 30 ] etc. are well-known as security incidents in the field of cybersecurity. A data breach is another type of security incident, known as a data leak, which is involved in the unauthorized access of data by an individual, application, or service [ 5 ]. Thus, all data breaches are considered as security incidents, however, all the security incidents are not data breaches. Most data breaches occur in the banking industry involving the credit card numbers, personal information, followed by the healthcare sector and the public sector [ 36 ].

Cybersecurity defense strategies

Defense strategies are needed to protect data or information, information systems, and networks from cyber-attacks or intrusions. More granularly, they are responsible for preventing data breaches or security incidents and monitoring and reacting to intrusions, which can be defined as any kind of unauthorized activity that causes damage to an information system [ 37 ]. An intrusion detection system (IDS) is typically represented as “a device or software application that monitors a computer network or systems for malicious activity or policy violations” [ 38 ]. The traditional well-known security solutions such as anti-virus, firewalls, user authentication, access control, data encryption and cryptography systems, however might not be effective according to today’s need in the cyber industry

[ 16 , 17 , 18 , 19 ]. On the other hand, IDS resolves the issues by analyzing security data from several key points in a computer network or system [ 39 , 40 ]. Moreover, intrusion detection systems can be used to detect both internal and external attacks.

Intrusion detection systems are different categories according to the usage scope. For instance, a host-based intrusion detection system (HIDS), and network intrusion detection system (NIDS) are the most common types based on the scope of single computers to large networks. In a HIDS, the system monitors important files on an individual system, while it analyzes and monitors network connections for suspicious traffic in a NIDS. Similarly, based on methodologies, the signature-based IDS, and anomaly-based IDS are the most well-known variants [ 37 ].

Signature-based IDS : A signature can be a predefined string, pattern, or rule that corresponds to a known attack. A particular pattern is identified as the detection of corresponding attacks in a signature-based IDS. An example of a signature can be known patterns or a byte sequence in a network traffic, or sequences used by malware. To detect the attacks, anti-virus software uses such types of sequences or patterns as a signature while performing the matching operation. Signature-based IDS is also known as knowledge-based or misuse detection [ 41 ]. This technique can be efficient to process a high volume of network traffic, however, is strictly limited to the known attacks only. Thus, detecting new attacks or unseen attacks is one of the biggest challenges faced by this signature-based system.

Anomaly-based IDS : The concept of anomaly-based detection overcomes the issues of signature-based IDS discussed above. In an anomaly-based intrusion detection system, the behavior of the network is first examined to find dynamic patterns, to automatically create a data-driven model, to profile the normal behavior, and thus it detects deviations in the case of any anomalies [ 41 ]. Thus, anomaly-based IDS can be treated as a dynamic approach, which follows behavior-oriented detection. The main advantage of anomaly-based IDS is the ability to identify unknown or zero-day attacks [ 42 ]. However, the issue is that the identified anomaly or abnormal behavior is not always an indicator of intrusions. It sometimes may happen because of several factors such as policy changes or offering a new service.

In addition, a hybrid detection approach [ 43 , 44 ] that takes into account both the misuse and anomaly-based techniques discussed above can be used to detect intrusions. In a hybrid system, the misuse detection system is used for detecting known types of intrusions and anomaly detection system is used for novel attacks [ 45 ]. Beside these approaches, stateful protocol analysis can also be used to detect intrusions that identifies deviations of protocol state similarly to the anomaly-based method, however it uses predetermined universal profiles based on accepted definitions of benign activity [ 41 ]. In Table 1 , we have summarized these common approaches highlighting their pros and cons. Once the detecting has been completed, the intrusion prevention system (IPS) that is intended to prevent malicious events, can be used to mitigate the risks in different ways such as manual, providing notification, or automatic process [ 46 ]. Among these approaches, an automatic response system could be more effective as it does not involve a human interface between the detection and response systems.

Data science

We are living in the age of data, advanced analytics, and data science, which are related to data-driven intelligent decision making. Although, the process of searching patterns or discovering hidden and interesting knowledge from data is known as data mining [ 47 ], in this paper, we use the broader term “data science” rather than data mining. The reason is that, data science, in its most fundamental form, is all about understanding of data. It involves studying, processing, and extracting valuable insights from a set of information. In addition to data mining, data analytics is also related to data science. The development of data mining, knowledge discovery, and machine learning that refers creating algorithms and program which learn on their own, together with the original data analysis and descriptive analytics from the statistical perspective, forms the general concept of “data analytics” [ 47 ]. Nowadays, many researchers use the term “data science” to describe the interdisciplinary field of data collection, preprocessing, inferring, or making decisions by analyzing the data. To understand and analyze the actual phenomena with data, various scientific methods, machine learning techniques, processes, and systems are used, which is commonly known as data science. According to Cao et al. [ 47 ] “data science is a new interdisciplinary field that synthesizes and builds on statistics, informatics, computing, communication, management, and sociology to study data and its environments, to transform data to insights and decisions by following a data-to-knowledge-to-wisdom thinking and methodology”. As a high-level statement in the context of cybersecurity, we can conclude that it is the study of security data to provide data-driven solutions for the given security problems, as known as “the science of cybersecurity data”. Figure 2 shows the typical data-to-insight-to-decision transfer at different periods and general analytic stages in data science, in terms of a variety of analytics goals (G) and approaches (A) to achieve the data-to-decision goal [ 47 ].

Data-to-insight-to-decision analytic stages in data science [ 47 ]

Based on the analytic power of data science including machine learning techniques, it can be a viable component of security strategies. By using data science techniques, security analysts can manipulate and analyze security data more effectively and efficiently, uncovering valuable insights from data. Thus, data science methodologies including machine learning techniques can be well utilized in the context of cybersecurity, in terms of problem understanding, gathering security data from diverse sources, preparing data to feed into the model, data-driven model building and updating, for providing smart security services, which motivates to define cybersecurity data science and to work in this research area.

Cybersecurity data science

In this section, we briefly discuss cybersecurity data science including various categories of cyber incidents data with the usage in different application areas, and the key terms and areas related to our study.

Understanding cybersecurity data

Data science is largely driven by the availability of data [ 48 ]. Datasets typically represent a collection of information records that consist of several attributes or features and related facts, in which cybersecurity data science is based on. Thus, it’s important to understand the nature of cybersecurity data containing various types of cyberattacks and relevant features. The reason is that raw security data collected from relevant cyber sources can be used to analyze the various patterns of security incidents or malicious behavior, to build a data-driven security model to achieve our goal. Several datasets exist in the area of cybersecurity including intrusion analysis, malware analysis, anomaly, fraud, or spam analysis that are used for various purposes. In Table 2 , we summarize several such datasets including their various features and attacks that are accessible on the Internet, and highlight their usage based on machine learning techniques in different cyber applications. Effectively analyzing and processing of these security features, building target machine learning-based security model according to the requirements, and eventually, data-driven decision making, could play a role to provide intelligent cybersecurity services that are discussed briefly in “ A multi-layered framework for smart cybersecurity services ” section.

Defining cybersecurity data science

Data science is transforming the world’s industries. It is critically important for the future of intelligent cybersecurity systems and services because of “security is all about data”. When we seek to detect cyber threats, we are analyzing the security data in the form of files, logs, network packets, or other relevant sources. Traditionally, security professionals didn’t use data science techniques to make detections based on these data sources. Instead, they used file hashes, custom-written rules like signatures, or manually defined heuristics [ 21 ]. Although these techniques have their own merits in several cases, it needs too much manual work to keep up with the changing cyber threat landscape. On the contrary, data science can make a massive shift in technology and its operations, where machine learning algorithms can be used to learn or extract insight of security incident patterns from the training data for their detection and prevention. For instance, to detect malware or suspicious trends, or to extract policy rules, these techniques can be used.

In recent days, the entire security industry is moving towards data science, because of its capability to transform raw data into decision making. To do this, several data-driven tasks can be associated, such as—(i) data engineering focusing practical applications of data gathering and analysis; (ii) reducing data volume that deals with filtering significant and relevant data to further analysis; (iii) discovery and detection that focuses on extracting insight or incident patterns or knowledge from data; (iv) automated models that focus on building data-driven intelligent security model; (v) targeted security alerts focusing on the generation of remarkable security alerts based on discovered knowledge that minimizes the false alerts, and (vi) resource optimization that deals with the available resources to achieve the target goals in a security system. While making data-driven decisions, behavioral analysis could also play a significant role in the domain of cybersecurity [ 81 ].

Thus, the concept of cybersecurity data science incorporates the methods and techniques of data science and machine learning as well as the behavioral analytics of various security incidents. The combination of these technologies has given birth to the term “cybersecurity data science”, which refers to collect a large amount of security event data from different sources and analyze it using machine learning technologies for detecting security risks or attacks either through the discovery of useful insights or the latest data-driven patterns. It is, however, worth remembering that cybersecurity data science is not just about a collection of machine learning algorithms, rather, a process that can help security professionals or analysts to scale and automate their security activities in a smart way and in a timely manner. Therefore, the formal definition can be as follows: “Cybersecurity data science is a research or working area existing at the intersection of cybersecurity, data science, and machine learning or artificial intelligence, which is mainly security data-focused, applies machine learning methods, attempts to quantify cyber-risks or incidents, and promotes inferential techniques to analyze behavioral patterns in security data. It also focuses on generating security response alerts, and eventually seeks for optimizing cybersecurity solutions, to build automated and intelligent cybersecurity systems.”

Table 3 highlights some key terms associated with cybersecurity data science. Overall, the outputs of cybersecurity data science are typically security data products, which can be a data-driven security model, policy rule discovery, risk or attack prediction, potential security service and recommendation, or the corresponding security system depending on the given security problem in the domain of cybersecurity. In the next section, we briefly discuss various machine learning tasks with examples within the scope of our study.

Machine learning tasks in cybersecurity

Machine learning (ML) is typically considered as a branch of “Artificial Intelligence”, which is closely related to computational statistics, data mining and analytics, data science, particularly focusing on making the computers to learn from data [ 82 , 83 ]. Thus, machine learning models typically comprise of a set of rules, methods, or complex “transfer functions” that can be applied to find interesting data patterns, or to recognize or predict behavior [ 84 ], which could play an important role in the area of cybersecurity. In the following, we discuss different methods that can be used to solve machine learning tasks and how they are related to cybersecurity tasks.

Supervised learning

Supervised learning is performed when specific targets are defined to reach from a certain set of inputs, i.e., task-driven approach. In the area of machine learning, the most popular supervised learning techniques are known as classification and regression methods [ 129 ]. These techniques are popular to classify or predict the future for a particular security problem. For instance, to predict denial-of-service attack (yes, no) or to identify different classes of network attacks such as scanning and spoofing, classification techniques can be used in the cybersecurity domain. ZeroR [ 83 ], OneR [ 130 ], Navies Bayes [ 131 ], Decision Tree [ 132 , 133 ], K-nearest neighbors [ 134 ], support vector machines [ 135 ], adaptive boosting [ 136 ], and logistic regression [ 137 ] are the well-known classification techniques. In addition, recently Sarker et al. have proposed BehavDT [ 133 ], and IntruDtree [ 106 ] classification techniques that are able to effectively build a data-driven predictive model. On the other hand, to predict the continuous or numeric value, e.g., total phishing attacks in a certain period or predicting the network packet parameters, regression techniques are useful. Regression analyses can also be used to detect the root causes of cybercrime and other types of fraud [ 138 ]. Linear regression [ 82 ], support vector regression [ 135 ] are the popular regression techniques. The main difference between classification and regression is that the output variable in the regression is numerical or continuous, while the predicted output for classification is categorical or discrete. Ensemble learning is an extension of supervised learning while mixing different simple models, e.g., Random Forest learning [ 139 ] that generates multiple decision trees to solve a particular security task.

Unsupervised learning

In unsupervised learning problems, the main task is to find patterns, structures, or knowledge in unlabeled data, i.e., data-driven approach [ 140 ]. In the area of cybersecurity, cyber-attacks like malware stays hidden in some ways, include changing their behavior dynamically and autonomously to avoid detection. Clustering techniques, a type of unsupervised learning, can help to uncover the hidden patterns and structures from the datasets, to identify indicators of such sophisticated attacks. Similarly, in identifying anomalies, policy violations, detecting, and eliminating noisy instances in data, clustering techniques can be useful. K-means [ 141 ], K-medoids [ 142 ] are the popular partitioning clustering algorithms, and single linkage [ 143 ] or complete linkage [ 144 ] are the well-known hierarchical clustering algorithms used in various application domains. Moreover, a bottom-up clustering approach proposed by Sarker et al. [ 145 ] can also be used by taking into account the data characteristics.

Besides, feature engineering tasks like optimal feature selection or extraction related to a particular security problem could be useful for further analysis [ 106 ]. Recently, Sarker et al. [ 106 ] have proposed an approach for selecting security features according to their importance score values. Moreover, Principal component analysis, linear discriminant analysis, pearson correlation analysis, or non-negative matrix factorization are the popular dimensionality reduction techniques to solve such issues [ 82 ]. Association rule learning is another example, where machine learning based policy rules can prevent cyber-attacks. In an expert system, the rules are usually manually defined by a knowledge engineer working in collaboration with a domain expert [ 37 , 140 , 146 ]. Association rule learning on the contrary, is the discovery of rules or relationships among a set of available security features or attributes in a given dataset [ 147 ]. To quantify the strength of relationships, correlation analysis can be used [ 138 ]. Many association rule mining algorithms have been proposed in the area of machine learning and data mining literature, such as logic-based [ 148 ], frequent pattern based [ 149 , 150 , 151 ], tree-based [ 152 ], etc. Recently, Sarker et al. [ 153 ] have proposed an association rule learning approach considering non-redundant generation, that can be used to discover a set of useful security policy rules. Moreover, AIS [ 147 ], Apriori [ 149 ], Apriori-TID and Apriori-Hybrid [ 149 ], FP-Tree [ 152 ], and RARM [ 154 ], and Eclat [ 155 ] are the well-known association rule learning algorithms that are capable to solve such problems by generating a set of policy rules in the domain of cybersecurity.

Neural networks and deep learning

Deep learning is a part of machine learning in the area of artificial intelligence, which is a computational model that is inspired by the biological neural networks in the human brain [ 82 ]. Artificial Neural Network (ANN) is frequently used in deep learning and the most popular neural network algorithm is backpropagation [ 82 ]. It performs learning on a multi-layer feed-forward neural network consists of an input layer, one or more hidden layers, and an output layer. The main difference between deep learning and classical machine learning is its performance on the amount of security data increases. Typically deep learning algorithms perform well when the data volumes are large, whereas machine learning algorithms perform comparatively better on small datasets [ 44 ]. In our earlier work, Sarker et al. [ 129 ], we have illustrated the effectiveness of these approaches considering contextual datasets. However, deep learning approaches mimic the human brain mechanism to interpret large amount of data or the complex data such as images, sounds and texts [ 44 , 129 ]. In terms of feature extraction to build models, deep learning reduces the effort of designing a feature extractor for each problem than the classical machine learning techniques. Beside these characteristics, deep learning typically takes a long time to train an algorithm than a machine learning algorithm, however, the test time is exactly the opposite [ 44 ]. Thus, deep learning relies more on high-performance machines with GPUs than classical machine-learning algorithms [ 44 , 156 ]. The most popular deep neural network learning models include multi-layer perceptron (MLP) [ 157 ], convolutional neural network (CNN) [ 158 ], recurrent neural network (RNN) or long-short term memory (LSTM) network [ 121 , 158 ]. In recent days, researchers use these deep learning techniques for different purposes such as detecting network intrusions, malware traffic detection and classification, etc. in the domain of cybersecurity [ 44 , 159 ].

Other learning techniques

Semi-supervised learning can be described as a hybridization of supervised and unsupervised techniques discussed above, as it works on both the labeled and unlabeled data. In the area of cybersecurity, it could be useful, when it requires to label data automatically without human intervention, to improve the performance of cybersecurity models. Reinforcement techniques are another type of machine learning that characterizes an agent by creating its own learning experiences through interacting directly with the environment, i.e., environment-driven approach, where the environment is typically formulated as a Markov decision process and take decision based on a reward function [ 160 ]. Monte Carlo learning, Q-learning, Deep Q Networks, are the most common reinforcement learning algorithms [ 161 ]. For instance, in a recent work [ 126 ], the authors present an approach for detecting botnet traffic or malicious cyber activities using reinforcement learning combining with neural network classifier. In another work [ 128 ], the authors discuss about the application of deep reinforcement learning to intrusion detection for supervised problems, where they received the best results for the Deep Q-Network algorithm. In the context of cybersecurity, genetic algorithms that use fitness, selection, crossover, and mutation for finding optimization, could also be used to solve a similar class of learning problems [ 119 ].

Various types of machine learning techniques discussed above can be useful in the domain of cybersecurity, to build an effective security model. In Table 4 , we have summarized several machine learning techniques that are used to build various types of security models for various purposes. Although these models typically represent a learning-based security model, in this paper, we aim to focus on a comprehensive cybersecurity data science model and relevant issues, in order to build a data-driven intelligent security system. In the next section, we highlight several research issues and potential solutions in the area of cybersecurity data science.

Research issues and future directions

Our study opens several research issues and challenges in the area of cybersecurity data science to extract insight from relevant data towards data-driven intelligent decision making for cybersecurity solutions. In the following, we summarize these challenges ranging from data collection to decision making.

Cybersecurity datasets : Source datasets are the primary component to work in the area of cybersecurity data science. Most of the existing datasets are old and might insufficient in terms of understanding the recent behavioral patterns of various cyber-attacks. Although the data can be transformed into a meaningful understanding level after performing several processing tasks, there is still a lack of understanding of the characteristics of recent attacks and their patterns of happening. Thus, further processing or machine learning algorithms may provide a low accuracy rate for making the target decisions. Therefore, establishing a large number of recent datasets for a particular problem domain like cyber risk prediction or intrusion detection is needed, which could be one of the major challenges in cybersecurity data science.

Handling quality problems in cybersecurity datasets : The cyber datasets might be noisy, incomplete, insignificant, imbalanced, or may contain inconsistency instances related to a particular security incident. Such problems in a data set may affect the quality of the learning process and degrade the performance of the machine learning-based models [ 162 ]. To make a data-driven intelligent decision for cybersecurity solutions, such problems in data is needed to deal effectively before building the cyber models. Therefore, understanding such problems in cyber data and effectively handling such problems using existing algorithms or newly proposed algorithm for a particular problem domain like malware analysis or intrusion detection and prevention is needed, which could be another research issue in cybersecurity data science.

Security policy rule generation : Security policy rules reference security zones and enable a user to allow, restrict, and track traffic on the network based on the corresponding user or user group, and service, or the application. The policy rules including the general and more specific rules are compared against the incoming traffic in sequence during the execution, and the rule that matches the traffic is applied. The policy rules used in most of the cybersecurity systems are static and generated by human expertise or ontology-based [ 163 , 164 ]. Although, association rule learning techniques produce rules from data, however, there is a problem of redundancy generation [ 153 ] that makes the policy rule-set complex. Therefore, understanding such problems in policy rule generation and effectively handling such problems using existing algorithms or newly proposed algorithm for a particular problem domain like access control [ 165 ] is needed, which could be another research issue in cybersecurity data science.

Hybrid learning method : Most commercial products in the cybersecurity domain contain signature-based intrusion detection techniques [ 41 ]. However, missing features or insufficient profiling can cause these techniques to miss unknown attacks. In that case, anomaly-based detection techniques or hybrid technique combining signature-based and anomaly-based can be used to overcome such issues. A hybrid technique combining multiple learning techniques or a combination of deep learning and machine-learning methods can be used to extract the target insight for a particular problem domain like intrusion detection, malware analysis, access control, etc. and make the intelligent decision for corresponding cybersecurity solutions.

Protecting the valuable security information : Another issue of a cyber data attack is the loss of extremely valuable data and information, which could be damaging for an organization. With the use of encryption or highly complex signatures, one can stop others from probing into a dataset. In such cases, cybersecurity data science can be used to build a data-driven impenetrable protocol to protect such security information. To achieve this goal, cyber analysts can develop algorithms by analyzing the history of cyberattacks to detect the most frequently targeted chunks of data. Thus, understanding such data protecting problems and designing corresponding algorithms to effectively handling these problems, could be another research issue in the area of cybersecurity data science.

Context-awareness in cybersecurity : Existing cybersecurity work mainly originates from the relevant cyber data containing several low-level features. When data mining and machine learning techniques are applied to such datasets, a related pattern can be identified that describes it properly. However, a broader contextual information [ 140 , 145 , 166 ] like temporal, spatial, relationship among events or connections, dependency can be used to decide whether there exists a suspicious activity or not. For instance, some approaches may consider individual connections as DoS attacks, while security experts might not treat them as malicious by themselves. Thus, a significant limitation of existing cybersecurity work is the lack of using the contextual information for predicting risks or attacks. Therefore, context-aware adaptive cybersecurity solutions could be another research issue in cybersecurity data science.

Feature engineering in cybersecurity : The efficiency and effectiveness of a machine learning-based security model has always been a major challenge due to the high volume of network data with a large number of traffic features. The large dimensionality of data has been addressed using several techniques such as principal component analysis (PCA) [ 167 ], singular value decomposition (SVD) [ 168 ] etc. In addition to low-level features in the datasets, the contextual relationships between suspicious activities might be relevant. Such contextual data can be stored in an ontology or taxonomy for further processing. Thus how to effectively select the optimal features or extract the significant features considering both the low-level features as well as the contextual features, for effective cybersecurity solutions could be another research issue in cybersecurity data science.

Remarkable security alert generation and prioritizing : In many cases, the cybersecurity system may not be well defined and may cause a substantial number of false alarms that are unexpected in an intelligent system. For instance, an IDS deployed in a real-world network generates around nine million alerts per day [ 169 ]. A network-based intrusion detection system typically looks at the incoming traffic for matching the associated patterns to detect risks, threats or vulnerabilities and generate security alerts. However, to respond to each such alert might not be effective as it consumes relatively huge amounts of time and resources, and consequently may result in a self-inflicted DoS. To overcome this problem, a high-level management is required that correlate the security alerts considering the current context and their logical relationship including their prioritization before reporting them to users, which could be another research issue in cybersecurity data science.

Recency analysis in cybersecurity solutions : Machine learning-based security models typically use a large amount of static data to generate data-driven decisions. Anomaly detection systems rely on constructing such a model considering normal behavior and anomaly, according to their patterns. However, normal behavior in a large and dynamic security system is not well defined and it may change over time, which can be considered as an incremental growing of dataset. The patterns in incremental datasets might be changed in several cases. This often results in a substantial number of false alarms known as false positives. Thus, a recent malicious behavioral pattern is more likely to be interesting and significant than older ones for predicting unknown attacks. Therefore, effectively using the concept of recency analysis [ 170 ] in cybersecurity solutions could be another issue in cybersecurity data science.

The most important work for an intelligent cybersecurity system is to develop an effective framework that supports data-driven decision making. In such a framework, we need to consider advanced data analysis based on machine learning techniques, so that the framework is capable to minimize these issues and to provide automated and intelligent security services. Thus, a well-designed security framework for cybersecurity data and the experimental evaluation is a very important direction and a big challenge as well. In the next section, we suggest and discuss a data-driven cybersecurity framework based on machine learning techniques considering multiple processing layers.

A multi-layered framework for smart cybersecurity services

As discussed earlier, cybersecurity data science is data-focused, applies machine learning methods, attempts to quantify cyber risks, promotes inferential techniques to analyze behavioral patterns, focuses on generating security response alerts, and eventually seeks for optimizing cybersecurity operations. Hence, we briefly discuss a multiple data processing layered framework that potentially can be used to discover security insights from the raw data to build smart cybersecurity systems, e.g., dynamic policy rule-based access control or intrusion detection and prevention system. To make a data-driven intelligent decision in the resultant cybersecurity system, understanding the security problems and the nature of corresponding security data and their vast analysis is needed. For this purpose, our suggested framework not only considers the machine learning techniques to build the security model but also takes into account the incremental learning and dynamism to keep the model up-to-date and corresponding response generation, which could be more effective and intelligent for providing the expected services. Figure 3 shows an overview of the framework, involving several processing layers, from raw security event data to services. In the following, we briefly discuss the working procedure of the framework.

A generic multi-layered framework based on machine learning techniques for smart cybersecurity services

Security data collecting

Collecting valuable cybersecurity data is a crucial step, which forms a connecting link between security problems in cyberinfrastructure and corresponding data-driven solution steps in this framework, shown in Fig. 3 . The reason is that cyber data can serve as the source for setting up ground truth of the security model that affect the model performance. The quality and quantity of cyber data decide the feasibility and effectiveness of solving the security problem according to our goal. Thus, the concern is how to collect valuable and unique needs data for building the data-driven security models.

The general step to collect and manage security data from diverse data sources is based on a particular security problem and project within the enterprise. Data sources can be classified into several broad categories such as network, host, and hybrid [ 171 ]. Within the network infrastructure, the security system can leverage different types of security data such as IDS logs, firewall logs, network traffic data, packet data, and honeypot data, etc. for providing the target security services. For instance, a given IP is considered malicious or not, could be detected by performing data analysis utilizing the data of IP addresses and their cyber activities. In the domain of cybersecurity, the network source mentioned above is considered as the primary security event source to analyze. In the host category, it collects data from an organization’s host machines, where the data sources can be operating system logs, database access logs, web server logs, email logs, application logs, etc. Collecting data from both the network and host machines are considered a hybrid category. Overall, in a data collection layer the network activity, database activity, application activity, and user activity can be the possible security event sources in the context of cybersecurity data science.

Security data preparing

After collecting the raw security data from various sources according to the problem domain discussed above, this layer is responsible to prepare the raw data for building the model by applying various necessary processes. However, not all of the collected data contributes to the model building process in the domain of cybersecurity [ 172 ]. Therefore, the useless data should be removed from the rest of the data captured by the network sniffer. Moreover, data might be noisy, have missing or corrupted values, or have attributes of widely varying types and scales. High quality of data is necessary for achieving higher accuracy in a data-driven model, which is a process of learning a function that maps an input to an output based on example input-output pairs. Thus, it might require a procedure for data cleaning, handling missing or corrupted values. Moreover, security data features or attributes can be in different types, such as continuous, discrete, or symbolic [ 106 ]. Beyond a solid understanding of these types of data and attributes and their permissible operations, its need to preprocess the data and attributes to convert into the target type. Besides, the raw data can be in different types such as structured, semi-structured, or unstructured, etc. Thus, normalization, transformation, or collation can be useful to organize the data in a structured manner. In some cases, natural language processing techniques might be useful depending on data type and characteristics, e.g., textual contents. As both the quality and quantity of data decide the feasibility of solving the security problem, effectively pre-processing and management of data and their representation can play a significant role to build an effective security model for intelligent services.

Machine learning-based security modeling

This is the core step where insights and knowledge are extracted from data through the application of cybersecurity data science. In this section, we particularly focus on machine learning-based modeling as machine learning techniques can significantly change the cybersecurity landscape. The security features or attributes and their patterns in data are of high interest to be discovered and analyzed to extract security insights. To achieve the goal, a deeper understanding of data and machine learning-based analytical models utilizing a large number of cybersecurity data can be effective. Thus, various machine learning tasks can be involved in this model building layer according to the solution perspective. These are - security feature engineering that mainly responsible to transform raw security data into informative features that effectively represent the underlying security problem to the data-driven models. Thus, several data-processing tasks such as feature transformation and normalization, feature selection by taking into account a subset of available security features according to their correlations or importance in modeling, or feature generation and extraction by creating new brand principal components, may be involved in this module according to the security data characteristics. For instance, the chi-squared test, analysis of variance test, correlation coefficient analysis, feature importance, as well as discriminant and principal component analysis, or singular value decomposition, etc. can be used for analyzing the significance of the security features to perform the security feature engineering tasks [ 82 ].

Another significant module is security data clustering that uncovers hidden patterns and structures through huge volumes of security data, to identify where the new threats exist. It typically involves the grouping of security data with similar characteristics, which can be used to solve several cybersecurity problems such as detecting anomalies, policy violations, etc. Malicious behavior or anomaly detection module is typically responsible to identify a deviation to a known behavior, where clustering-based analysis and techniques can also be used to detect malicious behavior or anomaly detection. In the cybersecurity area, attack classification or prediction is treated as one of the most significant modules, which is responsible to build a prediction model to classify attacks or threats and to predict future for a particular security problem. To predict denial-of-service attack or a spam filter separating tasks from other messages, could be the relevant examples. Association learning or policy rule generation module can play a role to build an expert security system that comprises several IF-THEN rules that define attacks. Thus, in a problem of policy rule generation for rule-based access control system, association learning can be used as it discovers the associations or relationships among a set of available security features in a given security dataset. The popular machine learning algorithms in these categories are briefly discussed in “ Machine learning tasks in cybersecurity ” section. The module model selection or customization is responsible to choose whether it uses the existing machine learning model or needed to customize. Analyzing data and building models based on traditional machine learning or deep learning methods, could achieve acceptable results in certain cases in the domain of cybersecurity. However, in terms of effectiveness and efficiency or other performance measurements considering time complexity, generalization capacity, and most importantly the impact of the algorithm on the detection rate of a system, machine learning models are needed to customize for a specific security problem. Moreover, customizing the related techniques and data could improve the performance of the resultant security model and make it better applicable in a cybersecurity domain. The modules discussed above can work separately and combinedly depending on the target security problems.

Incremental learning and dynamism

In our framework, this layer is concerned with finalizing the resultant security model by incorporating additional intelligence according to the needs. This could be possible by further processing in several modules. For instance, the post-processing and improvement module in this layer could play a role to simplify the extracted knowledge according to the particular requirements by incorporating domain-specific knowledge. As the attack classification or prediction models based on machine learning techniques strongly rely on the training data, it can hardly be generalized to other datasets, which could be significant for some applications. To address such kind of limitations, this module is responsible to utilize the domain knowledge in the form of taxonomy or ontology to improve attack correlation in cybersecurity applications.

Another significant module recency mining and updating security model is responsible to keep the security model up-to-date for better performance by extracting the latest data-driven security patterns. The extracted knowledge discussed in the earlier layer is based on a static initial dataset considering the overall patterns in the datasets. However, such knowledge might not be guaranteed higher performance in several cases, because of incremental security data with recent patterns. In many cases, such incremental data may contain different patterns which could conflict with existing knowledge. Thus, the concept of RecencyMiner [ 170 ] on incremental security data and extracting new patterns can be more effective than the existing old patterns. The reason is that recent security patterns and rules are more likely to be significant than older ones for predicting cyber risks or attacks. Rather than processing the whole security data again, recency-based dynamic updating according to the new patterns would be more efficient in terms of processing and outcome. This could make the resultant cybersecurity model intelligent and dynamic. Finally, response planning and decision making module is responsible to make decisions based on the extracted insights and take necessary actions to prevent the system from the cyber-attacks to provide automated and intelligent services. The services might be different depending on particular requirements for a given security problem.

Overall, this framework is a generic description which potentially can be used to discover useful insights from security data, to build smart cybersecurity systems, to address complex security challenges, such as intrusion detection, access control management, detecting anomalies and fraud, or denial of service attacks, etc. in the area of cybersecurity data science.

Although several research efforts have been directed towards cybersecurity solutions, discussed in “ Background ” , “ Cybersecurity data science ”, and “ Machine learning tasks in cybersecurity ” sections in different directions, this paper presents a comprehensive view of cybersecurity data science. For this, we have conducted a literature review to understand cybersecurity data, various defense strategies including intrusion detection techniques, different types of machine learning techniques in cybersecurity tasks. Based on our discussion on existing work, several research issues related to security datasets, data quality problems, policy rule generation, learning methods, data protection, feature engineering, security alert generation, recency analysis etc. are identified that require further research attention in the domain of cybersecurity data science.

The scope of cybersecurity data science is broad. Several data-driven tasks such as intrusion detection and prevention, access control management, security policy generation, anomaly detection, spam filtering, fraud detection and prevention, various types of malware attack detection and defense strategies, etc. can be considered as the scope of cybersecurity data science. Such tasks based categorization could be helpful for security professionals including the researchers and practitioners who are interested in the domain-specific aspects of security systems [ 171 ]. The output of cybersecurity data science can be used in many application areas such as Internet of things (IoT) security [ 173 ], network security [ 174 ], cloud security [ 175 ], mobile and web applications [ 26 ], and other relevant cyber areas. Moreover, intelligent cybersecurity solutions are important for the banking industry, the healthcare sector, or the public sector, where data breaches typically occur [ 36 , 176 ]. Besides, the data-driven security solutions could also be effective in AI-based blockchain technology, where AI works with huge volumes of security event data to extract the useful insights using machine learning techniques, and block-chain as a trusted platform to store such data [ 177 ].

Although in this paper, we discuss cybersecurity data science focusing on examining raw security data to data-driven decision making for intelligent security solutions, it could also be related to big data analytics in terms of data processing and decision making. Big data deals with data sets that are too large or complex having characteristics of high data volume, velocity, and variety. Big data analytics mainly has two parts consisting of data management involving data storage, and analytics [ 178 ]. The analytics typically describe the process of analyzing such datasets to discover patterns, unknown correlations, rules, and other useful insights [ 179 ]. Thus, several advanced data analysis techniques such as AI, data mining, machine learning could play an important role in processing big data by converting big problems to small problems [ 180 ]. To do this, the potential strategies like parallelization, divide-and-conquer, incremental learning, sampling, granular computing, feature or instance selection, can be used to make better decisions, reducing costs, or enabling more efficient processing. In such cases, the concept of cybersecurity data science, particularly machine learning-based modeling could be helpful for process automation and decision making for intelligent security solutions. Moreover, researchers could consider modified algorithms or models for handing big data on parallel computing platforms like Hadoop, Storm, etc. [ 181 ].

Based on the concept of cybersecurity data science discussed in the paper, building a data-driven security model for a particular security problem and relevant empirical evaluation to measure the effectiveness and efficiency of the model, and to asses the usability in the real-world application domain could be a future work.

Motivated by the growing significance of cybersecurity and data science, and machine learning technologies, in this paper, we have discussed how cybersecurity data science applies to data-driven intelligent decision making in smart cybersecurity systems and services. We also have discussed how it can impact security data, both in terms of extracting insight of security incidents and the dataset itself. We aimed to work on cybersecurity data science by discussing the state of the art concerning security incidents data and corresponding security services. We also discussed how machine learning techniques can impact in the domain of cybersecurity, and examine the security challenges that remain. In terms of existing research, much focus has been provided on traditional security solutions, with less available work in machine learning technique based security systems. For each common technique, we have discussed relevant security research. The purpose of this article is to share an overview of the conceptualization, understanding, modeling, and thinking about cybersecurity data science.

We have further identified and discussed various key issues in security analysis to showcase the signpost of future research directions in the domain of cybersecurity data science. Based on the knowledge, we have also provided a generic multi-layered framework of cybersecurity data science model based on machine learning techniques, where the data is being gathered from diverse sources, and the analytics complement the latest data-driven patterns for providing intelligent security services. The framework consists of several main phases - security data collecting, data preparation, machine learning-based security modeling, and incremental learning and dynamism for smart cybersecurity systems and services. We specifically focused on extracting insights from security data, from setting a research design with particular attention to concepts for data-driven intelligent security solutions.

Overall, this paper aimed not only to discuss cybersecurity data science and relevant methods but also to discuss the applicability towards data-driven intelligent decision making in cybersecurity systems and services from machine learning perspectives. Our analysis and discussion can have several implications both for security researchers and practitioners. For researchers, we have highlighted several issues and directions for future research. Other areas for potential research include empirical evaluation of the suggested data-driven model, and comparative analysis with other security systems. For practitioners, the multi-layered machine learning-based model can be used as a reference in designing intelligent cybersecurity systems for organizations. We believe that our study on cybersecurity data science opens a promising path and can be used as a reference guide for both academia and industry for future research and applications in the area of cybersecurity.

Availability of data and materials

Not applicable.

Abbreviations

Machine learning

Artificial Intelligence

Information and communication technology

Internet of Things

Distributed Denial of Service

Intrusion detection system

Intrusion prevention system

Host-based intrusion detection systems

Network Intrusion Detection Systems

Signature-based intrusion detection system

Anomaly-based intrusion detection system

Li S, Da Xu L, Zhao S. The internet of things: a survey. Inform Syst Front. 2015;17(2):243–59.

Google Scholar

Sun N, Zhang J, Rimba P, Gao S, Zhang LY, Xiang Y. Data-driven cybersecurity incident prediction: a survey. IEEE Commun Surv Tutor. 2018;21(2):1744–72.

McIntosh T, Jang-Jaccard J, Watters P, Susnjak T. The inadequacy of entropy-based ransomware detection. In: International conference on neural information processing. New York: Springer; 2019. p. 181–189

Alazab M, Venkatraman S, Watters P, Alazab M, et al. Zero-day malware detection based on supervised learning algorithms of api call signatures (2010)

Shaw A. Data breach: from notification to prevention using pci dss. Colum Soc Probs. 2009;43:517.

Gupta BB, Tewari A, Jain AK, Agrawal DP. Fighting against phishing attacks: state of the art and future challenges. Neural Comput Appl. 2017;28(12):3629–54.

Av-test institute, germany, https://www.av-test.org/en/statistics/malware/ . Accessed 20 Oct 2019.

Ibm security report, https://www.ibm.com/security/data-breach . Accessed on 20 Oct 2019.

Fischer EA. Cybersecurity issues and challenges: In brief. Congressional Research Service (2014)

Juniper research. https://www.juniperresearch.com/ . Accessed on 20 Oct 2019.

Papastergiou S, Mouratidis H, Kalogeraki E-M. Cyber security incident handling, warning and response system for the european critical information infrastructures (cybersane). In: International Conference on Engineering Applications of Neural Networks, p. 476–487 (2019). New York: Springer

Aftergood S. Cybersecurity: the cold war online. Nature. 2017;547(7661):30.

Hey AJ, Tansley S, Tolle KM, et al. The fourth paradigm: data-intensive scientific discovery. 2009;1:

Cukier K. Data, data everywhere: A special report on managing information, 2010.

Google trends. In: https://trends.google.com/trends/ , 2019.

Anwar S, Mohamad Zain J, Zolkipli MF, Inayat Z, Khan S, Anthony B, Chang V. From intrusion detection to an intrusion response system: fundamentals, requirements, and future directions. Algorithms. 2017;10(2):39.

MATH Google Scholar

Mohammadi S, Mirvaziri H, Ghazizadeh-Ahsaee M, Karimipour H. Cyber intrusion detection by combined feature selection algorithm. J Inform Sec Appl. 2019;44:80–8.

Tapiador JE, Orfila A, Ribagorda A, Ramos B. Key-recovery attacks on kids, a keyed anomaly detection system. IEEE Trans Depend Sec Comput. 2013;12(3):312–25.

Tavallaee M, Stakhanova N, Ghorbani AA. Toward credible evaluation of anomaly-based intrusion-detection methods. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 40(5), 516–524 (2010)

Foroughi F, Luksch P. Data science methodology for cybersecurity projects. arXiv preprint arXiv:1803.04219 , 2018.

Saxe J, Sanders H. Malware data science: Attack detection and attribution, 2018.

Rainie L, Anderson J, Connolly J. Cyber attacks likely to increase. Digital Life in. 2014, vol. 2025.

Fischer EA. Creating a national framework for cybersecurity: an analysis of issues and options. LIBRARY OF CONGRESS WASHINGTON DC CONGRESSIONAL RESEARCH SERVICE, 2005.

Craigen D, Diakun-Thibault N, Purse R. Defining cybersecurity. Technology Innovation. Manag Rev. 2014;4(10):13–21.

Council NR. et al. Toward a safer and more secure cyberspace, 2007.

Jang-Jaccard J, Nepal S. A survey of emerging threats in cybersecurity. J Comput Syst Sci. 2014;80(5):973–93.

MathSciNet MATH Google Scholar

Mukkamala S, Sung A, Abraham A. Cyber security challenges: Designing efficient intrusion detection systems and antivirus tools. Vemuri, V. Rao, Enhancing Computer Security with Smart Technology.(Auerbach, 2006), 125–163, 2005.

Bilge L, Dumitraş T. Before we knew it: an empirical study of zero-day attacks in the real world. In: Proceedings of the 2012 ACM conference on computer and communications security. ACM; 2012. p. 833–44.

Davi L, Dmitrienko A, Sadeghi A-R, Winandy M. Privilege escalation attacks on android. In: International conference on information security. New York: Springer; 2010. p. 346–60.

Jovičić B, Simić D. Common web application attack types and security using asp .net. ComSIS, 2006.

Warkentin M, Willison R. Behavioral and policy issues in information systems security: the insider threat. Eur J Inform Syst. 2009;18(2):101–5.

Kügler D. “man in the middle” attacks on bluetooth. In: International Conference on Financial Cryptography. New York: Springer; 2003, p. 149–61.

Virvilis N, Gritzalis D. The big four-what we did wrong in advanced persistent threat detection. In: 2013 International Conference on Availability, Reliability and Security. IEEE; 2013. p. 248–54.

Boyd SW, Keromytis AD. Sqlrand: Preventing sql injection attacks. In: International conference on applied cryptography and network security. New York: Springer; 2004. p. 292–302.

Sigler K. Crypto-jacking: how cyber-criminals are exploiting the crypto-currency boom. Comput Fraud Sec. 2018;2018(9):12–4.

2019 data breach investigations report, https://enterprise.verizon.com/resources/reports/dbir/ . Accessed 20 Oct 2019.

Khraisat A, Gondal I, Vamplew P, Kamruzzaman J. Survey of intrusion detection systems: techniques, datasets and challenges. Cybersecurity. 2019;2(1):20.

Johnson L. Computer incident response and forensics team management: conducting a successful incident response, 2013.

Brahmi I, Brahmi H, Yahia SB. A multi-agents intrusion detection system using ontology and clustering techniques. In: IFIP international conference on computer science and its applications. New York: Springer; 2015. p. 381–93.

Qu X, Yang L, Guo K, Ma L, Sun M, Ke M, Li M. A survey on the development of self-organizing maps for unsupervised intrusion detection. In: Mobile networks and applications. 2019;1–22.

Liao H-J, Lin C-HR, Lin Y-C, Tung K-Y. Intrusion detection system: a comprehensive review. J Netw Comput Appl. 2013;36(1):16–24.

Alazab A, Hobbs M, Abawajy J, Alazab M. Using feature selection for intrusion detection system. In: 2012 International symposium on communications and information technologies (ISCIT). IEEE; 2012. p. 296–301.

Viegas E, Santin AO, Franca A, Jasinski R, Pedroni VA, Oliveira LS. Towards an energy-efficient anomaly-based intrusion detection engine for embedded systems. IEEE Trans Comput. 2016;66(1):163–77.

Xin Y, Kong L, Liu Z, Chen Y, Li Y, Zhu H, Gao M, Hou H, Wang C. Machine learning and deep learning methods for cybersecurity. IEEE Access. 2018;6:35365–81.

Dutt I, Borah S, Maitra IK, Bhowmik K, Maity A, Das S. Real-time hybrid intrusion detection system using machine learning techniques. 2018, p. 885–94.

Ragsdale DJ, Carver C, Humphries JW, Pooch UW. Adaptation techniques for intrusion detection and intrusion response systems. In: Smc 2000 conference proceedings. 2000 IEEE international conference on systems, man and cybernetics.’cybernetics evolving to systems, humans, organizations, and their complex interactions’(cat. No. 0). IEEE; 2000. vol. 4, p. 2344–2349.

Cao L. Data science: challenges and directions. Commun ACM. 2017;60(8):59–68.

Rizk A, Elragal A. Data science: developing theoretical contributions in information systems via text analytics. J Big Data. 2020;7(1):1–26.

Lippmann RP, Fried DJ, Graf I, Haines JW, Kendall KR, McClung D, Weber D, Webster SE, Wyschogrod D, Cunningham RK, et al. Evaluating intrusion detection systems: The 1998 darpa off-line intrusion detection evaluation. In: Proceedings DARPA information survivability conference and exposition. DISCEX’00. IEEE; 2000. vol. 2, p. 12–26.

Kdd cup 99. http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html . Accessed 20 Oct 2019.

Tavallaee M, Bagheri E, Lu W, Ghorbani AA. A detailed analysis of the kdd cup 99 data set. In: 2009 IEEE symposium on computational intelligence for security and defense applications. IEEE; 2009. p. 1–6.

Caida ddos attack 2007 dataset. http://www.caida.org/data/ passive/ddos-20070804-dataset.xml/ . Accessed 20 Oct 2019.

Caida anonymized internet traces 2008 dataset. https://www.caida.org/data/passive/passive-2008-dataset . Accessed 20 Oct 2019.

Isot botnet dataset. https://www.uvic.ca/engineering/ece/isot/ datasets/index.php/ . Accessed 20 Oct 2019.

The honeynet project. http://www.honeynet.org/chapters/france/ . Accessed 20 Oct 2019.

Canadian institute of cybersecurity, university of new brunswick, iscx dataset, http://www.unb.ca/cic/datasets/index.html/ . Accessed 20 Oct 2019.

Shiravi A, Shiravi H, Tavallaee M, Ghorbani AA. Toward developing a systematic approach to generate benchmark datasets for intrusion detection. Comput Secur. 2012;31(3):357–74.

The ctu-13 dataset. https://stratosphereips.org/category/datasets-ctu13 . Accessed 20 Oct 2019.

Moustafa N, Slay J. Unsw-nb15: a comprehensive data set for network intrusion detection systems (unsw-nb15 network data set). In: 2015 Military Communications and Information Systems Conference (MilCIS). IEEE; 2015. p. 1–6.

Cse-cic-ids2018 [online]. available: https://www.unb.ca/cic/ datasets/ids-2018.html/ . Accessed 20 Oct 2019.

Cic-ddos2019 [online]. available: https://www.unb.ca/cic/datasets/ddos-2019.html/ . Accessed 28 Mar 2019.

Jing X, Yan Z, Jiang X, Pedrycz W. Network traffic fusion and analysis against ddos flooding attacks with a novel reversible sketch. Inform Fusion. 2019;51:100–13.

Xie M, Hu J, Yu X, Chang E. Evaluating host-based anomaly detection systems: application of the frequency-based algorithms to adfa-ld. In: International conference on network and system security. New York: Springer; 2015. p. 542–49.

Lindauer B, Glasser J, Rosen M, Wallnau KC, ExactData L. Generating test data for insider threat detectors. JoWUA. 2014;5(2):80–94.

Glasser J, Lindauer B. Bridging the gap: A pragmatic approach to generating insider threat data. In: 2013 IEEE Security and Privacy Workshops. IEEE; 2013. p. 98–104.

Enronspam. https://labs-repos.iit.demokritos.gr/skel/i-config/downloads/enron-spam/ . Accessed 20 Oct 2019.

Spamassassin. http://www.spamassassin.org/publiccorpus/ . Accessed 20 Oct 2019.

Lingspam. https://labs-repos.iit.demokritos.gr/skel/i-config/downloads/lingspampublic.tar.gz/ . Accessed 20 Oct 2019.

Alexa top sites. https://aws.amazon.com/alexa-top-sites/ . Accessed 20 Oct 2019.

Bambenek consulting—master feeds. available online: http://osint.bambenekconsulting.com/feeds/ . Accessed 20 Oct 2019.

Dgarchive. https://dgarchive.caad.fkie.fraunhofer.de/site/ . Accessed 20 Oct 2019.

Zago M, Pérez MG, Pérez GM. Umudga: A dataset for profiling algorithmically generated domain names in botnet detection. Data in Brief. 2020;105400.

Zhou Y, Jiang X. Dissecting android malware: characterization and evolution. In: 2012 IEEE Symposium on security and privacy. IEEE; 2012. p. 95–109.

Virusshare. http://virusshare.com/ . Accessed 20 Oct 2019.

Virustotal. https://virustotal.com/ . Accessed 20 Oct 2019.

Comodo. https://www.comodo.com/home/internet-security/updates/vdp/database . Accessed 20 Oct 2019.

Contagio. http://contagiodump.blogspot.com/ . Accessed 20 Oct 2019.

Kumar R, Xiaosong Z, Khan RU, Kumar J, Ahad I. Effective and explainable detection of android malware based on machine learning algorithms. In: Proceedings of the 2018 international conference on computing and artificial intelligence. ACM; 2018. p. 35–40.

Microsoft malware classification (big 2015). arXiv:org/abs/1802.10135/ . Accessed 20 Oct 2019.

Koroniotis N, Moustafa N, Sitnikova E, Turnbull B. Towards the development of realistic botnet dataset in the internet of things for network forensic analytics: bot-iot dataset. Future Gen Comput Syst. 2019;100:779–96.

McIntosh TR, Jang-Jaccard J, Watters PA. Large scale behavioral analysis of ransomware attacks. In: International conference on neural information processing. New York: Springer; 2018. p. 217–29.

Han J, Pei J, Kamber M. Data mining: concepts and techniques, 2011.

Witten IH, Frank E. Data mining: Practical machine learning tools and techniques, 2005.

Dua S, Du X. Data mining and machine learning in cybersecurity, 2016.

Kotpalliwar MV, Wajgi R. Classification of attacks using support vector machine (svm) on kddcup’99 ids database. In: 2015 Fifth international conference on communication systems and network technologies. IEEE; 2015. p. 987–90.

Pervez MS, Farid DM. Feature selection and intrusion classification in nsl-kdd cup 99 dataset employing svms. In: The 8th international conference on software, knowledge, information management and applications (SKIMA 2014). IEEE; 2014. p. 1–6.

Yan M, Liu Z. A new method of transductive svm-based network intrusion detection. In: International conference on computer and computing technologies in agriculture. New York: Springer; 2010. p. 87–95.

Li Y, Xia J, Zhang S, Yan J, Ai X, Dai K. An efficient intrusion detection system based on support vector machines and gradually feature removal method. Expert Syst Appl. 2012;39(1):424–30.

Raman MG, Somu N, Jagarapu S, Manghnani T, Selvam T, Krithivasan K, Sriram VS. An efficient intrusion detection technique based on support vector machine and improved binary gravitational search algorithm. Artificial Intelligence Review. 2019, p. 1–32.

Kokila R, Selvi ST, Govindarajan K. Ddos detection and analysis in sdn-based environment using support vector machine classifier. In: 2014 Sixth international conference on advanced computing (ICoAC). IEEE; 2014. p. 205–10.

Xie M, Hu J, Slay J. Evaluating host-based anomaly detection systems: Application of the one-class svm algorithm to adfa-ld. In: 2014 11th international conference on fuzzy systems and knowledge discovery (FSKD). IEEE; 2014. p. 978–82.

Saxena H, Richariya V. Intrusion detection in kdd99 dataset using svm-pso and feature reduction with information gain. Int J Comput Appl. 2014;98:6.

Chandrasekhar A, Raghuveer K. Confederation of fcm clustering, ann and svm techniques to implement hybrid nids using corrected kdd cup 99 dataset. In: 2014 international conference on communication and signal processing. IEEE; 2014. p. 672–76.

Shapoorifard H, Shamsinejad P. Intrusion detection using a novel hybrid method incorporating an improved knn. Int J Comput Appl. 2017;173(1):5–9.

Vishwakarma S, Sharma V, Tiwari A. An intrusion detection system using knn-aco algorithm. Int J Comput Appl. 2017;171(10):18–23.

Meng W, Li W, Kwok L-F. Design of intelligent knn-based alarm filter using knowledge-based alert verification in intrusion detection. Secur Commun Netw. 2015;8(18):3883–95.

Dada E. A hybridized svm-knn-pdapso approach to intrusion detection system. In: Proc. Fac. Seminar Ser., 2017, p. 14–21.

Sharifi AM, Amirgholipour SK, Pourebrahimi A. Intrusion detection based on joint of k-means and knn. J Converg Inform Technol. 2015;10(5):42.

Lin W-C, Ke S-W, Tsai C-F. Cann: an intrusion detection system based on combining cluster centers and nearest neighbors. Knowl Based Syst. 2015;78:13–21.

Koc L, Mazzuchi TA, Sarkani S. A network intrusion detection system based on a hidden naïve bayes multiclass classifier. Exp Syst Appl. 2012;39(18):13492–500.

Moon D, Im H, Kim I, Park JH. Dtb-ids: an intrusion detection system based on decision tree using behavior analysis for preventing apt attacks. J Supercomput. 2017;73(7):2881–95.

Ingre, B., Yadav, A., Soni, A.K.: Decision tree based intrusion detection system for nsl-kdd dataset. In: International conference on information and communication technology for intelligent systems. New York: Springer; 2017. p. 207–18.

Malik AJ, Khan FA. A hybrid technique using binary particle swarm optimization and decision tree pruning for network intrusion detection. Cluster Comput. 2018;21(1):667–80.

Relan NG, Patil DR. Implementation of network intrusion detection system using variant of decision tree algorithm. In: 2015 international conference on nascent technologies in the engineering field (ICNTE). IEEE; 2015. p. 1–5.

Rai K, Devi MS, Guleria A. Decision tree based algorithm for intrusion detection. Int J Adv Netw Appl. 2016;7(4):2828.

Sarker IH, Abushark YB, Alsolami F, Khan AI. Intrudtree: a machine learning based cyber security intrusion detection model. Symmetry. 2020;12(5):754.

Puthran S, Shah K. Intrusion detection using improved decision tree algorithm with binary and quad split. In: International symposium on security in computing and communication. New York: Springer; 2016. p. 427–438.

Balogun AO, Jimoh RG. Anomaly intrusion detection using an hybrid of decision tree and k-nearest neighbor, 2015.

Azad C, Jha VK. Genetic algorithm to solve the problem of small disjunct in the decision tree based intrusion detection system. Int J Comput Netw Inform Secur. 2015;7(8):56.

Jo S, Sung H, Ahn B. A comparative study on the performance of intrusion detection using decision tree and artificial neural network models. J Korea Soc Dig Indus Inform Manag. 2015;11(4):33–45.

Zhan J, Zulkernine M, Haque A. Random-forests-based network intrusion detection systems. IEEE Trans Syst Man Cybern C. 2008;38(5):649–59.

Tajbakhsh A, Rahmati M, Mirzaei A. Intrusion detection using fuzzy association rules. Appl Soft Comput. 2009;9(2):462–9.

Mitchell R, Chen R. Behavior rule specification-based intrusion detection for safety critical medical cyber physical systems. IEEE Trans Depend Secure Comput. 2014;12(1):16–30.

Alazab M, Venkataraman S, Watters P. Towards understanding malware behaviour by the extraction of api calls. In: 2010 second cybercrime and trustworthy computing Workshop. IEEE; 2010. p. 52–59.

Yuan Y, Kaklamanos G, Hogrefe D. A novel semi-supervised adaboost technique for network anomaly detection. In: Proceedings of the 19th ACM international conference on modeling, analysis and simulation of wireless and mobile systems. ACM; 2016. p. 111–14.

Ariu D, Tronci R, Giacinto G. Hmmpayl: an intrusion detection system based on hidden markov models. Comput Secur. 2011;30(4):221–41.

Årnes A, Valeur F, Vigna G, Kemmerer RA. Using hidden markov models to evaluate the risks of intrusions. In: International workshop on recent advances in intrusion detection. New York: Springer; 2006. p. 145–64.

Hansen JV, Lowry PB, Meservy RD, McDonald DM. Genetic programming for prevention of cyberterrorism through dynamic and evolving intrusion detection. Decis Supp Syst. 2007;43(4):1362–74.

Aslahi-Shahri B, Rahmani R, Chizari M, Maralani A, Eslami M, Golkar MJ, Ebrahimi A. A hybrid method consisting of ga and svm for intrusion detection system. Neural Comput Appl. 2016;27(6):1669–76.

Alrawashdeh K, Purdy C. Toward an online anomaly intrusion detection system based on deep learning. In: 2016 15th IEEE international conference on machine learning and applications (ICMLA). IEEE; 2016. p. 195–200.

Yin C, Zhu Y, Fei J, He X. A deep learning approach for intrusion detection using recurrent neural networks. IEEE Access. 2017;5:21954–61.

Kim J, Kim J, Thu HLT, Kim H. Long short term memory recurrent neural network classifier for intrusion detection. In: 2016 international conference on platform technology and service (PlatCon). IEEE; 2016. p. 1–5.

Almiani M, AbuGhazleh A, Al-Rahayfeh A, Atiewi S, Razaque A. Deep recurrent neural network for iot intrusion detection system. Simulation Modelling Practice and Theory. 2019;102031.

Kolosnjaji B, Zarras A, Webster G, Eckert C. Deep learning for classification of malware system call sequences. In: Australasian joint conference on artificial intelligence. New York: Springer; 2016. p. 137–49.

Wang W, Zhu M, Zeng X, Ye X, Sheng Y. Malware traffic classification using convolutional neural network for representation learning. In: 2017 international conference on information networking (ICOIN). IEEE; 2017. p. 712–17.

Alauthman M, Aslam N, Al-kasassbeh M, Khan S, Al-Qerem A, Choo K-KR. An efficient reinforcement learning-based botnet detection approach. J Netw Comput Appl. 2020;150:102479.

Blanco R, Cilla JJ, Briongos S, Malagón P, Moya JM. Applying cost-sensitive classifiers with reinforcement learning to ids. In: International conference on intelligent data engineering and automated learning. New York: Springer; 2018. p. 531–38.

Lopez-Martin M, Carro B, Sanchez-Esguevillas A. Application of deep reinforcement learning to intrusion detection for supervised problems. Exp Syst Appl. 2020;141:112963.

Sarker IH, Kayes A, Watters P. Effectiveness analysis of machine learning classification models for predicting personalized context-aware smartphone usage. J Big Data. 2019;6(1):1–28.

Holte RC. Very simple classification rules perform well on most commonly used datasets. Mach Learn. 1993;11(1):63–90.

John GH, Langley P. Estimating continuous distributions in bayesian classifiers. In: Proceedings of the eleventh conference on uncertainty in artificial intelligence. Morgan Kaufmann Publishers Inc.; 1995. p. 338–45.

Quinlan JR. C4.5: Programs for machine learning. Machine Learning, 1993.

Sarker IH, Colman A, Han J, Khan AI, Abushark YB, Salah K. Behavdt: a behavioral decision tree learning to build user-centric context-aware predictive model. Mobile Networks and Applications. 2019, p. 1–11.

Aha DW, Kibler D, Albert MK. Instance-based learning algorithms. Mach Learn. 1991;6(1):37–66.

Keerthi SS, Shevade SK, Bhattacharyya C, Murthy KRK. Improvements to platt’s smo algorithm for svm classifier design. Neural Comput. 2001;13(3):637–49.

Freund Y, Schapire RE, et al: Experiments with a new boosting algorithm. In: Icml, vol. 96, p. 148–156 (1996). Citeseer

Le Cessie S, Van Houwelingen JC. Ridge estimators in logistic regression. J Royal Stat Soc C. 1992;41(1):191–201.

Watters PA, McCombie S, Layton R, Pieprzyk J. Characterising and predicting cyber attacks using the cyber attacker model profile (camp). J Money Launder Control. 2012.

Breiman L. Random forests. Mach Learn. 2001;45(1):5–32.

Sarker IH. Context-aware rule learning from smartphone data: survey, challenges and future directions. J Big Data. 2019;6(1):95.

MacQueen J. Some methods for classification and analysis of multivariate observations. In: Fifth Berkeley symposium on mathematical statistics and probability, vol. 1, 1967.

Rokach L. A survey of clustering algorithms. In: Data Mining and Knowledge Discovery Handbook. New York: Springer; 2010. p. 269–98.

Sneath PH. The application of computers to taxonomy. J Gen Microbiol. 1957;17:1.

Sorensen T. method of establishing groups of equal amplitude in plant sociology based on similarity of species. Biol Skr. 1948;5.

Sarker IH, Colman A, Kabir MA, Han J. Individualized time-series segmentation for mining mobile phone user behavior. Comput J. 2018;61(3):349–68.

Kim G, Lee S, Kim S. A novel hybrid intrusion detection method integrating anomaly detection with misuse detection. Exp Syst Appl. 2014;41(4):1690–700.

MathSciNet Google Scholar

Agrawal R, Imieliński T, Swami A. Mining association rules between sets of items in large databases. In: ACM SIGMOD Record. ACM; 1993. vol. 22, p. 207–16.

Flach PA, Lachiche N. Confirmation-guided discovery of first-order rules with tertius. Mach Learn. 2001;42(1–2):61–95.

Agrawal R, Srikant R, et al: Fast algorithms for mining association rules. In: Proc. 20th Int. Conf. Very Large Data Bases, VLDB, 1994, vol. 1215, p. 487–99.

Houtsma M, Swami A. Set-oriented mining for association rules in relational databases. In: Proceedings of the eleventh international conference on data engineering. IEEE; 1995. p. 25–33.

Ma BLWHY. Integrating classification and association rule mining. In: Proceedings of the fourth international conference on knowledge discovery and data mining, 1998.

Han J, Pei J, Yin Y. Mining frequent patterns without candidate generation. In: ACM Sigmod Record. ACM; 2000. vol. 29, p. 1–12.

Sarker IH, Salim FD. Mining user behavioral rules from smartphone data through association analysis. In: Proceedings of the 22nd Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), Melbourne, Australia. New York: Springer; 2018. p. 450–61.

Das A, Ng W-K, Woon Y-K. Rapid association rule mining. In: Proceedings of the tenth international conference on information and knowledge management. ACM; 2001. p. 474–81.

Zaki MJ. Scalable algorithms for association mining. IEEE Trans Knowl Data Eng. 2000;12(3):372–90.

Coelho IM, Coelho VN, Luz EJS, Ochi LS, Guimarães FG, Rios E. A gpu deep learning metaheuristic based model for time series forecasting. Appl Energy. 2017;201:412–8.

Van Efferen L, Ali-Eldin AM. A multi-layer perceptron approach for flow-based anomaly detection. In: 2017 International symposium on networks, computers and communications (ISNCC). IEEE; 2017. p. 1–6.

Liu H, Lang B, Liu M, Yan H. Cnn and rnn based payload classification methods for attack detection. Knowl Based Syst. 2019;163:332–41.

Berman DS, Buczak AL, Chavis JS, Corbett CL. A survey of deep learning methods for cyber security. Information. 2019;10(4):122.

Bellman R. A markovian decision process. J Math Mech. 1957;1:679–84.

Kaelbling LP, Littman ML, Moore AW. Reinforcement learning: a survey. J Artif Intell Res. 1996;4:237–85.

Sarker IH. A machine learning based robust prediction model for real-life mobile phone data. Internet of Things. 2019;5:180–93.

Kayes ASM, Han J, Colman A. OntCAAC: an ontology-based approach to context-aware access control for software services. Comput J. 2015;58(11):3000–34.

Kayes ASM, Rahayu W, Dillon T. An ontology-based approach to dynamic contextual role for pervasive access control. In: AINA 2018. IEEE Computer Society, 2018.

Colombo P, Ferrari E. Access control technologies for big data management systems: literature review and future trends. Cybersecurity. 2019;2(1):1–13.

Aleroud A, Karabatis G. Contextual information fusion for intrusion detection: a survey and taxonomy. Knowl Inform Syst. 2017;52(3):563–619.

Sarker IH, Abushark YB, Khan AI. Contextpca: Predicting context-aware smartphone apps usage based on machine learning techniques. Symmetry. 2020;12(4):499.

Madsen RE, Hansen LK, Winther O. Singular value decomposition and principal component analysis. Neural Netw. 2004;1:1–5.

Qiao L-B, Zhang B-F, Lai Z-Q, Su J-S. Mining of attack models in ids alerts from network backbone by a two-stage clustering method. In: 2012 IEEE 26th international parallel and distributed processing symposium workshops & Phd Forum. IEEE; 2012. p. 1263–9.

Sarker IH, Colman A, Han J. Recencyminer: mining recency-based personalized behavior from contextual smartphone data. J Big Data. 2019;6(1):49.

Ullah F, Babar MA. Architectural tactics for big data cybersecurity analytics systems: a review. J Syst Softw. 2019;151:81–118.

Zhao S, Leftwich K, Owens M, Magrone F, Schonemann J, Anderson B, Medhi D. I-can-mama: Integrated campus network monitoring and management. In: 2014 IEEE network operations and management symposium (NOMS). IEEE; 2014. p. 1–7.

Abomhara M, et al. Cyber security and the internet of things: vulnerabilities, threats, intruders and attacks. J Cyber Secur Mob. 2015;4(1):65–88.

Helali RGM. Data mining based network intrusion detection system: A survey. In: Novel algorithms and techniques in telecommunications and networking. New York: Springer; 2010. p. 501–505.

Ryoo J, Rizvi S, Aiken W, Kissell J. Cloud security auditing: challenges and emerging approaches. IEEE Secur Priv. 2013;12(6):68–74.

Densham B. Three cyber-security strategies to mitigate the impact of a data breach. Netw Secur. 2015;2015(1):5–8.

Salah K, Rehman MHU, Nizamuddin N, Al-Fuqaha A. Blockchain for ai: review and open research challenges. IEEE Access. 2019;7:10127–49.

Gandomi A, Haider M. Beyond the hype: big data concepts, methods, and analytics. Int J Inform Manag. 2015;35(2):137–44.

Golchha N. Big data-the information revolution. Int J Adv Res. 2015;1(12):791–4.

Hariri RH, Fredericks EM, Bowers KM. Uncertainty in big data analytics: survey, opportunities, and challenges. J Big Data. 2019;6(1):44.

Tsai C-W, Lai C-F, Chao H-C, Vasilakos AV. Big data analytics: a survey. J Big data. 2015;2(1):21.

Download references

Acknowledgements

The authors would like to thank all the reviewers for their rigorous review and comments in several revision rounds. The reviews are detailed and helpful to improve and finalize the manuscript. The authors are highly grateful to them.

Author information

Authors and affiliations.

Swinburne University of Technology, Melbourne, VIC, 3122, Australia

Iqbal H. Sarker

Chittagong University of Engineering and Technology, Chittagong, 4349, Bangladesh

La Trobe University, Melbourne, VIC, 3086, Australia

A. S. M. Kayes, Paul Watters & Alex Ng

University of Nevada, Reno, USA

Shahriar Badsha

Macquarie University, Sydney, NSW, 2109, Australia

Hamed Alqahtani

You can also search for this author in PubMed Google Scholar

Contributions

This article provides not only a discussion on cybersecurity data science and relevant methods but also to discuss the applicability towards data-driven intelligent decision making in cybersecurity systems and services. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Iqbal H. Sarker .

Ethics declarations

Competing interests.

The authors declare that they have no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Sarker, I.H., Kayes, A.S.M., Badsha, S. et al. Cybersecurity data science: an overview from machine learning perspective. J Big Data 7 , 41 (2020). https://doi.org/10.1186/s40537-020-00318-5

Download citation

Received : 26 October 2019

Accepted : 21 June 2020

Published : 01 July 2020

DOI : https://doi.org/10.1186/s40537-020-00318-5

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Decision making
Cyber-attack
Security modeling
Intrusion detection
Cyber threat intelligence

Data Science

Data science (m.sc.).

The Master’s degree program in data science is a great way to become a true expert in the world of data analysis and processing. In this degree program, you will learn the latest technologies and methods to model and analyze large amounts of data and furthermore gain insights that are critical in many use cases.

In your studies, you will gain a comprehensive understanding of data analytics, artificial intelligence, machine learning and statistics. You will learn how to create complex data models and apply them to different use cases.

What is the degree program about?

Data science has become a revolutionary technology that everyone seems to talk about. It is becoming a key concept for large private businesses, public institutions and research. While it is not easy to define it in a few words, data science deals with the methods and tools needed to analyze data and draw actionable conclusions from the results gained in the process. These methods and tools, which cover big data and their analysis, data modeling, machine learning, and simulation methods, are located mainly at the intersection of the subjects mathematics and computer science. Consequently, this new Master’s program at Friedrich-Alexander Universität is taught jointly by lecturers from these fields.

This program uses dynamic learning methodologies to ensure our students stand out in today’s competitive job market. Students will enjoy a wide variety of long-lasting benefits:

Hands-on teaching methodology.
A world-class institution.
Individual, interest-based curriculum.

Design and structure

The M.Sc. Data Science degree programme offers the following specialization areas :

Data-based optimization
Mathematical theory / Fundamentals of data science (taught in German)
Databases and knowledge representation
Machine learning / Artificial intelligence
Simulation and numerics
Mathematical statistical data analysis

A student has to select one specialization area as major field of study in which modules of in total 30 ECTS have to be completed. All other specialization areas form the minor field of study in which modules of in total 20 ECTS have to be completed.

Additionally, there are three core modules of in total 15 ECTS that are mandatory for all students of this degree programme. This is complemented by technical qualification modules of in total 5 ECTS.

Every student has to complete modules of in total 15 ECTS from the following application subjects :

Artificial intelligence in biomedical imaging
Digital humanities
Geosciences
International information systems
Medical data science
Material Science

The degree programme is finalized by a Master’s seminar (5 ECTS) that should lead to writing a Master’s thesis (30 ECTS) in the field of Data Science.

Fields of study and specializations

At the beginning of the Master’s degree program, one major field of study is selected from the following subject areas as part of an individual study agreement:

Mathematical theory / Fundamentals of data science

The other subject areas together form the minor field of study. The courses are mainly taught in English. Every student chooses a mentor at the beginning of the course of study. The mentor gives the student advice how to design the study plan in accordance with the student’s individual interests.

Which qualities and skills do I need?

If you still have doubts about choosing the “Data Science” degree program, read through the following statements and consider whether they apply to you.

As a digital native, the topic of “digitalization” is close to your heart and you are interested in current, data-driven technologyYou have a broad range of interests and feel motivated by many different challenges.
Mathematics does not scare you, but gives you pleasure. You like to work precisely, formalize ideas, and produce resilient results.
The ability to understand complex relationships and break them down to the essentials is one of your core competencies, along with an unquenchable thirst for knowledge.
You have a strong interest in understanding how human behavior can be captured and even predicted in mathematical models.
You always wanted to learn programming.
You are interested in how the underlying mathematical and computer science processes work.

If these points fit you, you will definitely make the right choice with the degree program “Data Science”.

Why should I study at FAU?

FAU Erlangen-Nuremberg offers unique conditions for the degree program “Data Science”. Due to the strong content-related networking of the departments of mathematics and computer science and the spatial distance of just two minutes on foot, there is a wide range of informatics and mathematics topics available, taught centrally in the degree program. Due to the great variety of subjects at FAU, you can choose your application subject from many different subject areas. This helps you to find your own individual specialization in your studies, and focus on subjects which you are particularly interested in and enjoy. In addition, the industrial environment of the Nuremberg metropolitan region creates ideal conditions for sustainable and application-oriented studies. And perhaps you will already get to know your future employer during your studies, such as Siemens, Schaeffler or adidas.

Which career prospects are open to me?

With a Master’s degree as a data scientist, many exciting fields of work open up to you in which you can profitably apply your knowledge. You work directly at the interface between man and machine. Here are some examples of industries with potential employers:

Technology industry (e.g. Google, Facebook, Microsoft, IBM, SAP, Siemens, etc.)
Consulting industry (e.g. McKinsey, Ernst & Young, Deloitte, etc.)
Biomedical research companies (e.g. AstraZeneca, Roche, Novartis, Bayer, etc.)
Logistics industry (Deutsche Post, UPS, DB Mobility Logistics, etc.)
Energy industry (E.ON, RWE, EDF, etc.)
Finance and insurance industry (Deutsche Bank, Allianz, Munich Re, etc.)

Due to the high demand for graduates in the field of data science – there is an estimated shortage of over 100,000 experts for data science in Germany alone – graduates can expect a relatively high starting salary when starting their career.

Alternatively, you can further deepen your understanding of data modeling and analysis by choosing to continue with a doctoral degree and thus even advance the current state of research, which will decisively shape the handling of the resource “data” for the coming decades.

Special features

German skills on a B1 level are highly recommended.

Admission requirements, application, and enrollment

A completed Bachelor’s degree in Mathematics, Industrial Mathematics, Mathematical Economy, Computer Science, Data Science, or Physics from FAU or another equivalent German or international degree that is not significantly different with regard to the competence profile taught in the respective degree program. Please note that your competence profile cannot be evaluated in advance, but only by the admission committee after completing the application process (described below).
A Grade Point Average (GPA) of 2.5 or better with respect to the German grading system. Candidates with an admissible degree (described above) and a GPA between 2.6 and 2.8 are invited for a short online interview in which their knowledge in calculus, linear algebra, algorithms and data structures is evaluated.
English proficiency at level B2 CEFR (vantage or upper intermediate) or six years of English classes at a German secondary school (Gymnasium). Applicants who have completed their university entrance qualifications or their first degree in English are not required to provide proof of proficiency in English.

Language skills

As the Master programme is in English we are asking for a certificate of upper intermediate English (level B2) are mandatory. Even if your undergraduate degree was taught in English we recommend to submit a language certificate.

For everyday life and for internships and working student jobs, we recommend solid knowledge (B1) of German, but a certificate is not necessary for the application.

Applications have to be submitted via the campus management portal campo.fau.de.

Do you need help or more information?

Our Student Advice and Career Service (IBZ) is the central point of contact for all questions about studying and starting a degree programme. Our Student Service Centres and subject advisors support you in planning your studies.

Organizational

Start of the semester
Semester dates
Semester fees
Advice and services
Going abroad
Module handbook
Degree program and examination regulations

Additional Information

Website of the degree program
Examination Office - Faculty of Sciences
Department/Institute website
Faculty of Sciences
Student Representatives at FAU

IMAGES

scientific thesis structure
Thesis Structure: A Step-by-Step Guide to Crafting a Strong Thesis
theoretical thesis structure
Thesis Format
Structure of the thesis
2: Steps of methodology of the thesis

VIDEO

DATA SCIENCE [MODULE-1]
DATA SCIENCE [MODULE-2]
CS301_Lecture35
AWR001 Academic Writing Part 1 A
What Is a Thesis?
What Is a master's Thesis (5 Characteristics of an A Plus Thesis)

COMMENTS

How to write a great data science thesis
They will stress the importance of structure, substance and style. They will urge you to write down your methodology and results first, then progress to the literature review, introduction and conclusions and to write the summary or abstract last. To write clearly and directly with the reader's expectations always in mind.
Thesis/Capstone for Master's in Data Science
Thesis. A thesis is an academic-focused research project with broader applicability. A thesis is more appropriate if: you want to get a PhD or other advanced degree and want the experience of the research process and writing for publication; you want to work individually with a specific faculty member who serves as your thesis adviser
Instructions for MSc Thesis
For a Data Science thesis, this part typically describes the method for the analysis. Chapter 5: Results. This chapter describes the results obtained when the methods of Chapter 4 are used on data. For a Computer Science thesis, this part typically describes the performance of the developed algorithm(s) on various synthetic and real datasets.
PDF Master Thesis: Data Science and Marketing Analytics
1.3 Structure of the thesis The remainder of the thesis is structured as follows. Chapter 2 gives an overview of the literature focused on attribution modeling in which topics such as attribution modeling, rule-based models, data-driven models, and explainable machine learning are discussed. Chapter 3 briefly discusses
Thesis Option
Data Science master's students can choose to satisfy the research experience requirement by selecting the thesis option. Students will spend the majority of their second year working on a substantial data science project that culminates in the submission and oral defense of a master's thesis. While all thesis projects must be related to data science, students are given leeway in finding a ...
Data Science Masters Theses // Arch : Northwestern University
Data Science Masters Theses. The Master of Science in Data Science program requires the successful completion of 12 courses to obtain a degree. These requirements cover six core courses, a leadership or project management course, two required courses corresponding to a declared specialization, two electives, and a capstone project or thesis.
Thesis guide • University of Passau
Structure+Hints for Seminar, Bachelor, Master and PhD Talks; Latex Template for MA/BA Thesis; Scientific Writing; Hints on Scientific Presentations - although focused on Theoretical Computer Science, most parts are also relevant for Computer Science in general (where proofs are not given in a formal way but by implementation and/or empirical ...
MSc in Data Science, Project Guide
RTDS+ (120-point thesis option) Contact Introduction. The project is an essential component of the Masters course. It is a substantial piece of full-time independent research in some area of data science. You will carry out your project under the individual supervision of a member of CDT staff.
A guide to backward paper writing for the data sciences
The high-level steps involved in the process of backward data science manuscript preparation. The square boxes at the top represent the important pre-writing steps in which you clarify the scientific and professional goals motivating your work. The rounded box represents the process of initial writing and revision.
How to structure a thesis
A typical thesis structure. 1. Abstract. The abstract is the overview of your thesis and generally very short. This section should highlight the main contents of your thesis "at a glance" so that someone who is curious about your work can get the gist quickly. Take a look at our guide on how to write an abstract for more info.
PDF Thesis topics for the master thesis Data Science and Business Analytics
Thesis topics for the master thesis Data Science and Business Analytics Topic 1: Logistic regression for modern data structures Promotor Gerda Claeskens Description Logistic regression is widely used for binary classification. In the classical setting with a fixed number of predictive variables p and a large sample size n, the likelihood ratio test
Thesis • University of Passau
Requirements for the Written Thesis: Thesis can be written in English or German. The structure should follow scientific rules. See our guide therefore.; A master thesis should range between 60 and 80 pages (excluding appendix, toc, and lists of figures/tables, references) depending on the complexity of the subject.
Thesis Projects and Research in DS
Thesis Projects and Research in DS. The Master's thesis is a mandatory course of the Master's program in Data Science. The thesis is supervised by a professor of the data science faculty list. Research in Data Science is a core elective for students in Data Science under the supervision of a data sci ...
PDF Thesis Guidelines
example assessment criteria Applied Data Science master. 6.2. Assessment Form for a Research Project - Graduate School of Natural Sciences Use of this form is mandatory for all large research projects, notably final thesis work ("afstuderen"). It must be filled out by the project supervisor and sent to the student desk (OSZ), Minnaert ...
Data Science, M.S.
Data Science, M.S. The plan in Data Science leads to the Master of Science degree. This plan is designed to equip students with the capability of integrating a wide spectrum of interdisciplinary knowledge and skills to uncover and utilize data to produce, apply and communicate value-adding intelligence for organizations and the society, in ...
10 Best Research and Thesis Topic Ideas for Data Science in 2022
In this article, we have listed 10 such research and thesis topic ideas to take up as data science projects in 2022. Handling practical video analytics in a distributed cloud: With increased dependency on the internet, sharing videos has become a mode of data and information exchange. The role of the implementation of the Internet of Things ...
Thesis
The structure of a thesis may vary slightly depending on the specific requirements of the institution, department, or field of study, but generally, it follows a specific format. ... interviews, experiments, or analyzing existing data. Write the Thesis: Once you have analyzed the data, you need to write the thesis. The thesis should follow a ...
Structuring your thesis
Structuring your thesis. The best structure for your HDR thesis will depend on your discipline and the research you aim to communicate. Before you begin writing your thesis, make sure you've read our advice on thesis preparation for information on the requirements you'll need to meet. Once you've done this, you can begin to think about how to ...
PDF Senior Thesis Guide
A: A "thesis" is a proposition or assertion that is supported by logical arguments and factual evidence, or data. A Senior Thesis should represent an analysis of some phenomenon, typically supported by original data. What makes your essay a "thesis" is that you go beyond narrative and description to include original data, analysis, and argument.
Cybersecurity data science: an overview from machine learning
In a computing context, cybersecurity is undergoing massive shifts in technology and its operations in recent days, and data science is driving the change. Extracting security incident patterns or insights from cybersecurity data and building corresponding data-driven model, is the key to make a security system automated and intelligent. To understand and analyze the actual phenomena with data ...
Data Science
Data Science (M.Sc.) The Master's degree program in data science is a great way to become a true expert in the world of data analysis and processing. In this degree program, you will learn the latest technologies and methods to model and analyze large amounts of data and furthermore gain insights that are critical in many use cases.
Thesis for Data Structures and Algorithms (Computer science) Free
Thesis. Dr. Bhim Rao Ambedkar University. Data Structures and Algorithms. 6 pages. 2017/2018. Data structure throuh cpp lab manual.