Pitchgrade

Presentations made painless

  • Get Premium

105 Data Mining Essay Topic Ideas & Examples

Inside This Article

Data mining is a powerful tool that helps businesses and organizations uncover hidden patterns, trends, and insights from large datasets. It involves the process of extracting valuable information from raw data, which can then be used for various purposes such as improving decision-making, predicting future outcomes, and understanding customer behavior. If you are a student or a professional looking to write an essay on data mining, here are 105 topic ideas and examples to help you get started.

  • The importance of data mining in today's business world
  • Ethical considerations in data mining
  • The impact of data mining on privacy
  • How data mining is used in healthcare to improve patient outcomes
  • Predictive analytics: Using data mining to forecast future trends
  • Data mining techniques for fraud detection in financial institutions
  • The role of data mining in customer relationship management
  • The use of data mining in social media marketing
  • Data mining and its application in personalized advertising
  • The benefits of data mining in supply chain management
  • Text mining: Analyzing unstructured data to extract valuable insights
  • The challenges of big data mining
  • Data mining in e-commerce: Enhancing customer experience
  • The role of data mining in improving cybersecurity
  • Data mining and its impact on decision-making in organizations
  • The use of data mining in predicting stock market trends
  • Data mining and its role in recommendation systems
  • The benefits of data mining in the education sector
  • Data mining techniques for sentiment analysis
  • The ethical implications of data mining in government surveillance
  • Data mining in the gaming industry: Enhancing player experience
  • The role of data mining in personalized medicine
  • Data mining techniques for credit scoring and risk assessment
  • The use of data mining in sports analytics
  • Data mining and its impact on urban planning
  • Data mining and its role in weather forecasting
  • The challenges of data mining in social network analysis
  • Data mining techniques for detecting plagiarism in academic papers
  • Data mining and its application in predicting natural disasters
  • The role of data mining in improving transportation systems
  • Data mining and its impact on online dating platforms
  • Data mining for predicting customer churn in telecommunications industry
  • The use of data mining in optimizing energy consumption
  • Data mining techniques for detecting credit card fraud
  • Data mining and its role in personalized news recommendation
  • The benefits of data mining in human resources management
  • Data mining in healthcare for disease diagnosis and treatment
  • Data mining and its impact on online advertising
  • Data mining techniques for identifying patterns in gene expression data
  • The role of data mining in improving online learning platforms
  • Data mining and its application in criminal investigations
  • The use of data mining in optimizing manufacturing processes
  • Data mining techniques for predicting customer lifetime value
  • The benefits of data mining in predicting traffic congestion
  • Data mining and its role in predicting customer preferences
  • Data mining in environmental analysis and conservation efforts
  • Data mining and its impact on personalized financial planning
  • The challenges of data mining in healthcare data integration
  • Data mining techniques for analyzing social media sentiment
  • The role of data mining in improving public safety
  • Data mining and its application in fraud detection in insurance industry
  • The use of data mining in optimizing online search engines
  • Data mining techniques for predicting student performance in education
  • Data mining and its impact on improving online user experience
  • Data mining and its role in predicting customer satisfaction
  • The benefits of data mining in optimizing logistics and supply chain
  • Data mining in crime analysis and prevention
  • Data mining and its impact on personalization in online shopping
  • Data mining techniques for analyzing customer feedback and reviews
  • The role of data mining in improving healthcare resource allocation
  • Data mining and its application in predicting customer lifetime loyalty
  • The use of data mining in optimizing inventory management
  • Data mining techniques for detecting fraudulent insurance claims
  • Data mining and its role in predicting disease outbreaks
  • Data mining in sentiment analysis of political discourse
  • Data mining and its impact on improving online voting systems
  • The challenges of data mining in analyzing geospatial data
  • Data mining techniques for optimizing pricing strategies in retail
  • The benefits of data mining in predicting customer churn in telecom industry
  • Data mining and its role in improving road safety
  • Data mining and its application in predicting customer behavior
  • The use of data mining in optimizing energy distribution networks
  • Data mining techniques for detecting insider trading in financial markets
  • Data mining and its impact on personalized travel recommendations
  • Data mining and its role in predicting customer loyalty
  • The benefits of data mining in optimizing warehouse operations
  • Data mining in fraud detection and prevention in online transactions
  • Data mining and its impact on personalized healthcare recommendations
  • Data mining techniques for analyzing customer segmentation
  • The role of data mining in improving disaster response and recovery
  • Data mining and its application in predicting customer lifetime value
  • The use of data mining in optimizing fleet management
  • Data mining techniques for detecting money laundering activities
  • Data mining and its role in predicting customer preferences in online advertising
  • The benefits of data mining in optimizing service quality in hospitality industry
  • Data mining in predicting student dropout and improving retention
  • Data mining and its impact on personalized music recommendations
  • Data mining techniques for analyzing patterns in web usage data
  • The role of data mining in improving urban mobility and transportation systems
  • Data mining and its application in predicting customer satisfaction in retail
  • The use of data mining in optimizing healthcare resource allocation
  • Data mining techniques for detecting online identity theft
  • Data mining and its role in predicting customer lifetime loyalty in e-commerce
  • The benefits of data mining in optimizing delivery routes
  • Data mining in detecting patterns of online extremist behavior
  • Data mining and its impact on enhancing personalized learning experiences
  • Data mining techniques for analyzing customer churn in subscription-based services
  • The role of data mining in improving disaster risk reduction strategies
  • Data mining and its application in predicting customer behavior in online gaming
  • The use of data mining in optimizing maintenance schedules for industrial equipment
  • Data mining techniques for detecting healthcare fraud and abuse
  • Data mining and its role in predicting customer preferences in online travel booking
  • The benefits of data mining in optimizing waste management processes
  • Data mining in detecting patterns of cyberbullying behavior
  • Data mining and its impact on enhancing personalized financial advice

These topic ideas provide a wide range of options for your data mining essay. Whether you are interested in business applications, healthcare, social media, or any other field, there is a topic that suits your interests. Remember to choose a topic that you are passionate about and conduct thorough research to provide a well-informed and insightful essay on data mining.

Want to research companies faster?

Instantly access industry insights

Let PitchGrade do this for me

Leverage powerful AI research capabilities

We will create your text and designs for you. Sit back and relax while we do the work.

Explore More Content

  • Privacy Policy
  • Terms of Service

© 2024 Pitchgrade

82 Data Mining Essay Topic Ideas & Examples

🏆 best data mining topic ideas & essay examples, 💡 good essay topics on data mining, ✅ most interesting data mining topics to write about.

  • Ethical Implications of Data Mining by Government Institutions Critics of personal data mining insist that it infringes on the rights of an individual and result to the loss of sensitive information.
  • Levi’s Company’s Data Mining & Customer Analytics Levi, the renowned name in jeans is feeling the heat of competition from a number of other brands, which have come upon the scene well after Levi’s but today appear to be approaching Levi’s market […]
  • Data Mining and Its Major Advantages Thus, it is possible to conclude that data mining is a convenient and effective way of processing information, which has many advantages.
  • Data Mining Role in Companies The increasing adoption of data mining in various sectors illustrates the potential of the technology regarding the analysis of data by entities that seek information crucial to their operations.
  • Disadvantages of Using Web 2.0 for Data Mining Applications This data can be confusing to the readers and may not be reliable. Lastly, with the use of Web 2.
  • The Data Mining Method in Healthcare and Education Thus, I would use data mining in both cases; however, before that, I would discover a way to improve the algorithms used for it.
  • Data Mining Tools and Data Mining Myths The first problem is correlated with keeping the identity of the person evolved in data mining secret. One of the major myths regarding data mining is that it can replace domain knowledge.
  • Hybrid Data Mining Approach in Healthcare One of the healthcare projects that will call for the use of data mining is treatment evaluation. In this case, it is essential to realize that the main aim of health data mining is to […]
  • Terrorism and Data Mining Algorithms However, this is a necessary evil as the nation’s security has to be prioritized since these attacks lead to harm to a larger population compared to the infringements.
  • Transforming Coded and Text Data Before Data Mining However, to complete data mining, it is necessary to transform the data according to the techniques that are to be used in the process.
  • Data Mining and Machine Learning Algorithms The shortest distance of string between two instances defines the distance of measure. However, this is also not very clear as to which transformations are summed, and thus it aims to a probability with the […]
  • Summary of C4.5 Algorithm: Data Mining 5 algorism: Each record from set of data should be associated with one of the offered classes, it means that one of the attributes of the class should be considered as a class mark.
  • Data Mining in Social Networks: Linkedin.com One of the ways to achieve the aim is to understand how users view data mining of their data on LinkedIn.
  • Ethnography and Data Mining in Anthropology The study of cultures is of great importance under normal circumstances to enhance the understanding of the same. Data mining is the success secret of ethnography.
  • Issues With Data Mining It is necessary to note that the usage of data mining helps FBI to have access to the necessary information for terrorism and crime tracking.
  • Large Volume Data Handling: An Efficient Data Mining Solution Data mining is the process of sorting huge amount of data and finding out the relevant data. Data mining is widely used for the maintenance of data which helps a lot to an organization in […]
  • Cryptocurrency Exchange Market Prediction and Analysis Using Data Mining and Artificial Intelligence This paper aims to review the application of A.I.in the context of blockchain finance by examining scholarly articles to determine whether the A.I.algorithm can be used to analyze this financial market.
  • “Data Mining and Customer Relationship Marketing in the Banking Industry“ by Chye & Gerry First of all, the article generally elaborates on the notion of customer relationship management, which is defined as “the process of predicting customer behavior and selecting actions to influence that behavior to benefit the company”.
  • Data Mining Techniques and Applications The use of data mining to detect disturbances in the ecosystem can help to avert problems that are destructive to the environment and to society.
  • Ethical Data Mining in the UAE Traffic Department The research question identified in the assignment two is considered to be the following, namely whether the implementation of the business intelligence into the working process will beneficially influence the work of the Traffic Department […]
  • Canadian University Dubai and Data Mining The aim of mining data in the education environment is to enhance the quality of education for the mass through proactive and knowledge-based decision-making approaches.
  • Data Mining and Customer Relationship Management As such, CRM not only entails the integration of marketing, sales, customer service, and supply chain capabilities of the firm to attain elevated efficiencies and effectiveness in conveying customer value, but it obliges the organization […]
  • E-Commerce: Mining Data for Better Business Intelligence The method allowed the use of Intel and an example to build the study and the literature on data mining for business intelligence to analyze the findings.
  • Data Warehouse and Data Mining in Business The circumstances leading to the establishment and development of the concept of data warehousing was attributed to the fact that failure to have a data warehouse led to the need of putting in place large […]
  • Data Mining: Concepts and Methods Speed of data mining process is important as it has a role to play in the relevance of the data mined. The accuracy of data is also another factor that can be used to measure […]
  • Data Mining Technologies According to Han & Kamber, data mining is the process of discovering correlations, patterns, trends or relationships by searching through a large amount of data that in most circumstances is stored in repositories, business databases […]
  • Data Mining: A Critical Discussion In recent times, the relatively new discipline of data mining has been a subject of widely published debate in mainstream forums and academic discourses, not only due to the fact that it forms a critical […]
  • Commercial Uses of Data Mining Data mining process entails the use of large relational database to identify the correlation that exists in a given data. The principal role of the applications is to sift the data to identify correlations.
  • A Discussion on the Acceptability of Data Mining Today, more than ever before, individuals, organizations and governments have access to seemingly endless amounts of data that has been stored electronically on the World Wide Web and the Internet, and thus it makes much […]
  • Applying Data Mining Technology for Insurance Rate Making: Automobile Insurance Example
  • Applebee’s, Travelocity and Others: Data Mining for Business Decisions
  • Applying Data Mining Procedures to a Customer Relationship
  • Business Intelligence as Competitive Tool of Data Mining
  • Overview of Accounting Information System Data Mining
  • Applying Data Mining Technique to Disassembly Sequence Planning
  • Approach for Image Data Mining Cultural Studies
  • Apriori Algorithm for the Data Mining of Global Cyberspace Security Issues
  • Database Data Mining: The Silent Invasion of Privacy
  • Data Management: Data Warehousing and Data Mining
  • Constructive Data Mining: Modeling Consumers’ Expenditure in Venezuela
  • Data Mining and Its Impact on Healthcare
  • Innovations and Perspectives in Data Mining and Knowledge Discovery
  • Data Mining and Machine Learning Methods for Cyber Security Intrusion Detection
  • Linking Data Mining and Anomaly Detection Techniques
  • Data Mining and Pattern Recognition Models for Identifying Inherited Diseases
  • Credit Card Fraud Detection Through Data Mining
  • Data Mining Approach for Direct Marketing of Banking Products
  • Constructive Data Mining: Modeling Argentine Broad Money Demand
  • Data Mining-Based Dispatching System for Solving the Pickup and Delivery Problem
  • Commercially Available Data Mining Tools Used in the Economic Environment
  • Data Mining Climate Variability as an Indicator of U.S. Natural Gas
  • Analysis of Data Mining in the Pharmaceutical Industry
  • Data Mining-Driven Analysis and Decomposition in Agent Supply Chain Management Networks
  • Credit Evaluation Model for Banks Using Data Mining
  • Data Mining for Business Intelligence: Multiple Linear Regression
  • Cluster Analysis for Diabetic Retinopathy Prediction Using Data Mining Techniques
  • Data Mining for Fraud Detection Using Invoicing Data
  • Jaeger Uses Data Mining to Reduce Losses From Crime and Waste
  • Data Mining for Industrial Engineering and Management
  • Business Intelligence and Data Mining – Decision Trees
  • Data Mining for Traffic Prediction and Intelligent Traffic Management System
  • Building Data Mining Applications for CRM
  • Data Mining Optimization Algorithms Based on the Swarm Intelligence
  • Big Data Mining: Challenges, Technologies, Tools, and Applications
  • Data Mining Solutions for the Business Environment
  • Overview of Big Data Mining and Business Intelligence Trends
  • Data Mining Techniques for Customer Relationship Management
  • Classification-Based Data Mining Approach for Quality Control in Wine Production
  • Data Mining With Local Model Specification Uncertainty
  • Employing Data Mining Techniques in Testing the Effectiveness of Modernization Theory
  • Enhancing Information Management Through Data Mining Analytics
  • Evaluating Feature Selection Methods for Learning in Data Mining Applications
  • Extracting Formations From Long Financial Time Series Using Data Mining
  • Financial and Banking Markets and Data Mining Techniques
  • Fraudulent Financial Statements and Detection Through Techniques of Data Mining
  • Harmful Impact Internet and Data Mining Have on Society
  • Informatics, Data Mining, Econometrics, and Financial Economics: A Connection
  • Integrating Data Mining Techniques Into Telemedicine Systems
  • Investigating Tobacco Usage Habits Using Data Mining Approach
  • Electronics Engineering Paper Topics
  • Cyber Security Topics
  • Google Paper Topics
  • Hacking Essay Topics
  • Identity Theft Essay Ideas
  • Internet Research Ideas
  • Microsoft Topics
  • Chicago (A-D)
  • Chicago (N-B)

IvyPanda. (2024, March 2). 82 Data Mining Essay Topic Ideas & Examples. https://ivypanda.com/essays/topic/data-mining-essay-topics/

"82 Data Mining Essay Topic Ideas & Examples." IvyPanda , 2 Mar. 2024, ivypanda.com/essays/topic/data-mining-essay-topics/.

IvyPanda . (2024) '82 Data Mining Essay Topic Ideas & Examples'. 2 March.

IvyPanda . 2024. "82 Data Mining Essay Topic Ideas & Examples." March 2, 2024. https://ivypanda.com/essays/topic/data-mining-essay-topics/.

1. IvyPanda . "82 Data Mining Essay Topic Ideas & Examples." March 2, 2024. https://ivypanda.com/essays/topic/data-mining-essay-topics/.

Bibliography

IvyPanda . "82 Data Mining Essay Topic Ideas & Examples." March 2, 2024. https://ivypanda.com/essays/topic/data-mining-essay-topics/.

IvyPanda uses cookies and similar technologies to enhance your experience, enabling functionalities such as:

  • Basic site functions
  • Ensuring secure, safe transactions
  • Secure account login
  • Remembering account, browser, and regional preferences
  • Remembering privacy and security settings
  • Analyzing site traffic and usage
  • Personalized search, content, and recommendations
  • Displaying relevant, targeted ads on and off IvyPanda

Please refer to IvyPanda's Cookies Policy and Privacy Policy for detailed information.

Certain technologies we use are essential for critical functions such as security and site integrity, account authentication, security and privacy preferences, internal site usage and maintenance data, and ensuring the site operates correctly for browsing and transactions.

Cookies and similar technologies are used to enhance your experience by:

  • Remembering general and regional preferences
  • Personalizing content, search, recommendations, and offers

Some functions, such as personalized recommendations, account preferences, or localization, may not work correctly without these technologies. For more details, please refer to IvyPanda's Cookies Policy .

To enable personalized advertising (such as interest-based ads), we may share your data with our marketing and advertising partners using cookies and other technologies. These partners may have their own information collected about you. Turning off the personalized advertising setting won't stop you from seeing IvyPanda ads, but it may make the ads you see less relevant or more repetitive.

Personalized advertising may be considered a "sale" or "sharing" of the information under California and other state privacy laws, and you may have the right to opt out. Turning off personalized advertising allows you to exercise your right to opt out. Learn more in IvyPanda's Cookies Policy and Privacy Policy .

M.Tech/Ph.D Thesis Help in Chandigarh | Thesis Guidance in Chandigarh

thesis topic on data mining

[email protected]

thesis topic on data mining

+91-9465330425

Data Mining

thesis topic on data mining

StudyCorgi . (2024) 'Data Mining Essay Topics'. 12 August.

1. StudyCorgi . "Data Mining Essay Topics." August 12, 2024. https://studycorgi.com/ideas/data-mining-essay-topics/.

Bibliography

StudyCorgi . "Data Mining Essay Topics." August 12, 2024. https://studycorgi.com/ideas/data-mining-essay-topics/.

StudyCorgi . 2024. "Data Mining Essay Topics." August 12, 2024. https://studycorgi.com/ideas/data-mining-essay-topics/.

These essay examples and topics on Data Mining were carefully selected by the StudyCorgi editorial team. They meet our highest standards in terms of grammar, punctuation, style, and fact accuracy. Please ensure you properly reference the materials if you’re using them to write your assignment.

This essay topic collection was updated on September 5, 2024 .

data mining Recently Published Documents

Total documents.

  • Latest Documents
  • Most Cited Documents
  • Contributed Authors
  • Related Sources
  • Related Keywords

Distance Based Pattern Driven Mining for Outlier Detection in High Dimensional Big Dataset

Detection of outliers or anomalies is one of the vital issues in pattern-driven data mining. Outlier detection detects the inconsistent behavior of individual objects. It is an important sector in the data mining field with several different applications such as detecting credit card fraud, hacking discovery and discovering criminal activities. It is necessary to develop tools used to uncover the critical information established in the extensive data. This paper investigated a novel method for detecting cluster outliers in a multidimensional dataset, capable of identifying the clusters and outliers for datasets containing noise. The proposed method can detect the groups and outliers left by the clustering process, like instant irregular sets of clusters (C) and outliers (O), to boost the results. The results obtained after applying the algorithm to the dataset improved in terms of several parameters. For the comparative analysis, the accurate average value and the recall value parameters are computed. The accurate average value is 74.05% of the existing COID algorithm, and our proposed algorithm has 77.21%. The average recall value is 81.19% and 89.51% of the existing and proposed algorithm, which shows that the proposed work efficiency is better than the existing COID algorithm.

Implementation of Data Mining Technology in Bonded Warehouse Inbound and Outbound Goods Trade

For the taxed goods, the actual freight is generally determined by multiplying the allocated freight for each KG and actual outgoing weight based on the outgoing order number on the outgoing bill. Considering the conventional logistics is insufficient to cope with the rapid response of e-commerce orders to logistics requirements, this work discussed the implementation of data mining technology in bonded warehouse inbound and outbound goods trade. Specifically, a bonded warehouse decision-making system with data warehouse, conceptual model, online analytical processing system, human-computer interaction module and WEB data sharing platform was developed. The statistical query module can be used to perform statistics and queries on warehousing operations. After the optimization of the whole warehousing business process, it only takes 19.1 hours to get the actual freight, which is nearly one third less than the time before optimization. This study could create a better environment for the development of China's processing trade.

Multi-objective economic load dispatch method based on data mining technology for large coal-fired power plants

User activity classification and domain-wise ranking through social interactions.

Twitter has gained a significant prevalence among the users across the numerous domains, in the majority of the countries, and among different age groups. It servers a real-time micro-blogging service for communication and opinion sharing. Twitter is sharing its data for research and study purposes by exposing open APIs that make it the most suitable source of data for social media analytics. Applying data mining and machine learning techniques on tweets is gaining more and more interest. The most prominent enigma in social media analytics is to automatically identify and rank influencers. This research is aimed to detect the user's topics of interest in social media and rank them based on specific topics, domains, etc. Few hybrid parameters are also distinguished in this research based on the post's content, post’s metadata, user’s profile, and user's network feature to capture different aspects of being influential and used in the ranking algorithm. Results concluded that the proposed approach is well effective in both the classification and ranking of individuals in a cluster.

A data mining analysis of COVID-19 cases in states of United States of America

Epidemic diseases can be extremely dangerous with its hazarding influences. They may have negative effects on economies, businesses, environment, humans, and workforce. In this paper, some of the factors that are interrelated with COVID-19 pandemic have been examined using data mining methodologies and approaches. As a result of the analysis some rules and insights have been discovered and performances of the data mining algorithms have been evaluated. According to the analysis results, JRip algorithmic technique had the most correct classification rate and the lowest root mean squared error (RMSE). Considering classification rate and RMSE measure, JRip can be considered as an effective method in understanding factors that are related with corona virus caused deaths.

Exploring distributed energy generation for sustainable development: A data mining approach

A comprehensive guideline for bengali sentiment annotation.

Sentiment Analysis (SA) is a Natural Language Processing (NLP) and an Information Extraction (IE) task that primarily aims to obtain the writer’s feelings expressed in positive or negative by analyzing a large number of documents. SA is also widely studied in the fields of data mining, web mining, text mining, and information retrieval. The fundamental task in sentiment analysis is to classify the polarity of a given content as Positive, Negative, or Neutral . Although extensive research has been conducted in this area of computational linguistics, most of the research work has been carried out in the context of English language. However, Bengali sentiment expression has varying degree of sentiment labels, which can be plausibly distinct from English language. Therefore, sentiment assessment of Bengali language is undeniably important to be developed and executed properly. In sentiment analysis, the prediction potential of an automatic modeling is completely dependent on the quality of dataset annotation. Bengali sentiment annotation is a challenging task due to diversified structures (syntax) of the language and its different degrees of innate sentiments (i.e., weakly and strongly positive/negative sentiments). Thus, in this article, we propose a novel and precise guideline for the researchers, linguistic experts, and referees to annotate Bengali sentences immaculately with a view to building effective datasets for automatic sentiment prediction efficiently.

Capturing Dynamics of Information Diffusion in SNS: A Survey of Methodology and Techniques

Studying information diffusion in SNS (Social Networks Service) has remarkable significance in both academia and industry. Theoretically, it boosts the development of other subjects such as statistics, sociology, and data mining. Practically, diffusion modeling provides fundamental support for many downstream applications (e.g., public opinion monitoring, rumor source identification, and viral marketing). Tremendous efforts have been devoted to this area to understand and quantify information diffusion dynamics. This survey investigates and summarizes the emerging distinguished works in diffusion modeling. We first put forward a unified information diffusion concept in terms of three components: information, user decision, and social vectors, followed by a detailed introduction of the methodologies for diffusion modeling. And then, a new taxonomy adopting hybrid philosophy (i.e., granularity and techniques) is proposed, and we made a series of comparative studies on elementary diffusion models under our taxonomy from the aspects of assumptions, methods, and pros and cons. We further summarized representative diffusion modeling in special scenarios and significant downstream tasks based on these elementary models. Finally, open issues in this field following the methodology of diffusion modeling are discussed.

The Influence of E-book Teaching on the Motivation and Effectiveness of Learning Law by Using Data Mining Analysis

This paper studies the motivation of learning law, compares the teaching effectiveness of two different teaching methods, e-book teaching and traditional teaching, and analyses the influence of e-book teaching on the effectiveness of law by using big data analysis. From the perspective of law student psychology, e-book teaching can attract students' attention, stimulate students' interest in learning, deepen knowledge impression while learning, expand knowledge, and ultimately improve the performance of practical assessment. With a small sample size, there may be some deficiencies in the research results' representativeness. To stimulate the learning motivation of law as well as some other theoretical disciplines in colleges and universities has particular referential significance and provides ideas for the reform of teaching mode at colleges and universities. This paper uses a decision tree algorithm in data mining for the analysis and finds out the influencing factors of law students' learning motivation and effectiveness in the learning process from students' perspective.

Intelligent Data Mining based Method for Efficient English Teaching and Cultural Analysis

The emergence of online education helps improving the traditional English teaching quality greatly. However, it only moves the teaching process from offline to online, which does not really change the essence of traditional English teaching. In this work, we mainly study an intelligent English teaching method to further improve the quality of English teaching. Specifically, the random forest is firstly used to analyze and excavate the grammatical and syntactic features of the English text. Then, the decision tree based method is proposed to make a prediction about the English text in terms of its grammar or syntax issues. The evaluation results indicate that the proposed method can effectively improve the accuracy of English grammar or syntax recognition.

Export Citation Format

Share document.

Eindhoven University of Technology research portal Logo

  • Help & FAQ

Data Mining

  • Data Science
  • Data and Artificial Intelligence

Student theses

  • 1 - 50 out of 270 results
  • Title (descending)

Search results

3d face reconstruction using deep learning.

Student thesis : Master

Achieving Long Term Fairness through Curiosity Driven Reinforcement Learning: How intrinsic motivation influences fairness in algorithmic decision making

Activity recognition using deep learning in videos under clinical setting, a data cleaning assistant.

Student thesis : Bachelor

A Data Cleaning Assistant for Machine Learning

A deep learning approach for clustering a multi-class dataset, aerial imagery pixel-level segmentation, a framework for understanding business process remaining time predictions, a hybrid model for pedestrian motion prediction, algorithms for center-based trajectory clustering, allocation decision-making in service supply chain with deep reinforcement learning, analyzing policy gradient approaches towards rapid policy transfer, an empirical study on dynamic curriculum learning in information retrieval, an explainable approach to multi-contextual fake news detection, an exploration and evaluation of concept based interpretability methods as a measure of representation quality in neural networks, anomaly detection in image data sets using disentangled representations, anomaly detection in polysomnography signals using ai, anomaly detection in text data using deep generative models, anomaly detection on dynamic graph, anomaly detection on finite multivariate time series from semi-automated screwing applications, anomaly detection on multivariate time series using gans, anomaly detection on vibration data, application of p&id symbol detection and classification for generation of material take-off documents (mtos), applications of deep generative models to tokamak nuclear fusion, a similarity based meta-learning approach to building pipeline portfolios for automated machine learning, aspect-based few-shot learning, aspect-based few-shot learning, assessing bias and fairness in machine learning through a causal lens, assessing fairness in anomaly detection: a framework for developing a context-aware fairness tool to assess rule-based models, a study of an open-ended strategy for learning complex locomotion skills, a systematic determination of metrics for classification tasks in openml, a universally applicable emm framework, automated machine learning with gradient boosting and meta-learning, automated object recognition of solar panels in aerial photographs: a case study in the liander service area, automatic data cleaning, automatic scoring of short open-ended questions, automatic synthesis of machine learning pipelines consisting of pre-trained models for multimodal data, automating string encoding in automl, autoregressive neural networks to model electroencephalograpy signals, balancing efficiency and fairness on ride-hailing platforms via reinforcement learning, benchmarking audio deepfake detection, better clustering evaluation for the openml evaluation engine, bi-level pipeline optimization for scalable automl, block-sparse evolutionary training using weight momentum evolution: training methods for hardware efficient sparse neural networks, boolean matrix factorization and completion, bootstrap hypothesis tests for evaluating subgroup descriptions in exceptional model mining, bottom-up search: a distance-based search strategy for supervised local pattern mining on multi-dimensional target spaces, bridging the domain-gap in computer vision tasks, can time series forecasting be automated: a benchmark and analysis.

thesis topic on data mining

Research Topics & Ideas: Data Science

Dissertation Coaching

PS – This is just the start…

We know it’s exciting to run through a list of research topics, but please keep in mind that this list is just a starting point . These topic ideas provided here are intentionally broad and generic , so keep in mind that you will need to develop them further. Nevertheless, they should inspire some ideas for your project.

Research topics and ideas about data science and big data analytics

Data Science-Related Research Topics

  • Developing machine learning models for real-time fraud detection in online transactions.
  • The use of big data analytics in predicting and managing urban traffic flow.
  • Investigating the effectiveness of data mining techniques in identifying early signs of mental health issues from social media usage.
  • The application of predictive analytics in personalizing cancer treatment plans.
  • Analyzing consumer behavior through big data to enhance retail marketing strategies.
  • The role of data science in optimizing renewable energy generation from wind farms.
  • Developing natural language processing algorithms for real-time news aggregation and summarization.
  • The application of big data in monitoring and predicting epidemic outbreaks.
  • Investigating the use of machine learning in automating credit scoring for microfinance.
  • The role of data analytics in improving patient care in telemedicine.
  • Developing AI-driven models for predictive maintenance in the manufacturing industry.
  • The use of big data analytics in enhancing cybersecurity threat intelligence.
  • Investigating the impact of sentiment analysis on brand reputation management.
  • The application of data science in optimizing logistics and supply chain operations.
  • Developing deep learning techniques for image recognition in medical diagnostics.
  • The role of big data in analyzing climate change impacts on agricultural productivity.
  • Investigating the use of data analytics in optimizing energy consumption in smart buildings.
  • The application of machine learning in detecting plagiarism in academic works.
  • Analyzing social media data for trends in political opinion and electoral predictions.
  • The role of big data in enhancing sports performance analytics.
  • Developing data-driven strategies for effective water resource management.
  • The use of big data in improving customer experience in the banking sector.
  • Investigating the application of data science in fraud detection in insurance claims.
  • The role of predictive analytics in financial market risk assessment.
  • Developing AI models for early detection of network vulnerabilities.

Research Topic Mega List

Data Science Research Ideas (Continued)

  • The application of big data in public transportation systems for route optimization.
  • Investigating the impact of big data analytics on e-commerce recommendation systems.
  • The use of data mining techniques in understanding consumer preferences in the entertainment industry.
  • Developing predictive models for real estate pricing and market trends.
  • The role of big data in tracking and managing environmental pollution.
  • Investigating the use of data analytics in improving airline operational efficiency.
  • The application of machine learning in optimizing pharmaceutical drug discovery.
  • Analyzing online customer reviews to inform product development in the tech industry.
  • The role of data science in crime prediction and prevention strategies.
  • Developing models for analyzing financial time series data for investment strategies.
  • The use of big data in assessing the impact of educational policies on student performance.
  • Investigating the effectiveness of data visualization techniques in business reporting.
  • The application of data analytics in human resource management and talent acquisition.
  • Developing algorithms for anomaly detection in network traffic data.
  • The role of machine learning in enhancing personalized online learning experiences.
  • Investigating the use of big data in urban planning and smart city development.
  • The application of predictive analytics in weather forecasting and disaster management.
  • Analyzing consumer data to drive innovations in the automotive industry.
  • The role of data science in optimizing content delivery networks for streaming services.
  • Developing machine learning models for automated text classification in legal documents.
  • The use of big data in tracking global supply chain disruptions.
  • Investigating the application of data analytics in personalized nutrition and fitness.
  • The role of big data in enhancing the accuracy of geological surveying for natural resource exploration.
  • Developing predictive models for customer churn in the telecommunications industry.
  • The application of data science in optimizing advertisement placement and reach.

Research topic evaluator

Recent Data Science-Related Studies

Below, we’ve included a selection of recent studies to help refine your thinking. These are actual studies,  so they can provide some useful insight as to what a research topic looks like in practice.

  • Data Science in Healthcare: COVID-19 and Beyond (Hulsen, 2022)
  • Auto-ML Web-application for Automated Machine Learning Algorithm Training and evaluation (Mukherjee & Rao, 2022)
  • Survey on Statistics and ML in Data Science and Effect in Businesses (Reddy et al., 2022)
  • Visualization in Data Science VDS @ KDD 2022 (Plant et al., 2022)
  • An Essay on How Data Science Can Strengthen Business (Santos, 2023)
  • A Deep study of Data science related problems, application and machine learning algorithms utilized in Data science (Ranjani et al., 2022)
  • You Teach WHAT in Your Data Science Course?!? (Posner & Kerby-Helm, 2022)
  • Statistical Analysis for the Traffic Police Activity: Nashville, Tennessee, USA (Tufail & Gul, 2022)
  • Data Management and Visual Information Processing in Financial Organization using Machine Learning (Balamurugan et al., 2022)
  • A Proposal of an Interactive Web Application Tool QuickViz: To Automate Exploratory Data Analysis (Pitroda, 2022)
  • Applications of Data Science in Respective Engineering Domains (Rasool & Chaudhary, 2022)
  • Jupyter Notebooks for Introducing Data Science to Novice Users (Fruchart et al., 2022)
  • Towards a Systematic Review of Data Science Programs: Themes, Courses, and Ethics (Nellore & Zimmer, 2022)
  • Application of data science and bioinformatics in healthcare technologies (Veeranki & Varshney, 2022)
  • TAPS Responsibility Matrix: A tool for responsible data science by design (Urovi et al., 2023)
  • Data Detectives: A Data Science Program for Middle Grade Learners (Thompson & Irgens, 2022)
  • MACHINE LEARNING FOR NON-MAJORS: A WHITE BOX APPROACH (Mike & Hazzan, 2022)
  • COMPONENTS OF DATA SCIENCE AND ITS APPLICATIONS (Paul et al., 2022)
  • Analysis on the Application of Data Science in Business Analytics (Wang, 2022)

Get 1-On-1 Help

Find the perfect research topic.

How To Choose A Research Topic: 5 Key Criteria

How To Choose A Research Topic: 5 Key Criteria

How To Choose A Research Topic Step-By-Step Tutorial With Examples + Free Topic...

Research Topics & Ideas: Automation & Robotics

Research Topics & Ideas: Automation & Robotics

A comprehensive list of automation and robotics-related research topics. Includes free access to a webinar and research topic evaluator.

Research Topics & Ideas: Sociology

Research Topics & Ideas: Sociology

Research Topics & Ideas: Sociology 50 Topic Ideas To Kickstart Your Research...

Research Topics & Ideas: Public Health & Epidemiology

Research Topics & Ideas: Public Health & Epidemiology

A comprehensive list of public health-related research topics. Includes free access to a webinar and research topic evaluator.

Research Topics & Ideas: Neuroscience

Research Topics & Ideas: Neuroscience

Research Topics & Ideas: Neuroscience 50 Topic Ideas To Kickstart Your Research...

đź“„ FREE TEMPLATES

Research Topic Ideation

Proposal Writing

Literature Review

Methodology & Analysis

Academic Writing

Referencing & Citing

Apps, Tools & Tricks

The Grad Coach Podcast

Krishna Kumar Mishra

I have to submit dissertation. can I get any help

Submit a Comment Cancel reply

Your email address will not be published. Required fields are marked *

Save my name, email, and website in this browser for the next time I comment.

Submit Comment

thesis topic on data mining

  • Print Friendly

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • PeerJ Comput Sci

Logo of peerjcs

Adaptations of data mining methodologies: a systematic literature review

Associated data.

The following information was supplied regarding data availability:

SLR Protocol (also shared via online repository), corpus with definitions and mappings are provided as a Supplemental File .

The use of end-to-end data mining methodologies such as CRISP-DM, KDD process, and SEMMA has grown substantially over the past decade. However, little is known as to how these methodologies are used in practice. In particular, the question of whether data mining methodologies are used ‘as-is’ or adapted for specific purposes, has not been thoroughly investigated. This article addresses this gap via a systematic literature review focused on the context in which data mining methodologies are used and the adaptations they undergo. The literature review covers 207 peer-reviewed and ‘grey’ publications. We find that data mining methodologies are primarily applied ‘as-is’. At the same time, we also identify various adaptations of data mining methodologies and we note that their number is growing rapidly. The dominant adaptations pattern is related to methodology adjustments at a granular level (modifications) followed by extensions of existing methodologies with additional elements. Further, we identify two recurrent purposes for adaptation: (1) adaptations to handle Big Data technologies, tools and environments (technological adaptations); and (2) adaptations for context-awareness and for integrating data mining solutions into business processes and IT systems (organizational adaptations). The study suggests that standard data mining methodologies do not pay sufficient attention to deployment issues, which play a prominent role when turning data mining models into software products that are integrated into the IT architectures and business processes of organizations. We conclude that refinements of existing methodologies aimed at combining data, technological, and organizational aspects, could help to mitigate these gaps.

Introduction

The availability of Big Data has stimulated widespread adoption of data mining and data analytics in research and in business settings ( Columbus, 2017 ). Over the years, a certain number of data mining methodologies have been proposed, and these are being used extensively in practice and in research. However, little is known about what and how data mining methodologies are applied, and it has not been neither widely researched nor discussed. Further, there is no consolidated view on what constitutes quality of methodological process in data mining and data analytics, how data mining and data analytics are applied/used in organization settings context, and how application practices relate to each other. That motivates the need for comprehensive survey in the field.

There have been surveys or quasi-surveys and summaries conducted in related fields. Notably, there have been two systematic systematic literature reviews; Systematic Literature Review, hereinafter, SLR is the most suitable and widely used research method for identifying, evaluating and interpreting research of particular research question, topic or phenomenon ( Kitchenham, Budgen & Brereton, 2015 ). These reviews concerned Big Data Analytics, but not general purpose data mining methodologies. Adrian et al. (2004) executed SLR with respect to implementation of Big Data Analytics (BDA), specifically, capability components necessary for BDA value discovery and realization. The authors identified BDA implementation studies, determined their main focus areas, and discussed in detail BDA applications and capability components. Saltz & Shamshurin (2016) have published SLR paper on Big Data Team Process Methodologies. Authors have identified lack of standard in regards to how Big Data projects are executed, highlighted growing research in this area and potential benefits of such process standard. Additionally, authors synthesized and produced list of 33 most important success factors for executing Big Data activities. Finally, there are studies that surveyed data mining techniques and applications across domains, yet, they focus on data mining process artifacts and outcomes ( Madni, Anwar & Shah, 2017 ; Liao, Chu & Hsiao, 2012 ), but not on end-to-end process methodology.

There have been number of surveys conducted in domain-specific settings such as hospitality, accounting, education, manufacturing, and banking fields. Mariani et al. (2018) focused on Business Intelligence (BI) and Big Data SLR in the hospitality and tourism environment context. Amani & Fadlalla (2017) explored application of data mining methods in accounting while Romero & Ventura (2013) investigated educational data mining. Similarly, Hassani, Huang & Silva (2018) addressed data mining application case studies in banking and explored them by three dimensions—topics, applied techniques and software. All studies were performed by the means of systematic literature reviews. Lastly, Bi & Cochran (2014) have undertaken standard literature review of Big Data Analytics and its applications in manufacturing.

Apart from domain-specific studies, there have been very few general purpose surveys with comprehensive overview of existing data mining methodologies, classifying and contextualizing them. Valuable synthesis was presented by Kurgan & Musilek (2006) as comparative study of the state-of-the art of data mining methodologies. The study was not SLR, and focused on comprehensive comparison of phases, processes, activities of data mining methodologies; application aspect was summarized briefly as application statistics by industries and citations. Three more comparative, non-SLR studies were undertaken by Marban, Mariscal & Segovia (2009) , Mariscal, Marbán & Fernández (2010) , and the most recent and closest one by Martnez-Plumed et al. (2017) . They followed the same pattern with systematization of existing data mining frameworks based on comparative analysis. There, the purpose and context of consolidation was even more practical—to support derivation and proposal of the new artifact, that is, novel data mining methodology. The majority of the given general type surveys in the field are more than a decade old, and have natural limitations due to being: (1) non-SLR studies, and (2) so far restricted to comparing methodologies in terms of phases, activities, and other elements.

The key common characteristic behind all the given studies is that data mining methodologies are treated as normative and standardized (‘one-size-fits-all’) processes. A complementary perspective, not considered in the above studies, is that data mining methodologies are not normative standardized processes, but instead, they are frameworks that need to be specialized to different industry domains, organizational contexts, and business objectives. In the last few years, a number of extensions and adaptations of data mining methodologies have emerged, which suggest that existing methodologies are not sufficient to cover the needs of all application domains. In particular, extensions of data mining methodologies have been proposed in the medical domain ( Niaksu, 2015 ), educational domain ( Tavares, Vieira & Pedro, 2017 ), the industrial engineering domain ( Huber et al., 2019 ; Solarte, 2002 ), and software engineering ( Marbán et al., 2007 , 2009 ). However, little attention has been given to studying how data mining methodologies are applied and used in industry settings, so far only non-scientific practitioners’ surveys provide such evidence.

Given this research gap, the central objective of this article is to investigate how data mining methodologies are applied by researchers and practitioners, both in their generic (standardized) form and in specialized settings. This is achieved by investigating if data mining methodologies are applied ‘as-is’ or adapted, and for what purposes such adaptations are implemented.

Guided by Systematic Literature Review method, initially we identified a corpus of primary studies covering both peer-reviewed and ‘grey’ literature from 1997 to 2018. An analysis of these studies led us to a taxonomy of uses of data mining methodologies, focusing on the distinction between ‘as is’ usage versus various types of methodology adaptations. By analyzing different types of methodology adaptations, this article identifies potential gaps in standard data mining methodologies both at the technological and at the organizational levels.

The rest of the article is organized as follows. The Background section provides an overview of key concepts of data mining and associated methodologies. Next, Research Design describes the research methodology. The Findings and Discussion section presents the study results and their associated interpretation. Finally, threats to validity are addressed in Threats to Validity while the Conclusion summarizes the findings and outlines directions for future work.

The section introduces main data mining concepts, provides overview of existing data mining methodologies, and their evolution.

Data mining is defined as a set of rules, processes, algorithms that are designed to generate actionable insights, extract patterns, and identify relationships from large datasets ( Morabito, 2016 ). Data mining incorporates automated data extraction, processing, and modeling by means of a range of methods and techniques. In contrast, data analytics refers to techniques used to analyze and acquire intelligence from data (including ‘big data’) ( Gandomi & Haider, 2015 ) and is positioned as a broader field, encompassing a wider spectrum of methods that includes both statistical and data mining ( Chen, Chiang & Storey, 2012 ). A number of algorithms has been developed in statistics, machine learning, and artificial intelligence domains to support and enable data mining. While statistical approaches precedes them, they inherently come with limitations, the most known being rigid data distribution conditions. Machine learning techniques gained popularity as they impose less restrictions while deriving understandable patterns from data ( Bose & Mahapatra, 2001 ).

Data mining projects commonly follow a structured process or methodology as exemplified by Mariscal, Marbán & Fernández (2010) , Marban, Mariscal & Segovia (2009) . A data mining methodology specifies tasks, inputs, outputs, and provides guidelines and instructions on how the tasks are to be executed ( Mariscal, Marbán & Fernández, 2010 ). Thus, data mining methodology provides a set of guidelines for executing a set of tasks to achieve the objectives of a data mining project ( Mariscal, Marbán & Fernández, 2010 ).

The foundations of structured data mining methodologies were first proposed by Fayyad, Piatetsky-Shapiro & Smyth (1996a , 1996b , 1996c) , and were initially related to Knowledge Discovery in Databases (KDD). KDD presents a conceptual process model of computational theories and tools that support information extraction (knowledge) with data ( Fayyad, Piatetsky-Shapiro & Smyth, 1996a ). In KDD, the overall approach to knowledge discovery includes data mining as a specific step. As such, KDD, with its nine main steps (exhibited in Fig. 1 ), has the advantage of considering data storage and access, algorithm scaling, interpretation and visualization of results, and human computer interaction ( Fayyad, Piatetsky-Shapiro & Smyth, 1996a , 1996c ). Introduction of KDD also formalized clearer distinction between data mining and data analytics, as for example formulated in Tsai et al. (2015) : “…by the data analytics, we mean the whole KDD process, while by the data analysis, we mean the part of data analytics that is aimed at finding the hidden information in the data, such as data mining”.

An external file that holds a picture, illustration, etc.
Object name is peerj-cs-06-267-g001.jpg

The main steps of KDD are as follows:

  • Step 1: Learning application domain: In the first step, it is needed to develop an understanding of the application domain and relevant prior knowledge followed by identifying the goal of the KDD process from the customer’s viewpoint.
  • Step 2: Dataset creation: Second step involves selecting a dataset, focusing on a subset of variables or data samples on which discovery is to be performed.
  • Step 3: Data cleaning and processing: In the third step, basic operations to remove noise or outliers are performed. Collection of necessary information to model or account for noise, deciding on strategies for handling missing data fields, and accounting for data types, schema, and mapping of missing and unknown values are also considered.
  • Step 4: Data reduction and projection: Here, the work of finding useful features to represent the data, depending on the goal of the task, application of transformation methods to find optimal features set for the data is conducted.
  • Step 5: Choosing the function of data mining: In the fifth step, the target outcome (e.g., summarization, classification, regression, clustering) are defined.
  • Step 6: Choosing data mining algorithm: Sixth step concerns selecting method(s) to search for patterns in the data, deciding which models and parameters are appropriate and matching a particular data mining method with the overall criteria of the KDD process.
  • Step 7: Data mining: In the seventh step, the work of mining the data that is, searching for patterns of interest in a particular representational form or a set of such representations: classification rules or trees, regression, clustering is conducted.
  • Step 8: Interpretation: In this step, the redundant and irrelevant patterns are filtered out, relevant patterns are interpreted and visualized in such way as to make the result understandable to the users.
  • Step 9: Using discovered knowledge: In the last step, the results are incorporated with the performance system, documented and reported to stakeholders, and used as basis for decisions.

The KDD process became dominant in industrial and academic domains ( Kurgan & Musilek, 2006 ; Marban, Mariscal & Segovia, 2009 ). Also, as timeline-based evolution of data mining methodologies and process models shows ( Fig. 2 below), the original KDD data mining model served as basis for other methodologies and process models, which addressed various gaps and deficiencies of original KDD process. These approaches extended the initial KDD framework, yet, extension degree has varied ranging from process restructuring to complete change in focus. For example, Brachman & Anand (1996) and further Gertosio & Dussauchoy (2004) (in a form of case study) introduced practical adjustments to the process based on iterative nature of process as well as interactivity. The complete KDD process in their view was enhanced with supplementary tasks and the focus was changed to user’s point of view (human-centered approach), highlighting decisions that need to be made by the user in the course of data mining process. In contrast, Cabena et al. (1997) proposed different number of steps emphasizing and detailing data processing and discovery tasks. Similarly, in a series of works Anand & Büchner (1998) , Anand et al. (1998) , Buchner et al. (1999) presented additional data mining process steps by concentrating on adaptation of data mining process to practical settings. They focused on cross-sales (entire life-cycles of online customer), with further incorporation of internet data discovery process (web-based mining). Further, Two Crows data mining process model is consultancy originated framework that has defined the steps differently, but is still close to original KDD. Finally, SEMMA (Sample, Explore, Modify, Model and Assess) based on KDD, was developed by SAS institute in 2005 ( SAS Institute Inc., 2017 ). It is defined as a logical organization of the functional toolset of SAS Enterprise Miner for carrying out the core tasks of data mining. Compared to KDD, this is vendor-specific process model which limits its application in different environments. Also, it skips two steps of original KDD process (‘Learning Application Domain’ and ‘Using of Discovered Knowledge’) which are regarded as essential for success of data mining project ( Mariscal, Marbán & Fernández, 2010 ). In terms of adoption, new KDD-based proposals received limited attention across academia and industry ( Kurgan & Musilek, 2006 ; Marban, Mariscal & Segovia, 2009 ). Subsequently, most of these methodologies converged into the CRISP-DM methodology.

An external file that holds a picture, illustration, etc.
Object name is peerj-cs-06-267-g002.jpg

Additionally, there have only been two non-KDD based approaches proposed alongside extensions to KDD. The first one is 5A’s approach presented by De Pisón Ascacbar (2003) and used by SPSS vendor. The key contribution of this approach has been related to adding ‘Automate’ step while disadvantage was associated with omitting ‘Data Understanding’ step. The second approach was 6-Sigma which is industry originated method to improve quality and customer’s satisfaction ( Pyzdek & Keller, 2003 ). It has been successfully applied to data mining projects in conjunction with DMAIC performance improvement model (Define, Measure, Analyze, Improve, Control).

In 2000, as response to common issues and needs ( Marban, Mariscal & Segovia, 2009 ), an industry-driven methodology called Cross-Industry Standard Process for Data Mining (CRISP-DM) was introduced as an alternative to KDD. It also consolidated original KDD model and its various extensions. While CRISP-DM builds upon KDD, it consists of six phases that are executed in iterations ( Marban, Mariscal & Segovia, 2009 ). The iterative executions of CRISP-DM stand as the most distinguishing feature compared to initial KDD that assumes a sequential execution of its steps. CRISP-DM, much like KDD, aims at providing practitioners with guidelines to perform data mining on large datasets. However,CRISP-DM with its six main steps with a total of 24 tasks and outputs, is more refined as compared to KDD. The main steps of CRIPS-DM, as depicted in Fig. 3 below are as follows:

  • Phase 1: Business understanding: The focus of the first step is to gain an understanding of the project objectives and requirements from a business perspective followed by converting these into data mining problem definitions. Presentation of a preliminary plan to achieve the objectives are also included in this first step.
  • Phase 2: Data understanding: This step begins with an initial data collection and proceeds with activities in order to get familiar with the data, identify data quality issues, discover first insights into the data, and potentially detect and form hypotheses.
  • Phase 3: Data preparation: The third step covers activities required to construct the final dataset from the initial raw data. Data preparation tasks are performed repeatedly.
  • Phase 4: Modeling phase: In this step, various modeling techniques are selected and applied followed by calibrating their parameters. Typically, several techniques are used for the same data mining problem.
  • Phase 5: Evaluation of the model(s): The fifth step begins with the quality perspective and then, before proceeding to final model deployment, ascertains that the model(s) achieves the business objectives. At the end of this phase, a decision should be reached on how to use data mining results.
  • Phase 6: Deployment phase: In the final step, the models are deployed to enable end-customers to use the data as basis for decisions, or support in the business process. Even if the purpose of the model is to increase knowledge of the data, the knowledge gained will need to be organized, presented, distributed in a way that the end-user can use it. Depending on the requirements, the deployment phase can be as simple as generating a report or as complex as implementing a repeatable data mining process.

An external file that holds a picture, illustration, etc.
Object name is peerj-cs-06-267-g003.jpg

The development of CRISP-DM was led by industry consortium. It is designed to be domain-agnostic ( Mariscal, Marbán & Fernández, 2010 ) and as such, is now widely used by industry and research communities ( Marban, Mariscal & Segovia, 2009) . These distinctive characteristics have made CRISP-DM to be considered as ‘de-facto’ standard of data mining methodology and as a reference framework to which other methodologies are benchmarked ( Mariscal, Marbán & Fernández, 2010 ).

Similarly to KDD, a number of refinements and extensions of the CRISP-DM methodology have been proposed with the two main directions—extensions of the process model itself and adaptations, merger with the process models and methodologies in other domains. Extensions direction of process models could be exemplified by Cios & Kurgan (2005) who have proposed integrated Data Mining & Knowledge Discovery (DMKD) process model. It contains several explicit feedback mechanisms, modification of the last step to incorporate discovered knowledge and insights application as well as relies on technologies for results deployment. In the same vein, Moyle & Jorge (2001) , Blockeel & Moyle (2002) proposed Rapid Collaborative Data Mining System (RAMSYS) framework—this is both data mining methodology and system for remote collaborative data mining projects. The RAMSYS attempted to achieve the combination of a problem solving methodology, knowledge sharing, and ease of communication. It intended to allow the collaborative work of remotely placed data miners in a disciplined manner as regards information flow while allowing the free flow of ideas for problem solving ( Moyle & Jorge, 2001 ). CRISP-DM modifications and integrations with other specific domains were proposed in Industrial Engineering (Data Mining for Industrial Engineering by Solarte (2002) ), and Software Engineering by Marbán et al. (2007 , 2009) . Both approaches enhanced CRISP-DM and contributed with additional phases, activities and tasks typical for engineering processes, addressing on-going support ( Solarte, 2002 ), as well as project management, organizational and quality assurance tasks ( Marbán et al., 2009 ).

Finally, limited number of attempts to create independent or semi-dependent data mining frameworks was undertaken after CRISP-DM creation. These efforts were driven by industry players and comprised KDD Roadmap by Debuse et al. (2001) for proprietary predictive toolkit (Lanner Group), and recent effort by IBM with Analytics Solutions Unified Method for Data Mining (ASUM-DM) in 2015 ( IBM Corporation, 2016 : https://developer.ibm.com/technologies/artificial-intelligence/articles/architectural-thinking-in-the-wild-west-of-data-science/ ). Both frameworks contributed with additional tasks, for example, resourcing in KDD Roadmap, or hybrid approach assumed in ASUM, for example, combination of agile and traditional implementation principles.

The Table 1 above summarizes reviewed data mining process models and methodologies by their origin, basis and key concepts.

NameOriginBasisKey conceptYear
Human-CenteredAcademyKDDIterative process and interactivity (user’s point of view and needed decisions)1996, 2004
Cabena et al.AcademyKDDFocus on data processing and discovery tasks1997
Anand and BuchnerAcademyKDDSupplementary steps and integration of web-mining1998, 1999
Two CrowsIndustryKDDModified definitions of steps1998
SEMMAIndustryKDDTool-specific (SAS Institute), elimination of some steps2005
5 A’sIndustryIndependentSupplementary steps2003
6 SigmasIndustryIndependentSix Sigma quality improvement paradigm in conjunction with DMAIC performance improvement model2003
CRISP-DMJoint industry and academyKDDIterative execution of steps, significant refinements to tasks and outputs2000
Cios et al.AcademyCrisp-DMIntegration of data mining and knowledge discovery, feedback mechanisms, usage of received insights supported by technologies2005
RAMSYSAcademyCrisp-DMIntegration of collaborative work aspects2001–2002
DMIEAcademyCrisp-DMIntegration and adaptation to Industrial Engineering domain2001
MarbanAcademyCrisp-DMIntegration and adaptation to Software Engineering domain2007
KDD roadmapJoint industry and academyIndependentTool-specific, resourcing task2001
ASUMIndustryCrisp-DMTool-specific, combination of traditional Crisp-DM and agile implementation approach2015

Research Design

The main research objective of this article is to study how data mining methodologies are applied by researchers and practitioners. To this end, we use systematic literature review (SLR) as scientific method for two reasons. Firstly, systematic review is based on trustworthy, rigorous, and auditable methodology. Secondly, SLR supports structured synthesis of existing evidence, identification of research gaps, and provides framework to position new research activities ( Kitchenham, Budgen & Brereton, 2015 ). For our SLR, we followed the guidelines proposed by Kitchenham, Budgen & Brereton (2015) . All SLR details have been documented in the separate, peer-reviewed SLR protocol (available at https://figshare.com/articles/Systematic-Literature-Review-Protocol/10315961 ).

Research questions

As suggested by Kitchenham, Budgen & Brereton (2015) , we have formulated research questions and motivate them as follows. In the preliminary phase of research we have discovered very limited number of studies investigating data mining methodologies application practices as such. Further, we have discovered number of surveys conducted in domain-specific settings, and very few general purpose surveys, but none of them considered application practices either. As contrasting trend, recent emergence of limited number of adaptation studies have clearly pinpointed the research gap existing in the area of application practices. Given this research gap, in-depth investigation of this phenomenon led us to ask: “How data mining methodologies are applied (‘as-is’ vs adapted) (RQ1)?” Further, as we intended to investigate in depth universe of adaptations scenarios, this naturally led us to RQ2: “How have existing data mining methodologies been adapted?” Finally, if adaptions are made, we wish to explore what the associated reasons and purposes are, which in turn led us to RQ3: “For what purposes are data mining methodologies adapted?”

Thus, for this review, there are three research questions defined:

  • Research Question 1: How data mining methodologies are applied (‘as-is’ versus adapted)? This question aims to identify data mining methodologies application and usage patterns and trends.
  • Research Question 2: How have existing data mining methodologies been adapted? This questions aims to identify and classify data mining methodologies adaptation patterns and scenarios.
  • Research Question 3: For what purposes have existing data mining methodologies been adapted? This question aims to identify, explain, classify and produce insights on what are the reasons and what benefits are achieved by adaptations of existing data mining methodologies. Specifically, what gaps do these adaptations seek to fill and what have been the benefits of these adaptations. Such systematic evidence and insights will be valuable input to potentially new, refined data mining methodology. Insights will be of interest to practitioners and researchers.

Data collection strategy

Our data collection and search strategy followed the guidelines proposed by Kitchenham, Budgen & Brereton (2015) . It defined the scope of the search, selection of literature and electronic databases, search terms and strings as well as screening procedures.

Primary search

The primary search aimed to identify an initial set of papers. To this end, the search strings were derived from the research objective and research questions. The term ‘data mining’ was the key term, but we also included ‘data analytics’ to be consistent with observed research practices. The terms ‘methodology’ and ‘framework’ were also included. Thus, the following search strings were developed and validated in accordance with the guidelines suggested by Kitchenham, Budgen & Brereton (2015) :

(‘data mining methodology’) OR (‘data mining framework’) OR (‘data analytics methodology’) OR (‘data analytics framework’)

The search strings were applied to the indexed scientific databases Scopus, Web of Science (for ‘peer-reviewed’, academic literature) and to the non-indexed Google Scholar (for non-peer-reviewed, so-called ‘grey’ literature). The decision to cover ‘grey’ literature in this research was motivated as follows. As proposed in number of information systems and software engineering domain publications ( Garousi, Felderer & Mäntylä, 2019 ; Neto et al., 2019 ), SLR as stand-alone method may not provide sufficient insight into ‘state of practice’. It was also identified ( Garousi, Felderer & Mäntylä, 2016 ) that ‘grey’ literature can give substantial benefits in certain areas of software engineering, in particular, when the topic of research is related to industrial and practical settings. Taking into consideration the research objectives, which is investigating data mining methodologies application practices, we have opted for inclusion of elements of Multivocal Literature Review (MLR) 1 in our study. Also, Kitchenham, Budgen & Brereton (2015) recommends including ‘grey’ literature to minimize publication bias as positive results and research outcomes are more likely to be published than negative ones. Following MLR practices, we also designed inclusion criteria for types of ‘grey’ literature reported below.

The selection of databases is motivated as follows. In case of peer-reviewed literature sources we concentrated to avoid potential omission bias. The latter is discussed in IS research ( Levy & Ellis, 2006 ) in case research is concentrated in limited disciplinary data sources. Thus, broad selection of data sources including multidisciplinary-oriented (Scopus, Web of Science, Wiley Online Library) and domain-oriented (ACM Digital Library, IEEE Xplorer Digital Library) scientific electronic databases was evaluated. Multidisciplinary databases have been selected due to wider domain coverage and it was validated and confirmed that they do include publications originating from domain-oriented databases, such as ACM and IEEE. From multi-disciplinary databases as such, Scopus was selected due to widest possible coverage (it is worlds largest database, covering app. 80% of all international peer-reviewed journals) while Web of Science was selected due to its longer temporal range. Thus, both databases complement each other. The selected non-indexed database source for ‘grey’ literature is Google Scholar, as it is comprehensive source of both academic and ‘grey’ literature publications and referred as such extensively ( Garousi, Felderer & Mäntylä, 2019 ; Neto et al., 2019 ).

Further, Garousi, Felderer & Mäntylä (2019) presented three-tier categorization framework for types of ‘grey literature’. In our study we restricted ourselves to the 1st tier ‘grey’ literature publications of the limited number of ‘grey’ literature producers. In particular, from the list of producers ( Neto et al., 2019 ) we have adopted and focused on government departments and agencies, non-profit economic, trade organizations (‘think-tanks’) and professional associations, academic and research institutions, businesses and corporations (consultancy companies and established private companies). The 1st tier ‘grey’ literature selected items include: (1) government, academic, and private sector consultancy reports 2 , (2) theses (not lower than Master level) and PhD Dissertations, (3) research reports, (4) working papers, (5) conference proceedings, preprints. With inclusion of the 1st tier ‘grey’ literature criteria we mitigate quality assessment challenge especially relevant and reported for it ( Garousi, Felderer & Mäntylä, 2019 ; Neto et al., 2019 ).

Scope and domains inclusion

As recommended by Kitchenham, Budgen & Brereton (2015) it is necessary to initially define research scope. To clarify the scope, we defined what is not included and is out of scope of this research. The following aspects are not included in the scope of our study:

  • Context of technology and infrastructure for data mining/data analytics tasks and projects.
  • Granular methods application in data mining process itself or their application for data mining tasks, for example, constructing business queries or applying regression or neural networks modeling techniques to solve classification problems. Studies with granular methods are included in primary texts corpus as long as method application is part of overall methodological approach.
  • Technological aspects in data mining for example, data engineering, dataflows and workflows.
  • Traditional statistical methods not associated with data mining directly including statistical control methods.

Similarly to Budgen et al. (2006) and Levy & Ellis (2006) , initial piloting revealed that search engines retrieved literature available for all major scientific domains including ones outside authors’ area of expertise (e.g., medicine). Even though such studies could be retrieved, it would be impossible for us to analyze and correctly interpret literature published outside the possessed area of expertise. The adjustments toward search strategy were undertaken by retaining domains closely associated with Information Systems, Software Engineering research. Thus, for Scopus database the final set of inclusive domains was limited to nine and included Computer Science, Engineering, Mathematics, Business, Management and Accounting, Decision Science, Economics, Econometrics and Finance, and Multidisciplinary as well as Undefined studies. Excluded domains covered 11.5% or 106 out of 925 publications; it was confirmed in validation process that they primarily focused on specific case studies in fundamental sciences and medicine 3 . The included domains from Scopus database were mapped to Web of Science to ensure consistent approach across databases and the correctness of mapping was validated.

Screening criteria and procedures

Based on the SLR practices (as in Kitchenham, Budgen & Brereton (2015) , Brereton et al. (2007) ) and defined SLR scope, we designed multi-step screening procedures (quality and relevancy) with associated set of Screening Criteria and Scoring System . The purpose of relevancy screening is to find relevant primary studies in an unbiased way ( Vanwersch et al., 2011 ). Quality screening, on the other hand, aims to assess primary relevant studies in terms of quality in unbiased way.

Screening Criteria consisted of two subsets— Exclusion Criteria applied for initial filtering and Relevance Criteria , also known as Inclusion Criteria .

Exclusion Criteria were initial threshold quality controls aiming at eliminating studies with limited or no scientific contribution. The exclusion criteria also address issues of understandability, accessability and availability. The Exclusion Criteria were as follows:

  • Quality 1: The publication item is not in English (understandability).
  • either the same document retrieved from two or all three databases.
  • or different versions of the same publication are retrieved (i.e., the same study published in different sources)—based on best practices, decision rule is that the most recent paper is retained as well as the one with the highest score ( Kofod-Petersen, 2014 ).
  • if a publication is published both as conference proceeding and as journal article with the same name and same authors or as an extended version of conference paper, the latter is selected.
  • Quality 3: Length of the publication is less than 6 pages—short papers do not have the space to expand and discuss presented ideas in sufficient depth to examine for us.
  • Quality 4: The paper is not accessible in full length online through the university subscription of databases and via Google Scholar—not full availability prevents us from assessing and analyzing the text.

The initially retrieved list of papers was filtered based on Exclusion Criteria . Only papers that passed all criteria were retained in the final studies corpus. Mapping of criteria towards screening steps is exhibited in Fig. 4 .

An external file that holds a picture, illustration, etc.
Object name is peerj-cs-06-267-g004.jpg

Relevance Criteria were designed to identify relevant publications and are presented in Table 2 below while mapping to respective process steps is presented in Fig. 4 . These criteria were applied iteratively.

Relevance criteriaCriteria definitionCriteria justification
Relevance 1Is the study about data mining or data analytics approach and is within designated list of domains?Exclude studies conducted outside the designated domain list. Exclude studies not directly describing and/or discussing data mining and data analytics
Relevance 2Is the study introducing/describing data mining or data analytics methodology/framework or modifying existing approaches?Exclude texts considering only specific, granular data mining and data analytics techniques, methods or traditional statistical methods. Exclude publications focusing on specific, granular data mining and data analytics process/sub-process aspects. Exclude texts where description and discussion of data mining methodologies or frameworks is manifestly missing

As a final SLR step, the full texts quality assessment was performed with constructed Scoring Metrics (in line with Kitchenham & Charters (2007) ). It is presented in the Table 3 below.

ScoreCriteria definition
3Data mining methodology or framework is presented in full. All steps described and explained, tests performed, results compared and evaluated. There is clear proposal on usage, application, deployment of solution in organization’s business process(es) and IT/IS system, and/or prototype or full solution implementation is discussed. Success factors described and presented
2Data mining methodology or framework is presented, some process steps are missing, but they do not impact the holistic view and understanding of the performed work. Data mining process is clearly presented and described, tests performed, results compared and evaluated. There is proposal on usage, application, deployment of solution in organization’s business process(es) and IT/IS system(s)
1Data mining methodology or framework is not presented in full, some key phases and process steps are missing. Publication focuses on one or some aspects (e.g., method, technique)
0Data mining methodology or framework not presented as holistic approach, but on fragmented basis, study limited to some aspects (e.g., method or technique discussion, etc.)

Data extraction and screening process

The conducted data extraction and screening process is presented in Fig. 4 . In Step 1 initial publications list were retrieved from pre-defined databases—Scopus, Web of Science, Google Scholar. The lists were merged and duplicates eliminated in Step 2. Afterwards, texts being less than 6 pages were excluded (Step 3). Steps 1–3 were guided by Exclusion Criteria . In the next stage (Step 4), publications were screened by Title based on pre-defined Relevance Criteria . The ones which passed were evaluated by their availability (Step 5). As long as study was available, it was evaluated again by the same pre-defined Relevance Criteria applied to Abstract, Conclusion and if necessary Introduction (Step 6). The ones which passed this threshold formed primary publications corpus extracted from databases in full. These primary texts were evaluated again based on full text (Step 7) applying Relevance Criteria first and then Scoring Metrics .

Results and quantitative analysis

In Step 1, 1,715 publications were extracted from relevant databases with the following composition—Scopus (819), Web of Science (489), Google Scholar (407). In terms of scientific publication domains, Computer Science (42.4%), Engineering (20.6%), Mathematics (11.1%) accounted for app. 74% of Scopus originated texts. The same applies to Web of Science harvest. Exclusion Criteria application produced the following results. In Step 2, after eliminating duplicates, 1,186 texts were passed for minimum length evaluation, and 767 reached assessment by Relevancy Criteria .

As mentioned Relevance Criteria were applied iteratively (Step 4–6) and in conjunction with availability assessment. As a result, only 298 texts were retained for full evaluation with 241 originating from scientific databases while 57 were ‘grey’. These studies formed primary texts corpus which was extracted, read in full and evaluated by Relevance Criteria combined with Scoring Metrics . The decision rule was set as follows. Studies that scored “1” or “0” were rejected, while texts with “3” and “2” evaluation were admitted as final primary studies corpus. To this end, as an outcome of SLR-based, broad, cross-domain publications collection and screening we identified 207 relevant publications from peer-reviewed (156 texts) and ‘grey’ literature (51 texts). Figure 5 below exhibits yearly published research numbers with the breakdown by ‘peer-reviewed’ and ‘grey’ literature starting from 1997.

An external file that holds a picture, illustration, etc.
Object name is peerj-cs-06-267-g005.jpg

In terms of composition, ‘peer-reviewed’ studies corpus is well-balanced with 72 journal articles and 82 conference papers while book chapters account for 4 instances only. In contrast, in ‘grey’ literature subset, articles in moderated and non-peer reviewed journals are dominant ( n = 34) compared to overall number of conference papers ( n = 13), followed by small number of technical reports and pre-prints ( n = 4).

Temporal analysis of texts corpus (as per Fig. 5 below) resulted in two observations. Firstly, we note that stable and significant research interest (in terms of numbers) on data mining methodologies application has started around a decade ago—in 2007. Research efforts made prior to 2007 were relatively limited with number of publications below 10. Secondly, we note that research on data mining methodologies has grown substantially since 2007, an observation supported by the 3-year and 10-year constructed mean trendlines. In particular, the number of publications have roughly tripled over past decade hitting all time high with 24 texts released in 2017.

Further, there are also two distinct spike sub-periods in the years 2007–2009 and 2014–2017 followed by stable pattern with overall higher number of released publications on annual basis. This observation is in line with the trend of increased penetration of methodologies, tools, cross-industry applications and academic research of data mining.

Findings and Discussion

In this section, we address the research questions of the paper. Initially, as part of RQ1, we present overview of data mining methodologies ‘as-is’ and adaptation trends. In addressing RQ2, we further classify the adaptations identified. Then, as part of RQ3 subsection, each category identified under RQ2 is analyzed with particular focus on the goals of adaptations.

RQ1: How data mining methodologies are applied (‘as-is’ vs. adapted)?

The first research question examines the extent to which data mining methodologies are used ‘as-is’ versus adapted. Our review based on 207 publications identified two distinct paradigms on how data mining methodologies are applied. The first is ‘as-is’ where the data mining methodologies are applied as stipulated. The second is with ‘adaptations’; that is, methodologies are modified by introducing various changes to the standard process model when applied.

We have aggregated research by decades to differentiate application pattern between two time periods 1997–2007 with limited vs 2008–2018 with more intensive data mining application. The given cut has not only been guided by extracted publications corpus but also by earlier surveys. In particular, during the pre-2007 research, there where ten new methodologies proposed, but since then, only two new methodologies have been proposed. Thus, there is a distinct trend observed over the last decade of large number of extensions and adaptations proposed vs entirely new methodologies.

We note that during the first decade of our time scope (1997–2007), the ratio of data mining methodologies applied ‘as-is’ was 40% (as presented in Fig. 6A ). However, the same ratio for the following decade is 32% ( Fig. 6B ). Thus, in terms of relative shares we note a clear decrease in using data mining methodologies ‘as-is’ in favor of adapting them to cater to specific needs.The trend is even more pronounced when comparing numbers—adaptations more than tripled (from 30 to 106) while ‘as-is’ scenario has increased modestly (from 20 to 51). Given this finding, we continue with analyzing how data mining methodologies have been adapted under RQ2.

An external file that holds a picture, illustration, etc.
Object name is peerj-cs-06-267-g006.jpg

RQ2: How have existing data mining methodologies been adapted?

We identified that data mining methodologies have been adapted to cater to specific needs. In order to categorize adaptations scenarios, we applied a two-level dichotomy, specifically, by applying the following decision tree:

  • Level 1 Decision: Has the methodology been combined with another methodology? If yes, the resulting methodology was classified in the ‘integration’ category. Otherwise, we posed the next question.
  • Level 2 Decision: Are any new elements (phases, tasks, deliverables) added to the methodology? If yes, we designate the resulting methodology as an ‘extension’ of the original one. Otherwise, we classify the resulting methodology as a modification of the original one.

Thus, when adapted three distinct types of adaptation scenarios can be distinguished:

  • Scenario ‘Modification’: introduces specialized sub-tasks and deliverables in order to address specific use cases or business problems. Modifications typically concentrate on granular adjustments to the methodology at the level of sub-phases, tasks or deliverables within the existing reference frameworks (e.g., CRISP-DM or KDD) stages. For example, Chernov et al. (2014) , in the study of mobile network domain, proposed automated decision-making enhancement in the deployment phase. In addition, the evaluation phase was modified by using both conventional and own-developed performance metrics. Further, in a study performed within the financial services domain, Yang et al. (2016) presents feature transformation and feature selection as sub-phases, thereby enhancing the data mining modeling stage.
  • Scenario ‘Extension’: primarily proposes significant extensions to reference data mining methodologies. Such extensions result in either integrated data mining solutions, data mining frameworks serving as a component or tool for automated IS systems, or their transformations to fit specialized environments. The main purposes of extensions are to integrate fully-scaled data mining solutions into IS/IT systems and business processes and provide broader context with useful architectures, algorithms, etc. Adaptations, where extensions have been made, elicit and explicitly present various artifacts in the form of system and model architectures, process views, workflows, and implementation aspects. A number of soft goals are also achieved, providing holistic perspective on data mining process, and contextualizing with organizational needs. Also, there are extensions in this scenario where data mining process methodologies are substantially changed and extended in all key phases to enable execution of data mining life-cycle with the new (Big) Data technologies, tools and in new prototyping and deployment environments (e.g., Hadoop platforms or real-time customer interfaces). For example, Kisilevich, Keim & Rokach (2013) presented extensions to traditional CRISP-DM data mining outcomes with fully fledged Decision Support System (DSS) for hotel brokerage business. Authors ( Kisilevich, Keim & Rokach, 2013 ) have introduced spatial/non-spatial data management (extending data preparation), analytical and spatial modeling capabilities (extending modeling phase), provided spatial display and reporting capabilities (enhancing deployment phase). In the same work domain knowledge was introduced in all phases of data mining process, and usability and ease of use were also addressed.
  • Scenario ‘Integration’: combines reference methodology, for example, CRISP-DM with: (1) data mining methodologies originated from other domains (e.g., Software engineering development methodologies), (2) organizational frameworks (Balanced Scorecard, Analytics Canvass, etc.), or (3) adjustments to accommodate Big Data technologies and tools. Also, adaptations in the form of ‘Integration’ typically introduce various types of ontologies and ontology-based tools, domain knowledge, software engineering, and BI-driven framework elements. Fundamental data mining process adjustments to new types of data, IS architectures (e.g., real time data, multi-layer IS) are also presented. Key gaps addressed with such adjustments are prescriptive nature and low degree of formalization in CRISP-DM, obsolete nature of CRISP-DM with respect to tools, and lack of CRISP-DM integration with other organizational frameworks. For example, Brisson & Collard (2008) developed KEOPS data mining methodology (CRIPS-DM based) centered on domain knowledge integration. Ontology-driven information system has been proposed with integration and enhancements to all steps of data mining process. Further, an integrated expert knowledge used in all data mining phases was proved to produce value in data mining process.

To examine how the application scenario of each data mining methodology usage has developed over time, we mapped peer-reviewed texts and ‘grey’ literature to respective adaptation scenarios, aggregated by decades (as presented in the Fig. 7 for peer-reviewed and Fig. 8 for ‘grey’).

An external file that holds a picture, illustration, etc.
Object name is peerj-cs-06-267-g007.jpg

For peer-reviewed research, such temporal analysis resulted in three observations. Firstly, research efforts in each adaptation scenario has been growing and number of publication more than quadrupled (128 vs. 28). Secondly, as noted above relative proportion of ‘as-is’ studies is diluted (from 39% to 33%) and primarily replaced with ‘Extension’ paradigm (from 25% to 30%). In contrast, in relative terms ‘Modification’ and ‘Integration’ paradigms gains are modest. Further, this finding is reinforced with other observation—most notable gaps in terms of modest number of publications remain in ‘Integration’ category where excluding 2008–2009 spike, research efforts are limited and number of texts is just 13. This is in stark contrast with prolific research in ‘Extension category’ though concentrated in the recent years. We can hypothesize that existing reference methodologies do not accommodate and support increasing complexity of data mining projects and IS/IT infrastructure, as well as certain domains specifics and as such need to be adapted.

In ‘grey’ literature, in contrast to peer-reviewed research, growth in number of publications is less profound—29 vs. 22 publications or 32% comparing across two decade (as per Fig. 8 ). The growth is solely driven by ‘Integration’ scenarios application (13 vs. 4 publications) while both ‘as-is’ and other adaptations scenarios are stagnating or in decline.

RQ3: For what purposes have existing data mining methodologies been adapted?

We address the third research question by analyzing what gaps the data mining methodology adaptations seek to fill and the benefits of such adaptations. We identified three adaptation scenarios, namely ‘Modification’, ‘Extension’, and ‘Integration’. Here, we analyze each of them.

Modification

Modifications of data mining methodologies are present in 30 peer-reviewed and 4 ‘grey’ literature studies. The analysis shows that modifications overwhelmingly consist of specific case studies. However, the major differentiating point compared to ‘as-is’ case studies is clear presence of specific adjustments towards standard data mining process methodologies. Yet, the proposed modifications and their purposes do not go beyond traditional data mining methodologies phases. They are granular, specialized and executed on tasks, sub-tasks, and at deliverables level. With modifications, authors describe potential business applications and deployment scenarios at a conceptual level, but typically do not report or present real implementations in the IS/IT systems and business processes.

Further, this research subcategory can be best classified based on domains where case studies were performed and data mining methodologies modification scenarios executed. We have identified four distinct domain-driven applications presented in the Fig. 9 .

An external file that holds a picture, illustration, etc.
Object name is peerj-cs-06-267-g009.jpg

IT, IS domain

The largest number of publications (14 or app. 40%), was performed on IT, IS security, software development, specific data mining and processing topics. Authors address intrusion detection problem in Hossain, Bridges & Vaughn (2003) , Fan, Ye & Chen (2016) , Lee, Stolfo & Mok (1999) , specialized algorithms for variety of data types processing in Yang & Shi (2010) , Chen et al. (2001) , Yi, Teng & Xu (2016) , Pouyanfar & Chen (2016) , effective and efficient computer and mobile networks management in Guan & Fu (2010) , Ertek, Chi & Zhang (2017) , Zaki & Sobh (2005) , Chernov, Petrov & Ristaniemi (2015) , Chernov et al. (2014) .

Manufacturing and engineering

The next most popular research area is manufacturing/engineering with 10 case studies. The central topic here is high-technology manufacturing, for example, semi-conductors associated—study of Chien, Diaz & Lan (2014) , and various complex prognostics case studies in rail, aerospace domains ( Létourneau et al., 2005 ; Zaluski et al., 2011 ) concentrated on failure predictions. These are complemented by studies on equipment fault and failure predictions and maintenance ( Kumar, Shankar & Thakur, 2018 ; Kang et al., 2017 ; Wang, 2017 ) as well as monitoring system ( García et al., 2017 ).

Sales and services, incl. financial industry

The third category is presented by seven business application papers concerning customer service, targeting and advertising ( Karimi-Majd & Mahootchi, 2015 ; Reutterer et al., 2017 ; Wang, 2017 ), financial services credit risk assessments ( Smith, Willis & Brooks, 2000 ), supply chain management ( Nohuddin et al., 2018 ), and property management ( Yu, Fung & Haghighat, 2013 ), and similar.

As a consequence of specialization, these studies concentrate on developing ‘state-of-the art’ solution to the respective domain-specific problem.

‘Extension’ scenario was identified in 46 peer-reviewed and 12 ‘grey’ publications. We noted that ‘Extension’ to existing data mining methodologies were executed with four major purposes:

  • Purpose 1: To implement fully scaled, integrated data mining solution and regular, repeatable knowledge discovery process— address model, algorithm deployment, implementation design (including architecture, workflows and corresponding IS integration). Also, complementary goal is to tackle changes to business process to incorporate data mining into organization activities.
  • Purpose 2: To implement complex, specifically designed systems and integrated business applications with data mining model/solution as component or tool. Typically, this adaptation is also oriented towards Big Data specifics, and is complemented by proposed artifacts such as Big Data architectures, system models, workflows, and data flows.
  • Purpose 3: To implement data mining as part of integrated/combined specialized infrastructure, data environments and types (e.g., IoT, cloud, mobile networks) .
  • Purpose 4: To incorporate context-awareness aspects.

The specific list of studies mapped to each of the given purposes presented in the Appendix ( Table A1 ). Main purposes of adaptations, associated gaps and/or benefits along with observations and artifacts are documented in the Fig. 10 below.

An external file that holds a picture, illustration, etc.
Object name is peerj-cs-06-267-g010.jpg

Main adaptation purposePublications
(1) To implement fully scaled, integrated data mining solution , , , , , , , , , , , , , , ,
(2) To implement complex systems and integrated business applications with data mining model/solution as component or tool , , , , , , , , , , , , , , , , , , ,
(3) To implement data mining as part of integrated/combined specialized infrastructure,data environments and types (e.g., IoT, cloud, mobile networks) , , , , , , , , , , , , , , , , , , , ,
(4) To incorporate context-awareness aspects

In ‘Extension’ category, studies executed with the Purpose 1 propose fully scaled, integrated data mining solutions of specific data mining models, associated frameworks and processes. The distinctive trait of this research subclass is that it ensures repeatability and reproducibility of delivered data mining solution in different organizational and industry settings. Both the results of data mining use case as well as deployment and integration into IS/IT systems and associated business process(es) are presented explicitly. Thus, ‘Extension’ subclass is geared towards specific solution design, tackling concrete business or industrial setting problem or addressing specific research gaps thus resembling comprehensive case study.

This direction can be well exemplified by expert finder system in research social network services proposed by Sun et al. (2015) , data mining solution for functional test content optimization by Wang (2015) and time-series mining framework to conduct estimation of unobservable time-series by Hu et al. (2010) . Similarly, Du et al. (2017) tackle online log anomalies detection, automated association rule mining is addressed by Çinicioğlu et al. (2011) , software effort estimation by Deng, Purvis & Purvis (2011) , network patterns visual discovery by Simoff & Galloway (2008) . Number of studies address solutions in IS security ( Shin & Jeong, 2005 ), manufacturing ( Güder et al., 2014 ; Chee, Baharudin & Karkonasasi, 2016 ), materials engineering domains ( Doreswamy, 2008 ), and business domains ( Xu & Qiu, 2008 ; Ding & Daniel, 2007 ).

In contrast, ‘Extension’ studies executed for the Purpose 2 concentrate on design of complex, multi-component information systems and architectures. These are holistic, complex systems and integrated business applications with data mining framework serving as component or tool. Moreover, data mining methodology in these studies is extended with systems integration phases.

For example, Mobasher (2007) presents data mining application in Web personalization system and associated process; here, data mining cycle is extended in all phases with utmost goal of leveraging multiple data sources and using discovered models and corresponding algorithms in an automatic personalization system. Authors comprehensively address data processing, algorithm, design adjustments and respective integration into automated system. Similarly, Haruechaiyasak, Shyu & Chen (2004) tackle improvement of Webpage recommender system by presenting extended data mining methodology including design and implementation of data mining model. Holistic view on web-mining with support of all data sources, data warehousing and data mining techniques integration, as well as multiple problem-oriented analytical outcomes with rich business application scenarios (personalization, adaptation, profiling, and recommendations) in e-commerce domain was proposed and discussed by Büchner & Mulvenna (1998) . Further, Singh et al. (2014) tackled scalable implementation of Network Threat Intrusion Detection System. In this study, data mining methodology and resulting model are extended, scaled and deployed as module of quasi-real-time system for capturing Peer-to-Peer Botnet attacks. Similar complex solution was presented in a series of publications by Lee et al. (2000 , 2001) who designed real-time data mining-based Intrusion Detection System (IDS). These works are complemented by comprehensive study of Barbará et al. (2001) who constructed experimental testbed for intrusion detection with data mining methods. Detection model combining data fusion and mining and respective components for Botnets identification was developed by Kiayias et al. (2009) too. Similar approach is presented in Alazab et al. (2011) who proposed and implemented zero-day malware detection system with associated machine-learning based framework. Finally, Ahmed, Rafique & Abulaish (2011) presented multi-layer framework for fuzzy attack in 3G cellular IP networks.

A number of authors have considered data mining methodologies in the context of Decision Support Systems and other systems that generate information for decision-making, across a variety of domains. For example, Kisilevich, Keim & Rokach (2013) executed significant extension of data mining methodology by designing and presenting integrated Decision Support System (DSS) with six components acting as supporting tool for hotel brokerage business to increase deal profitability. Similar approach is undertaken by Capozzoli et al. (2017) focusing on improving energy management of properties by provision of occupancy pattern information and reconfiguration framework. Kabir (2016) presented data mining information service providing improved sales forecasting that supported solution of under/over-stocking problem while Lau, Zhang & Xu (2018) addressed sales forecasting with sentiment analysis on Big Data. Kamrani, Rong & Gonzalez (2001) proposed GA-based Intelligent Diagnosis system for fault diagnostics in manufacturing domain. The latter was tackled further in Shahbaz et al. (2010) with complex, integrated data mining system for diagnosing and solving manufacturing problems in real time.

Lenz, Wuest & Westkämper (2018) propose a framework for capturing data analytics objectives and creating holistic, cross-departmental data mining systems in the manufacturing domain. This work is representative of a cohort of studies that aim at extending data mining methodologies in order to support the design and implementation of enterprise-wide data mining systems. In this same research cohort, we classify Luna, Castro & Romero (2017) , which presents a data mining toolset integrated into the Moodle learning management system, with the aim of supporting university-wide learning analytics.

One study addresses multi-agent based data mining concept. Khan, Mohamudally & Babajee (2013) have developed unified theoretical framework for data mining by formulating a unified data mining theory. The framework is tested by means of agent programing proposing integration into multi-agent system which is useful due to scalability, robustness and simplicity.

The subcategory of ‘Extension’ research executed with Purpose 3 is devoted to data mining methodologies and solutions in specialized IT/IS, data and process environments which emerged recently as consequence of Big Data associated technologies and tools development. Exemplary studies include IoT associated environment research, for example, Smart City application in IoT presented by Strohbach et al. (2015) . In the same domain, Bashir & Gill (2016) addressed IoT-enabled smart buildings with the additional challenge of large amount of high-speed real time data and requirements of real-time analytics. Authors proposed integrated IoT Big Data Analytics framework. This research is complemented by interdisciplinary study of Zhong et al. (2017) where IoT and wireless technologies are used to create RFID-enabled environment producing analysis of KPIs to improve logistics.

Significant number of studies addresses various mobile environments sometimes complemented by cloud-based environments or cloud-based environments as stand-alone. Gomes, Phua & Krishnaswamy (2013) addressed mobile data mining with execution on mobile device itself; the framework proposes innovative approach addressing extensions of all aspects of data mining including contextual data, end-user privacy preservation, data management and scalability. Yuan, Herbert & Emamian (2014) and Yuan & Herbert (2014) introduced cloud-based mobile data analytics framework with application case study for smart home based monitoring system. Cuzzocrea, Psaila & Toccu (2016) have presented innovative FollowMe suite which implements data mining framework for mobile social media analytics with several tools with respective architecture and functionalities. An interesting paper was presented by Torres et al. (2017) who addressed data mining methodology and its implementation for congestion prediction in mobile LTE networks tackling also feedback reaction with network reconfigurations trigger.

Further, Biliri et al. (2014) presented cloud-based Future Internet Enabler—automated social data analytics solution which also addresses Social Network Interoperability aspect supporting enterprises to interconnect and utilize social networks for collaboration. Real-time social media streamed data and resulting data mining methodology and application was extensively discussed by Zhang, Lau & Li (2014) . Authors proposed design of comprehensive ABIGDAD framework with seven main components implementing data mining based deceptive review identification. Interdisciplinary study tackling both these topics was developed by Puthal et al. (2016) who proposed integrated framework and architecture of disaster management system based on streamed data in cloud environment ensuring end-to-end security. Additionally, key extensions to data mining framework have been proposed merging variety of data sources and types, security verification and data flow access controls. Finally, cloud-based manufacturing was addressed in the context of fault diagnostics by Kumar et al. (2016) .

Also, Mahmood et al. (2013) tackled Wireless Sensor Networks and associated data mining framework required extensions. Interesting work is executed by Nestorov & Jukic (2003) addressing rare topic of data mining solutions integration within traditional data warehouses and active mining of data repositories themselves.

Supported by new generation of visualization technologies (including Virtual Reality environments), Wijayasekara, Linda & Manic (2011) proposed and implemented CAVE-SOM (3D visual data mining framework) which offers interactive, immersive visual data mining with multiple visualization modes supported by plethora of methods. Earlier version of visual data mining framework was successfully developed and presented by Ganesh et al. (1996) as early as in 1996.

Large-scale social media data is successfully tackled by Lemieux (2016) with comprehensive framework accompanied by set of data mining tools and interface. Real time data analytics was addressed by Shrivastava & Pal (2017) in the domain of enterprise service ecosystem. Images data was addressed in Huang et al. (2002) by proposing multimedia data mining framework and its implementation with user relevance feedback integration and instance learning. Further, exploded data diversity and associated need to extend standard data mining is addressed by Singh et al. (2016) in the study devoted to object detection in video surveillance systems supporting real time video analysis.

Finally, there is also limited number of studies which addresses context awareness (Purpose 4) and extends data mining methodology with context elements and adjustments. In comparison with ‘Integration’ category research, here, the studies are at lower abstraction level, capturing and presenting list of adjustments. Singh, Vajirkar & Lee (2003) generate taxonomy of context factors, develop extended data mining framework and propose deployment including detailed IS architecture. Context-awareness aspect is also addressed in the papers reviewed above, for example, Lenz, Wuest & Westkämper (2018) , Kisilevich, Keim & Rokach (2013) , Sun et al. (2015) , and other studies.

Integration

‘Integration’ of data mining methodologies scenario was identified in 27 ‘peer-reviewed’ and 17 ‘grey’ studies. Our analysis revealed that this adaptation scenario at a higher abstraction level is typically executed with the five key purposes:

  • Purpose 1: to integrate/combine with various ontologies existing in organization .
  • Purpose 2: to introduce context-awareness and incorporate domain knowledge .
  • Purpose 3: to integrate/combine with other research or industry domains framework, process methodologies and concepts .
  • Purpose 4: to integrate/combine with other well-known organizational governance frameworks, process methodologies and concepts .
  • Purpose 5: to accommodate and/or leverage upon newly available Big Data technologies, tools and methods.

The specific list of studies mapped to each of the given purposes presented in Appendix ( Table A2 ). Main purposes of adaptations, associated gaps and/or benefits along with observations and artifacts are documented in Fig. 11 below.

An external file that holds a picture, illustration, etc.
Object name is peerj-cs-06-267-g011.jpg

Main adaptation purposePublications
(1) To integrate/combined with various ontologies existing in organization , , , , ,
(2) To introduce context-awareness and incorporate domain knowledge , , , , , ,
(3) To integrate/combine with other research/industry domains frameworks, process methodologies, and concepts , , , , , , , , , , , , ,
(4) To integrate/combine with other organizational governance frameworks, process methodologies, concepts , , , , , , , ,
(5) To accomodate or leverage upon newly available Big Data technologies, tools and methods , , , , , ,

As mentioned, number of studies concentrates on proposing ontology-based Integrated data mining frameworks accompanies by various types of ontologies (Purpose 1). For example, Sharma & Osei-Bryson (2008) focus on ontology-based organizational view with Actors, Goals and Objectives which supports execution of Business Understanding Phase. Brisson & Collard (2008) propose KEOPS framework which is CRISP-DM compliant and integrates a knowledge base and ontology with the purpose to build ontology-driven information system (OIS) for business and data understanding phases while knowledge base is used for post-processing step of model interpretation. Park et al. (2017) propose and design comprehensive ontology-based data analytics tool IRIS with the purpose to align analytics and business. IRIS is based on concept to connect dots, analytics methods or transforming insights into business value, and supports standardized process for applying ontology to match business problems and solutions.

Further, Ying et al. (2014) propose domain-specific data mining framework oriented to business problem of customer demand discovery. They construct ontology for customer demand and customer demand discovery task which allows to execute structured knowledge extraction in the form of knowledge patterns and rules. Here, the purpose is to facilitate business value realization and support actionability of extracted knowledge via marketing strategies and tactics. In the same vein, Cannataro & Comito (2003) presented ontology for the Data Mining domain which main goal is to simplify the development of distributed knowledge discovery applications. Authors offered to a domain expert a reference model for different kind of data mining tasks, methodologies, and software capable to solve the given business problem and find the most appropriate solution.

Apart from ontologies, Sharma & Osei-Bryson (2009) in another study propose IS inspired, driven by Input-Output model data mining methodology which supports formal implementation of Business Understanding Phase. This research exemplifies studies executed with Purpose 2. The goal of the paper is to tackle prescriptive nature of CRISP-DM and address how the entire process can be implemented. Cao, Schurmann & Zhang (2005) study is also exemplary in terms of aggregating and introducing several fundamental concepts into traditional CRISP-DM data mining cycle—context awareness, in-depth pattern mining, human–machine cooperative knowledge discovery (in essence, following human-centricity paradigm in data mining), loop-closed iterative refinement process (similar to Agile-based methodologies in Software Development). There are also several concepts, like data, domain, interestingness, rules which are proposed to tackle number of fundamental constrains identified in CRISP-DM. They have been discussed and further extended by Cao & Zhang (2007 , 2008) , Cao (2010) into integrated domain driven data mining concept resulting in fully fledged D3M (domain-driven) data mining framework. Interestingly, the same concepts, but on individual basis are investigated and presented by other authors, for example, context-aware data mining methodology is tackled by Xiang (2009a , 2009b) in the context of financial sector. Pournaras et al. (2016) attempted very crucial privacy-preservation topic in the context of achieving effective data analytics methodology. Authors introduced metrics and self-regulatory (reconfigurable) information sharing mechanism providing customers with controls for information disclosure.

A number of studies have proposed CRISP-DM adjustments based on existing frameworks, process models or concepts originating in other domains (Purpose 3), for example, software engineering ( Marbán et al., 2007 , 2009 ; Marban, Mariscal & Segovia, 2009 ) and industrial engineering ( Solarte, 2002 ; Zhao et al., 2005 ).

Meanwhile, Mariscal, Marbán & Fernández (2010) proposed a new refined data mining process based on a global comparative analysis of existing frameworks while Angelov (2014) outlined a data analytics framework based on statistical concepts. Following a similar approach, some researchers suggest explicit integration with other areas and organizational functions, for example, BI-driven Data Mining by Hang & Fong (2009) . Similarly, Chen, Kazman & Haziyev (2016) developed an architecture-centric agile Big Data analytics methodology, and an architecture-centric agile analytics and DevOps model. Alternatively, several authors tackled data mining methodology adaptations in other domains, for example, educational data mining by Tavares, Vieira & Pedro (2017) , decision support in learning management systems ( Murnion & Helfert, 2011 ), and in accounting systems ( Amani & Fadlalla, 2017 ).

Other studies are concerned with actionability of data mining and closer integration with business processes and organizational management frameworks (Purpose 4). In particular, there is a recurrent focus on embedding data mining solutions into knowledge-based decision making processes in organizations, and supporting fast and effective knowledge discovery ( Bohanec, Robnik-Sikonja & Borstnar, 2017 ).

Examples of adaptations made for this purpose include: (1) integration of CRISP-DM with the Balanced Scorecard framework used for strategic performance management in organizations ( Yun, Weihua & Yang, 2014 ); (2) integration with a strategic decision-making framework for revenue management Segarra et al. (2016) ; (3) integration with a strategic analytics methodology Van Rooyen & Simoff (2008) , and (4) integration with a so-called ‘Analytics Canvas’ for management of portfolios of data analytics projects Kühn et al. (2018) . Finally, Ahangama & Poo (2015) explored methodological attributes important for adoption of data mining methodology by novice users. This latter study uncovered factors that could support the reduction of resistance to the use of data mining methodologies. Conversely, Lawler & Joseph (2017) comprehensively evaluated factors that may increase the benefits of Big Data Analytics projects in an organization.

Lastly, a number of studies have proposed data mining frameworks (e.g., CRISP-DM) adaptations to cater for new technological architectures, new types of datasets and applications (Purpose 5). For example, Lu et al. (2017) proposed a data mining system based on a Service-Oriented Architecture (SOA), Zaghloul, Ali-Eldin & Salem (2013) developed a concept of self-service data analytics, Osman, Elragal & Bergvall-Kåreborn (2017) blended CRISP-DM into a Big Data Analytics framework for Smart Cities, and Niesen et al. (2016) proposed a data-driven risk management framework for Industry 4.0 applications.

Our analysis of RQ3, regarding the purposes of existing data mining methodologies adaptations, revealed the following key findings. Firstly, adaptations of type ‘Modification’ are predominantly targeted at addressing problems that are specific to a given case study. The majority of modifications were made within the domain of IS security, followed by case studies in the domains of manufacturing and financial services. This is in clear contrast with adaptations of type ‘Extension’, which are primarily aimed at customizing the methodology to take into account specialized development environments and deployment infrastructures, and to incorporate context-awareness aspects. Thirdly, a recurrent purpose of adaptations of type ‘Integration’ is to combine a data mining methodology with either existing ontologies in an organization or with other domain frameworks, methodologies, and concepts. ‘Integration’ is also used to instill context-awareness and domain knowledge into a data mining methodology, or to adapt it to specialized methods and tools, such as Big Data. The distinctive outcome and value (gaps filled in) of ‘Integrations’ stems from improved knowledge discovery, better actionability of results, improved combination with key organizational processes and domain-specific methodologies, and improved usage of Big Data technologies.

We discovered that the adaptations of existing data mining methodologies found in the literature can be classified into three categories: modification, extension, or integration.

We also noted that adaptations are executed either to address deficiencies and lack of important elements or aspects in the reference methodology (chiefly CRISP-DM). Furthermore, adaptations are also made to improve certain phases, deliverables or process outcomes.

In short, adaptations are made to:

  • improve key reference data mining methodologies phases—for example, in case of CRISP-DM these are primarily business understanding and deployment phases.
  • support knowledge discovery and actionability.
  • introduce context-awareness and higher degree of formalization.
  • integrate closer data mining solution with key organizational processes and frameworks.
  • significantly update CRISP-DM with respect to Big Data technologies, tools, environments and infrastructure.
  • incorporate broader, explicit context of architectures, algorithms and toolsets as integral deliverables or supporting tools to execute data mining process.
  • expand and accommodate broader unified perspective for incorporating and implementing data mining solutions in organization, IT infrastructure and business processes.

Threats to Validity

Systematic literature reviews have inherent limitations that must be acknowledged. These threats to validity include subjective bias (internal validity) and incompleteness of search results (external validity).

The internal validity threat stems from the subjective screening and rating of studies, particularly when assessing the studies with respect to relevance and quality criteria. We have mitigated these effects by documenting the survey protocol (SLR Protocol), strictly adhering to the inclusion criteria, and performing significant validation procedures, as documented in the Protocol.

The external validity threat relates to the extent to which the findings of the SLR reflect the actual state of the art in the field of data mining methodologies, given that the SLR only considers published studies that can be retrieved using specific search strings and databases. We have addressed this threat to validity by conducting trial searches to validate our search strings in terms of their ability to identify relevant papers that we knew about beforehand. Also, the fact that the searches led to 1,700 hits overall suggests that a significant portion of the relevant literature has been covered.

In this study, we have examined the use of data mining methodologies by means of a systematic literature review covering both peer-reviewed and ‘grey’ literature. We have found that the use of data mining methodologies, as reported in the literature, has grown substantially since 2007 (four-fold increase relative to the previous decade). Also, we have observed that data mining methodologies were predominantly applied ‘as-is’ from 1997 to 2007. This trend was reversed from 2008 onward, when the use of adapted data mining methodologies gradually started to replace ‘as-is’ usage.

The most frequent adaptations have been in the ‘Extension’ category. This category refers to adaptations that imply significant changes to key phases of the reference methodology (chiefly CRISP-DM). These adaptations particularly target the business understanding, deployment and implementation phases of CRISP-DM (or other methodologies). Moreover, we have found that the most frequent purposes of adaptions are: (1) adaptations to handle Big Data technologies, tools and environments (technological adaptations); and (2) adaptations for context-awareness and for integrating data mining solutions into business processes and IT systems (organizational adaptations). A key finding is that standard data mining methodologies do not pay sufficient attention to deployment aspects required to scale and transform data mining models into software products integrated into large IT/IS systems and business processes.

Apart from the adaptations in the ‘Extension’ category, we have also identified an increasing number of studies focusing on the ‘Integration’ of data mining methodologies with other domain-specific and organizational methodologies, frameworks, and concepts. These adaptions are aimed at embedding the data mining methodology into broader organizational aspects.

Overall, the findings of the study highlight the need to develop refinements of existing data mining methodologies that would allow them to seamlessly interact with IT development platforms and processes (technological adaptation) and with organizational management frameworks (organizational adaptation). In other words, there is a need to frame existing data mining methodologies as being part of a broader ecosystem of methodologies, as opposed to the traditional view where data mining methodologies are defined in isolation from broader IT systems engineering and organizational management methodologies.

Supplemental Information

Supplemental information 1.

Unfortunately, we were not able to upload any graph (original png files). Based on Overleaf placed PeerJ template we constructed graphs files based on the template examples. Unfortunately, we were not able to understand why it did not fit, redoing to new formats will change all texts flow and generated pdf file. We submit graphs in archived file as part of supplementary material. We will do our best to redo the graphs further based on instructions from You.

Supplemental Information 2

File starts with Definitions page—it lists and explains all columns definitions as well as SLR scoring metrics. Second page contains"Peer reviewed" texts while next one "grey" literature corpus.

Funding Statement

The authors received no funding for this work.

Additional Information and Declarations

The authors declare that they have no competing interests.

Veronika Plotnikova conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft.

Marlon Dumas conceived and designed the experiments, authored or reviewed drafts of the paper, and approved the final draft.

Fredrik Milani conceived and designed the experiments, authored or reviewed drafts of the paper, and approved the final draft.

Primary Sources

PHD PRIME

Data Mining Thesis Ideas

Data mining is examined as an efficient technique that provides a wide range of opportunities to conduct thesis works. Our Data Mining Research Topics provide innovative ideas to help you excel in your future research endeavors. With over 100 world-class professionals contributing their creative insights, we are dedicated to enhancing the quality of your research projects. For original and plag free writing stay in touch with phdprime.com. On the basis of data mining, we list out several interesting thesis plans, including possible research gaps and suggesting methodologies that can assist you to solve those gaps in an effective way:

  • Explainable AI in Data Mining for Healthcare

Potential Research Gap: In complex domains such as healthcare, implementation of data mining models is mostly obstructed due to their inadequate explainability, even though several models can forecast results in a more precise manner. Efficient models are required, which offer clarifications that can be interpreted and believed by healthcare experts, in addition to providing precise forecasting.

Thesis Plan: Specifically for the healthcare sector, explainable data mining models must be created that can stabilize preciseness with interpretability. Efficient models have to be developed, which are capable of clarifying their forecasting to healthcare experts in an understandable way.

Research Queries:

  • In what way can we create explainable models while efficiently preserving preciseness?
  • To make complicated models (for instance: neural networks) highly interpretable, what approaches can be utilized?

Methodology:

  • Current interpretability approaches like attention mechanisms, SHAP, and LIME have to be investigated.
  • By combining these approaches with conventional data mining techniques, we intend to create and examine novel models.
  • Concentrate on understandability as well as predictive performance to verify the models. For that, employ actual healthcare datasets.

Possible Datasets:

  • UCI Machine Learning Repository (for example: Diabetes Dataset)
  • MIMIC-III Clinical Database.

Anticipated Results:

  • As a means to provide extensive preciseness and clarity, this study could recommend data mining models.
  • It could offer perceptions based on improving decision-making in healthcare through explainable models.
  • Federated Learning for Privacy-Preserving Data Mining

Potential Research Gap: Requirement for privacy-preserving data mining approaches is emphasized through the growth of data confidentiality problems and principles like GDPR. In several fields, federated learning is insufficiently examined, in which data mining is performed on decentralized data without the distribution of unprocessed data.

Thesis Plan: For conducting privacy-preserving data mining in various domains such as education, healthcare, and finance, the use of federated learning has to be explored.

  • To stabilize performance and confidentiality, how federated learning can be applied in an efficient way?
  • What are the significant issues and possible solutions in the implementation of federated learning to different fields?
  • The latest privacy-preserving approaches and federated learning architectures must be analyzed.
  • In various fields, we apply federated learning models with actual and artificial datasets.
  • The confidentiality impacts and performance of the models have to be assessed.
  • Education: UCI Student Performance Dataset
  • Healthcare: MIMIC-III Clinical Database
  • Finance: Lending Club Loan Data
  • To preserve extensive performance without compromising data confidentiality, this project could propose efficient federated learning models.
  • In federated learning, it could detect domain-based issues and potential solutions.
  • Anomaly Detection in IoT Networks Using Data Mining

Potential Research Gap: To manage the specific issues of IoT networks, like actual-time processing needs, and extensive data types and range, efficient anomaly identification methods are highly required, because of the expansion of IoT devices.

Thesis Plan: Appropriate for IoT networks, innovative anomaly identification approaches have to be created and assessed, especially to detect functional problems and safety hazards.

  • What are the highly robust approaches to identify abnormalities in extensive IoT networks?
  • In anomaly identification techniques, how can we enhance the actual-time processing abilities?
  • Previous anomaly identification approaches should be investigated. It could involve unsupervised learning, supervised learning, and hybrid techniques.
  • Solve the particular issues of IoT platforms by improving current methods or creating novel ones.
  • We plan to utilize extensive IoT datasets to examine and verify the techniques.
  • CICIDS 2017 Dataset
  • KDD Cup 1999 Data
  • To carry out actual-time anomaly identification in IoT networks, our project could recommend enhanced methods.
  • Based on various abnormalities that are typically detected in IoT data, it could provide perceptions. For their identification, it can suggest efficient techniques.
  • Temporal Data Mining for Predicting Customer Behavior

Potential Research Gap: Temporal features of customer activity can enhance predictive preciseness and offer in-depth perceptions, but numerous data mining approaches do not consider these features sufficiently and concentrate only on static data.

Thesis Plan: By concentrating on regions such as healthcare, finance, and retail, forecast customer activity and patterns periodically through exploring temporal data mining approaches.

  • For the forecasting process, how temporal features in customer activity can be seized and utilized in an efficient manner?
  • What are the optimal approaches to combine temporal data with predictive models?
  • Various latest temporal data mining methods have to be analyzed. It could include temporal clustering, sequential pattern mining, and time-series analysis.
  • By combining temporal data with customer behavior analysis, we create models.
  • Employ actual-world datasets to examine the models. With conventional techniques, their performance must be compared.
  • Finance: Historical Stock Price Data
  • Retail: Online Retail Dataset
  • In order to offer relevant and precise forecasting on the basis of temporal customer activity, this project could recommend creation of models.
  • Regarding how temporal data influences customer behavior analysis, it could offer enhanced insights.
  • Multi-Modal Data Mining for Comprehensive Insights

Potential Research Gap: Generally, a single type of data (for instance: image, text, or numerical data) is considered by conventional data mining approaches. To offer extensive perceptions, combining and examining multi-modal data is crucial. For accomplishing this efficiently, robust methods are required.

Thesis Plan: As a means to enhance decision-making in various domains like smart cities, social media, and healthcare, combine and examine multi-modal data (for example: numerical data, images, and text) by creating techniques.

  • In what way multi-modal data can be combined and examined efficiently?
  • What are the advantages and potential issues of utilizing multi-modal data mining approaches?
  • Particularly for multi-modal data combination and analysis, explore approaches. Methods of deep learning such as multi-modal neural networks could be encompassed.
  • To enhance the multi-modal data combination and analysis, we aim to optimize previous approaches or create novel ones.
  • Use datasets which include different kinds of data to examine and verify the approaches.
  • Smart Cities: Make use of Sensor Data from different sources.
  • Social Media: Twitter Data (which integrates images and text data).
  • Healthcare: Utilize MIMIC-III Clinical Database (This dataset integrates numerical data, images, and text).
  • To combine and examine multi-modal data, efficient approaches could be suggested.
  • It can provide interpretations based on improving decision-making in different fields through multi-modal data.
  • Mining Educational Data for Early Student Dropout Prediction

Potential Research Gap: Intervention policies and academic results can be majorly enhanced through early detection of students who have the chances to drop out. To forecast student dropout possibility with the aid of academic data, efficient data mining approaches are essential.

Thesis Plan: For early detection of students who are susceptible to drop out, build predictive models. From different academic sources, make use of data.

  • What are the major aspects which influence student dropout, and in what way they can be detected in an efficient way?
  • For the student dropout, how predictive models can be created in a highly relevant and precise manner?
  • For academic data mining and dropout forecasting, current approaches should be analyzed.
  • By combining different data sources like attendance, population details, and educational performance, we create models.
  • Our project employs educational datasets to verify the models. In forecasting dropout possibility, their efficiency has to be assessed.
  • From academic universities, use openly accessible datasets.
  • UCI Student Performance Dataset.
  • To precisely detect students who are susceptible to drop out, this project can recommend predictive models.
  • It could offer perceptions based on the possible intervention policies and the major aspects that influence student dropout.
  • Enhancing Data Mining for Big Data with Distributed Processing

Potential Research Gap: For conventional data mining approaches, big data presents issues because of its diversity, velocity, and size. In order to manage big data in an effective way, innovative distributed processing approaches are important.

Thesis Plan: Data mining approaches have to be explored and created, which manage big data in an effective manner by utilizing distributed processing architectures.

  • To enhance the scalability and effectiveness of data mining approaches, in what way distributed processing architectures can be used?
  • What are the major issues and potential solutions in the implementation of data mining to big data?
  • Some previous distributed processing frameworks like Apache Spark and Apache Hadoop must be examined.
  • Appropriate for distributed processing, we aim to create data mining methods.
  • By utilizing extensive datasets, examine and verify the methods.
  • A wide range of enterprise datasets
  • Google Cloud Public Datasets and openly available datasets from sources such as Kaggle.
  • To manage big data, this project can suggest the creation of adaptable data mining approaches.
  • In big data platforms, it could provide enhanced efficacy and robustness of data mining operations.
  • Ethical and Fair Data Mining in Predictive Analytics

Potential Research Gap: Regarding the moral impacts of data mining, problems are being expanded, which are specifically relevant to unfairness and impartiality in predictive analytics. To assure that the data mining models and operations are impartial and proper, effective methods are required.

Thesis Plan: In predictive analytics models and data mining operations, minimize unfairness and assure impartiality through creating efficient approaches.

  • Specifically in data mining operations, how can impartiality and unfairness be assessed and reduced?
  • What are the efficient approaches to build moral data mining models?
  • For assessing and reducing unfairness in predictive analytics and data mining, analyze existing approaches.
  • As a means to minimize unfairness and assure impartiality in predictive models, we plan to improve previous techniques or create novel ones.
  • Utilize datasets which include biases, especially to verify the approaches.
  • COMPAS Recidivism Data
  • UCI Adult Income Dataset
  • To assure that the data mining operations are impartial and moral, it could create approaches.
  • This study could offer perceptions regarding in what way unfairness in predictive analytics can be detected and reduced efficiently.
  • Temporal Pattern Mining for Financial Market Analysis

Potential Research Gap: In financial data, the temporal aspects and features are mostly not considered by conventional data mining approaches. As a means to offer perceptions based on financial markets, robust techniques are essential, which are capable of extracting temporal aspects in an efficient manner.

Thesis Plan: To examine market patterns and forecast stock prices, extract temporal features in financial data by creating techniques.

  • In financial data, how can temporal features be seized and examined in an efficient way?
  • What are the optimal approaches to extract temporal aspects in financial markets?
  • For financial market assessment and temporal feature mining, the latest approaches must be explored.
  • In order to extract temporal features in financial data, optimize previous techniques or build novel ones.
  • By employing historical stock price data, we intend to verify the techniques. With conventional techniques, their performance has to be compared.
  • Cryptocurrency transaction data
  • From Google Finance or Yahoo Finance, use historical stock price data.
  • For extracting temporal features in financial data, this project could suggest efficient techniques.
  • It could offer enhanced preciseness of stock price forecasting and perceptions based on market patterns.
  • Data Mining for Cybersecurity Threat Detection

Potential Research Gap: To identify and reduce potential cybersecurity hazards in an efficient way, innovative data mining approaches are required due to the high intricacy of cyber hazards.

Thesis Plan: For identifying cybersecurity hazards in system records and network traffic, create data mining approaches.

  • What are the highly robust approaches to identify cybersecurity hazards with the aid of data mining?
  • To manage emerging cyber hazards, in what way data mining models can be optimized?
  • For the identification of cybersecurity hazards, previous data mining approaches have to be analyzed.
  • To identify hazards in actual-time with system records and network traffic, we plan to improve previous techniques or create novel ones.
  • Through the utilization of cybersecurity datasets, verify the techniques. In identifying different kinds of hazards, evaluate the efficiency of these techniques.

What data mining project can I do with Java that would be easy for an intermediate developer and how do I go about it?

In the field of data mining, several projects have developed with the aid of Java. Appropriate for an intermediate developer, we suggest an attainable and realistic project plan, along with procedural instructions, tools, and some instances of code:

Project: Customer Segmentation with Clustering

Goal: Examining consumer data and dividing them into different clusters is the major aim of this project, which specifically considers customers’ purchasing activity for segmentation. To adapt the marketing policies in a highly efficient manner, this segmentation technique can assist businesses.

Reason for Appropriateness:

  • This project includes interpreting customer activity and dealing with actual-world data.
  • Major data mining theories such as clustering are used in this study.
  • It majorly offers important perceptions and is a direct approach.

Project Classification:

  • Project Arrangement and Data Gathering
  • Data preprocessing
  • Clustering Application
  • Assessment and Analysis
  • Visualization and reposting

Procedural Instruction:

  • Java Development Kit (JDK)
  • Weka (It is a Java-related data mining tool)
  • Eclipse or IntelliJ IDEA (IDE)
  • MySQL (useful for data storage)

Procedures:

  • Install and Set Up Java and IDE: JDK and the selected IDE must be installed in an appropriate manner. Assuring this aspect is crucial.
  • Set Up Weka: The tools such as Weka have to be downloaded and combined with our java project. For machine learning and data mining, a wide range of libraries are provided by Weka.
  • Gather Data: From UCI, a freely available dataset should be utilized, like the Online Retail Dataset. Consumer purchasing activity is particularly encompassed in this dataset.

Code Snippet for Weka Incorporation:

// Import Weka libraries

import weka.core.Instances;

import weka.clusterers.SimpleKMeans;

import weka.core.converters.ConverterUtils.DataSource;

// Load dataset

DataSource source = new DataSource(“path/to/your/dataset.arff”);

Instances data = source.getDataSet();

  • Data Preprocessing
  • Load Data: Within our java application, we have to import the dataset.
  • Clean Data: Focus on normalizing data, eliminating duplicates, and managing missing values.

Code Snippet for Data Preprocessing:

// Remove instances with missing values

data.deleteWithMissingClass();

// Normalize data

weka.filters.unsupervised.attribute.Normalize normalize = new weka.filters.unsupervised.attribute.Normalize();

normalize.setInputFormat(data);

Instances normalizedData = weka.filters.Filter.useFilter(data, normalize);

  • Select a Clustering Algorithm: For the clustering process, K-Means is considered as an efficient as well as direct method.
  • Apply Clustering: In order to carry out clustering on the preprocessed data, utilize Weka’s SimpleKMeans class.

Code Snippet for Clustering:

// Set up K-Means clustering

SimpleKMeans kmeans = new SimpleKMeans();

kmeans.setNumClusters(5); // Choose the number of clusters

kmeans.setSeed(10);

kmeans.buildClusterer(normalizedData);

// Output cluster results

for (int i = 0; i < normalizedData.numInstances(); i++) {

int cluster = kmeans.clusterInstance(normalizedData.instance(i));

System.out.println(“Instance ” + i + ” belongs to cluster ” + cluster);

  • Assess Cluster Quality: It is approachable to employ major metrics such as Within-Cluster Sum of Squares (WCSS) or Silhouette Score.
  • Examine Clusters: To interpret customer divisions, the features of every group have to be analyzed.

Code Snippet for Assessment:

// Evaluate cluster quality using WCSS

double wcss = kmeans.getSquaredError();

System.out.println(“Within-Cluster Sum of Squares: ” + wcss);

  • Visualization and Reporting
  • Visualize Clusters: Plan to combine with the visualization tools of Weka or utilize Java-related libraries such as JFreeChart.
  • Create Reports: Depict the significant discoveries by developing visualizations and outlines.

Code Snippet for Visualization:

// Example using JFreeChart to create a simple bar chart

import org.jfree.chart.ChartFactory;

import org.jfree.chart.ChartPanel;

import org.jfree.chart.JFreeChart;

import org.jfree.data.category.DefaultCategoryDataset;

import javax.swing.JFrame;

public class ClusterChart extends JFrame {

public ClusterChart() {

DefaultCategoryDataset dataset = new DefaultCategoryDataset();

// Add cluster data

dataset.addValue(1.0, “Cluster 1”, “Segment 1”);

dataset.addValue(4.0, “Cluster 2”, “Segment 2”);

JFreeChart barChart = ChartFactory.createBarChart(

“Customer Segmentation”,

“Cluster”,

“Number of Customers”,

ChartPanel chartPanel = new ChartPanel(barChart);

chartPanel.setPreferredSize(new java.awt.Dimension(800, 600));

setContentPane(chartPanel);

public static void main(String[] args) {

ClusterChart chart = new ClusterChart();

chart.pack();

chart.setVisible(true);

Data Mining Thesis Topics

By encompassing possible research gaps and relevant methodologies, we listed out a few thesis plans based on data mining. Furthermore, a realistic and achievable data mining project plan is recommended by us, along with step-by-step procedures that can support you to accomplish this project efficiently.

  • Matrix Decomposition Methods for the Improvement of Data Mining in Telecommunications
  • Associational approach of text data mining and its implications
  • The relationship between artificial intelligence and data mining: application to future military information systems
  • The Anatomy of Weka4WS: A WSRF-enabled Toolkit for Distributed Data Mining on Grid
  • Handling Structured Data Using Data Mining Clustering Techniques
  • A Self-Adaptive Hybrid Genetic Algorithm for Data Mining Applications
  • Severe Hail Prediction within a Spatiotemporal Relational Data Mining Framework
  • Analysis of Data Mining Techniques for Constructing a Predictive Model for Academic Performance
  • Short-term PV Power Prediction Based on Data Mining and Multi-kernel SVM
  • Using Data Mining Techniques To Enhance The Student Performance. A semantic review.
  • Data mining and modeling in scientific databases
  • Multi-robot Cooperative Pursuit Based on Association Rule Data Mining
  • Data Mining Based Partitioning of Dynamic Voltage Control Areas and Contingency Clustering
  • Using procedure reasoning system for knowledge discovery in data mining
  • Applying Data Mining to Pseudo-Relevance Feedback for High Performance Text Retrieval
  • Research of RFID Data mining based on supply chain management
  • Data mining approaches to software fault diagnosis
  • Extraction Rules-Based Relational Data Mining for Power Project Management Ontology
  • An object tracking scheme for wireless sensor networks using data mining mechanism
  • An ontology based semantic heterogeneity measurement framework for optimization in distributed data mining
  • Data mining for customers’ positive reaction to advertising in social media
  • Developing the System of Web-Data Mining from Chemical Database Based on Internet
  • Spatial data mining: clustering of hot spots and pattern recognition
  • Ultrasound Imaging Optimization by Using Data Mining Techniques
  • A Data Mining Based Pervasive User Requests Prediction Method in e-Learning Systems
  • Prediction of Depression in Social Network Sites Using Data Mining
  • Applying Data Mining and Mathematical Morphology to Borehole Data Coming from Exploration and Mining Industry
  • Design of TCM Research Demand System Based on Data Mining Technology
  • Predicting higher secondary results by data mining algorithms with VBR: A feature reduction method
  • Tour Route Planning Algorithm Based on Precise Interested Tourist Sight Data Mining

thesis topic on data mining

Opening Hours

  • Mon-Sat 09.00 am – 6.30 pm
  • Lunch Time 12.30 pm – 01.30 pm
  • Break Time 04.00 pm – 04.30 pm
  • 18 years service excellence
  • 40+ country reach
  • 36+ university mou
  • 194+ college mou
  • 6000+ happy customers
  • 100+ employees
  • 240+ writers
  • 60+ developers
  • 45+ researchers
  • 540+ Journal tieup

Payment Options

money gram

Our Clients

thesis topic on data mining

Social Links

thesis topic on data mining

  • Terms of Use

thesis topic on data mining

Opening Time

thesis topic on data mining

Closing Time

  • We follow Indian time zone

award1

  • Bibliography
  • More Referencing guides Blog Automated transliteration Relevant bibliographies by topics
  • Automated transliteration
  • Relevant bibliographies by topics
  • Referencing guides

Stack Exchange Network

Stack Exchange network consists of 183 Q&A communities including Stack Overflow , the largest, most trusted online community for developers to learn, share their knowledge, and build their careers.

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Master thesis topics [closed]

I am looking for a thesis to complete my master, I am interested in Predictive Analytics in marketing, HR, management or financial subject, using Data Mining Application.

I have found a very interesting subject: "Predicting customer churn using decision tree" or either "Predicting employee turnover using decision tree", I looked around very hard but unfortunately couldn't find any relevant dataset to download ( Telecommunication Customer churn Dataset ).

I would like to work on a similar subject using "Decision Tree Technique".

Please suggest some topics or project that would make for a good masters thesis subject.

  • data-mining
  • predictive-modeling
  • decision-trees

Community's user avatar

2 Answers 2

This is the approach I took:

  • Find journals related to your field of studies
  • Skim through the proceedings, see if there are titles that catch your interest
  • Read the papers (carefully or globally) that seemed interesting
  • Carefully consider the approaches and whatever future suggestions they present in their papers
  • Think critically: What would you change? What do you want to find out? Don't limit yourself to data but rather orient from the perspective of research. Solutions for data might only become apparent when you know exactly what you want to examine.

I think this has advantages because these papers outline details regarding data as well -- perhaps you can use the same.

Present some papers and your idea to your prospective supervisor and he/she will make some suggestions. Researchers generally have a lot of knowledge about the possibilities and might even be curious about some things themselves.

Good luck! And enjoy.

lennyklb's user avatar

First, talk to your thesis advisor before committing to a project. They know better than I do.

Secondly, just analyzing a new dataset using standard techniques doesn't make for a good masters thesis. Your project is expected to use some sort of novel approach.

With that said, I'd suggest that you start by reading up on existing decision tree techniques, learning why they work and what their flaws are, and try to find ways to overcome the flaws. Then, once you have your improvement, it should be relatively easy to find a dataset to apply it to.

Timothy Nodine's user avatar

Not the answer you're looking for? Browse other questions tagged data-mining predictive-modeling bigdata decision-trees research or ask your own question .

  • The Overflow Blog
  • The world’s largest open-source business has plans for enhancing LLMs
  • Featured on Meta
  • User activation: Learnings and opportunities
  • Site maintenance - Mon, Sept 16 2024, 21:00 UTC to Tue, Sept 17 2024, 2:00...

Hot Network Questions

  • What was the newest chess piece
  • Sum of the individual kinetic energies of the particles which make the system the same as the K.E. of the center of mass? What's bad in my reasoning?
  • Is it possible/recommended to paint the side of piano's keys?
  • Do carbon fiber wings need a wing spar?
  • Explicit Examples for Unrestricted Hartree Fock Calculations
  • If someone threatens force to prevent another person from leaving, are they holding them hostage?
  • Boon of combat prowess when you can only attack once
  • Are Backstabs All That Bad?
  • How to fix: "Error dependency is not satisfiable"?
  • How to replace undermount kitchen sink between plywood and stone countertop?
  • When a creature enchanted with Fungal Fortitude dies and returns to the battlefield, does it keep Fungal Fortitude?
  • Not great at regexs. Trying to code-golf this beast of a regex WITHOUT using string replacement (for G-Zip purposes)
  • Driving low power LEDs from pre-biased digital transistors
  • Recover lost disk space (> 270 GB)
  • Determining Entropy in PHP
  • Odorless color less , transparent fluid is leaking underneath my car
  • Seeking a Text-Based Version of Paul Dirac's 1926 Paper on Quantum Mechanics
  • Copy and pasting wont work in animation. Not even duplicate
  • How to make a soundless world
  • A coworker says I’m being rude—only to him. How should I handle this?
  • Missed the application deadline for a TA job. Should I contact them back?
  • What does "either" refer to?
  • What ever happened to combinatorial chemistry? (Super hyped in the 90s.)
  • Lower Bound on Product of Expectation for Random Variables

thesis topic on data mining

Trending Data Mining Thesis Topics

            Data mining seems to be the act of analyzing large amounts of data in order to uncover business insights that can assist firms in fixing issues, reducing risks, and embracing new possibilities . This article provides a complete picture on data mining thesis topics where you can get all information regarding data mining research

How to Implement Data Mining Thesis Topics

How does data mining work?

  • A standard data mining design begins with the appropriate business statement in the questionnaire, the appropriate data is collected to tackle it, and the data is prepared for the examination.
  • What happens in the earlier stages determines how successful the later versions are.
  • Data miners should assure the data quality they utilize as input for research because bad data quality results in poor outcomes.
  • Establishing a detailed understanding of the design factors, such as the present business scenario, the project’s main business goal, and the performance objectives.
  • Identifying the data required to address the problem as well as collecting this from all sorts of sources.
  • Addressing any errors and bugs, like incomplete or duplicate data, and processing the data in a suitable format to solve the research questions.
  • Algorithms are used to find patterns from data.
  • Identifying if or how another model’s output will contribute to the achievement of a business objective.
  • In order to acquire the optimum outcome, an iterative process is frequently used to identify the best method.
  • Getting the project’s findings suitable for making decisions in real-time

  The techniques and actions listed above are repeated until the best outcomes are achieved. Our engineers and developers have extensive knowledge of the tools, techniques, and approaches used in the processes described above. We guarantee that we will provide the best research advice w.r.t to data mining thesis topics and complete your project on schedule. What are the important data mining tasks?

Data Mining Tasks 

  • Data mining finds application in many ways including description, Analysis, summarization of data, and clarifying the conceptual understanding by data description
  • And also prediction, classification, dependency analysis, segmentation, and case-based reasoning are some of the important data mining tasks
  • Regression – numerical data prediction (stock prices, temperatures, and total sales)
  • Data warehousing – business decision making and large-scale data mining
  • Classification – accurate prediction of target classes and their categorization
  • Association rule learning – market-based analytical tools that were involved in establishing variable data set relationship
  • Machine learning – statistical probability-based decision making method without complicated programming
  • Data analytics – digital data evaluation for business purposes
  • Clustering – dataset partitioning into clusters and subclasses for analyzing natural data structure and format
  • Artificial intelligence – human-based Data analytics for reasoning, solving problems, learning, and planning
  • Data preparation and cleansing – conversion of raw data into a processed form for identification and removal of errors

You can look at our website for a more in-depth look at all of these operations. We supply you with the needed data, as well as any additional data you may need for your data mining thesis topics . We supply non-plagiarized data mining thesis assistance in any fresh idea of your choice. Let us now discuss the stages in data mining that are to be included in your thesis topics

How to work on a data mining thesis topic? 

 The following are the important stages or phases in developing data mining thesis topics.

  • First of all, you need to identify the present demand and address the question
  • The next step is defining or specifying the problem
  • Collection of data is the third step
  • Alternative solutions and designs have to be analyzed in the next step
  • The proposed methodology has to be designed
  • The system is then to be implemented

Usually, our experts help in writing codes and implementing them successfully without hassles . By consistently following the above steps you can develop one of the best data mining thesis topics of recent days. Furthermore, technically it is important for you to have a better idea of all the tasks and techniques involved in data mining about which we have discussed below

  • Data visualization
  • Neural networks
  • Statistical modeling
  • Genetic algorithms and neural networks
  • Decision trees and induction
  • Discriminant analysis
  • Induction techniques
  • Association rules and data visualization
  • Bayesian networks
  • Correlation
  • Regression analysis
  • Regression analysis and regression trees

If you are looking forward to selecting the best tool for your data mining project then evaluating its consistency and efficiency stands first. For this, you need to gain enough technical data from real-time executed projects for which you can directly contact us. Since we have delivered an ample number of data mining thesis topics successfully we can help you in finding better solutions to all your research issues. What are the points to be remembered about the data mining strategy?

  • Furthermore, data mining strategies must be picked before instruments in order to prevent using strategies that do not align with the article’s true purposes.
  • The typical data mining strategy has always been to evaluate a variety of methodologies in order to select one which best fits the situation.
  • As previously said, there are some principles that may be used to choose effective strategies for data mining projects.
  • Since they are easy to handle and comprehend
  • They could indeed collaborate with definitional and parametric data
  • Tare unaffected by critical values, they could perhaps function with incomplete information
  • They could also expose various interrelationships and an absence of linear combinations
  • They could indeed handle noise in records
  • They can process huge amounts of data.
  • Decision trees, on the other hand, have significant drawbacks.
  • Many rules are frequently necessary for dependent variables or numerous regressions, and tiny changes in the data can result in very different tree architectures.

All such pros and cons of various data mining aspects are discussed on our website. We will provide you with high-quality research assistance and thesis writing assistance . You may see proof of our skill and the unique approach that we generated in the field by looking at the samples of the thesis that we produced on our website. We also offer an internal review to help you feel more confident. Let us now discuss the recent data mining methodologies

Current methods in Data Mining

  • Prediction of data (time series data mining)
  • Discriminant and cluster analysis
  • Logistic regression and segmentation

Our technical specialists and technicians usually give adequate accurate data, a thorough and detailed explanation, and technical notes for all of these processes and algorithms. As a result, you can get all of your questions answered in one spot. Our technical team is also well-versed in current trends, allowing us to provide realistic explanations for all new developments. We will now talk about the latest data mining trends

Latest Trending Data Mining Thesis Topics

  • Visual data mining and data mining software engineering
  • Interaction and scalability in data mining
  • Exploring applications of data mining
  • Biological and visual data mining
  • Cloud computing and big data integration
  • Data security and protecting privacy in data mining
  • Novel methodologies in complex data mining
  • Data mining in multiple databases and rationalities
  • Query language standardization in data mining
  • Integration of MapReduce, Amazon EC2, S3, Apache Spark, and Hadoop into data mining

These are the recent trends in data mining. We insist that you choose one of the topics that interest you the most. Having an appropriate content structure or template is essential while writing a thesis . We design the plan in a chronological order relevant to the study assessment with this in mind. The incorporation of citations is one of the most important aspects of the thesis. We focus not only on authoring but also on citing essential sources in the text. Students frequently struggle to deal with appropriate proposals when commencing their thesis. We have years of experience in providing the greatest study and data mining thesis writing services to the scientific community, which are promptly and widely acknowledged. We will now talk about future research directions of research in various data mining thesis topics

Future Research Directions of Data Mining

  • The potential of data mining and data science seems promising, as the volume of data continues to grow.
  • It is expected that the total amount of data in our digital cosmos will have grown from 4.4 zettabytes to 44 zettabytes.
  • We’ll also generate 1.7 gigabytes of new data for every human being on this planet each second.
  • Mining algorithms have completely transformed as technology has advanced, and thus have tools for obtaining useful insights from data.
  • Only corporations like NASA could utilize their powerful computers to examine data once upon a time because the cost of producing and processing data was simply too high.
  • Organizations are now using cloud-based data warehouses to accomplish any kinds of great activities with machine learning, artificial intelligence, and deep learning.

The Internet of Things as well as wearable electronics, for instance, has transformed devices to be connected into data-generating engines which provide limitless perspectives into people and organizations if firms can gather, store, and analyze the data quickly enough. What are the aspects to be remembered for choosing the best  data mining thesis topics?

  • An excellent thesis topic is a broad concept that has to be developed, verified, or refuted.
  • Your thesis topic must capture your curiosity, as well as the involvement of both the supervisor and the academicians.
  • Your thesis topic must be relevant to your studies and should be able to withstand examination.

Our engineers and experts can provide you with any type of research assistance on any of these data mining development tools . We satisfy the criteria of your universities by ensuring several revisions, appropriate formatting and editing of your thesis, comprehensive grammar check, and so on . As a result, you can contact us with confidence for complete assistance with your data mining thesis. What are the important data mining thesis topics?

Trending Data Mining Research Thesis Topics

Research Topics in Data Mining

  • Handling cost-effective, unbalanced non-static data
  • Issues related to data mining and their solutions
  • Network settings in data mining and ensuring privacy, security, and integrity of data
  • Environmental and biological issues in data mining
  • Complex data mining and sequential data mining (time series data)
  • Data mining at higher dimensions
  • Multi-agent data mining and distributed data mining
  • High-speed data mining
  • Development of unified data mining theory

We currently provide full support for all parts of research study, development, investigation, including project planning, technical advice, legitimate scientific data, thesis writing, paper publication, assignments and project planning, internal review, and many other services. As a result, you can contact us for any kind of help with your data mining thesis topics.

Why Work With Us ?

Senior research member, research experience, journal member, book publisher, research ethics, business ethics, valid references, explanations, paper publication, 9 big reasons to select us.

Our Editor-in-Chief has Website Ownership who control and deliver all aspects of PhD Direction to scholars and students and also keep the look to fully manage all our clients.

Our world-class certified experts have 18+years of experience in Research & Development programs (Industrial Research) who absolutely immersed as many scholars as possible in developing strong PhD research projects.

We associated with 200+reputed SCI and SCOPUS indexed journals (SJR ranking) for getting research work to be published in standard journals (Your first-choice journal).

PhDdirection.com is world’s largest book publishing platform that predominantly work subject-wise categories for scholars/students to assist their books writing and takes out into the University Library.

Our researchers provide required research ethics such as Confidentiality & Privacy, Novelty (valuable research), Plagiarism-Free, and Timely Delivery. Our customers have freedom to examine their current specific research activities.

Our organization take into consideration of customer satisfaction, online, offline support and professional works deliver since these are the actual inspiring business factors.

Solid works delivering by young qualified global research team. "References" is the key to evaluating works easier because we carefully assess scholars findings.

Detailed Videos, Readme files, Screenshots are provided for all research projects. We provide Teamviewer support and other online channels for project explanation.

Worthy journal publication is our main thing like IEEE, ACM, Springer, IET, Elsevier, etc. We substantially reduces scholars burden in publication side. We carry scholars from initial submission to final acceptance.

Related Pages

Our benefits, throughout reference, confidential agreement, research no way resale, plagiarism-free, publication guarantee, customize support, fair revisions, business professionalism, domains & tools, we generally use, wireless communication (4g lte, and 5g), ad hoc networks (vanet, manet, etc.), wireless sensor networks, software defined networks, network security, internet of things (mqtt, coap), internet of vehicles, cloud computing, fog computing, edge computing, mobile computing, mobile cloud computing, ubiquitous computing, digital image processing, medical image processing, pattern analysis and machine intelligence, geoscience and remote sensing, big data analytics, data mining, power electronics, web of things, digital forensics, natural language processing, automation systems, artificial intelligence, mininet 2.1.0, matlab (r2018b/r2019a), matlab and simulink, apache hadoop, apache spark mlib, apache mahout, apache flink, apache storm, apache cassandra, pig and hive, rapid miner, support 24/7, call us @ any time, +91 9444829042, [email protected].

Questions ?

Click here to chat with us

IEEE Account

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

IMAGES

  1. M.tech Thesis on Data Mining at Rs 15000/project in Indore

    thesis topic on data mining

  2. Trending Top 10 Data Mining Thesis Topics [How to Choose Novel Idea]

    thesis topic on data mining

  3. PPT

    thesis topic on data mining

  4. Data Mining Thesis Ideas

    thesis topic on data mining

  5. PhD Thesis Topics in Data Mining (Thesis Writing Help)

    thesis topic on data mining

  6. Professional Research Guidance

    thesis topic on data mining

VIDEO

  1. Interior thesis topic #interior #thesis #thesistips #interiordesign #tips

  2. Lecture 15: Data Mining CSE 2020 Fall

  3. 01 Lecture 6 Part 01

  4. HOW TO CHOOSE THESIS TOPIC IN B.ARCH?

  5. 3 Minute Thesis Competition

  6. Data Mining Introduction

COMMENTS

  1. 105 Data Mining Essay Topic Ideas & Examples

    If you are a student or a professional looking to write an essay on data mining, here are 105 topic ideas and examples to help you get started. The importance of data mining in today's business world. Ethical considerations in data mining. The impact of data mining on privacy.

  2. 82 Data Mining Essay Topic Ideas & Examples

    Commercial Uses of Data Mining. Data mining process entails the use of large relational database to identify the correlation that exists in a given data. The principal role of the applications is to sift the data to identify correlations. A Discussion on the Acceptability of Data Mining.

  3. Latest Research and Thesis topics in Data Mining

    Topics to study in data mining. Data mining is a relatively new thing and many are not aware of this technology. This can also be a good topic for M.Tech thesis and for presentations. Following are the topics under data mining to study: Fraud Detection. Crime Rate Prediction.

  4. 60 Data Mining Essay Topics & Research Titles at StudyCorgi

    Application of Data Mining Techniques to Healthcare Data. Data Mining as a Hybrid Methodology for Complex and Dynamic Research. Data Visualization and Data Analytics in the Process of Data Mining. An Analysis of Customer Retention and Insurance Claim Patterns Using Data Mining. Data Mining Techniques Against Cyber Threats.

  5. data mining Latest Research Papers

    The accurate average value is 74.05% of the existing COID algorithm, and our proposed algorithm has 77.21%. The average recall value is 81.19% and 89.51% of the existing and proposed algorithm, which shows that the proposed work efficiency is better than the existing COID algorithm. Download Full-text.

  6. Data Mining Dissertation Topics

    Data Mining Dissertation Topics. The term "data mining" refers to an intelligent data lookup capacity that uses statistics-based algorithms and methodologies to find trends, patterns, links, and correlations within the collected data and records. Audio, Pictorial, Video, textual, online, and social media-based mining are only a few examples ...

  7. PDF The application of data mining methods

    This thesis first introduces the basic concepts of data mining, such as the definition of data mining, its basic function, common methods and basic process, and two common data mining methods, classification and clustering. Then a data mining application in network is discussed in detail, followed by a brief introduction on data mining ...

  8. PDF Data Mining Thesis Topics in Finland

    information regarding degree topics stays up to date. 2.2 Data Mining Overview Data mining is the process of discovering patterns and relationships in large volumes of data by using methods from the areas of computer science, statistics and artificial intelligence [12]. Data mining is a general term and it can often be confusing. Moreover,

  9. Data Mining

    Achieving Long Term Fairness through Curiosity Driven Reinforcement Learning: How intrinsic motivation influences fairness in algorithmic decision making. van der Wee, W. J. (Author), Pechenizkiy, M. (Supervisor 1), Gajane, P. (Supervisor 2) & Kapodistria, S. (Supervisor 2), 28 Aug 2023. Student thesis: Master.

  10. Research Topics & Ideas: Data Science

    Data Science-Related Research Topics. Developing machine learning models for real-time fraud detection in online transactions. The use of big data analytics in predicting and managing urban traffic flow. Investigating the effectiveness of data mining techniques in identifying early signs of mental health issues from social media usage.

  11. Data Mining Thesis Topics

    Data Mining Thesis Topics & Ideas are suggested with an extensive summary along with the research methodology, and also efficient master thesis topics for implementing text mining information recovery on scientific literature collection are provided by us in an extensive manner. The below specified information will be beneficial and assistive ...

  12. (PDF) Trends in data mining research: A two-decade review using topic

    The last two topics, Text Mining and Data Streams have attracted steady interest from researchers. The results presented here shed light on the structure and trends of data mining over the past ...

  13. MASTER'S THESIS

    objective in this thesis is to get a deeper knowledge of the data mining area and evalu-ate methods to automatically classify the data in structured personal registries. In this report I'll describe di erent methods and algorithms used in data mining, apply a few selected popular algorithms on two data sets and then discuss the result. i

  14. Adaptations of data mining methodologies: a systematic literature

    This article addresses this gap via a systematic literature review focused on the context in which data mining methodologies are used and the adaptations they undergo. The literature review covers 207 peer-reviewed and 'grey' publications. We find that data mining methodologies are primarily applied 'as-is'.

  15. Dissertations / Theses on the topic 'Data mining'

    Consult the top 50 dissertations / theses for your research on the topic 'Data mining.'. Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

  16. (PDF) APPLYING DATA MINING TECHNIQUES OVER BIG DATA

    Data mining is concerned with knowledge discovery and finding patterns in. datasets through a process of applying the model to the data [13]. The model, the heart of. the data mining proce ss, is ...

  17. Data Mining Thesis Topics

    Data mining is examined as an efficient technique that provides a wide range of opportunities to conduct thesis works. Our Data Mining Research Topics provide innovative ideas to help you excel in your future research endeavors. With over 100 world-class professionals contributing their creative insights, we are dedicated to enhancing the ...

  18. Dissertations / Theses on the topic 'Data mining

    Consult the top 50 dissertations / theses for your research on the topic 'Data mining - Research.'. Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago ...

  19. data mining

    First, talk to your thesis advisor before committing to a project. They know better than I do. Secondly, just analyzing a new dataset using standard techniques doesn't make for a good masters thesis. Your project is expected to use some sort of novel approach.

  20. Trending Data Mining Thesis Topics

    Integration of MapReduce, Amazon EC2, S3, Apache Spark, and Hadoop into data mining. These are the recent trends in data mining. We insist that you choose one of the topics that interest you the most. Having an appropriate content structure or template is essential while writing a thesis.

  21. Educational Data Mining Clustering Approach: Case Study of

    This study aims to investigate the potential of educational data mining (EDM) in addressing the issue of delayed completion in undergraduate student thesis courses. Delayed completion of these courses is a common issue that affects both students and higher education institutions. This study employed clustering analysis to create clusters of thesis topics. The research model was constructed ...