Edited by
This edition first published 2020
© 2020 John Wiley & Sons Ltd
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions.
The right of Deepak Gupta, Siddhartha Bhattacharyya, Ashish Khanna, and Kalpna Sagar to be identified as the authors of the editorial material in this work has been asserted in accordance with law.
Registered Offices
John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA
John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK
Editorial Office
The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK
For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com.
Wiley also publishes its books in a variety of electronic formats and by print-on-demand. Some content that appears in standard print versions of this book may not be available in other formats.
Limit of Liability/Disclaimer of Warranty
MATLAB® is a trademark of The MathWorks, Inc. and is used with permission. The MathWorks does not warrant the accuracy of the text or exercises in this book. This work's use or discussion of MATLAB® software or related products does not constitute endorsement or sponsorship by The MathWorks of a particular pedagogical approach or particular use of the MATLAB® software.
While the publisher and authors have used their best efforts in preparing this work, they make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives, written sales materials or promotional statements for this work. The fact that an organization, website, or product is referred to in this work as a citation and/or potential source of further information does not mean that the publisher and authors endorse the information or services the organization, website, or product may provide or recommendations it may make. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for your situation. You should consult with a specialist where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.
Library of Congress Cataloging-in-Publication Data
Names: Gupta, Deepak, editor.
Title: Intelligent data analysis : from data gathering to data comprehension / edited by Dr. Deepak Gupta, Dr. Siddhartha Bhattacharyya, Dr. Ashish Khanna, Ms. Kalpna Sagar.
Description: Hoboken, NJ, USA : Wiley, 2020. | Series: The Wiley series in intelligent signal and data processing | Includes bibliographical references and index.
Identifiers: LCCN 2019056735 (print) | LCCN 2019056736 (ebook) | ISBN 9781119544456 (hardback) | ISBN 9781119544449 (adobe pdf) | ISBN 9781119544463 (epub)
Subjects: LCSH: Data mining. | Computational intelligence.
Classification: LCC QA76.9.D343 I57435 2020 (print) | LCC QA76.9.D343 (ebook) | DDC 006.3/12–dc23
LC record available at https://lccn.loc.gov/2019056735
LC ebook record available at https://lccn.loc.gov/2019056736
Cover Design: Wiley
Cover Image: © gremlin/Getty Images
Deepak Gupta would like to dedicate this book to his father, Sh. R.K. Gupta, his mother, Smt. Geeta Gupta, his mentors for their constant encouragement, and his family members, including his wife, brothers, sisters, kids and the students.
Siddhartha Bhattacharyya would like to dedicate this book to his parents, the late Ajit Kumar Bhattacharyya and the late Hashi Bhattacharyya, his beloved wife, Rashni, and his research scholars, Sourav, Sandip, Hrishikesh, Pankaj, Debanjan, Alokananda, Koyel, and Tulika.
Ashish Khanna would like to dedicate this book to his parents, the late R.C. Khanna and Smt. Surekha Khanna, for their constant encouragement and support, and to his wife, Sheenu, and children, Master Bhavya and Master Sanyukt.
Kalpna Sagar would like to dedicate this book to her father, Mr. Lekh Ram Sagar, and her mother, Smt. Gomti Sagar, the strongest persons of her life.
The Intelligent Signal and Data Processing (ISDP) book series is aimed at fostering the field of signal and data processing, which encompasses the theory and practice of algorithms and hardware that convert signals produced by artificial or natural means into a form useful for a specific purpose. The signals might be speech, audio, images, video, sensor data, telemetry, electrocardiograms, or seismic data, among others. The possible application areas include transmission, display, storage, interpretation, classification, segmentation, or diagnosis. The primary objective of the ISDP book series is to evolve future-generation scalable intelligent systems for faithful analysis of signals and data. ISDP is mainly intended to enrich the scholarly discourse on intelligent signal and image processing in different incarnations. ISDP will benefit a wide range of learners, including students, researchers, and practitioners. The student community can use the volumes in the series as reference texts to advance their knowledge base. In addition, the monographs will also come in handy to the aspiring researcher because of the valuable contributions both have made in this field. Moreover, both faculty members and data practitioners are likely to grasp depth of the relevant knowledge base from these volumes.
The series coverage will contain, not exclusively, the following:
Intelligent data analysis (IDA), knowledge discovery, and decision support have recently become more challenging research fields and have gained much attention among a large number of researchers and practitioners. In our view, the awareness of these challenging research fields and emerging technologies among the research community will increase the applications in biomedical science. This book aims to present the various approaches, techniques, and methods that are available for IDA, and to present case studies of their application.
This volume comprises 18 chapters focusing on the latest advances in IDA tools and techniques.
Machine learning models are broadly categorized into two types: white box and black box. Due to the difficulty in interpreting their inner workings, some machine learning models are considered black box models. Chapter 1 focuses on the different machine learning models, along with their advantages and limitations as far as the analysis of data is concerned.
With the advancement of technology, the amount of data generated is very large. The data generated has useful information that needs to be gathered by data analytics tools in order to make better decisions. In Chapter 2, the definition of data and its classifications based on different factors is given. The reader will learn about how and what data is and about the breakup of the data. After a description of what data is, the chapter will focus on defining and explaining big data and the various challenges faced by dealing with big data. The authors also describe various types of analytics that can be performed on large data and six data analytics tools (Microsoft Excel, Apache Spark, OpenRefine, R, Hadoop, and Tableau).
In recent years, the widespread use of computers and the internet has led to the generation of data on an unprecedented scale. To make an effective use of this data, it is necessary that data must be collected and analyzed so that inferences can be made to improve various products and services. Statistics deals with the collection, organization, and analysis of data. The organization and description of data is studied under these statistics in Chapter 3 while analysis of data and how to make predictions based on it is dealt with in inferential statistics.
After having an idea about various aspects of IDA in the previous chapters, Chapter 4 deals with an overview of data mining. It also discusses the process of knowledge discovery in data along with a detailed analysis of various mining methods including classification, clustering, and decision tree. In addition to that, the chapter concludes with a view of data visualization and probability concepts for IDA.
In Chapter 5, the authors demonstrate one of the most crucial and challenge areas in computer vision and the IDA field based on manipulating the convergence. This subject is divided into a deep learning paradigm for object segmentation in computer vision and visualization paradigm for efficiently incremental interpretation in manipulating the datasets for supervised and unsupervised learning, and online or offline training in reinforcement learning. This topic recently has had a large impact in robotics and autonomous systems, food detection, recommendation systems, and medical applications.
Dental caries is a painful bacterial disease of teeth caused mainly by Streptococcus mutants, acid, and carbohydrates, and it destroys the enamel, or the dentine, layer of the tooth. As per the World Health Organization report, worldwide, 60–90% of school children and almost 100% of adults have dental caries. Dental caries and periodontal disease without treatment for long periods causes tooth loss. There is not a single method to detect caries in its earliest stages. The size of carious lesions and early caries detection are very challenging tasks for dental practitioners. The methods related to dental caries detection are the radiograph, QLF or or quantitative light-induced fluorescence, ECM, FOTI, DIFOTI, etc. In a radiograph-based technique, dentists analyze the image data. In Chapter 6, the authors present a method to detect caries by analyzing the secondary emission data.
With the growth of data in the education field in recent years, there is a need for intelligent data analytics, in order that academic data should be used effectively to improve learning. Educational data mining and learning analytics are the fields of IDA that play important roles in intelligent analysis of educational data. One of the real challenges faced by students and institutions alike is the quality of education. An equally important factor related to the quality of education is the performance of students in the higher education system. The decisions that the students make while selecting their area of specialization is of grave concern here. In the absence of support systems, the students and the teachers/mentors fall short when making the right decisions for the furthering of their chosen career paths. Therefore, in Chapter 7, the authors attempt to address the issue by proposing a system that can guide the student to choose and to focus on the right course(s) based on their personal preferences. For this purpose, a system has been envisaged by blending data mining and classification with big data. A methodology using MapReduce Framework and association rule mining is proposed in order to derive the right blend of courses for students to pursue to enhance their career prospects.
Atmospheric air pollution is creating significant health problems that affect millions of people around the world. Chapter 8 analyzes the hypothesis about whether or not global green space variation is changing the global air quality. The authors perform a big data analysis with a data set that contains more than 1M (1 048 000) green space data and air quality data points by considering 190 countries during the years 1990 to 2015. Air quality is measured by considering particular matter (PM) value. The analysis is carried out using multivariate graphs and a k-mean clustering algorithm. The relative geographical changes of the tree areas, as well as the level of the air quality, were identified and the results indicated encouraging news.
Space technology and geotechnology, such as geographic information systems, plays a vital role in the day-to-day activities of a society. In the initial days, the data collection was very rudimentary and primitive. The quality of the data collected was a subject of verification and the accuracy of the data was also questionable. With the advent of newer technology, the problems have been overcome. Using modern sophisticated systems, space science has been changed drastically. Implementing cutting-edge spaceborne sensors has made it possible to capture real-time data from space. Chapter 9 focuses on these aspects in detail.
Transportation plays an important role in our overall economy, conveying products and people through progressively mind-boggling, interconnected, and multidimensional transportation frameworks. But, the complexities of present-day transportation can't be managed by previous systems. The utilization of IDA frameworks and strategies, with compelling information gathering and data dispersion frameworks, gives openings that are required to building the future intelligent transportation systems (ITSs). In Chapter 10, the authors exhibit the application of IDA in IoT-based ITS.
Chapter 11 aims to observe emerging patterns and trends by using big data analysis to enhance predictions of motor vehicle collisions using a data set consisting of 17 attributes and 998 193 collisions in New York City. The data is extracted from the New York City Police Department (NYPD). The data set has then been tested in three classification algorithms, which are k-nearest neighbor, random forest, and naive Bayes. The outputs are captured using k-fold cross-validation method. These outputs are used to identify and compare classifier accuracy, and random forest node accuracy and processing time. Further, an analysis of raw data is performed describing the four different vehicle groups in order to detect significance within the recorded period. Finally, extreme cases of collision severity are identified using outlier analysis. The analysis demonstrates that out of three classifiers, random forest gives the best results.
Neurological disorders are the diseases that are related to the brain, nervous system, and the spinal cord of the human body. These disorders may affect the walking, speaking, learning, and moving capacity of human beings. Some of the major human neurological disorders are stroke, brain tumors, epilepsy, meningitis, Alzheimer's, etc. Additionally, remarkable growth has been observed in the areas of disease diagnosis and health informatics. The critical human disorders related to lung, kidney, skin, and brain have been successfully diagnosed using different data mining and machine learning techniques. In Chapter 12, several neurological and psychological disorders are discussed. The role of different computing techniques in designing different biomedical applications are presented. In addition, the challenges and promising areas of innovation in designing a smart and intelligent neurological disorder diagnostic system using big data, internet of things, and emerging computing techniques are also highlighted.
Bug reports are one of the crucial software artifacts in open-source software. Issue tracking systems maintain enormous bug reports with several attributes, such as long description of bugs, threaded discussion comments, and bug meta-data, which includes BugID, priority, status, resolution, time, and others. In Chapter 13, bug reports of 20 open-source projects of the Apache Software Foundation are extracted using a tool named the Bug Report Collection System for trend analysis. As per the quantitative analysis of data, about 20% of open bugs are critical in nature, which directly impacts the functioning of the system. The presence of a large number of bugs of this kind can put systems into vulnerability positions and reduces the risk aversion capability. Thus, it is essential to resolve these issues on a high priority. The test lead can assign these issues to the most contributing developers of a project for quick closure of opened critical bugs. The comments are mined, which help us identify the developers resolving the majority of bugs, which is beneficial for test leads of distinct projects. As per the collated data, the areas more prone to system failures are determined such as input/output type error and logical code error.
Sentiments are the standard way by which people express their feelings. Sentiments are broadly classified as positive and negative. The problem occurs when the user expresses with words that are different than the actual feelings. This phenomenon is generally known to us as sarcasm, where people say something opposite the actual sentiments. Sarcasm detection is of great importance for the correct analysis of sentiments. Chapter 14 attempts to give an algorithm for successful detection of hyperbolic sarcasm and general sarcasm in a data set of sarcastic posts that are collected from pages dedicated for sarcasm on social media sites such as Facebook, Pinterest, and Instagram. This chapter also shows the initial results of the algorithm and its evaluation.
Predictive analytics refers to forecasting the future probabilities by extracting information from existing data sets and determining patterns from predicted outcomes. Predictive analytics also includes what-if scenarios and risk assessment. In Chapter 15, an effort has been made to use principles of predictive modeling to analyze the authentic social network data set, and results have been encouraging. The post-analysis of the results have been focused on exhibiting contact details, mobility pattern, and a number of degree of connections/minutes leading to identification of the linkage/bonding between the nodes in the social network.
Modern medicine has been confronted by a major challenge of achieving promise and capacity of tremendous expansion in medical data sets of all kinds. Medical databases develop huge bulk of knowledge and data, which mandates a specialized tool to store and perform analysis of data and as a result, effectively use saved knowledge and data. Information is extracted from data by using a domain's background knowledge in the process of IDA. Various matters dealt with regard use, definition, and impact of these processes and they are tested for their optimization in application domains of medicine. The primary focus of Chapter 16 is on the methods and tools of IDA, with an aim to minimize the growing differences between data comprehension and data gathering.
Snoozing, or sleeping, is a physical phenomenon of the human life. When human snooze is disturbed, it generates many problems, such as mental disease, heart disease, etc. Total snooze is characterized by two stages, viz., rapid eye movement and nonrapid eye movement. Bruxism is a type of snooze disorder. The traditional method of the prognosis takes time and the result is in analog form. Chapter 17 proposes a method for easy prognosis of snooze bruxism.
Neurodegenerative diseases like Alzheimer's and Parkinson's impair the cognitive and motor abilities of the patient, along with memory loss and confusion. As handwriting involves proper functioning of the brain and motor control, it is affected. Alteration in handwriting is one of the first signs of Alzheimer's disease. The handwriting gets shaky, due to loss of muscle control, confusion, and forgetfulness. The symptoms get progressively worse. It gets illegible and the phonological spelling mistakes become inevitable. In Chapter 18, the authors use a feature extraction technique to be used as a parameter for diagnosis. A variational auto encoder (VAE), a deep unsupervised learning technique, has been applied, which is used to compress the input data and then reconstruct it keeping the targeted output the same as the targeted input.
This edited volume on IDA gathers researchers, scientists, and practitioners interested in computational data analysis methods, aimed at narrowing the gap between extensive amounts of data stored in medical databases and the interpretation, understandable, and effective use of the stored data. The expected readers of this book are researchers, scientists, and practitioners interested in IDA, knowledge discovery, and decision support in databases, particularly those who are interested in using these technologies. This publication provides useful references for educational institutions, industry, academic researchers, professionals, developers, and practitioners to apply, evaluate, and reproduce the contributions to this book.
May 07, 2019
New Delhi, India
Deepak Gupta
Bengaluru, India
Siddhartha Bhattacharyya
New Delhi, India
Ashish Khanna
Uttar Pradesh, India
Kalpna Sagar