Cover
List of Contributors
Series Preface
Preface
1 Intelligent Data Analysis: Black Box Versus White Box Modeling
1. 1.1 Introduction
2. 1.2 Interpretation of White Box Models
3. 1.3 Interpretation of Black Box Models
4. 1.4 Issues and Further Challenges
5. 1.5 Summary
6. References
2 Data: Its Nature and Modern Data Analytical Tools
1. 2.1 Introduction
2. 2.2 Data Types and Various File Formats
3. 2.3 Overview of Big Data
4. 2.4 Data Analytics Phases
5. 2.5 Data Analytical Tools
6. 2.6 Database Management System for Big Data Analytics
7. 2.7 Challenges in Big Data Analytics
8. 2.8 Conclusion
9. References
3 Statistical Methods for Intelligent Data Analysis: Introduction and Various Concepts
1. 3.1 Introduction
2. 3.2 Probability
3. 3.3 Descriptive Statistics
4. 3.4 Inferential Statistics
5. 3.5 Statistical Methods
6. 3.6 Errors
7. 3.7 Conclusion
8. References
4 Intelligent Data Analysis with Data Mining: Theory and Applications
1. 4.1 Introduction to Data Mining
2. 4.2 Data and Knowledge
3. 4.3 Discovering Knowledge in Data Mining
4. 4.4 Data Analysis and Data Mining
5. 4.5 Data Mining: Issues
6. 4.6 Data Mining: Systems and Query Language
7. 4.7 Data Mining Methods
8. 4.8 Data Exploration
9. 4.9 Data Visualization
10. 4.10 Probability Concepts for Intelligent Data Analysis (IDA)
11. Reference
5 Intelligent Data Analysis: Deep Learning and Visualization
1. 5.1 Introduction
2. 5.2 Deep Learning and Visualization
3. 5.3 Data Processing and Visualization
4. 5.4 Experiments and Results
5. 5.5 Conclusion
6. References
6 A Systematic Review on the Evolution of Dental Caries Detection Methods and Its Significance in Data Analysis Perspective
1. 6.1 Introduction
2. 6.2 Different Caries Lesion Detection Methods and Data Characterization
3. 6.3 Technical Challenges with the Existing Methods
4. 6.4 Result Analysis
5. 6.5 Conclusion
6. Acknowledgment
7. References
7 Intelligent Data Analysis Using Hadoop Cluster – Inspired MapReduce Framework and Association Rule Mining on Educational Domain
1. 7.1 Introduction
2. 7.2 Learning Analytics in Education
3. 7.3 Motivation
4. 7.4 Literature Review
5. 7.5 Intelligent Data Analytical Tools
6. 7.6 Intelligent Data Analytics Using MapReduce Framework in an Educational Domain
7. 7.7 Results
8. 7.8 Conclusion and Future Scope
9. References
8 Influence of Green Space on Global Air Quality Monitoring: Data Analysis Using K-Means Clustering Algorithm
1. 8.1 Introduction
2. 8.2 Material and Methods
3. 8.3 Results
4. 8.4 Quantitative Analysis
5. 8.5 Discussion
6. 8.6 Conclusion
7. References
9 IDA with Space Technology and Geographic Information System
1. 9.1 Introduction
2. 9.2 Geospatial Techniques
3. 9.3 Comparative Analysis
4. 9.4 Conclusion
5. References
10 Application of Intelligent Data Analysis in Intelligent Transportation System Using IoT
1. 10.1 Introduction to Intelligent Transportation System (ITS)
2. 10.2 Issues and Challenges of Intelligent Transportation System (ITS)
3. 10.3 Intelligent Data Analysis Makes an IoT-Based Transportation System Intelligent
4. 10.4 Intelligent Data Analysis for Security in Intelligent Transportation System
5. 10.5 Tools to Support IDA in an Intelligent Transportation System
6. References
11 Applying Big Data Analytics on Motor Vehicle Collision Predictions in New York City
1. 11.1 Introduction
2. 11.2 Materials and Methods
3. 11.3 Classification Algorithms and K-Fold Validation Using Data Set Obtained from NYPD (2012–2017)
4. 11.4 Results
5. 11.5 Discussion
6. 11.6 Conclusion
7. References
12 A Smart and Promising Neurological Disorder Diagnostic System: An Amalgamation of Big Data, IoT, and Emerging Computing Techniques
1. 12.1 Introduction
2. 12.2 Statistics of Neurological Disorders
3. 12.3 Emerging Computing Techniques
4. 12.4 Related Works and Publication Trends of Articles
5. 12.5 The Need for Neurological Disorders Diagnostic System
6. 12.6 Conclusion
7. References
13 Comments-Based Analysis of a Bug Report Collection System and Its Applications
1. 13.1 Introduction
2. 13.2 Background
3. 13.3 Related Work
4. 13.4 Data Collection Process
5. 13.5 Analysis of Bug Reports
6. 13.6 Threats to Validity
7. 13.7 Conclusion
8. References
9. Notes
14 Sarcasm Detection Algorithms Based on Sentiment Strength
1. 14.1 Introduction
2. 14.2 Literature Survey
3. 14.3 Experiment
4. 14.4 Results and Evaluation
5. 14.5 Conclusion
6. References
7. Notes
15 SNAP: Social Network Analysis Using Predictive Modeling
1. 15.1 Introduction
2. 15.2 Literature Survey
3. 15.3 Comparative Study
4. 15.4 Simulation and Analysis
5. 15.5 Conclusion and Future Work
6. References
16 Intelligent Data Analysis for Medical Applications
1. 16.1 Introduction
2. 16.2 IDA Needs in Medical Applications
3. 16.3 IDA Methods Classifications
4. 16.4 Intelligent Decision Support System in Medical Applications
5. 16.5 Conclusion
6. References
17 Bruxism Detection Using Single-Channel C4-A1 on Human Sleep S2 Stage Recording
1. 17.1 Introduction
2. 17.2 History of Sleep Disorder
3. 17.3 Electroencephalogram Signal
4. 17.4 EEG Data Measurement Technique
5. 17.5 Literature Review
6. 17.6 Subjects and Methodology
7. 17.7 Data Analysis of the Bruxism and Normal Data Using EEG Signal
8. 17.8 Result
9. 17.9 Conclusions
10. Acknowledgments
11. References
18 Handwriting Analysis for Early Detection of Alzheimer's Disease
1. 18.1 Introduction and Background
2. 18.2 Proposed Work and Methodology
3. 18.3 Results and Discussions
4. 18.4 Conclusion
5. References
Index
End User License Agreement

List of Tables

Chapter 2
1. Table 2.1 Schema of an employee table in a broker company.
2. Table 2.2 Data storage measurements.
3. Table 2.3 Comparison of different data analytic tools.
4. Table 2.4 Comparison between SQL and NoSQL.
Chapter 4
1. Table 4.1 Dissimilarities between data and knowledge.
Chapter 6
1. Table 6.1 Global DMFT trends for 12-year-old children [81–83].
2. Table 6.2 Code meaning.
Chapter 7
1. Table 7.1 Educational data set [37].
2. Table 7.2 Synthesized educational data set [38].
3. Table 7.3 The data set for course selection.
4. Table 7.4 Output of Map reduce task.
5. Table 7.5 Best rules found by Apriori.
Chapter 8
1. Table 8.1 Air quality categories (annual mean ambient defined by WHO).
2. Table 8.2 Categorization of the difference of green space area percentage dur...
3. Table 8.3 Analysis of variance (ANOVA) statistics table.
Chapter 9
1. Table 9.1 NoSQL database types.
2. Table 9.2 NoSQL database types.
3. Table 9.3 NoSQL database types.
Chapter 10
1. Table 10.1 Objects, statistical techniques, and graphs supported by R program...
Chapter 11
1. Table 11.1 Illustration of data set attributes.
2. Table 11.2 Categorized vehicle groups.
3. Table 11.3 Description of classification algorithms and functionalities.
4. Table 11.4 Comparison of classifier results.
5. Table 11.5 Analyzedp-value test results.
Chapter 12
1. Table 12.1 Difference between neurological and psychological disorders.
2. Table 12.2 Publications details along with citations used in the study.
Chapter 13
1. Table 13.1 Comparison of previous studies of data extraction.
2. Table 13.2 Categories of error and its significant keywords.
3. Table 13.3 Frequent words for severe and nonsevere bugs.
Chapter 14
1. Table 14.1 Examples for hyperbolic sarcasm.
2. Table 14.2 Examples for general sarcasm, positive sentences, and negative sen...
3. Table 14.3 Shows the patterns used by extended Algorithm 14.2 to detect the p...
4. Table 14.4 Shows example cases for Table 14.3.
5. Table 14.5 True positive and True negative values of the classification resul...
6. Table 14.6 Evaluation results of the classification done by the extended algo...
Chapter 15
1. Table 15.1 Comparison table of literature work.
Chapter 17
1. Table 17.1 The comparative analysis between bruxism and a normal human for th...
2. Table 17.2 The comparative analysis between bruxism and normal human for the ...
3. Table 17.3 The comparative analysis between bruxism and normal human for the ...

List of Illustrations

Chapter 1
1. Figure 1.1 Data analysis process.
2. Figure 1.2 Linear regression.
3. Figure 1.3 Decision tree.
4. Figure 1.4 Distribution of points in case of high and low information gain....
5. Figure 1.5 Partial dependence plots from a gradient boosting regressor train...
6. Figure 1.6 Partial dependence plot from a gradient boosting regressor traine...
7. Figure 1.7 Relationship between X₂ and Y [24].
8. Figure 1.8 ICE plot between feature X_2 and Y [24].
9. Figure 1.9 Calculation of PDP and M-plot [25].
10. Figure 1.10 Calculation of ALE plot [25].
11. Figure 1.11 Correlation does not imply causation [29].
Chapter 2
1. Figure 2.1 Various stages of data.
2. Figure 2.2 Classifications of digital data.
3. Figure 2.3 CSV file opened in Microsoft Excel.
4. Figure 2.4 Plain text file opened in Notepad.
5. Figure 2.8 Characteristics of big data.
6. Figure 2.9 Different types of big data analytics.
7. Figure 2.10 Various phases of data analytics.
8. Figure 2.11 Features of Apache Spark.
9. Figure 2.12 Components of Hadoop.
Chapter 4
1. Figure 4.1 From data to knowledge.
2. Figure 4.2 Variety of data in data mining.
3. Figure 4.3 Knowledge tree for intelligent data mining.
4. Figure 4.4 Knowledge discovery process.
5. Figure 4.5 Relationship between data analysis and data mining.
6. Figure 4.6 Issues in data mining.
7. Figure 4.7 Various systems in data mining.
8. Figure 4.8 Diagrammatic concept of classification.
9. Figure 4.9 Diagrammatic concept of clustering.
10. Figure 4.10 Diagrammatic concept of classification.
11. Figure 4.11 Specimen for decision tree induction.
12. Figure 4.12 Sample representation for stacked column chart.
13. Figure 4.13 Different relationships shown by scatter plots for bivariate ana...
14. Figure 4.14 Different techniques used for data visualization.
15. Figure 4.15 Different sample visualizations used for different cases.
16. Figure 4.16 Different probability distribution functions classification and ...
Chapter 5
1. Figure 5.1 Left: overview of neural network and deep learning; Right: branch...
2. Figure 5.2 (a) Overview of visualization: score function, data loss, and reg...
3. Figure 5.3 Linear model and sample data visualization: left: a simple linear...
4. Figure 5.4 Gradient descent is the excellent to visualization in deep learni...
5. Figure 5.5 left: Design the model with simplify blocks regarding dog detecti...
6. Figure 5.6 The loss of entropy.
7. Figure 5.7 (a) Matrix multiplication for deep learning using linear model. (...
8. Figure 5.8 Optimizer [16]: Adam works and others shows.
9. Figure 5.9 Left: example of block box most uses to visualize the complex net...
10. Figure 5.10 Overview of reinforcement learning model [9]: an agent is visual...
11. Figure 5.11 Deep reinforcement learning.
12. Figure 5.12 Reinforcement learning and visualization.
13. Figure 5.13 Inception v3 module: it was the powerful for visualizing the dee...
14. Figure 5.14 GoogLeNet architecture [12].
15. Figure 5.15 x: input, z: logit, : softmax, y: 1-hot labels;
16. Figure 5.16 Example of interpretation of histogram distribution [Morvan].
17. Figure 5.17 Illustrated the multiple layers features in representation [medi...
18. Figure 5.18 Relationship visualizations: two variables using the scatter dia...
19. Figure 5.19 Comparison method: overview of charts is represented the most co...
20. Figure 5.20 Composition methodology: overview of charts is represented most ...
21. Figure 5.21 Example of visualization applied MNIST data set by using deep le...
22. Figure 5.22 MNIST visualization.
23. Figure 5.23 Example of visualization using MNIST in 3D.
24. Figure 5.24 L1 and L2 regularization.
25. Figure 5.25 Dropout processing and visualization: sampling dropout loss base...
26. Figure 5.26 Mask-RCNN for object detection and segmentation [21].
27. Figure 5.27 Mask-RCCN result progress: training with Mask-RCNN according to ...
28. Figure 5.28 Deep learning and object visualization based on sampling during ...
29. Figure 5.29 Deep learning and object visualization.
30. Figure 5.30 Human detection using Mask RCNN: noised data during the human de...
31. Figure 5.31 Showing the activation function of layers based on food recognit...
32. Figure 5.32 Interpretation of histogram distribution using Mask-RCNN.
33. Figure 5.33 Overfitting representation based on experience from Mask-RCNN [2...
34. Figure 5.34 Weights histogram based on distributed parameters of training se...
35. Figure 5.35 Correlations.
36. Figure 5.36 Visualization of food recognition.
37. Figure 5.37 Visualization for deep matrix factorization model [18].
38. Figure 5.38 Visualization and loss function in deep learning for recommendat...
39. Figure 5.39 Data visualization in MovieLens 1 M of recommendation system bas...
40. Figure 5.40 Line in charts, and modeling and visualization for reinforcement...
Chapter 6
1. Figure 6.1 Dental caries at its different phases.
2. Figure 6.2 Worldwide dental caries severity regions.
3. Figure 6.3 The affected risk of dental caries on smoking.
4. Figure 6.4 Worldwide dental caries affected Level that according to DMFT amo...
5. Figure 6.5 Classification of caries detection method.
6. Figure 6.6 Internal diagram of point detection method.
7. Figure 6.7 Teeth data features along with its distribution.
8. Figure 6.8 Discoloration of enamel under FOTI machine.
9. Figure 6.9 (a) (35–40) mm teeth image, (b) QLF teeth image.
10. Figure 6.10 (a) FOTI device, (b) diagnodent device, (c) QLF machine, (d) car...
11. Figure 6.11 Caries affected lesion, 3D view of the same lesion and it spread...
12. Figure 6.12 Performance of traditional caries detection methods after Bader ...
13. Figure 6.13 Performance of traditional method for Proximal Surfaces after Ba...
Chapter 7
1. Figure 7.1 Artificial intelligence and its subsets using intelligent data an...
2. Figure 7.2 Learner support provided by learning analytics.
3. Figure 7.3 Learning through web and mobile computing.
4. Figure 7.4 Sample techniques for the analytics engine [6].
5. Figure 7.5 Data mining using WEKA tool [36].
6. Figure 7.6 Decision tree generated for the data set [37].
7. Figure 7.7 Distribution table for the data set [37].
8. Figure 7.8 (a). Visualization of student attributes (K = 2) [38]. (b). Visua...
9. Figure 7.9 Working principle of MapReduce framework.
10. Figure 7.10 Output obtained from MapReduce programming framework.
Chapter 8
1. Figure 8.1 The flow of data processing procedure.
2. Figure 8.2 (a) Air quality with land areas in 2014 (using 1 048 576 instance...
3. Figure 8.3 (a) Tree area in 1990. (b) Tree area in 2014. (c) Difference of t...
4. Figure 8.4 Variance of each attribute with coordinates.
5. Figure 8.5 Variance of each attribute.
6. Figure 8.6 Count values of cases in each cluster.
7. Figure 8.7 Tree area percentage/relation of raw data (difference) and ranges...
8. Figure 8.8 Air quality with green space percentage.
9. Figure 8.9 Air quality with green space percentage analysis.
Chapter 9
1. Figure 9.1 Data collection from various sources from the space.
2. Figure 9.2 GIS evolution and future trends.
3. Figure 9.3 Remote sensing big data architecture.
4. Figure 9.4 The machine learning process.
5. Figure 9.5 Big data in remote sensing.
6. Figure 9.6 Big data in remote sensing.
7. Figure 9.7 Geospatial techniques.
8. Figure 9.8 A roadmap for geospatial big data management.
9. Figure 9.9 A roadmap knowledge discovery and service.
10. Figure 9.10 Conceptual diagram of the proposed fogGIS framework for power-ef...
Chapter 10
1. Figure 10.1 Overview of intelligent transportation system.
2. Figure 10.2 Services of intelligent transportation system (ITS).
3. Figure 10.3 Challenges and opportunities in the implementation of ITS.
4. Figure 10.4 Process of intelligent data analysis.
5. Figure 10.5 Three-dimensional model for security in ITS.
6. Figure 10.6 Data types of Python.
Chapter 11
1. Figure 11.1 Overall methodology of data analysis process.
2. Figure 11.2 Accuracy comparison of RF and kNN.
3. Figure 11.3 Random forest node processing time.
4. Figure 11.4 Random forest node accuracy.
5. Figure 11.5 Heat map of large vehicle collisions.
6. Figure 11.6 Heat map of very-small vehicle collisions.
7. Figure 11.7 Comparison of number of collisions, persons injured, and persons...
8. Figure 11.8 Number of persons injured based on vehicle groups.
9. Figure 11.9 Number of persons killed based on vehicle groups.
10. Figure 11.10 Number of persons injured based on borough.
11. Figure 11.11 Number of persons killed based on borough.
12. Figure 11.12 Number of persons injured in medium vehicles over N-68802 colli...
13. Figure 11.13 Number of persons killed in medium vehicles over N-68802 collis...
14. Figure 11.14 Number of persons injured in large vehicles over N-27508 collis...
15. Figure 11.15 Number of persons killed in large vehicles over N-27508 collisi...
16. Figure 11.16 Number of persons injured in small vehicles over N-892174 colli...
17. Figure 11.17 Number of persons killed in small vehicles over N-892174 collis...
18. Figure 11.18 Number of persons injured in very small vehicles over N-9705 co...
19. Figure 11.19 Number of persons killed in very small vehicles over N-9705 col...
Chapter 12
1. Figure 12.1 Types of neurological disorders.
2. Figure 12.2 Prevalence and death rate due to neurological disorders in the y...
3. Figure 12.3 Prevalence of neurological disorders in different countries [15]...
4. Figure 12.4 IoT and big data.
5. Figure 12.5 Soft computing techniques.
6. Figure 12.6 The process to generate an optimal solution [76, 77].
7. Figure 12.7 Machine learning applications.
8. Figure 12.8 The accuracy achieved by different studies for neurological diso...
9. Figure 12.9 Sensitivity achieve by different studies for neurological disord...
10. Figure 12.10 Specificity achieve by different studies for neurological disor...
11. Figure 12.11 Publication trend from 2008 to 2018 for neurological disorder d...
12. Figure 12.12 Neurological disorder diagnostic framework.
Chapter 13
1. Figure 13.1 Statistics of bug reports of 20 projects of the Apache Software ...
2. Figure 13.2 (a) Number of bug reports based on resolution. (b) Number of bug...
3. Figure 13.3 Example of bug report of Accumulo project.
4. Figure 13.4 Data extraction process.
5. Figure 13.5 Number of open bugs of distinct severity level.
6. Figure 13.6 Percentage of open bugs as per severity level.
7. Figure 13.7 (a)–(d) Most contributing developer for 20 projects of Apache So...
8. Figure 13.8 Code for finding corelated words (a) Association graph for logic...
9. Figure 13.9 (a)–(d) Association graphs of various errors for Kafka project (...
10. Figure 13.10 (a)–(b) Frequency and association plots for severe bugs.
11. Figure 13.11 K-means cluster group similar words.
12. Figure 13.12 Dendogram of most similar words.
Chapter 14
1. Figure. 14.1 Sentiment strengths and their elaboration as given by SentiStre...
2. Figure 14.2 Chart showing classification results of all four sentiments.
3. Figure 14.3 Chart showing evaluation results.
Chapter 16
1. Figure 16.1 Conventional decision support system.
2. Figure 16.2 Intelligent system for decision support/expert analysis in layou...
Chapter 17
1. Figure 17.1 Differences of bruxism patient teeth and normal human teeth.
2. Figure 17.2 Flow chart of the proposed work.
3. Figure 17.3 Low pass filter.
4. Figure 17.4 The loading of the bruxism data for the EEG signal and the total...
5. Figure 17.5 Loading of the normal data for the EEG signal in the S2 snooze s...
6. Figure 17.6 Extracted single-channels C4-A1 of the bruxism for the S2 sleep ...
7. Figure 17.7 Extracted single-channels C4-A1 of the normal for the S2 sleep s...
8. Figure 17.8 Filtered C4-A1 channel of S2 sleep stage for bruxism, we used a ...
9. Figure 17.9 Filtered C4-A1 channel of S2 sleep stage for the normal, we used...
10. Figure 17.10 Sampled C4-A1 channel of S2 sleep stage for the bruxism using t...
11. Figure 17.11 Sampled C4-A1 channel of S2 sleep stage for the normal using Ha...
12. Figure 17.12 It has represented the estimation of the power spectral density...
13. Figure 17.13 It has represented the estimation of the power spectral density...
14. Figure 17.14 Graphical representation for the normalized value of the single...
Chapter 18
1. Figure 18.1 This is simplest form of representation of the Encoder architect...
2. Figure 18.2 (a) The encoder compresses data into latent space (y). (b) The d...
3. Figure 18.3 Image reconstruction process.
4. Figure 18.4 Line segment from handwritten sample from patients suffering fro...
5. Figure 18.5 (a and b) Word segmentation samples produced from the segmented ...
6. Figure 18.6 Sample of character segmentation obtained using segmented words....
7. Figure 18.7 Segmented characters reconstructed using VAE.
8. Figure 18.8 Clusters of reconstructed images using VAE. (a) Cluster for “e,”...
9. Figure 18.9 Ambiguous “l” and “e.”
10. Figure 18.10 Ambiguous “c” and “e.”
11. Figure 18.11 Unclear or disconnected writing with spelling errors.

This edition first published 2020

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions.

The right of Deepak Gupta, Siddhartha Bhattacharyya, Ashish Khanna, and Kalpna Sagar to be identified as the authors of the editorial material in this work has been asserted in accordance with law.

Registered Offices

John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA

John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK

Editorial Office

The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK

For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com.

Wiley also publishes its books in a variety of electronic formats and by print-on-demand. Some content that appears in standard print versions of this book may not be available in other formats.

Limit of Liability/Disclaimer of Warranty

MATLAB^® is a trademark of The MathWorks, Inc. and is used with permission. The MathWorks does not warrant the accuracy of the text or exercises in this book. This work's use or discussion of MATLAB^® software or related products does not constitute endorsement or sponsorship by The MathWorks of a particular pedagogical approach or particular use of the MATLAB^® software.

While the publisher and authors have used their best efforts in preparing this work, they make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives, written sales materials or promotional statements for this work. The fact that an organization, website, or product is referred to in this work as a citation and/or potential source of further information does not mean that the publisher and authors endorse the information or services the organization, website, or product may provide or recommendations it may make. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for your situation. You should consult with a specialist where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.

Library of Congress Cataloging-in-Publication Data

Names: Gupta, Deepak, editor.

Title: Intelligent data analysis : from data gathering to data comprehension / edited by Dr. Deepak Gupta, Dr. Siddhartha Bhattacharyya, Dr. Ashish Khanna, Ms. Kalpna Sagar.

Description: Hoboken, NJ, USA : Wiley, 2020. | Series: The Wiley series in intelligent signal and data processing | Includes bibliographical references and index.

Identifiers: LCCN 2019056735 (print) | LCCN 2019056736 (ebook) | ISBN 9781119544456 (hardback) | ISBN 9781119544449 (adobe pdf) | ISBN 9781119544463 (epub)

Subjects: LCSH: Data mining. | Computational intelligence.

Classification: LCC QA76.9.D343 I57435 2020 (print) | LCC QA76.9.D343 (ebook) | DDC 006.3/12–dc23

LC record available at https://lccn.loc.gov/2019056735

LC ebook record available at https://lccn.loc.gov/2019056736

Cover Design: Wiley

Cover Image: © gremlin/Getty Images

Preface

Intelligent data analysis (IDA), knowledge discovery, and decision support have recently become more challenging research fields and have gained much attention among a large number of researchers and practitioners. In our view, the awareness of these challenging research fields and emerging technologies among the research community will increase the applications in biomedical science. This book aims to present the various approaches, techniques, and methods that are available for IDA, and to present case studies of their application.

This volume comprises 18 chapters focusing on the latest advances in IDA tools and techniques.

Machine learning models are broadly categorized into two types: white box and black box. Due to the difficulty in interpreting their inner workings, some machine learning models are considered black box models. Chapter 1 focuses on the different machine learning models, along with their advantages and limitations as far as the analysis of data is concerned.

With the advancement of technology, the amount of data generated is very large. The data generated has useful information that needs to be gathered by data analytics tools in order to make better decisions. In Chapter 2, the definition of data and its classifications based on different factors is given. The reader will learn about how and what data is and about the breakup of the data. After a description of what data is, the chapter will focus on defining and explaining big data and the various challenges faced by dealing with big data. The authors also describe various types of analytics that can be performed on large data and six data analytics tools (Microsoft Excel, Apache Spark, OpenRefine, R, Hadoop, and Tableau).

In recent years, the widespread use of computers and the internet has led to the generation of data on an unprecedented scale. To make an effective use of this data, it is necessary that data must be collected and analyzed so that inferences can be made to improve various products and services. Statistics deals with the collection, organization, and analysis of data. The organization and description of data is studied under these statistics in Chapter 3 while analysis of data and how to make predictions based on it is dealt with in inferential statistics.

After having an idea about various aspects of IDA in the previous chapters, Chapter 4 deals with an overview of data mining. It also discusses the process of knowledge discovery in data along with a detailed analysis of various mining methods including classification, clustering, and decision tree. In addition to that, the chapter concludes with a view of data visualization and probability concepts for IDA.

In Chapter 5, the authors demonstrate one of the most crucial and challenge areas in computer vision and the IDA field based on manipulating the convergence. This subject is divided into a deep learning paradigm for object segmentation in computer vision and visualization paradigm for efficiently incremental interpretation in manipulating the datasets for supervised and unsupervised learning, and online or offline training in reinforcement learning. This topic recently has had a large impact in robotics and autonomous systems, food detection, recommendation systems, and medical applications.

Dental caries is a painful bacterial disease of teeth caused mainly by Streptococcus mutants, acid, and carbohydrates, and it destroys the enamel, or the dentine, layer of the tooth. As per the World Health Organization report, worldwide, 60–90% of school children and almost 100% of adults have dental caries. Dental caries and periodontal disease without treatment for long periods causes tooth loss. There is not a single method to detect caries in its earliest stages. The size of carious lesions and early caries detection are very challenging tasks for dental practitioners. The methods related to dental caries detection are the radiograph, QLF or or quantitative light-induced fluorescence, ECM, FOTI, DIFOTI, etc. In a radiograph-based technique, dentists analyze the image data. In Chapter 6, the authors present a method to detect caries by analyzing the secondary emission data.

With the growth of data in the education field in recent years, there is a need for intelligent data analytics, in order that academic data should be used effectively to improve learning. Educational data mining and learning analytics are the fields of IDA that play important roles in intelligent analysis of educational data. One of the real challenges faced by students and institutions alike is the quality of education. An equally important factor related to the quality of education is the performance of students in the higher education system. The decisions that the students make while selecting their area of specialization is of grave concern here. In the absence of support systems, the students and the teachers/mentors fall short when making the right decisions for the furthering of their chosen career paths. Therefore, in Chapter 7, the authors attempt to address the issue by proposing a system that can guide the student to choose and to focus on the right course(s) based on their personal preferences. For this purpose, a system has been envisaged by blending data mining and classification with big data. A methodology using MapReduce Framework and association rule mining is proposed in order to derive the right blend of courses for students to pursue to enhance their career prospects.

Atmospheric air pollution is creating significant health problems that affect millions of people around the world. Chapter 8 analyzes the hypothesis about whether or not global green space variation is changing the global air quality. The authors perform a big data analysis with a data set that contains more than 1M (1 048 000) green space data and air quality data points by considering 190 countries during the years 1990 to 2015. Air quality is measured by considering particular matter (PM) value. The analysis is carried out using multivariate graphs and a k-mean clustering algorithm. The relative geographical changes of the tree areas, as well as the level of the air quality, were identified and the results indicated encouraging news.

Space technology and geotechnology, such as geographic information systems, plays a vital role in the day-to-day activities of a society. In the initial days, the data collection was very rudimentary and primitive. The quality of the data collected was a subject of verification and the accuracy of the data was also questionable. With the advent of newer technology, the problems have been overcome. Using modern sophisticated systems, space science has been changed drastically. Implementing cutting-edge spaceborne sensors has made it possible to capture real-time data from space. Chapter 9 focuses on these aspects in detail.

Transportation plays an important role in our overall economy, conveying products and people through progressively mind-boggling, interconnected, and multidimensional transportation frameworks. But, the complexities of present-day transportation can't be managed by previous systems. The utilization of IDA frameworks and strategies, with compelling information gathering and data dispersion frameworks, gives openings that are required to building the future intelligent transportation systems (ITSs). In Chapter 10, the authors exhibit the application of IDA in IoT-based ITS.

Chapter 11 aims to observe emerging patterns and trends by using big data analysis to enhance predictions of motor vehicle collisions using a data set consisting of 17 attributes and 998 193 collisions in New York City. The data is extracted from the New York City Police Department (NYPD). The data set has then been tested in three classification algorithms, which are k-nearest neighbor, random forest, and naive Bayes. The outputs are captured using k-fold cross-validation method. These outputs are used to identify and compare classifier accuracy, and random forest node accuracy and processing time. Further, an analysis of raw data is performed describing the four different vehicle groups in order to detect significance within the recorded period. Finally, extreme cases of collision severity are identified using outlier analysis. The analysis demonstrates that out of three classifiers, random forest gives the best results.

Neurological disorders are the diseases that are related to the brain, nervous system, and the spinal cord of the human body. These disorders may affect the walking, speaking, learning, and moving capacity of human beings. Some of the major human neurological disorders are stroke, brain tumors, epilepsy, meningitis, Alzheimer's, etc. Additionally, remarkable growth has been observed in the areas of disease diagnosis and health informatics. The critical human disorders related to lung, kidney, skin, and brain have been successfully diagnosed using different data mining and machine learning techniques. In Chapter 12, several neurological and psychological disorders are discussed. The role of different computing techniques in designing different biomedical applications are presented. In addition, the challenges and promising areas of innovation in designing a smart and intelligent neurological disorder diagnostic system using big data, internet of things, and emerging computing techniques are also highlighted.

Bug reports are one of the crucial software artifacts in open-source software. Issue tracking systems maintain enormous bug reports with several attributes, such as long description of bugs, threaded discussion comments, and bug meta-data, which includes BugID, priority, status, resolution, time, and others. In Chapter 13, bug reports of 20 open-source projects of the Apache Software Foundation are extracted using a tool named the Bug Report Collection System for trend analysis. As per the quantitative analysis of data, about 20% of open bugs are critical in nature, which directly impacts the functioning of the system. The presence of a large number of bugs of this kind can put systems into vulnerability positions and reduces the risk aversion capability. Thus, it is essential to resolve these issues on a high priority. The test lead can assign these issues to the most contributing developers of a project for quick closure of opened critical bugs. The comments are mined, which help us identify the developers resolving the majority of bugs, which is beneficial for test leads of distinct projects. As per the collated data, the areas more prone to system failures are determined such as input/output type error and logical code error.

Sentiments are the standard way by which people express their feelings. Sentiments are broadly classified as positive and negative. The problem occurs when the user expresses with words that are different than the actual feelings. This phenomenon is generally known to us as sarcasm, where people say something opposite the actual sentiments. Sarcasm detection is of great importance for the correct analysis of sentiments. Chapter 14 attempts to give an algorithm for successful detection of hyperbolic sarcasm and general sarcasm in a data set of sarcastic posts that are collected from pages dedicated for sarcasm on social media sites such as Facebook, Pinterest, and Instagram. This chapter also shows the initial results of the algorithm and its evaluation.

Predictive analytics refers to forecasting the future probabilities by extracting information from existing data sets and determining patterns from predicted outcomes. Predictive analytics also includes what-if scenarios and risk assessment. In Chapter 15, an effort has been made to use principles of predictive modeling to analyze the authentic social network data set, and results have been encouraging. The post-analysis of the results have been focused on exhibiting contact details, mobility pattern, and a number of degree of connections/minutes leading to identification of the linkage/bonding between the nodes in the social network.

Modern medicine has been confronted by a major challenge of achieving promise and capacity of tremendous expansion in medical data sets of all kinds. Medical databases develop huge bulk of knowledge and data, which mandates a specialized tool to store and perform analysis of data and as a result, effectively use saved knowledge and data. Information is extracted from data by using a domain's background knowledge in the process of IDA. Various matters dealt with regard use, definition, and impact of these processes and they are tested for their optimization in application domains of medicine. The primary focus of Chapter 16 is on the methods and tools of IDA, with an aim to minimize the growing differences between data comprehension and data gathering.

Snoozing, or sleeping, is a physical phenomenon of the human life. When human snooze is disturbed, it generates many problems, such as mental disease, heart disease, etc. Total snooze is characterized by two stages, viz., rapid eye movement and nonrapid eye movement. Bruxism is a type of snooze disorder. The traditional method of the prognosis takes time and the result is in analog form. Chapter 17 proposes a method for easy prognosis of snooze bruxism.

Neurodegenerative diseases like Alzheimer's and Parkinson's impair the cognitive and motor abilities of the patient, along with memory loss and confusion. As handwriting involves proper functioning of the brain and motor control, it is affected. Alteration in handwriting is one of the first signs of Alzheimer's disease. The handwriting gets shaky, due to loss of muscle control, confusion, and forgetfulness. The symptoms get progressively worse. It gets illegible and the phonological spelling mistakes become inevitable. In Chapter 18, the authors use a feature extraction technique to be used as a parameter for diagnosis. A variational auto encoder (VAE), a deep unsupervised learning technique, has been applied, which is used to compress the input data and then reconstruct it keeping the targeted output the same as the targeted input.

This edited volume on IDA gathers researchers, scientists, and practitioners interested in computational data analysis methods, aimed at narrowing the gap between extensive amounts of data stored in medical databases and the interpretation, understandable, and effective use of the stored data. The expected readers of this book are researchers, scientists, and practitioners interested in IDA, knowledge discovery, and decision support in databases, particularly those who are interested in using these technologies. This publication provides useful references for educational institutions, industry, academic researchers, professionals, developers, and practitioners to apply, evaluate, and reproduce the contributions to this book.

May 07, 2019

New Delhi, India

Deepak Gupta

Bengaluru, India

Siddhartha Bhattacharyya

New Delhi, India

Ashish Khanna

Uttar Pradesh, India

Kalpna Sagar

Intelligent Data Analysis

From Data Gathering to Data Comprehension

List of Contributors

Series Preface

Dr. Siddhartha Bhattacharyya, CHRIST (Deemed to be University), Bengaluru, India (Series Editor)

Preface