Cover: Intelligent Data Analysis by Deepak Gupta, Siddhartha Bhattacharyya, Ashish Khanna, Kalpna

Intelligent Data Analysis

From Data Gathering to Data Comprehension

 

Edited by

Deepak Gupta

Maharaja Agrasen Institute of Technology

Delhi, India

Siddhartha Bhattacharyya

CHRIST (Deemed to be University)

Bengaluru, India

Ashish Khanna

Maharaja Agrasen Institute of Technology

Delhi, India

Kalpna Sagar

KIET Group of Institutions

Uttar Pradesh, India

 

 

 

Wiley Logo

Deepak Gupta would like to dedicate this book to his father, Sh. R.K. Gupta, his mother, Smt. Geeta Gupta, his mentors for their constant encouragement, and his family members, including his wife, brothers, sisters, kids and the students.

Siddhartha Bhattacharyya would like to dedicate this book to his parents, the late Ajit Kumar Bhattacharyya and the late Hashi Bhattacharyya, his beloved wife, Rashni, and his research scholars, Sourav, Sandip, Hrishikesh, Pankaj, Debanjan, Alokananda, Koyel, and Tulika.

Ashish Khanna would like to dedicate this book to his parents, the late R.C. Khanna and Smt. Surekha Khanna, for their constant encouragement and support, and to his wife, Sheenu, and children, Master Bhavya and Master Sanyukt.

Kalpna Sagar would like to dedicate this book to her father, Mr. Lekh Ram Sagar, and her mother, Smt. Gomti Sagar, the strongest persons of her life.

List of Contributors

  • Ambarish G. Mohapatra
  • Silicon Institute of Technology
  • Bhubaneswar
  • India
  • Anirban Mukherjee
  • RCC Institute of Information Technology
  • West Bengal
  • India
  • Aniruddha Sadhukhan
  • RCC Institute of Information Technology
  • West Bengal
  • India
  • Anisha Roy
  • RCC Institute of Information Technology
  • West Bengal
  • India
  • Arvinder Kaur
  • Guru Gobind Singh Indraprastha University
  • India
  • Ayush Ahuja
  • Jaypee Institute of Information Technology Noida
  • India
  • Biswajit Modak
  • Nabadwip State General Hospital
  • Nabadwip
  • India
  • R.S. Bhatia
  • National Institute of Technology
  • Kurukshetra
  • India
  • Bright Keswani
  • Suresh Gyan Vihar University
  • Jaipur
  • India
  • Dakun Lai
  • University of Electronic Science and Technology of China
  • Chengdu
  • China
  • Deepak Kumar Sharma
  • Netaji Subhas University of Technology
  • New Delhi
  • India
  • Dhanushka Abeyratne
  • Yellowfin (HQ)
  • The University of Melbourne
  • Australia
  • Faijan Akhtar
  • Jamia Hamdard
  • New Delhi
  • India
  • Gihan S. Pathirana
  • Charles Sturt University
  • Melbourne
  • Australia
  • Huy V. Pham
  • Ton Duc Thang University
  • Vietnam
  • Malka N. Halgamuge
  • The University of Melbourne
  • Australia
  • Manashi De
  • Techno India
  • West Bengal
  • India
  • Manik Sharma
  • DAV University
  • Jalandhar
  • India
  • Manu Agarwal
  • Jaypee Institute of Information Technology Noida
  • India
  • Manu Sood
  • University Shimla
  • India
  • Md Belal Bin Heyat
  • University of Electronic Science and Technology of China
  • Chengdu
  • China
  • Mohd Ammar Bin Hayat
  • Medical University
  • India
  • Moolchand Sharma
  • Maharaja Agrasen Institute of Technology (MAIT)
  • Delhi
  • India
  • Nabendu Chaki
  • University of Calcutta
  • Kolkata
  • India
  • Nisheeth Joshi
  • Banasthali Vidyapith
  • Rajasthan
  • India
  • Om Prakash Rishi
  • University of Kota
  • India
  • Poonam Keswani
  • Akashdeep PG College
  • Jaipur
  • India
  • Prableen Kaur
  • DAV University
  • Jalandhar
  • India
  • Pragya Katyayan
  • Banasthali Vidyapith
  • Rajasthan
  • India
  • Pratiyush Guleria
  • University Shimla
  • India
  • Prerna Sharma
  • Maharaja Agrasen Institute of Technology (MAIT)
  • Delhi
  • India
  • Rachna Jain
  • Bharati Vidyapeeth's College of Engineering
  • New Delhi
  • India
  • Rahul Johari
  • GGSIP University
  • New Delhi
  • India
  • Rajib Saha
  • RCC Institute of Information Technology
  • West Bengal
  • India
  • Rakesh Roshan
  • Institute of Management Studies
  • Ghaziabad
  • India
  • Ramneek Singhal
  • Bharati Vidyapeeth's College of Engineering
  • New Delhi
  • India
  • Ravinder Ahuja
  • Jaypee Institute of Information Technology Noida
  • India
  • Samarth Chugh
  • Netaji Subhas University of Technology
  • New Delhi
  • India
  • Samridhi Seth
  • GGSIP University
  • New Delhi
  • India
  • Sarthak Gupta
  • Netaji Subhas University of Technology
  • New Delhi
  • India
  • Shadab Azad
  • Chaudhary Charan Singh University Meerut
  • India
  • Shafan Azad
  • Dr. A.P.J. Abdul Kalam Technical University
  • Uttar Pradesh
  • India
  • Shajan Azad
  • Hayat Institute of Nursing
  • Lucknow
  • India
  • Shikhar Asthana
  • Jaypee Institute of Information Technology Noida
  • India
  • Shivam Bachhety
  • Bharati Vidyapeeth's College of Engineering
  • New Delhi
  • India
  • Shubham Kumaram
  • Netaji Subhas University of Technology
  • New Delhi
  • India
  • Shubhra Goyal
  • Guru Gobind Singh Indraprastha University
  • India
  • Siddhant Bagga
  • Netaji Subhas University of Technology
  • New Delhi
  • India
  • Soma Datta
  • University of Calcutta
  • Kolkata
  • India
  • Tarini Ch. Mishra
  • Silicon Institute of Technology
  • Bhubaneswar
  • India
  • Than D. Le
  • University of Bordeaux
  • France
  • Vikas Chaudhary
  • KIET
  • Ghaziabad
  • India

Series Preface

Dr. Siddhartha Bhattacharyya, CHRIST (Deemed to be University), Bengaluru, India (Series Editor)

The Intelligent Signal and Data Processing (ISDP) book series is aimed at fostering the field of signal and data processing, which encompasses the theory and practice of algorithms and hardware that convert signals produced by artificial or natural means into a form useful for a specific purpose. The signals might be speech, audio, images, video, sensor data, telemetry, electrocardiograms, or seismic data, among others. The possible application areas include transmission, display, storage, interpretation, classification, segmentation, or diagnosis. The primary objective of the ISDP book series is to evolve future-generation scalable intelligent systems for faithful analysis of signals and data. ISDP is mainly intended to enrich the scholarly discourse on intelligent signal and image processing in different incarnations. ISDP will benefit a wide range of learners, including students, researchers, and practitioners. The student community can use the volumes in the series as reference texts to advance their knowledge base. In addition, the monographs will also come in handy to the aspiring researcher because of the valuable contributions both have made in this field. Moreover, both faculty members and data practitioners are likely to grasp depth of the relevant knowledge base from these volumes.

The series coverage will contain, not exclusively, the following:

  1. Intelligent signal processing
    1. Adaptive filtering
    2. Learning algorithms for neural networks
    3. Hybrid soft-computing techniques
    4. Spectrum estimation and modeling
  2. Image processing
    1. Image thresholding
    2. Image restoration
    3. Image compression
    4. Image segmentation
    5. Image quality evaluation
    6. Computer vision and medical imaging
    7. Image mining
    8. Pattern recognition
    9. Remote sensing imagery
    10. Underwater image analysis
    11. Gesture analysis
    12. Human mind analysis
    13. Multidimensional image analysis
  3. Speech processing
    1. Modeling
    2. Compression
    3. Speech recognition and analysis
  4. Video processing
    1. Video compression
    2. Analysis and processing
    3. 3D video compression
    4. Target tracking
    5. Video surveillance
    6. Automated and distributed crowd analytics
    7. Stereo-to-auto stereoscopic 3D video conversion
    8. Virtual and augmented reality
  5. Data analysis
    1. Intelligent data acquisition
    2. Data mining
    3. Exploratory data analysis
    4. Modeling and algorithms
    5. Big data analytics
    6. Business intelligence
    7. Smart cities and smart buildings
    8. Multiway data analysis
    9. Predictive analytics
    10. Intelligent systems

Preface

Intelligent data analysis (IDA), knowledge discovery, and decision support have recently become more challenging research fields and have gained much attention among a large number of researchers and practitioners. In our view, the awareness of these challenging research fields and emerging technologies among the research community will increase the applications in biomedical science. This book aims to present the various approaches, techniques, and methods that are available for IDA, and to present case studies of their application.

This volume comprises 18 chapters focusing on the latest advances in IDA tools and techniques.

Machine learning models are broadly categorized into two types: white box and black box. Due to the difficulty in interpreting their inner workings, some machine learning models are considered black box models. Chapter 1 focuses on the different machine learning models, along with their advantages and limitations as far as the analysis of data is concerned.

With the advancement of technology, the amount of data generated is very large. The data generated has useful information that needs to be gathered by data analytics tools in order to make better decisions. In Chapter 2, the definition of data and its classifications based on different factors is given. The reader will learn about how and what data is and about the breakup of the data. After a description of what data is, the chapter will focus on defining and explaining big data and the various challenges faced by dealing with big data. The authors also describe various types of analytics that can be performed on large data and six data analytics tools (Microsoft Excel, Apache Spark, OpenRefine, R, Hadoop, and Tableau).

In recent years, the widespread use of computers and the internet has led to the generation of data on an unprecedented scale. To make an effective use of this data, it is necessary that data must be collected and analyzed so that inferences can be made to improve various products and services. Statistics deals with the collection, organization, and analysis of data. The organization and description of data is studied under these statistics in Chapter 3 while analysis of data and how to make predictions based on it is dealt with in inferential statistics.

After having an idea about various aspects of IDA in the previous chapters, Chapter 4 deals with an overview of data mining. It also discusses the process of knowledge discovery in data along with a detailed analysis of various mining methods including classification, clustering, and decision tree. In addition to that, the chapter concludes with a view of data visualization and probability concepts for IDA.

In Chapter 5, the authors demonstrate one of the most crucial and challenge areas in computer vision and the IDA field based on manipulating the convergence. This subject is divided into a deep learning paradigm for object segmentation in computer vision and visualization paradigm for efficiently incremental interpretation in manipulating the datasets for supervised and unsupervised learning, and online or offline training in reinforcement learning. This topic recently has had a large impact in robotics and autonomous systems, food detection, recommendation systems, and medical applications.

Dental caries is a painful bacterial disease of teeth caused mainly by Streptococcus mutants, acid, and carbohydrates, and it destroys the enamel, or the dentine, layer of the tooth. As per the World Health Organization report, worldwide, 60–90% of school children and almost 100% of adults have dental caries. Dental caries and periodontal disease without treatment for long periods causes tooth loss. There is not a single method to detect caries in its earliest stages. The size of carious lesions and early caries detection are very challenging tasks for dental practitioners. The methods related to dental caries detection are the radiograph, QLF or or quantitative light-induced fluorescence, ECM, FOTI, DIFOTI, etc. In a radiograph-based technique, dentists analyze the image data. In Chapter 6, the authors present a method to detect caries by analyzing the secondary emission data.

With the growth of data in the education field in recent years, there is a need for intelligent data analytics, in order that academic data should be used effectively to improve learning. Educational data mining and learning analytics are the fields of IDA that play important roles in intelligent analysis of educational data. One of the real challenges faced by students and institutions alike is the quality of education. An equally important factor related to the quality of education is the performance of students in the higher education system. The decisions that the students make while selecting their area of specialization is of grave concern here. In the absence of support systems, the students and the teachers/mentors fall short when making the right decisions for the furthering of their chosen career paths. Therefore, in Chapter 7, the authors attempt to address the issue by proposing a system that can guide the student to choose and to focus on the right course(s) based on their personal preferences. For this purpose, a system has been envisaged by blending data mining and classification with big data. A methodology using MapReduce Framework and association rule mining is proposed in order to derive the right blend of courses for students to pursue to enhance their career prospects.

Atmospheric air pollution is creating significant health problems that affect millions of people around the world. Chapter 8 analyzes the hypothesis about whether or not global green space variation is changing the global air quality. The authors perform a big data analysis with a data set that contains more than 1M (1 048 000) green space data and air quality data points by considering 190 countries during the years 1990 to 2015. Air quality is measured by considering particular matter (PM) value. The analysis is carried out using multivariate graphs and a k-mean clustering algorithm. The relative geographical changes of the tree areas, as well as the level of the air quality, were identified and the results indicated encouraging news.

Space technology and geotechnology, such as geographic information systems, plays a vital role in the day-to-day activities of a society. In the initial days, the data collection was very rudimentary and primitive. The quality of the data collected was a subject of verification and the accuracy of the data was also questionable. With the advent of newer technology, the problems have been overcome. Using modern sophisticated systems, space science has been changed drastically. Implementing cutting-edge spaceborne sensors has made it possible to capture real-time data from space. Chapter 9 focuses on these aspects in detail.

Transportation plays an important role in our overall economy, conveying products and people through progressively mind-boggling, interconnected, and multidimensional transportation frameworks. But, the complexities of present-day transportation can't be managed by previous systems. The utilization of IDA frameworks and strategies, with compelling information gathering and data dispersion frameworks, gives openings that are required to building the future intelligent transportation systems (ITSs). In Chapter 10, the authors exhibit the application of IDA in IoT-based ITS.

Chapter 11 aims to observe emerging patterns and trends by using big data analysis to enhance predictions of motor vehicle collisions using a data set consisting of 17 attributes and 998 193 collisions in New York City. The data is extracted from the New York City Police Department (NYPD). The data set has then been tested in three classification algorithms, which are k-nearest neighbor, random forest, and naive Bayes. The outputs are captured using k-fold cross-validation method. These outputs are used to identify and compare classifier accuracy, and random forest node accuracy and processing time. Further, an analysis of raw data is performed describing the four different vehicle groups in order to detect significance within the recorded period. Finally, extreme cases of collision severity are identified using outlier analysis. The analysis demonstrates that out of three classifiers, random forest gives the best results.

Neurological disorders are the diseases that are related to the brain, nervous system, and the spinal cord of the human body. These disorders may affect the walking, speaking, learning, and moving capacity of human beings. Some of the major human neurological disorders are stroke, brain tumors, epilepsy, meningitis, Alzheimer's, etc. Additionally, remarkable growth has been observed in the areas of disease diagnosis and health informatics. The critical human disorders related to lung, kidney, skin, and brain have been successfully diagnosed using different data mining and machine learning techniques. In Chapter 12, several neurological and psychological disorders are discussed. The role of different computing techniques in designing different biomedical applications are presented. In addition, the challenges and promising areas of innovation in designing a smart and intelligent neurological disorder diagnostic system using big data, internet of things, and emerging computing techniques are also highlighted.

Bug reports are one of the crucial software artifacts in open-source software. Issue tracking systems maintain enormous bug reports with several attributes, such as long description of bugs, threaded discussion comments, and bug meta-data, which includes BugID, priority, status, resolution, time, and others. In Chapter 13, bug reports of 20 open-source projects of the Apache Software Foundation are extracted using a tool named the Bug Report Collection System for trend analysis. As per the quantitative analysis of data, about 20% of open bugs are critical in nature, which directly impacts the functioning of the system. The presence of a large number of bugs of this kind can put systems into vulnerability positions and reduces the risk aversion capability. Thus, it is essential to resolve these issues on a high priority. The test lead can assign these issues to the most contributing developers of a project for quick closure of opened critical bugs. The comments are mined, which help us identify the developers resolving the majority of bugs, which is beneficial for test leads of distinct projects. As per the collated data, the areas more prone to system failures are determined such as input/output type error and logical code error.

Sentiments are the standard way by which people express their feelings. Sentiments are broadly classified as positive and negative. The problem occurs when the user expresses with words that are different than the actual feelings. This phenomenon is generally known to us as sarcasm, where people say something opposite the actual sentiments. Sarcasm detection is of great importance for the correct analysis of sentiments. Chapter 14 attempts to give an algorithm for successful detection of hyperbolic sarcasm and general sarcasm in a data set of sarcastic posts that are collected from pages dedicated for sarcasm on social media sites such as Facebook, Pinterest, and Instagram. This chapter also shows the initial results of the algorithm and its evaluation.

Predictive analytics refers to forecasting the future probabilities by extracting information from existing data sets and determining patterns from predicted outcomes. Predictive analytics also includes what-if scenarios and risk assessment. In Chapter 15, an effort has been made to use principles of predictive modeling to analyze the authentic social network data set, and results have been encouraging. The post-analysis of the results have been focused on exhibiting contact details, mobility pattern, and a number of degree of connections/minutes leading to identification of the linkage/bonding between the nodes in the social network.

Modern medicine has been confronted by a major challenge of achieving promise and capacity of tremendous expansion in medical data sets of all kinds. Medical databases develop huge bulk of knowledge and data, which mandates a specialized tool to store and perform analysis of data and as a result, effectively use saved knowledge and data. Information is extracted from data by using a domain's background knowledge in the process of IDA. Various matters dealt with regard use, definition, and impact of these processes and they are tested for their optimization in application domains of medicine. The primary focus of Chapter 16 is on the methods and tools of IDA, with an aim to minimize the growing differences between data comprehension and data gathering.

Snoozing, or sleeping, is a physical phenomenon of the human life. When human snooze is disturbed, it generates many problems, such as mental disease, heart disease, etc. Total snooze is characterized by two stages, viz., rapid eye movement and nonrapid eye movement. Bruxism is a type of snooze disorder. The traditional method of the prognosis takes time and the result is in analog form. Chapter 17 proposes a method for easy prognosis of snooze bruxism.

Neurodegenerative diseases like Alzheimer's and Parkinson's impair the cognitive and motor abilities of the patient, along with memory loss and confusion. As handwriting involves proper functioning of the brain and motor control, it is affected. Alteration in handwriting is one of the first signs of Alzheimer's disease. The handwriting gets shaky, due to loss of muscle control, confusion, and forgetfulness. The symptoms get progressively worse. It gets illegible and the phonological spelling mistakes become inevitable. In Chapter 18, the authors use a feature extraction technique to be used as a parameter for diagnosis. A variational auto encoder (VAE), a deep unsupervised learning technique, has been applied, which is used to compress the input data and then reconstruct it keeping the targeted output the same as the targeted input.

This edited volume on IDA gathers researchers, scientists, and practitioners interested in computational data analysis methods, aimed at narrowing the gap between extensive amounts of data stored in medical databases and the interpretation, understandable, and effective use of the stored data. The expected readers of this book are researchers, scientists, and practitioners interested in IDA, knowledge discovery, and decision support in databases, particularly those who are interested in using these technologies. This publication provides useful references for educational institutions, industry, academic researchers, professionals, developers, and practitioners to apply, evaluate, and reproduce the contributions to this book.

May 07, 2019

New Delhi, India

Deepak Gupta

Bengaluru, India

Siddhartha Bhattacharyya

New Delhi, India

Ashish Khanna

Uttar Pradesh, India

Kalpna Sagar