Tariq Samad, Editor in Chief | |
George W. Arnold | Pui-In Mak |
Giancarlo Fortino | Jeffrey Nanzer |
Dmitry Goldgof | Ray Perez |
Ekram Hossain | Linda Shafer |
Xiaoou Li | Zidong Wang |
Vladimir Lumelsky | MengChu Zhou |
Kenneth Moore, Director of IEEE Book and Information Services (BIS)
Technical Reviewers
Guoyin Wang, College of Computer Science and Technology
Chongqing University of Posts and Telecommunications
Yan Pei, University of Aizu, Japan
Suichen Gu, Google, Inc.
Copyright © 2016 by the IEEE Computer Society, Inc. All rights reserved.
Published by John Wiley & Sons, Inc., Hoboken, New Jersey.
Published simultaneously in Canada.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permission.
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.
For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com.
Library of Congress Cataloging-in-Publication Data is available.
ISBN: 978-1-119-07628-5
This book is dedicated to my wife, Chao Deng
and my daughter, Fei Tan
The most terrible threats to the security of computers and networking systems are the so-called computer virus and unknown intrusion. The rapid development of evasion techniques used in viruses invalidate the well-known signature-based computer virus detection techniques, so a number of novel virus detection approaches have been proposed to cope with this vital security issue. Because the natural similarities between the biological immune system (BIS) and computer security system, the artificial immune system (AIS) has been developed as a new field in the community of anti-virus researches. The various principles and mechanisms in BIS provide unique opportunities to build novel computer virus detection models with abilities of robustness and adaptiveness in detecting the known and unknown viruses.
Biological immune systems are hierarchical natural systems featuring high distribution, parallelization, and the ability to process complex information, among other useful features. It is also a dynamically adjusting system that is characterized by the abilities of learning, memory, recognition, and cognition, such that the BIS is good at recognizing and removing antigens effectively for the purpose of protection of the organism. The BIS makes full use of various intelligent ways to react to an antigen's intrusions, producing accurate immune responses by means of intrinsic and adaptive immune abilities. Through mutation, evolution, and learning to adapt to new environments, along with memory mechanisms, BIS can react much stronger and faster against foreign antigens and their likes. The BIS consists of intrinsic immune (i.e., non-specific immune) and adaptive immune (i.e., specific immune) responses that mutually cooperate to defend against foreign antigens.
An artificial immune system is an adaptive system inspired by theoretical immunology and observed immune functions, principles, and models, which are applied for problem solving. In another words, the AIS is a computational system inspired by the BIS, sometime also referred to as the second brain, made up of computational intelligence paradigms. The AIS is a dynamic, adaptive, robust, and distributed learning system. Because it has fault tolerance and noise resistance, it is very suitable for applications in time-varying unknown environments. The AIS has been applied to many complex problem fields, such as optimization, pattern recognition, fault and anomaly diagnosis, network intrusion detection, and virus detection, as well as many others.
Generally speaking, the AIS could be roughly classified into two major categories: population-based and network-based algorithms. Network-based algorithms make use of the concepts of immune network theory, while population-based algorithms use theories and models such as negative selection principle, clonal principle, danger theory, and others. During the past decades, there have been a large number of immune theories and models, such as self and nonself models, clonal selection algorithm, immune network, dendritic cell algorithms, danger theory, and so on. By mimicking BIS's mechanisms and functions, AIS has developed and is now widely used in anomaly detection, fault detection, pattern recognition, optimization, learning, and so on. Like its biological counterpart, AIS is also characterized by noise-tolerance, unsupervised learning, self-organization, memorizing, recognition, and so on.
In particular, anomaly detection techniques decide whether an unknown test sample is produced by the underlying probability distribution that corresponds to the training set of normal examples. The pioneering work of Forrest and associates led to a great deal of research and proposals of immune-inspired anomaly detection systems. For example, as for the self and nonself model, the central challenges with anomaly detection is determining the difference between normal and potentially harmful activity. Usually, only self (normal) class is available for training the system regardless of nonself (anomaly) class. Thus, the essence of the anomaly detection task is that the training set contains instances only from the self class, while the test set contains instances of both self and nonself classes. Specifically, computer security and virus detection should be regarded as the typical examples of anomaly detection in artificial immune systems whose task is protecting computers from viruses, unauthorized users, and so on. In computer security, AIS has a very strong capability of anomaly detection for defending against unknown viruses and intrusions. The adaptability is also a very important feature for AIS to learn unknown viruses and intrusions as well as quickly reacting to the learned ones. Other features of AIS like distributability, autonomy, diversity, and disposability are also required for the flexibility and stability of AIS.
Therefore, the features of the BIS are just what a computer security system needs, meanwhile the functions of BIS and computer security system are similar to each other to some extent. Therefore, the biological immune principles provide effective solutions to computer security issues. The research and development of AIS-based computer security detection are receiving increasing attention. The application of immune principles and mechanisms can better protect the computer and improve the network environment greatly.
In recent years, computer and networking technologies have developed rapidly and been used more and more widely in our daily life. At the same time, computer security issues appear frequently. The large varieties of malwares, especially new variants and unknown ones, always seriously threaten computers. What is worse is that malwares are getting more complicated and delicate, with faster speed and greater damage. Meanwhile, a huge number of spam not only occupy storage and network bandwidth, but also waste users' time to handle them, resulting in a great loss of productivity. Although many classic solutions have been proposed, there are still many limitations in dealing with the real-world computer security issues.
A computer virus is a program or a piece of code that can infect other programs by modifying them to include an evolved copy of it. Broadly, one can regard the computer virus as the malicious code designed to harm or secretly access a computer system without the owners' informed consent, such as viruses, worms, backdoors, Trojans, harmful Apps, hacker codes, and so on. All programs that are not authorized by users and that perform harmful operations in the background are referred to as viruses; they are characterized by several salient features including infectivity, destruction, concealment, latency, triggering, and so on.
Computer viruses have evolved with computer technologies and systems. Generally speaking, the development of viruses has gone through several phases, including the DOS boot phase, DOS executable phase, virus generator phase, macro virus phase, as well as virus techniques merging with hacker techniques. As computer viruses have developed and proliferated, they have become the main urgent threat to the security of computers and Internet.
The battle between viruses and anti-virus techniques is an endless warfare. Computer viruses disguise themselves by means of various kinds of evasion techniques, including metamorphic and polymorphous techniques, packer and encryption techniques, to name a few. To confront these critical situations, anti-virus techniques have to unpack the suspicious programs, decrypt them, and try to be robust to these evasion techniques. The viruses are also trying to evolve to anti-unpack, anti-decrypt, and develop to obfuscate the anti-virus techniques. The fighting between viruses and anti-virus techniques is very serious and will last forever.
Nowadays, varieties of novel viruses' techniques are continuously emergent and are often one step ahead of the anti-virus techniques. A good anti-virus technique should have to increase the difficulty of viruses' intrusion, decrease the losses caused by the viruses, and react to an outbreak of viruses as quickly as possible.
Many host-based anti-virus solutions have been proposed by researchers and companies, which could be roughly classified into three categories—static techniques, dynamic techniques, and heuristics.
Static techniques usually work on bit strings, assembly codes, and application programming interface (API) calls of a program without running the program. One of the most famous static techniques is the signature-based virus detection technique, in which a signature usually is a bit string divided from a virus sample and can identify the virus uniquely.
Dynamic techniques keep watching over the execution of every program in real time and observe the behaviors of the program. The dynamic techniques usually utilize the operating system's API sequences, system calls, and other kinds of behavior characteristics to identify the purpose of a program.
Heuristic approaches make full use of various heuristic knowledge and information in the program and its environments, by using intelligent computing techniques such as machine learning, data mining, evolutionary computing, AIS, and so on, for detecting viruses, which not only can fight the known viruses efficiently, but also can detect new variants and unseen viruses.
Because classic detection approaches of computer viruses are not able to efficiently detect new variants of viruses and unseen viruses, it is urgent to study novel virus detection approaches in depth. As for this point, the immune principle-based computer virus detection approaches have been becoming a priority choice in the community of the anti-virus researchers because it is characterized by the strong detection capability for new variants of viruses and unseen viruses. The immune-based computer virus detection approaches are able to detect new variants and unseen viruses at low false positive rates with limited overheads. These approaches have developed into a new field for computer virus detection and attracted more and more researchers and practitioners.
The computer virus has compared to a biological virus because of their similarities, such as parasitism, propagation, infection, ability to hide, and destruction. The BIS protects the body from antigens from the very beginning of life successfully, resolving the problem of defeating unseen antigens. The computer security system has functions similar to the BIS. Furthermore, the features of the AIS, such as dynamic, adaptive, robust, are also needed in the computer anti-virus system. Applying immune principles to detect viruses enables us to recognize new variants and unseen viruses by using existing knowledge. The immune principle-based virus detection approaches would own many finer features, such as being dynamic, adaptive, and robust. The AIS is considered to be able to make up for the faults of the signature-based virus detection techniques. The immune-based computer virus detection approaches have paved a new way for anti-virus research in the past decades.
Although a number of virus detection models based on immune principles have achieved great success, in particular, in detecting new variants and unseen viruses under unknown environments, there exist a few drawbacks in the AIS-based virus detection, such as a lack of rigorous theoretical analysis and very simple simulations between the AIS and the BIS. Therefore, there is still a long way to go for us to apply the immune-based virus detection approaches to the real-world computer security systems.
The objective of this book is to present our proposed major theories and models as well as their applications in malware detection in recent years, for academia, researchers, and engineering practitioners who are involved or interested in the study, use, design, and development of artificial immune systems (AIS) and AIS-based solutions to computer security issues. Furthermore, this book provides a single record of our achievements to date in computer security based on immune principles.
This book is designed for a professional audience who wishes to learn about the state of the art of artificial immune systems and AIS-based malware detection approaches. More specifically, the book offers a theoretical perspective and practical solutions to researchers, practitioners, and graduates who are working in the areas of artificial immune system-based computer security.
The organization of this book is arranged in a manner from simple to complex. In order to understand the contents of this book comprehensively, the readers should have some fundamentals of computer architecture and software, computer virus, artificial intelligence, computational intelligence, pattern recognition, and machine learning.
I hope this book will help shape the research of AIS-based malware detection appropriately and gives the state of art AIS-based malware detection methods and algorithms for interested readers who might find many algorithms in the book that are helpful for their projects, furthermore, some algorithms can also be viewed as a starting point for researchers to work with.
In addition, many newly proposed malware detection methods in didactic approach with detailed materials are presented and their excellent performance is illustrated by a number of experiments and comparisons with the state-of-the-art malware detection techniques. Furthermore, a collection of references, resources, and source codes is listed in some webpages that are available free at http://www.cil.pku.edu.cn/research/anti-malware/index.html, http://www.cil.pku.edu.cn/resources/ and http://www.cil.pku.edu.cn/publications/.
This monograph is organized into 11 chapters, which will be briefly described next.
In Chapter 1, AIS is presented after a brief introduction of BIS. Several typical AIS algorithms are presented in detail, followed by features and applications of AIS.
In Chapter 2, introductions to malware and its detection methods are described in detail. As malware has become a challenge to the security of the computer system, a number of detecting approaches have been proposed to cope with the situation. These approaches are classified into three categories: static techniques, dynamic techniques, and heuristics. The classic malware detection approaches and immune-based malware detection approaches are briefly introduced after the background knowledge of malware is given. The immune-based malware detection approaches have paved a new way for anti-malware research.
Because the detection of unknown malware is one of most important tasks in Computer Immune System (CIS) studies, by using nonself detection techniques, the diversity of anti-body (Ab) and neural networks (NN), an NN-based malware detection algorithm is proposed in Chapter 3. A number of experiments are conducted to illustrate that this algorithm has a high detection rate with a very low false-positive rate.
In Chapter 4, by using the negative selection principle in BIS, a novel generating algorithm of detector, that is, multiple-point bit mutation method, is proposed, which utilizes random multiple-point mutation to look for nonself detectors in a large range in the whole space of detectors, such that we can obtain a required detector set in a reasonable computational time.
A virus detection system (VDS) based on AIS is proposed in Chapter 5. The VDS at first generates the detector set from virus files in the dataset, negative selection and clonal selection are applied to the detector set to eliminate auto-immunity detectors and increase the diversity of the detector set in the nonself space. Two novel hybrid distances called hamming-max and shift r-bit continuous distance are proposed to calculate the affinity vectors of each file using the detector set. The VDS compares the detection rates using three classifiers, k-nearest neighbor (KNN), RBF networks, and SVM when the length of detectors is 32-bit and 64-bit, respectively. The experimental results show that the proposed VDS has a strong detection ability and good generalization performance.
As viruses become more complex, existing anti-virus methods are inefficient to detect various forms of viruses, especially new variants and unknown viruses. Inspired by the immune system, a hierarchical artificial immune system (AIS) model, which is based on matching in three layers, is proposed to detect a variety of forms of viruses in Chapter 6. Experimental results demonstrate that the proposed model can recognize obfuscated viruses efficiently with an average recognition rate of 94 percent, including new variants of viruses and unknown viruses.
In Chapter 7, a malware detection model based on the negative selection algorithm with a penalty factor was proposed to overcome the drawbacks of traditional negative selection algorithms (NSA) in defining the harmfulness of self and nonself. Unlike danger theory, the proposed model is able to detect malware through dangerous signatures extracted from programs. Instead of deleting nonself that matches self, the NSA with penalty factor (NSAPF) penalizes the nonself using penalty factor C and keeps these items in a library. In this way, the effectiveness of the proposed model is improved by the dangerous signatures that would have been discarded in the traditional NSA.
A danger feature-based negative selection algorithm (NFNSA) is presented in Chapter 8, which divides the danger feature space into four parts, and reserves the information of danger features to the utmost extent for measuring the danger of a sample efficiently. Comprehensive experimental results suggest that the DFNSA is able to reserve as much information of danger features as possible, and the DFNSA malware detection model is effective to detect unseen malware by measuring the danger of a sample precisely.
In Chapter 9, immune concentration is used to detect malwares. The local concentration-based malware detection method connects a certain number of two-element local concentration vectors as feature vector. To achieve better detection performance, particle swarm optimization (PSO) is used to optimize the parameters of local concentration. Then the hybrid concentration-based feature extraction (HCFE) approach is presented by extracting the hybrid concentration (HC) of malware in both global and local resolutions.
In Chapter 10, inspired from the immune cooperation (IC) mechanism in BIS, an IC mechanism-based learning (ICL) framework is proposed. In this framework, a sample can be expressed as an antigen-specific feature vector and an antigen-nonspecific feature vector at first, respectively, simulating the antigenic determinant and danger features in BIS. The antigen-specific and antigen-nonspecific classifiers score the two vectors and export real-valued Signal 1 and Signal 2, respectively. In collaboration with the two signals, the sample can be classified by the cooperation classifier, which resolves the signal conflict problem at the same time. The ICL framework simulates the BIS in the view of immune signals and takes full advantage of the cooperation effect of the immune signals, which improves the performance of the ICL framework dramatically.
Chapter 11 presents a new statistic named class-wise information gain (CIG). Different from information gain (IG) that only selects global features for a classification problem, the CIG is able to select the features with the highest information content for a specific class in a problem. On the basis of the CIG, a novel CIG-based malware detection method is proposed to efficiently detect malware loaders and infected executables in the wild.
Finally, a keyword index completes the book.
Due to the limited specialty knowledge and capability of mine, a few errors, typos, and inadequacies are bound to occur. The readers' critical comments and valuable suggestions are warmly welcomed at ytan@pku.edu.cn.
Ying Tan
Beijing, China
I would like to thank my colleagues and students who provided strong assistance in research on such an amazing issue. I am grateful to my students who took part in the research work for this book under my guidance at Computational Intelligence Laboratory at Peking University (CIL@PKU), Institute of Intelligent Information System (IIIS) at Electronic Engineering Institute (EEI), and University of Science and Technology of China (USTC).
Almost the entire content of this book is excerpted from the research works and academic papers published by myself and my supervised PhD students and Master students, including Dr. Zhenghe GUO, Dr. Yuanchun ZHU, Dr. Wei WANG, Dr. Pengtao ZHANG, Mr. Rui CHAO, and my current graduate Mr. Weiwei HU. I would like to deliver my special thanks to all of them here. Without their hard work and unremitting efforts, it would have been impossible for me to make this book a reality.
The author owes his gratitude to all colleagues and students who have collaborated directly or indirectly with research on this issue and in writing this book.