Details

Foundations of Data Intensive Applications

Large Scale Data Analytics under the Hood
1. Aufl.

von: Supun Kamburugamuve, Saliya Ekanayake
35,99 €
Verlag:	Wiley
Format:	PDF
Veröffentl.:	05.08.2021
ISBN/EAN:	9781119713036
Sprache:	englisch
Anzahl Seiten:	416

In den Warenkorb

Als Gutschein

DRM-geschütztes eBook, Sie benötigen z.B. Adobe Digital Editions und eine Adobe ID zum Lesen.

Beschreibungen

Titelbeschreibung

PEEK “UNDER THE HOOD” OF BIG DATA ANALYTICS The world of big data analytics grows ever more complex. And while many people can work superficially with specific frameworks, far fewer understand the fundamental principles of large-scale, distributed data processing systems and how they operate. In Foundations of Data Intensive Applications: Large Scale Data Analytics under the Hood, renowned big-data experts and computer scientists Drs. Supun Kamburugamuve and Saliya Ekanayake deliver a practical guide to applying the principles of big data to software development for optimal performance. The authors discuss foundational components of large-scale data systems and walk readers through the major software design decisions that define performance, application type, and usability. You???ll learn how to recognize problems in your applications resulting in performance and distributed operation issues, diagnose them, and effectively eliminate them by relying on the bedrock big data principles explained within. Moving beyond individual frameworks and APIs for data processing, this book unlocks the theoretical ideas that operate under the hood of every big data processing system. Ideal for data scientists, data architects, dev-ops engineers, and developers, Foundations of Data Intensive Applications: Large Scale Data Analytics under the Hood shows readers how to: <ul><li>Identify the foundations of large-scale, distributed data processing systems</li> <li>Make major software design decisions that optimize performance</li> <li>Diagnose performance problems and distributed operation issues</li> <li>Understand state-of-the-art research in big data</li> <li>Explain and use the major big data frameworks and understand what underpins them</li> <li>Use big data analytics in the real world to solve practical problems</li></ul>

Inhaltsverzeichnis

Introduction xxvii Chapter 1 Data Intensive Applications 1 Anatomy of a Data-Intensive Application 1 A Histogram Example 2 Program 2 Process Management 3 Communication 4 Execution 5 Data Structures 6 Putting It Together 6 Application 6 Resource Management 6 Messaging 7 Data Structures 7 Tasks and Execution 8 Fault Tolerance 8 Remote Execution 8 Parallel Applications 9 Serial Applications 9 Lloyd’s K-Means Algorithm 9 Parallelizing Algorithms 11 Decomposition 11 Task Assignment 12 Orchestration 12 Mapping 13 K-Means Algorithm 13 Parallel and Distributed Computing 15 Memory Abstractions 16 Shared Memory 16 Distributed Memory 18 Hybrid (Shared + Distributed) Memory 20 Partitioned Global Address Space Memory 21 Application Classes and Frameworks 22 Parallel Interaction Patterns 22 Pleasingly Parallel 23 Dataflow 23 Iterative 23 Irregular 23 Data Abstractions 24 Data-Intensive Frameworks 24 Components 24 Workflows 25 An Example 25 What Makes It Difficult? 26 Developing Applications 27 Concurrency 27 Data Partitioning 28 Debugging 28 Diverse Environments 28 Computer Networks 29 Synchronization 29 Thread Synchronization 29 Data Synchronization 30 Ordering of Events 31 Faults 31 Consensus 31 Summary 32 References 32 Chapter 2 Data and Storage 35 Storage Systems 35 Storage for Distributed Systems 36 Direct-Attached Storage 37 Storage Area Network 37 Network-Attached Storage 38 DAS or SAN or NAS? 38 Storage Abstractions 39 Block Storage 39 File Systems 40 Object Storage 41 Data Formats 41 XML 42 JSON 43 CSV 44 Apache Parquet 45 Apache Avro 47 Avro Data Definitions (Schema) 48 Code Generation 49 Without Code Generation 49 Avro File 49 Schema Evolution 49 Protocol Buffers, Flat Buffers, and Thrift 50 Data Replication 51 Synchronous and Asynchronous Replication 52 Single-Leader and Multileader Replication 52 Data Locality 53 Disadvantages of Replication 54 Data Partitioning 54 Vertical Partitioning 55 Horizontal Partitioning (Sharding) 55 Hybrid Partitioning 56 Considerations for Partitioning 57 NoSQL Databases 58 Data Models 58 Key-Value Databases 58 Document Databases 59 Wide Column Databases 59 Graph Databases 59 CAP Theorem 60 Message Queuing 61 Message Processing Guarantees 63 Durability of Messages 64 Acknowledgments 64 Storage First Brokers and Transient Brokers 65 Summary 66 References 66 Chapter 3 Computing Resources 69 A Demonstration 71 Computer Clusters 72 Anatomy of a Computer Cluster 73 Data Analytics in Clusters 74 Dedicated Clusters 76 Classic Parallel Systems 76 Big Data Systems 77 Shared Clusters 79 OpenMPI on a Slurm Cluster 79 Spark on a Yarn Cluster 80 Distributed Application Life Cycle 80 Life Cycle Steps 80 Step 1: Preparation of the Job Package 81 Step 2: Resource Acquisition 81 Step 3: Distributing the Application (Job) Artifacts 81 Step 4: Bootstrapping the Distributed Environment 82 Step 5: Monitoring 82 Step 6: Termination 83 Computing Resources 83 Data Centers 83 Physical Machines 85 Network 85 Virtual Machines 87 Containers 87 Processor, Random Access Memory, and Cache 88 Cache 89 Multiple Processors in a Computer 90 Nonuniform Memory Access 90 Uniform Memory Access 91 Hard Disk 92 GPUs 92 Mapping Resources to Applications 92 Cluster Resource Managers 93 Kubernetes 94 Kubernetes Architecture 94 Kubernetes Application Concepts 96 Data-Intensive Applications on Kubernetes 96 Slurm 98 Yarn 99 Job Scheduling 99 Scheduling Policy 101 Objective Functions 101 Throughput and Latency 101 Priorities 102 Lowering Distance Among the Processes 102 Data Locality 102 Completion Deadline 102 Algorithms 103 First in First Out 103 Gang Scheduling 103 List Scheduling 103 Backfill Scheduling 104 Summary 104 References 104 Chapter 4 Data Structures 107 Virtual Memory 108 Paging and TLB 109 Cache 111 The Need for Data Structures 112 Cache and Memory Layout 112 Memory Fragmentation 114 Data Transfer 115 Data Transfer Between Frameworks 115 Cross-Language Data Transfer 115 Object and Text Data 116 Serialization 116 Vectors and Matrices 117 1D Vectors 118 Matrices 118 Row-Major and Column-Major Formats 119 N-Dimensional Arrays/Tensors 122 NumPy 123 Memory Representation 125 K-means with NumPy 126 Sparse Matrices 127 Table 128 Table Formats 129 Column Data Format 129 Row Data Format 130 Apache Arrow 130 Arrow Data Format 131 Primitive Types 131 Variable-Length Data 132 Arrow Serialization 133 Arrow Example 133 Pandas DataFrame 134 Column vs. Row Tables 136 Summary 136 References 136 Chapter 5 Programming Models 139 Introduction 139 Parallel Programming Models 140 Parallel Process Interaction 140 Problem Decomposition 140 Data Structures 140 Data Structures and Operations 141 Data Types 141 Local Operations 143 Distributed Operations 143 Array 144 Tensor 145 Indexing 145 Slicing 146 Broadcasting 146 Table 146 Graph Data 148 Message Passing Model 150 Model 151 Message Passing Frameworks 151 Message Passing Interface 151 Bulk Synchronous Parallel 153 K-Means 154 Distributed Data Model 157 Eager Model 157 Dataflow Model 158 Data Frames, Datasets, and Tables 159 Input and Output 160 Task Graphs (Dataflow Graphs) 160 Model 161 User Program to Task Graph 161 Tasks and Functions 162 Source Task 162 Compute Task 163 Implicit vs. Explicit Parallel Models 163 Remote Execution 163 Components 164 Batch Dataflow 165 Data Abstractions 165 Table Abstraction 165 Matrix/Tensors 165 Functions 166 Source 166 Compute 167 Sink 168 An Example 168 Caching State 169 Evaluation Strategy 170 Lazy Evaluation 171 Eager Evaluation 171 Iterative Computations 172 DOALL Parallel 172 DOACROSS Parallel 172 Pipeline Parallel 173 Task Graph Models for Iterative Computations 173 K-Means Algorithm 174 Streaming Dataflow 176 Data Abstractions 177 Streams 177 Distributed Operations 178 Streaming Functions 178 Sources 178 Compute 179 Sink 179 An Example 179 Windowing 180 Windowing Strategies 181 Operations on Windows 182 Handling Late Events 182 SQL 182 Queries 183 Summary 184 References 184 Chapter 6 Messaging 187 Network Services 188 TCP/IP 188 RDMA 189 Messaging for Data Analytics 189 Anatomy of a Message 190 Data Packing 190 Protocol 191 Message Types 192 Control Messages 192 External Data Sources 192 Data Transfer Messages 192 Distributed Operations 194 How Are They Used? 194 Task Graph 194 Parallel Processes 195 Anatomy of a Distributed Operation 198 Data Abstractions 198 Distributed Operation API 198 Streaming and Batch Operations 199 Streaming Operations 199 Batch Operations 199 Distributed Operations on Arrays 200 Broadcast 200 Reduce and AllReduce 201 Gather and AllGather 202 Scatter 203 AllToAll 204 Optimized Operations 204 Broadcast 205 Reduce 206 AllReduce 206 Gather and AllGather Collective Algorithms 208 Scatter and AllToAll Collective Algorithms 208 Distributed Operations on Tables 209 Shuffle 209 Partitioning Data 211 Handling Large Data 212 Fetch-Based Algorithm (Asynchronous Algorithm) 213 Distributed Synchronization Algorithm 214 GroupBy 214 Aggregate 215 Join 216 Join Algorithms 219 Distributed Joins 221 Performance of Joins 223 More Operations 223 Advanced Topics 224 Data Packing 224 Memory Considerations 224 Message Coalescing 224 Compression 225 Stragglers 225 Nonblocking vs. Blocking Operations 225 Blocking Operations 226 Nonblocking Operations 226 Summary 227 References 227 Chapter 7 Parallel Tasks 229 CPUs 229 Cache 229 False Sharing 230 Vectorization 231 Threads and Processes 234 Concurrency and Parallelism 234 Context Switches and Scheduling 234 Mutual Exclusion 235 User-Level Threads 236 Process Affinity 236 NUMA-Aware Programming 237 Accelerators 237 Task Execution 238 Scheduling 240 Static Scheduling 240 Dynamic Scheduling 240 Loosely Synchronous and Asynchronous Execution 241 Loosely Synchronous Parallel System 242 Asynchronous Parallel System (Fully Distributed) 243 Actor Model 244 Actor 244 Asynchronous Messages 244 Actor Frameworks 245 Execution Models 245 Process Model 246 Thread Model 246 Remote Execution 246 Tasks for Data Analytics 248 SPMD and MPMD Execution 248 Batch Tasks 249 Data Partitions 249 Operations 251 Task Graph Scheduling 253 Threads, CPU Cores, and Partitions 254 Data Locality 255 Execution 257 Streaming Execution 257 State 257 Immutable Data 258 State in Driver 258 Distributed State 259 Streaming Tasks 259 Streams and Data Partitioning 260 Partitions 260 Operations 261 Scheduling 262 Uniform Resources 263 Resource-Aware Scheduling 264 Execution 264 Dynamic Scaling 264 Back Pressure (Flow Control) 265 Rate-Based Flow Control 266 Credit-Based Flow Control 266 State 267 Summary 268 References 268 Chapter 8 Case Studies 271 Apache Hadoop 271 Programming Model 272 Architecture 274 Cluster Resource Management 275 Apache Spark 275 Programming Model 275 RDD API 276 SQL, DataFrames, and DataSets 277 Architecture 278 Resource Managers 278 Task Schedulers 279 Executors 279 Communication Operations 280 Apache Spark Streaming 280 Apache Storm 282 Programming Model 282 Architecture 284 Cluster Resource Managers 285 Communication Operations 286 Kafka Streams 286 Programming Model 286 Architecture 287 PyTorch 288 Programming Model 288 Execution 292 Cylon 295 Programming Model 296 Architecture 296 Execution 297 Communication Operations 298 Rapids cuDF 298 Programming Model 298 Architecture 299 Summary 300 References 300 Chapter 9 Fault Tolerance 303 Dependable Systems and Failures 303 Fault Tolerance is Not Free 304 Dependable Systems 305 Failures 306 Process Failures 306 Network Failures 307 Node Failures 307 Byzantine Faults 307 Failure Models 308 Failure Detection 308 Recovering from Faults 309 Recovery Methods 310 Stateless Programs 310 Batch Systems 311 Streaming Systems 311 Processing Guarantees 311 Role of Cluster Resource Managers 312 Checkpointing 313 State 313 Consistent Global State 313 Uncoordinated Checkpointing 314 Coordinated Checkpointing 315 Chandy-Lamport Algorithm 315 Batch Systems 316 When to Checkpoint? 317 Snapshot Data 318 Streaming Systems 319 Case Study: Apache Storm 319 Message Tracking 320 Failure Recovery 321 Case Study: Apache Flink 321 Checkpointing 322 Failure Recovery 324 Batch Systems 324 Iterative Programs 324 Case Study: Apache Spark 325 RDD Recomputing 326 Checkpointing 326 Recovery from Failures 327 Summary 327 References 327 Chapter 10 Performance and Productivity 329 Performance Metrics 329 System Performance Metrics 330 Parallel Performance Metrics 330 Speedup 330 Strong Scaling 331 Weak Scaling 332 Parallel Efficiency 332 Amdahl’s Law 333 Gustafson’s Law 334 Throughput 334 Latency 335 Benchmarks 336 LINPACK Benchmark 336 NAS Parallel Benchmark 336 BigDataBench 336 TPC Benchmarks 337 HiBench 337 Performance Factors 337 Memory 337 Execution 338 Distributed Operators 338 Disk I/O 339 Garbage Collection 339 Finding Issues 342 Serial Programs 342 Profiling 342 Scaling 343 Strong Scaling 343 Weak Scaling 344 Debugging Distributed Applications 344 Programming Languages 345 C/C++ 346 Java 346 Memory Management 347 Data Structures 348 Interfacing with Python 348 Python 350 C/C++ Code integration 350 Productivity 351 Choice of Frameworks 351 Operating Environment 353 CPUs and GPUs 353 Public Clouds 355 Future of Data-Intensive Applications 358 Summary 358 References 359 Index 361

Autorenportrait

SUPUN KAMBURUGAMUVE, PhD, is a computer scientist researching and designing large scale data analytics tools. He received his doctorate in Computer Science from Indiana University, Bloomington and architected the data processing systems Twister2 and Cylon. SALIYA EKANAYAKE, PhD, is a Senior Software Engineer at Microsoft working in the intersection of scaling deep learning systems and parallel computing. He is also a research affiliate at Berkeley Lab. He received his doctorate in Computer Science from Indiana University, Bloomington.