Details

Designing Big Data Platforms

How to Use, Deploy, and Maintain Big Data Systems
1. Aufl.

von: Yusuf Aytas
114,99 €
Verlag:	Wiley
Format:	PDF
Veröffentl.:	29.06.2021
ISBN/EAN:	9781119690948
Sprache:	englisch
Anzahl Seiten:	336

In den Warenkorb

Als Gutschein

DRM-geschütztes eBook, Sie benötigen z.B. Adobe Digital Editions und eine Adobe ID zum Lesen.

Beschreibungen

Titelbeschreibung

DESIGNING BIG DATA PLATFORMS Provides expert guidance and valuable insights on getting the most out of Big Data systemsAn array of tools are currently available for managing and processing data—some are ready-to-go solutions that can be immediately deployed, while others require complex and time-intensive setups. With such a vast range of options, choosing the right tool to build a solution can be complicated, as can determining which tools work well with each other. Designing Big Data Platforms provides clear and authoritative guidance on the critical decisions necessary for successfully deploying, operating, and maintaining Big Data systems.This highly practical guide helps readers understand how to process large amounts of data with well-known Linux tools and database solutions, use effective techniques to collect and manage data from multiple sources, transform data into meaningful business insights, and much more. Author Yusuf Aytas, a software engineer with a vast amount of big data experience, discusses the design of the ideal Big Data platform: one that meets the needs of data analysts, data engineers, data scientists, software engineers, and a spectrum of other stakeholders across an organization. Detailed yet accessible chapters cover key topics such as stream data processing, data analytics, data science, data discovery, and data security. This real-world manual for Big Data technologies:<ul><li>Provides up-to-date coverage of the tools currently used in Big Data processing and management</li><li>Offers step-by-step guidance on building a data pipeline, from basic scripting to distributed systems</li><li>Highlights and explains how data is processed at scale</li><li>Includes an introduction to the foundation of a modern data platform</li></ul>Designing Big Data Platforms: How to Use, Deploy, and Maintain Big Data Systems is a must-have for all professionals working with Big Data, as well researchers and students in computer science and related fields.

Inhaltsverzeichnis

List of Contributors xvii Preface xix Acknowledgments xxi Acronyms xxiii Introduction xxv 1 An Introduction: What’s a Modern Big Data Platform 1 1.1 Defining Modern Big Data Platform 1 1.2 Fundamentals of a Modern Big Data Platform 2 1.2.1 Expectations from Data 2 1.2.1.1 Ease of Access 2 1.2.1.2 Security 2 1.2.1.3 Quality 3 1.2.1.4 Extensibility 3 1.2.2 Expectations from Platform 3 1.2.2.1 Storage Layer 4 1.2.2.2 Resource Management 4 1.2.2.3 ETL 5 1.2.2.4 Discovery 6 1.2.2.5 Reporting 7 1.2.2.6 Monitoring 7 1.2.2.7 Testing 8 1.2.2.8 Lifecycle Management 9 2 A Bird’s Eye View on Big Data 11 2.1 A Bit of History 11 2.1.1 Early Uses of Big Data Term 11 2.1.2 A New Era 12 2.1.2.1 Word Count Problem 12 2.1.2.2 Execution Steps 13 2.1.3 An Open-Source Alternative 15 2.1.3.1 Hadoop Distributed File System 15 2.1.3.2 HadoopMapReduce 17 2.2 What Makes Big Data 20 2.2.1 Volume 20 2.2.2 Velocity 21 2.2.3 Variety 21 2.2.4 Complexity 21 2.3 Components of Big Data Architecture 22 2.3.1 Ingestion 22 2.3.2 Storage 23 2.3.3 Computation 23 2.3.4 Presentation 24 2.4 Making Use of Big Data 24 2.4.1 Querying 24 2.4.2 Reporting 25 2.4.3 Alerting 25 2.4.4 Searching 25 2.4.5 Exploring 25 2.4.6 Mining 25 2.4.7 Modeling 26 3 A Minimal Data Processing and Management System 27 3.1 Problem Definition 27 3.1.1 Online Book Store 27 3.1.2 User Flow Optimization 28 3.2 Processing Large Data with Linux Commands 28 3.2.1 Understand the Data 28 3.2.2 Sample the Data 28 3.2.3 Building the Shell Command 29 3.2.4 Executing the Shell Command 30 3.2.5 Analyzing the Results 31 3.2.6 Reporting the Findings 32 3.2.7 Automating the Process 33 3.2.8 A Brief Review 33 3.3 Processing Large Data with PostgreSQL 34 3.3.1 Data Modeling 34 3.3.2 Copying Data 35 3.3.3 Sharding in PostgreSQL 37 3.3.3.1 Setting up Foreign Data Wrapper 37 3.3.3.2 Sharding Data over Multiple Nodes 38 3.4 Cost of Big Data 39 4 Big Data Storage 41 4.1 Big Data Storage Patterns 41 4.1.1 Data Lakes 41 4.1.2 Data Warehouses 42 4.1.3 Data Marts 43 4.1.4 Comparison of Storage Patterns 43 4.2 On-Premise Storage Solutions 44 4.2.1 Choosing Hardware 44 4.2.1.1 DataNodes 44 4.2.1.2 NameNodes 45 4.2.1.3 Resource Managers 45 4.2.1.4 Network Equipment 45 4.2.2 Capacity Planning 46 4.2.2.1 Overall Cluster 46 4.2.2.2 Resource Sharing 47 4.2.2.3 Doing the Math 47 4.2.3 Deploying Hadoop Cluster 48 4.2.3.1 Networking 48 4.2.3.2 Operating System 48 4.2.3.3 Management Tools 49 4.2.3.4 Hadoop Ecosystem 49 4.2.3.5 A Humble Deployment 49 4.3 Cloud Storage Solutions 53 4.3.1 Object Storage 54 4.3.2 Data Warehouses 55 4.3.2.1 Columnar Storage 55 4.3.2.2 Provisioned Data Warehouses 56 4.3.2.3 Serverless Data Warehouses 56 4.3.2.4 Virtual Data Warehouses 57 4.3.3 Archiving 58 4.4 Hybrid Storage Solutions 59 4.4.1 Making Use of Object Store 59 4.4.1.1 Additional Capacity 59 4.4.1.2 Batch Processing 59 4.4.1.3 Hot Backup 60 4.4.2 Making Use of Data Warehouse 60 4.4.2.1 Primary Data Warehouse 60 4.4.2.2 Shared Data Mart 61 4.4.3 Making Use of Archiving 61 5 Offline Big Data Processing 63 5.1 Defining Offline Data Processing 63 5.2 MapReduce Technologies 65 5.2.1 Apache Pig 65 5.2.1.1 Pig Latin Overview 66 5.2.1.2 Compilation To MapReduce 66 5.2.2 Apache Hive 67 5.2.2.1 Hive Database 68 5.2.2.2 Hive Architecture 69 5.3 Apache Spark 70 5.3.1 What’s Spark 71 5.3.2 Spark Constructs and Components 71 5.3.2.1 Resilient Distributed Datasets 71 5.3.2.2 Distributed Shared Variables 73 5.3.2.3 Datasets and DataFrames 74 5.3.2.4 Spark Libraries and Connectors 75 5.3.3 Execution Plan 76 5.3.3.1 The Logical Plan 77 5.3.3.2 The Physical Plan 77 5.3.4 Spark Architecture 77 5.3.4.1 Inside of Spark Application 78 5.3.4.2 Outside of Spark Application 79 5.4 Apache Flink 81 5.5 Presto 83 5.5.1 Presto Architecture 83 5.5.2 Presto System Design 84 5.5.2.1 Execution Plan 84 5.5.2.2 Scheduling 86 5.5.2.3 Resource Management 86 5.5.2.4 Fault Tolerance 87 6 Stream Big Data Processing 89 6.1 The Need for Stream Processing 89 6.2 Defining Stream Data Processing 90 6.3 Streams via Message Brokers 92 6.3.1 Apache Kafka 92 6.3.1.1 Apache Samza 93 6.3.1.2 Kafka Streams 98 6.3.2 Apache Pulsar 100 6.3.2.1 Pulsar Functions 102 6.3.3 AMQP Based Brokers 105 6.4 Streams via Stream Engines 106 6.4.1 Apache Flink 106 6.4.1.1 Flink Architecture 107 6.4.1.2 System Design 109 6.4.2 Apache Storm 111 6.4.2.1 Storm Architecture 114 6.4.2.2 System Design 115 6.4.3 Apache Heron 116 6.4.3.1 Storm Limitations 116 6.4.3.2 Heron Architecture 117 6.4.4 Spark Streaming 118 6.4.4.1 Discretized Streams 119 6.4.4.2 Fault-tolerance 120 7 Data Analytics 121 7.1 Log Collection 121 7.1.1 Apache Flume 122 7.1.2 Fluentd 122 7.1.2.1 Data Pipeline 123 7.1.2.2 Fluent Bit 124 7.1.2.3 Fluentd Deployment 124 7.2 Transferring Big Data Sets 125 7.2.1 Reloading 126 7.2.2 Partition Loading 126 7.2.3 Streaming 127 7.2.4 Timestamping 127 7.2.5 Tools 128 7.2.5.1 Sqoop 128 7.2.5.2 Embulk 128 7.2.5.3 Spark 129 7.2.5.4 Apache Gobblin 130 7.3 Aggregating Big Data Sets 132 7.3.1 Data Cleansing 132 7.3.2 Data Transformation 134 7.3.2.1 Transformation Functions 134 7.3.2.2 Transformation Stages 135 7.3.3 Data Retention 135 7.3.4 Data Reconciliation 136 7.4 Data Pipeline Scheduler 136 7.4.1 Jenkins 137 7.4.2 Azkaban 138 7.4.2.1 Projects 139 7.4.2.2 Execution Modes 139 7.4.3 Airflow 139 7.4.3.1 Task Execution 140 7.4.3.2 Scheduling 141 7.4.3.3 Executor 141 7.4.3.4 Security and Monitoring 142 7.4.4 Cloud 143 7.5 Patterns and Practices 143 7.5.1 Patterns 143 7.5.1.1 Data Centralization 143 7.5.1.2 Singe Source of Truth 144 7.5.1.3 Domain Driven Data Sets 145 7.5.2 Anti-Patterns 146 7.5.2.1 Data Monolith 146 7.5.2.2 Data Swamp 147 7.5.2.3 Technology Pollution 147 7.5.3 Best Practices 148 7.5.3.1 Business-Driven Approach 148 7.5.3.2 Cost of Maintenance 148 7.5.3.3 Avoiding Modeling Mistakes 149 7.5.3.4 Choosing Right Tool for The Job 150 7.5.4 Detecting Anomalies 150 7.5.4.1 Manual Anomaly Detection 151 7.5.4.2 Automated Anomaly Detection 151 7.6 Exploring Data Visually 152 7.6.1 Metabase 152 7.6.2 Apache Superset 153 8 Data Science 155 8.1 Data Science Applications 155 8.1.1 Recommendation 156 8.1.2 Predictive Analytics 156 8.1.3 Pattern Discovery 157 8.2 Data Science Life Cycle 158 8.2.1 Business Objective 158 8.2.2 Data Understanding 159 8.2.3 Data Ingestion 159 8.2.4 Data Preparation 160 8.2.5 Data Exploration 160 8.2.6 Feature Engineering 161 8.2.7 Modeling 161 8.2.8 Model Evaluation 162 8.2.9 Model Deployment 163 8.2.10 Operationalizing 163 8.3 Data Science Toolbox 164 8.3.1 R 164 8.3.2 Python 165 8.3.3 SQL 167 8.3.4 TensorFlow 167 8.3.4.1 Execution Model 169 8.3.4.2 Architecture 170 8.3.5 Spark MLlib 171 8.4 Productionalizing Data Science 173 8.4.1 Apache PredictionIO 173 8.4.1.1 Architecture Overview 174 8.4.1.2 Machine Learning Templates 174 8.4.2 Seldon 175 8.4.3 MLflow 175 8.4.3.1 MLflow Tracking 175 8.4.3.2 MLflow Projects 176 8.4.3.3 MLflow Models 176 8.4.3.4 MLflow Model Registry 177 8.4.4 Kubeflow 177 8.4.4.1 Kubeflow Pipelines 177 8.4.4.2 Kubeflow Metadata 178 8.4.4.3 Kubeflow Katib 178 9 Data Discovery 179 9.1 Need for Data Discovery 179 9.1.1 Single Source of Metadata 180 9.1.2 Searching 181 9.1.3 Data Lineage 182 9.1.4 Data Ownership 182 9.1.5 Data Metrics 183 9.1.6 Data Grouping 183 9.1.7 Data Clustering 184 9.1.8 Data Classification 184 9.1.9 Data Glossary 185 9.1.10 Data Update Notification 185 9.1.11 Data Presentation 186 9.2 Data Governance 186 9.2.1 Data Governance Overview 186 9.2.1.1 Data Quality 186 9.2.1.2 Metadata 187 9.2.1.3 Data Access 187 9.2.1.4 Data Life Cycle 188 9.2.2 Big Data Governance 188 9.2.2.1 Data Architecture 188 9.2.2.2 Data Source Integration 189 9.3 Data Discovery Tools 189 9.3.1 Metacat 189 9.3.1.1 Data Abstraction and Interoperability 190 9.3.1.2 Metadata Enrichment 190 9.3.1.3 Searching and Indexing 190 9.3.1.4 Update Notifications 191 9.3.2 Amundsen 191 9.3.2.1 Discovery Capabilities 191 9.3.2.2 Integration Points 192 9.3.2.3 Architecture Overview 192 9.3.3 Apache Atlas 193 9.3.3.1 Searching 194 9.3.3.2 Glossary 194 9.3.3.3 Type System 194 9.3.3.4 Lineage 195 9.3.3.5 Notifications 196 9.3.3.6 Bridges and Hooks 196 9.3.3.7 Architecture Overview 196 10 Data Security 199 10.1 Infrastructure Security 199 10.1.1 Computing 200 10.1.1.1 Auditing 200 10.1.1.2 Operating System 200 10.1.1.3 Network 200 10.1.2 Identity and Access Management 201 10.1.2.1 Authentication 201 10.1.2.2 Authorization 201 10.1.3 Data Transfers 202 10.2 Data Privacy 202 10.2.1 Data Encryption 202 10.2.1.1 File System Layer Encryption 203 10.2.1.2 Database Layer Encryption 203 10.2.1.3 Transport Layer Encryption 203 10.2.1.4 Application Layer Encryption 203 10.2.2 Data Anonymization 204 10.2.2.1 k-Anonymity 204 10.2.2.2 l-Diversity 204 10.2.3 Data Perturbation 204 10.3 Law Enforcement 205 10.3.1 PII 205 10.3.1.1 Identifying PII Tables/Columns 205 10.3.1.2 Segregating Tables Containing PII 205 10.3.1.3 Protecting PII Tables via Access Control 206 10.3.1.4 Masking and Anonymizing PII Data 206 10.3.2 Privacy Regulations/Acts 207 10.3.2.1 Collecting Data 207 10.3.2.2 Erasing Data 207 10.4 Data Security Tools 208 10.4.1 Apache Ranger 208 10.4.1.1 Ranger Policies 208 10.4.1.2 Managed Components 209 10.4.1.3 Architecture Overview 211 10.4.2 Apache Sentry 212 10.4.2.1 Managed Components 212 10.4.2.2 Architecture Overview 213 10.4.3 Apache Knox 214 10.4.3.1 Authentication and Authorization 215 10.4.3.2 Supported Hadoop Services 215 10.4.3.3 Client Services 216 10.4.3.4 Architecture Overview 217 10.4.3.5 Audit 218 11 Putting All Together 219 11.1 Platforms 219 11.1.1 In-house Solutions 220 11.1.1.1 Cloud Provisioning 220 11.1.1.2 On-premise Provisioning 221 11.1.2 Cloud Providers 221 11.1.2.1 Vendor Lock-in 222 11.1.2.2 Outages 223 11.1.3 Hybrid Solutions 223 11.1.3.1 Kubernetes 224 11.2 Big Data Systems and Tools 224 11.2.1 Storage 224 11.2.1.1 File-Based Storage 225 11.2.1.2 NoSQL 225 11.2.2 Processing 226 11.2.2.1 Batch Processing 226 11.2.2.2 Stream Processing 227 11.2.2.3 Combining Batch and Streaming 227 11.2.3 Model Training 228 11.2.4 A Holistic View 228 11.3 Challenges 229 11.3.1 Growth 229 11.3.2 SLA 230 11.3.3 Versioning 231 11.3.4 Maintenance 232 11.3.5 Deprecation 233 11.3.6 Monitoring 233 11.3.7 Trends 234 11.3.8 Security 235 11.3.9 Testing 235 11.3.10 Organization 236 11.3.11 Talent 237 11.3.12 Budget 238 12 An Ideal Platform 239 12.1 Event Sourcing 240 12.1.1 How It Works 240 12.1.2 Messaging Middleware 241 12.1.3 Why Use Event Sourcing 242 12.2 Kappa Architecture 242 12.2.1 How It Works 243 12.2.2 Limitations 244 12.3 Data Mesh 245 12.3.1 Domain-Driven Design 245 12.3.2 Self-Serving Infrastructure 247 12.3.3 Data as Product Approach 247 12.4 Data Reservoirs 248 12.4.1 Data Organization 248 12.4.2 Data Standards 249 12.4.3 Data Policies 249 12.4.4 Multiple Data Reservoirs 250 12.5 Data Catalog 250 12.5.1 Data Feedback Loop 251 12.5.2 Data Synthesis 252 12.6 Self-service Platform 252 12.6.1 Data Publishing 253 12.6.2 Data Processing 254 12.6.3 Data Monitoring 254 12.7 Abstraction 254 12.7.1 Abstractions via User Interface 255 12.7.2 Abstractions via Wrappers 256 12.8 Data Guild 256 12.9 Trade-offs 257 12.9.1 Quality vs Efficiency 258 12.9.2 Real time vs Offline 258 12.9.3 Performance vs Cost 258 12.9.4 Consistency vs Availability 259 12.9.5 Disruption vs Reliability 259 12.9.6 Build vs Buy 259 12.10 Data Ethics 260 Appendix A Further Systems and Patterns 261 A.1 Lambda Architecture 261 A.1.1 Batch Layer 262 A.1.2 Speed Layer 262 A.1.3 Serving Layer 262 A.2 Apache Cassandra 263 A.2.1 Cassandra Data Modeling 263 A.2.2 Cassandra Architecture 265 A.2.2.1 Cassandra Components 265 A.2.2.2 Storage Engine 267 A.3 Apache Beam 267 A.3.1 Programming Overview 268 A.3.2 Execution Model 270 Appendix B Recipes 271 B.1 Activity Tracking Recipe 271 B.1.1 Problem Statement 271 B.1.2 Design Approach 271 B.1.2.1 Data Ingestion 271 B.1.2.2 Computation 272 B.2 Data Quality Assurance 273 B.2.1 Problem Statement 273 B.2.2 Design Approach 273 B.2.2.1 Ingredients 273 B.2.2.2 Preparation 275 B.2.2.3 The Menu 277 B.3 Estimating Time to Delivery 277 B.3.1 Problem Definition 278 B.3.2 Design Approach 278 B.3.2.1 Streaming Reference Architecture 278 B.4 Incident Response Recipe 283 B.4.1 Problem Definition 283 B.4.2 Design Approach 283 B.4.2.1 Minimizing Disruptions 284 B.4.2.2 Categorizing Disruptions 284 B.4.2.3 Handling Disruptions 284 B.5 Leveraging Spark SQL Metrics 286 B.5.1 Problem Statement 286 B.5.2 Design Approach 286 B.5.2.1 Spark SQL Metrics Pipeline 286 B.5.2.2 Integrating SQL Metrics 288 B.5.2.3 The Menu 289 B.6 Airbnb Price Prediction 289 B.6.1 Problem Statement 290 B.6.2 Design Approach 290 B.6.2.1 Tech Stack 290 B.6.2.2 Data Understanding and Ingestion 291 B.6.2.3 Data Preparation 291 B.6.2.4 Modeling and Evaluation 293 B.6.2.5 Monitor and Iterate 294 Bibliography 295 Index 301

Autorenportrait

Yusuf Aytas is an experienced software engineer. He holds a Master and Bachelor of Computer Science from Bilkent University, Ankara, Turkey. Yusuf has experience in building highly scalable, optimized analytics platforms. He is proficient in agile methodologies, continuous delivery, and software development best practices.

Back cover copy

Provides expert guidance and valuable insights on getting the most out of Big Data systemsAn array of tools are currently available for managing and processing data—some are ready-to-go solutions that can be immediately deployed, while others require complex and time-intensive setups. With such a vast range of options, choosing the right tool to build a solution can be complicated, as can determining which tools work well with each other. Designing Big Data Platforms provides clear and authoritative guidance on the critical decisions necessary for successfully deploying, operating, and maintaining Big Data systems.This highly practical guide helps readers understand how to process large amounts of data with well-known Linux tools and database solutions, use effective techniques to collect and manage data from multiple sources, transform data into meaningful business insights, and much more. Author Yusuf Aytas, a software engineer with a vast amount of big data experience, discusses the design of the ideal Big Data platform: one that meets the needs of data analysts, data engineers, data scientists, software engineers, and a spectrum of other stakeholders across an organization. Detailed yet accessible chapters cover key topics such as stream data processing, data analytics, data science, data discovery, and data security. This real-world manual for Big Data technologies:<ul><li>Provides up-to-date coverage of the tools currently used in Big Data processing and management</li><li>Offers step-by-step guidance on building a data pipeline, from basic scripting to distributed systems</li><li>Highlights and explains how data is processed at scale</li><li>Includes an introduction to the foundation of a modern data platform</li></ul>Designing Big Data Platforms: How to Use, Deploy, and Maintain Big Data Systems is a must-have for all professionals working with Big Data, as well researchers and students in computer science and related fields.