Details

Data Analytics in the AWS Cloud

Building a Data Platform for BI and Predictive Analytics on AWS
1. Aufl.

von: Joe Minichino
38,99 €
Verlag:	Wiley
Format:	PDF
Veröffentl.:	31.03.2023
ISBN/EAN:	9781119909262
Sprache:	englisch
Anzahl Seiten:	416

In den Warenkorb

Als Gutschein

DRM-geschütztes eBook, Sie benötigen z.B. Adobe Digital Editions und eine Adobe ID zum Lesen.

Beschreibungen

Titelbeschreibung

A comprehensive and accessible roadmap to performing data analytics in the AWS cloud In Data Analytics in the AWS Cloud: Building a Data Platform for BI and Predictive Analytics on AWS, accomplished software engineer and data architect Joe Minichino delivers an expert blueprint to storing, processing, analyzing data on the Amazon Web Services cloud platform. In the book, you’ll explore every relevant aspect of data analytics—from data engineering to analysis, business intelligence, DevOps, and MLOps—as you discover how to integrate machine learning predictions with analytics engines and visualization tools. You’ll also find: <ul> <li>Real-world use cases of AWS architectures that demystify the applications of data analytics</li> <li>Accessible introductions to data acquisition, importation, storage, visualization, and reporting</li> <li>Expert insights into serverless data engineering and how to use it to reduce overhead and costs, improve stability, and simplify maintenance</li></ul>A can't-miss for data architects, analysts, engineers and technical professionals, Data Analytics in the AWS Cloud will also earn a place on the bookshelves of business leaders seeking a better understanding of data analytics on the AWS cloud platform.

Inhaltsverzeichnis

Introduction xxiii Chapter 1 AWS Data Lakes and Analytics Technology Overview 1 Why AWS? 1 What Does a Data Lake Look Like in AWS? 2 Analytics on AWS 3 Skills Required to Build and Maintain an AWS Analytics Pipeline 3 Chapter 2 The Path to Analytics: Setting Up a Data and Analytics Team 5 The Data Vision 6 Support 6 DA Team Roles 7 Early Stage Roles 7 Team Lead 8 Data Architect 8 Data Engineer 8 Data Analyst 9 Maturity Stage Roles 9 Data Scientist 9 Cloud Engineer 10 Business Intelligence (BI) Developer 10 Machine Learning Engineer 10 Business Analyst 11 Niche Roles 11 Analytics Flow at a Process Level 12 Workflow Methodology 12 The DA Team Mantra: “Automate Everything” 14 Analytics Models in the Wild: Centralized, Distributed, Center of Excellence 15 Centralized 15 Distributed 16 Center of Excellence 16 Summary 17 Chapter 3 Working on AWS 19 Accessing AWS 20 Everything Is a Resource 21 S3: An Important Exception 21 IAM: Policies, Roles, and Users 22 Policies 22 Identity- Based Policies 24 Resource- Based Policies 25 Roles 25 Users and User Groups 25 Summarizing IAM 26 Working with the Web Console 26 The AWS Command- Line Interface 29 Installing AWS cli 29 Linux Installation 30 macOS Installation 30 Windows 31 Configuring AWS cli 31 A Note on Region 33 Setting Individual Parameters 33 Using Profiles and Configuration Files 33 Final Notes on Configuration 36 Using the AWS cli 36 Using Skeletons and File Inputs 39 Cleaning Up! 43 Infrastructure- as- Code: CloudFormation and Terraform 44 CloudFormation 44 CloudFormation Stacks 46 CloudFormation Template Anatomy 47 CloudFormation Changesets 52 Getting Stack Information 55 Cleaning Up Again 57 CloudFormation Conclusions 58 Terraform 58 Coding Style 58 Modularity 59 Limitations 59 Terraform vs. CloudFormation 60 Infrastructure- as- Code: CDK, Pulumi, Cloudcraft, and Other Solutions 60 AWS CDK 60 Pulumi 62 Cloudcraft 62 Infrastructure Management Conclusions 63 Chapter 4 Serverless Computing and Data Engineering 65 Serverless vs. Fully Managed 65 AWS Serverless Technologies 66 AWS Lambda 67 Pricing Model 67 Laser Focus on Code 68 The Lambda Paradigm Shift 69 Virtually Infinite Scalability 70 Geographical Distribution 70 A Lambda Hello World 71 Lambda Configuration 74 Runtime 74 Container- Based Lambdas 75 Architectures 75 Memory 75 Networking 76 Execution Role 76 Environment Variables 76 AWS EventBridge 77 AWS Fargate 77 AWS DynamoDB 77 AWS SNS 77 Amazon SQS 78 AWS CloudWatch 78 Amazon QuickSight 78 AWS Step Functions 78 Amazon API Gateway 79 Amazon Cognito 79 AWS Serverless Application Model (SAM) 79 Ephemeral Infrastructure 80 AWS SAM Installation 80 Configuration 80 Creating Your First AWS SAM Project 81 Application Structure 83 SAM Resource Types 85 SAM Lambda Template 86 !! Recursive Lambda Invocation !! 88 Function Metadata 88 Outputs 89 Implicitly Generated Resources 89 Other Template Sections 90 Lambda Code 90 Building Your First SAM Application 93 Testing the AWS SAM Application Locally 96 Deployment 99 Cleaning Up 104 Summary 104 Chapter 5 Data Ingestion 105 AWS Data Lake Architecture 106 Serverless Data Lake Architecture Structure 106 Ingestion 106 Storage and Processing 108 Cataloging, Governance, and Search 108 Security and Monitoring 109 Consumption 109 Sample Processing Architecture: Cataloging Images into DynamoDB 109 Use Case Description 109 SAM Application Creation 110 S3- Triggered Lambda 111 Adding DynamoDB 119 Lambda Execution Context 121 Inserting into DynamoDB 121 Cleaning Up 123 Serverless Ingestion 124 AWS Fargate 124 AWS Lambda 124 Example Architecture: Fargate- Based Periodic Batch Import 125 The Basic Importer 125 ECS CLI 128 AWS Copilot cli 128 Clean Up 136 AWS Kinesis Ingestion 136 Example Architecture: Two- Pronged Delivery 137 Fully Managed Ingestion with AppFlow 146 Operational Data Ingestion with Database Migration Service 151 DMS Concepts 151 DMS Instance 151 DMS Endpoints 152 DMS Tasks 152 Summary of the Workflow 152 Common Use of DMS 153 Example Architecture: DMS to S3 154 DMS Instance 154 DMS Endpoints 156 DMS Task 162 Summary 167 Chapter 6 Processing Data 169 Phases of Data Preparation 170 What Is ETL? Why Should I Care? 170 ETL Job vs. Streaming Job 171 Overview of ETL in AWS 172 ETL with AWS Glue 172 ETL with Lambda Functions 172 ETL with Hadoop/EMR 173 Other Ways to Perform ETL 173 ETL Job Design Concepts 173 Source Identification 174 Destination Identification 174 Mappings 174 Validation 174 Filter 175 Join, Denormalization, Relationalization 175 AWS Glue for ETL 176 Really, It’s Just Spark 176 Visual 176 Spark Script Editor 177 Python Shell Script Editor 177 Jupyter Notebook 177 Connectors 177 Creating Connections 178 Creating Connections with the Web Console 178 Creating Connections with the AWS cli 179 Creating ETL Jobs with AWS Glue Visual Editor 184 ETL Example: Format Switch from Raw (JSON) to Cleaned (Parquet) 184 Job Bookmarks 187 Transformations 188 Apply Mapping 189 Filter 189 Other Available Transforms 190 Run the Edited Job 191 Visual Editor with Source and Target Conclusions 192 Creating ETL Jobs with AWS Glue Visual Editor (without Source and Target) 192 Creating ETL Jobs with the Spark Script Editor 192 Developing ETL Jobs with AWS Glue Notebooks 193 What Is a Notebook? 194 Notebook Structure 194 Step 1: Load Code into a DynamicFrame 196 Step 2: Apply Field Mapping 197 Step 3: Apply the Filter 197 Step 4: Write to S3 in Parquet Format 198 Example: Joining and Denormalizing Data from Two S3 Locations 199 Conclusions for Manually Authored Jobs with Notebooks 203 Creating ETL Jobs with AWS Glue Interactive Sessions 204 It’s Magic 205 Development Workflow 206 Streaming Jobs 207 Differences with a Standard ETL Job 208 Streaming Sources 208 Example: Process Kinesis Streams with a Streaming Job 208 Streaming ETL Jobs Conclusions 217 Summary 217 Chapter 7 Cataloging, Governance, and Search 219 Cataloging with AWS Glue 219 AWS Glue and the AWS Glue Data Catalog 219 Glue Databases and Tables 220 Databases 220 The Idea of Schema- on- Read 221 Tables 222 Create Table Manually 223 Creating a Table from an Existing Schema 225 Creating a Table with a Crawler 225 Summary on Databases and Tables 226 Crawlers 226 Updating or Not Updating? 230 Running the Crawler 231 Creating a Crawler from the AWS CLI 231 Retrieving Table Information from the CLI 233 Classifiers 235 Classifier Example 236 Crawlers and Classifiers Summary 237 Search with Amazon Athena: The Heart of Analytics in AWS 238 A Bit of History 238 Interface Overview 238 Creating Tables Manually 239 Athena Data Types 240 Complex Types 241 Running a Query 242 Connecting with JDBC and ODBC 243 Query Stats 243 Recent Queries and Saved Queries 243 The Power of Partitions 244 Athena Pricing Model 244 Automatic Naming 245 Athena Query Output 246 Athena Peculiarities (SQL and Not) 246 Computed Fields Gotcha and WITH Statement Workaround 246 Lowercase! 247 Query Explain 248 Deduplicating Records 249 Working with JSON, Flattening, and Unnesting 250 Athena Views 251 Create Table as Select (CTAS) 252 Saving Queries and Reusing Saved Queries 253 Running Parameterized Queries 254 Athena Federated Queries 254 Athena Lambda Connectors 255 Note on Connection Errors 256 Performing Federated Queries 257 Creating a View from a Federated Query 258 Governing: Athena Workgroups, Lake Formation, and More 258 Athena Workgroups 259 Fine- Grained Athena Access with IAM 262 Recap of Athena- Based Governance 264 AWS Lake Formation 265 Registering a Location in Lake Formation 266 Creating a Database in Lake Formation 268 Assigning Permissions in Lake Formation 269 LF- Tags and Permissions in Lake Formation 271 Data Filters 277 Governance Conclusions 279 Summary 280 Chapter 8 Data Consumption: BI, Visualization, and Reporting 283 QuickSight 283 Signing Up for QuickSight 284 Standard Plan 284 Enterprise Plan 284 Users and User Groups 285 Managing Users and Groups 285 Managing QuickSight 286 Users and Groups 287 Your Subscriptions 287 SPICE Capacity 287 Account Settings 287 Security and Permissions 287 VPC Connections 288 Mobile Settings 289 Domains and Embedding 289 Single Sign- On 289 Data Sources and Datasets 289 Creating an Athena Data Source 291 Creating Other Data Sources 292 Creating a Data Source from the AWS cli 292 Creating a Dataset from a Table 294 Creating a Dataset from a SQL Query 295 Duplicating Datasets 296 Note on Creating Datasets 297 QuickSight Favorites, Recent, and Folders 297 SPICE 298 Manage SPICE Capacity 298 Refresh Schedule 299 QuickSight Data Editor 299 QuickSight Data Types 302 Change Data Types 302 Calculated Fields 303 Joining Data 305 Excluding Fields 309 Filtering Data 309 Removing Data 310 Geospatial Hierarchies and Adding Fields to Hierarchies 310 Unsupported Format Dates 311 Visualizing Data: QuickSight Analysis 312 Adding a Title and a Description to Your Analysis 313 Renaming the Sheet 314 Your First Visual with AutoGraph 314 Field Wells 314 Visuals Types 315 Saving and Autosaving 316 A First Example: Pie Chart 316 Renaming a Visual 317 Filtering Data 318 Adding Drill- Downs 320 Parameters 321 Actions 324 Insights 328 ML- Powered Insights 330 Sharing an Analysis 335 Dashboards 335 Dashboard Layouts and Themes 335 Publishing a Dashboard 336 Embedding Visuals and Dashboards 337 Data Consumption: Not Only Dashboards 337 Summary 338 Chapter 9 Machine Learning at Scale 339 Machine Learning and Artificial Intelligence 339 What Are ML/AI Use Cases? 340 Types of ML Models 340 Overview of ML/AI AWS Solutions 341 Amazon SageMaker 341 SageMaker Domains 342 Adding a User to the Domain 344 SageMaker Studio 344 SageMaker Example Notebook 346 Step 1: Prerequisites and Preprocessing 346 Step 2: Data Ingestion 347 Step 3: Data Inspection 348 Step 4: Data Conversion 349 Step 5: Upload Training Data 349 Step 6: Train the Model 349 Step 7: Set Up Hosting and Deploy the Model 351 Step 8: Validate the Model 352 Step 9: Use the Model 353 Inference 353 Real Time 354 Asynchronous 354 Serverless 354 Batch Transform 354 Data Wrangler 356 SageMaker Canvas 357 Summary 358 Appendix Example Data Architectures in AWS 359 Modern Data Lake Architecture 360 ETL in a Lake House 361 Consuming Data in the Lake House 361 The Modern Data Lake Architecture 362 Batch Processing 362 Stream Processing 363 Architecture Design Recommendations 364 Automate Everything 365 Build on Events 365 Performance = Cost Savings 365 AWS Glue Catalog and Athena- Centric Workflow 365 Design Flexible 365 Pick Your Battles 365 Parquet 366 Summary 366 Index 367

Autorenportrait

GIONATA “JOE” MINICHINO is Principal Software Engineer and Data Architect on the Data & Analytics Team at Teamwork. He specializes in cloud computing, machine/deep learning, and artificial intelligence and designs end-to-end Amazon Web Services pipelines that move large quantities of diverse data for analysis and visualization.

Back cover copy

Accessible and hands-on guidance for data analytics solutions built on the AWS cloud In Data Analytics in the AWS Cloud: Building a Data Platform for BI and Predictive Analytics on AWS, veteran data architect and software engineer Joe Minichino delivers an insightful and practical blueprint to storing, processing, and analyzing data on the Amazon Web Services cloud platform. The author explains every relevant aspect of AWS data analytics—from data engineering to analysis, business intelligence, DevOps, MLOps, and more—as he walks you through how to integrate machine learning predictions with analytics engines and data visualization tools. The book includes real-world case studies of businesses using AWS architectures to apply cutting-edge data analytics and offers comprehensive coverage of data acquisition, importation, storage, visualization, and reporting in the Amazon cloud environment. It also discusses expert insights into serverless data engineering, showing you how to use it to reduce overhead and costs, simplify maintenance, and improve stability across the board. An essential resource for data analysts, architects, and engineers, Data Analytics in the AWS Cloud will also benefit a wide variety of technical professionals and business leaders who seek a fuller understanding of how to use Amazon Web Services to enable and deploy superior data analytics solutions.