Details

Designing Big Data Platforms


Designing Big Data Platforms

How to Use, Deploy, and Maintain Big Data Systems
1. Aufl.

von: Yusuf Aytas

114,99 €

Verlag: Wiley
Format: PDF
Veröffentl.: 29.06.2021
ISBN/EAN: 9781119690948
Sprache: englisch
Anzahl Seiten: 336

DRM-geschütztes eBook, Sie benötigen z.B. Adobe Digital Editions und eine Adobe ID zum Lesen.

Beschreibungen

<b>DESIGNING BIG DATA PLATFORMS</b> <p><b>Provides expert guidance and valuable insights on getting the most out of Big Data systems</b><p>An array of tools are currently available for managing and processing data—some are ready-to-go solutions that can be immediately deployed, while others require complex and time-intensive setups. With such a vast range of options, choosing the right tool to build a solution can be complicated, as can determining which tools work well with each other. <i>Designing Big Data Platforms</i> provides clear and authoritative guidance on the critical decisions necessary for successfully deploying, operating, and maintaining Big Data systems.<p>This highly practical guide helps readers understand how to process large amounts of data with well-known Linux tools and database solutions, use effective techniques to collect and manage data from multiple sources, transform data into meaningful business insights, and much more. Author Yusuf Aytas, a software engineer with a vast amount of big data experience, discusses the design of the ideal Big Data platform: one that meets the needs of data analysts, data engineers, data scientists, software engineers, and a spectrum of other stakeholders across an organization. Detailed yet accessible chapters cover key topics such as stream data processing, data analytics, data science, data discovery, and data security. This real-world manual for Big Data technologies:<ul><li>Provides up-to-date coverage of the tools currently used in Big Data processing and management</li><li>Offers step-by-step guidance on building a data pipeline, from basic scripting to distributed systems</li><li>Highlights and explains how data is processed at scale</li><li>Includes an introduction to the foundation of a modern data platform</li></ul><p><i>Designing Big Data Platforms: How to Use, Deploy, and Maintain Big Data Systems</i> is a must-have for all professionals working with Big Data, as well researchers and students in computer science and related fields.
<p>List of Contributors <i>x</i>vii</p> <p>Preface xix</p> <p>Acknowledgments xxi</p> <p>Acronyms xxiii</p> <p>Introduction xxv</p> <p><b>1 An Introduction: What’s a Modern Big Data Platform </b><b>1</b></p> <p>1.1 Defining Modern Big Data Platform 1</p> <p>1.2 Fundamentals of a Modern Big Data Platform 2</p> <p>1.2.1 Expectations from Data 2</p> <p>1.2.1.1 Ease of Access 2</p> <p>1.2.1.2 Security 2</p> <p>1.2.1.3 Quality 3</p> <p>1.2.1.4 Extensibility 3</p> <p>1.2.2 Expectations from Platform 3</p> <p>1.2.2.1 Storage Layer 4</p> <p>1.2.2.2 Resource Management 4</p> <p>1.2.2.3 ETL 5</p> <p>1.2.2.4 Discovery 6</p> <p>1.2.2.5 Reporting 7</p> <p>1.2.2.6 Monitoring 7</p> <p>1.2.2.7 Testing 8</p> <p>1.2.2.8 Lifecycle Management 9</p> <p><b>2 A Bird’s Eye View on Big Data </b><b>11</b></p> <p>2.1 A Bit of History 11</p> <p>2.1.1 Early Uses of Big Data Term 11</p> <p>2.1.2 A New Era 12</p> <p>2.1.2.1 Word Count Problem 12</p> <p>2.1.2.2 Execution Steps 13</p> <p>2.1.3 An Open-Source Alternative 15</p> <p>2.1.3.1 Hadoop Distributed File System 15</p> <p>2.1.3.2 HadoopMapReduce 17</p> <p>2.2 What Makes Big Data 20</p> <p>2.2.1 Volume 20</p> <p>2.2.2 Velocity 21</p> <p>2.2.3 Variety 21</p> <p>2.2.4 Complexity 21</p> <p>2.3 Components of Big Data Architecture 22</p> <p>2.3.1 Ingestion 22</p> <p>2.3.2 Storage 23</p> <p>2.3.3 Computation 23</p> <p>2.3.4 Presentation 24</p> <p>2.4 Making Use of Big Data 24</p> <p>2.4.1 Querying 24</p> <p>2.4.2 Reporting 25</p> <p>2.4.3 Alerting 25</p> <p>2.4.4 Searching 25</p> <p>2.4.5 Exploring 25</p> <p>2.4.6 Mining 25</p> <p>2.4.7 Modeling 26</p> <p><b>3 A Minimal Data Processing and Management System </b><b>27</b></p> <p>3.1 Problem Definition 27</p> <p>3.1.1 Online Book Store 27</p> <p>3.1.2 User Flow Optimization 28</p> <p>3.2 Processing Large Data with Linux Commands 28</p> <p>3.2.1 Understand the Data 28</p> <p>3.2.2 Sample the Data 28</p> <p>3.2.3 Building the Shell Command 29</p> <p>3.2.4 Executing the Shell Command 30</p> <p>3.2.5 Analyzing the Results 31</p> <p>3.2.6 Reporting the Findings 32</p> <p>3.2.7 Automating the Process 33</p> <p>3.2.8 A Brief Review 33</p> <p>3.3 Processing Large Data with PostgreSQL 34</p> <p>3.3.1 Data Modeling 34</p> <p>3.3.2 Copying Data 35</p> <p>3.3.3 Sharding in PostgreSQL 37</p> <p>3.3.3.1 Setting up Foreign Data Wrapper 37</p> <p>3.3.3.2 Sharding Data over Multiple Nodes 38</p> <p>3.4 Cost of Big Data 39</p> <p><b>4 Big Data Storage </b><b>41</b></p> <p>4.1 Big Data Storage Patterns 41</p> <p>4.1.1 Data Lakes 41</p> <p>4.1.2 Data Warehouses 42</p> <p>4.1.3 Data Marts 43</p> <p>4.1.4 Comparison of Storage Patterns 43</p> <p>4.2 On-Premise Storage Solutions 44</p> <p>4.2.1 Choosing Hardware 44</p> <p>4.2.1.1 DataNodes 44</p> <p>4.2.1.2 NameNodes 45</p> <p>4.2.1.3 Resource Managers 45</p> <p>4.2.1.4 Network Equipment 45</p> <p>4.2.2 Capacity Planning 46</p> <p>4.2.2.1 Overall Cluster 46</p> <p>4.2.2.2 Resource Sharing 47</p> <p>4.2.2.3 Doing the Math 47</p> <p>4.2.3 Deploying Hadoop Cluster 48</p> <p>4.2.3.1 Networking 48</p> <p>4.2.3.2 Operating System 48</p> <p>4.2.3.3 Management Tools 49</p> <p>4.2.3.4 Hadoop Ecosystem 49</p> <p>4.2.3.5 A Humble Deployment 49</p> <p>4.3 Cloud Storage Solutions 53</p> <p>4.3.1 Object Storage 54</p> <p>4.3.2 Data Warehouses 55</p> <p>4.3.2.1 Columnar Storage 55</p> <p>4.3.2.2 Provisioned Data Warehouses 56</p> <p>4.3.2.3 Serverless Data Warehouses 56</p> <p>4.3.2.4 Virtual Data Warehouses 57</p> <p>4.3.3 Archiving 58</p> <p>4.4 Hybrid Storage Solutions 59</p> <p>4.4.1 Making Use of Object Store 59</p> <p>4.4.1.1 Additional Capacity 59</p> <p>4.4.1.2 Batch Processing 59</p> <p>4.4.1.3 Hot Backup 60</p> <p>4.4.2 Making Use of Data Warehouse 60</p> <p>4.4.2.1 Primary Data Warehouse 60</p> <p>4.4.2.2 Shared Data Mart 61</p> <p>4.4.3 Making Use of Archiving 61</p> <p><b>5 Offline Big Data Processing </b><b>63</b></p> <p>5.1 Defining Offline Data Processing 63</p> <p>5.2 MapReduce Technologies 65</p> <p>5.2.1 Apache Pig 65</p> <p>5.2.1.1 Pig Latin Overview 66</p> <p>5.2.1.2 Compilation To MapReduce 66</p> <p>5.2.2 Apache Hive 67</p> <p>5.2.2.1 Hive Database 68</p> <p>5.2.2.2 Hive Architecture 69</p> <p>5.3 Apache Spark 70</p> <p>5.3.1 What’s Spark 71</p> <p>5.3.2 Spark Constructs and Components 71</p> <p>5.3.2.1 Resilient Distributed Datasets 71</p> <p>5.3.2.2 Distributed Shared Variables 73</p> <p>5.3.2.3 Datasets and DataFrames 74</p> <p>5.3.2.4 Spark Libraries and Connectors 75</p> <p>5.3.3 Execution Plan 76</p> <p>5.3.3.1 The Logical Plan 77</p> <p>5.3.3.2 The Physical Plan 77</p> <p>5.3.4 Spark Architecture 77</p> <p>5.3.4.1 Inside of Spark Application 78</p> <p>5.3.4.2 Outside of Spark Application 79</p> <p>5.4 Apache Flink 81</p> <p>5.5 Presto 83</p> <p>5.5.1 Presto Architecture 83</p> <p>5.5.2 Presto System Design 84</p> <p>5.5.2.1 Execution Plan 84</p> <p>5.5.2.2 Scheduling 86</p> <p>5.5.2.3 Resource Management 86</p> <p>5.5.2.4 Fault Tolerance 87</p> <p><b>6 Stream Big Data Processing </b><b>89</b></p> <p>6.1 The Need for Stream Processing 89</p> <p>6.2 Defining Stream Data Processing 90</p> <p>6.3 Streams via Message Brokers 92</p> <p>6.3.1 Apache Kafka 92</p> <p>6.3.1.1 Apache Samza 93</p> <p>6.3.1.2 Kafka Streams 98</p> <p>6.3.2 Apache Pulsar 100</p> <p>6.3.2.1 Pulsar Functions 102</p> <p>6.3.3 AMQP Based Brokers 105</p> <p>6.4 Streams via Stream Engines 106</p> <p>6.4.1 Apache Flink 106</p> <p>6.4.1.1 Flink Architecture 107</p> <p>6.4.1.2 System Design 109</p> <p>6.4.2 Apache Storm 111</p> <p>6.4.2.1 Storm Architecture 114</p> <p>6.4.2.2 System Design 115</p> <p>6.4.3 Apache Heron 116</p> <p>6.4.3.1 Storm Limitations 116</p> <p>6.4.3.2 Heron Architecture 117</p> <p>6.4.4 Spark Streaming 118</p> <p>6.4.4.1 Discretized Streams 119</p> <p>6.4.4.2 Fault-tolerance 120</p> <p><b>7 Data Analytics </b><b>121</b></p> <p>7.1 Log Collection 121</p> <p>7.1.1 Apache Flume 122</p> <p>7.1.2 Fluentd 122</p> <p>7.1.2.1 Data Pipeline 123</p> <p>7.1.2.2 Fluent Bit 124</p> <p>7.1.2.3 Fluentd Deployment 124</p> <p>7.2 Transferring Big Data Sets 125</p> <p>7.2.1 Reloading 126</p> <p>7.2.2 Partition Loading 126</p> <p>7.2.3 Streaming 127</p> <p>7.2.4 Timestamping 127</p> <p>7.2.5 Tools 128</p> <p>7.2.5.1 Sqoop 128</p> <p>7.2.5.2 Embulk 128</p> <p>7.2.5.3 Spark 129</p> <p>7.2.5.4 Apache Gobblin 130</p> <p>7.3 Aggregating Big Data Sets 132</p> <p>7.3.1 Data Cleansing 132</p> <p>7.3.2 Data Transformation 134</p> <p>7.3.2.1 Transformation Functions 134</p> <p>7.3.2.2 Transformation Stages 135</p> <p>7.3.3 Data Retention 135</p> <p>7.3.4 Data Reconciliation 136</p> <p>7.4 Data Pipeline Scheduler 136</p> <p>7.4.1 Jenkins 137</p> <p>7.4.2 Azkaban 138</p> <p>7.4.2.1 Projects 139</p> <p>7.4.2.2 Execution Modes 139</p> <p>7.4.3 Airflow 139</p> <p>7.4.3.1 Task Execution 140</p> <p>7.4.3.2 Scheduling 141</p> <p>7.4.3.3 Executor 141</p> <p>7.4.3.4 Security and Monitoring 142</p> <p>7.4.4 Cloud 143</p> <p>7.5 Patterns and Practices 143</p> <p>7.5.1 Patterns 143</p> <p>7.5.1.1 Data Centralization 143</p> <p>7.5.1.2 Singe Source of Truth 144</p> <p>7.5.1.3 Domain Driven Data Sets 145</p> <p>7.5.2 Anti-Patterns 146</p> <p>7.5.2.1 Data Monolith 146</p> <p>7.5.2.2 Data Swamp 147</p> <p>7.5.2.3 Technology Pollution 147</p> <p>7.5.3 Best Practices 148</p> <p>7.5.3.1 Business-Driven Approach 148</p> <p>7.5.3.2 Cost of Maintenance 148</p> <p>7.5.3.3 Avoiding Modeling Mistakes 149</p> <p>7.5.3.4 Choosing Right Tool for The Job 150</p> <p>7.5.4 Detecting Anomalies 150</p> <p>7.5.4.1 Manual Anomaly Detection 151</p> <p>7.5.4.2 Automated Anomaly Detection 151</p> <p>7.6 Exploring Data Visually 152</p> <p>7.6.1 Metabase 152</p> <p>7.6.2 Apache Superset 153</p> <p><b>8 Data Science </b><b>155</b></p> <p>8.1 Data Science Applications 155</p> <p>8.1.1 Recommendation 156</p> <p>8.1.2 Predictive Analytics 156</p> <p>8.1.3 Pattern Discovery 157</p> <p>8.2 Data Science Life Cycle 158</p> <p>8.2.1 Business Objective 158</p> <p>8.2.2 Data Understanding 159</p> <p>8.2.3 Data Ingestion 159</p> <p>8.2.4 Data Preparation 160</p> <p>8.2.5 Data Exploration 160</p> <p>8.2.6 Feature Engineering 161</p> <p>8.2.7 Modeling 161</p> <p>8.2.8 Model Evaluation 162</p> <p>8.2.9 Model Deployment 163</p> <p>8.2.10 Operationalizing 163</p> <p>8.3 Data Science Toolbox 164</p> <p>8.3.1 R 164</p> <p>8.3.2 Python 165</p> <p>8.3.3 SQL 167</p> <p>8.3.4 TensorFlow 167</p> <p>8.3.4.1 Execution Model 169</p> <p>8.3.4.2 Architecture 170</p> <p>8.3.5 Spark MLlib 171</p> <p>8.4 Productionalizing Data Science 173</p> <p>8.4.1 Apache PredictionIO 173</p> <p>8.4.1.1 Architecture Overview 174</p> <p>8.4.1.2 Machine Learning Templates 174</p> <p>8.4.2 Seldon 175</p> <p>8.4.3 MLflow 175</p> <p>8.4.3.1 MLflow Tracking 175</p> <p>8.4.3.2 MLflow Projects 176</p> <p>8.4.3.3 MLflow Models 176</p> <p>8.4.3.4 MLflow Model Registry 177</p> <p>8.4.4 Kubeflow 177</p> <p>8.4.4.1 Kubeflow Pipelines 177</p> <p>8.4.4.2 Kubeflow Metadata 178</p> <p>8.4.4.3 Kubeflow Katib 178</p> <p><b>9 Data Discovery </b><b>179</b></p> <p>9.1 Need for Data Discovery 179</p> <p>9.1.1 Single Source of Metadata 180</p> <p>9.1.2 Searching 181</p> <p>9.1.3 Data Lineage 182</p> <p>9.1.4 Data Ownership 182</p> <p>9.1.5 Data Metrics 183</p> <p>9.1.6 Data Grouping 183</p> <p>9.1.7 Data Clustering 184</p> <p>9.1.8 Data Classification 184</p> <p>9.1.9 Data Glossary 185</p> <p>9.1.10 Data Update Notification 185</p> <p>9.1.11 Data Presentation 186</p> <p>9.2 Data Governance 186</p> <p>9.2.1 Data Governance Overview 186</p> <p>9.2.1.1 Data Quality 186</p> <p>9.2.1.2 Metadata 187</p> <p>9.2.1.3 Data Access 187</p> <p>9.2.1.4 Data Life Cycle 188</p> <p>9.2.2 Big Data Governance 188</p> <p>9.2.2.1 Data Architecture 188</p> <p>9.2.2.2 Data Source Integration 189</p> <p>9.3 Data Discovery Tools 189</p> <p>9.3.1 Metacat 189</p> <p>9.3.1.1 Data Abstraction and Interoperability 190</p> <p>9.3.1.2 Metadata Enrichment 190</p> <p>9.3.1.3 Searching and Indexing 190</p> <p>9.3.1.4 Update Notifications 191</p> <p>9.3.2 Amundsen 191</p> <p>9.3.2.1 Discovery Capabilities 191</p> <p>9.3.2.2 Integration Points 192</p> <p>9.3.2.3 Architecture Overview 192</p> <p>9.3.3 Apache Atlas 193</p> <p>9.3.3.1 Searching 194</p> <p>9.3.3.2 Glossary 194</p> <p>9.3.3.3 Type System 194</p> <p>9.3.3.4 Lineage 195</p> <p>9.3.3.5 Notifications 196</p> <p>9.3.3.6 Bridges and Hooks 196</p> <p>9.3.3.7 Architecture Overview 196</p> <p><b>10 Data Security </b><b>199</b></p> <p>10.1 Infrastructure Security 199</p> <p>10.1.1 Computing 200</p> <p>10.1.1.1 Auditing 200</p> <p>10.1.1.2 Operating System 200</p> <p>10.1.1.3 Network 200</p> <p>10.1.2 Identity and Access Management 201</p> <p>10.1.2.1 Authentication 201</p> <p>10.1.2.2 Authorization 201</p> <p>10.1.3 Data Transfers 202</p> <p>10.2 Data Privacy 202</p> <p>10.2.1 Data Encryption 202</p> <p>10.2.1.1 File System Layer Encryption 203</p> <p>10.2.1.2 Database Layer Encryption 203</p> <p>10.2.1.3 Transport Layer Encryption 203</p> <p>10.2.1.4 Application Layer Encryption 203</p> <p>10.2.2 Data Anonymization 204</p> <p>10.2.2.1 <i>k</i>-Anonymity 204</p> <p>10.2.2.2 <i>l</i>-Diversity 204</p> <p>10.2.3 Data Perturbation 204</p> <p>10.3 Law Enforcement 205</p> <p>10.3.1 PII 205</p> <p>10.3.1.1 Identifying PII Tables/Columns 205</p> <p>10.3.1.2 Segregating Tables Containing PII 205</p> <p>10.3.1.3 Protecting PII Tables via Access Control 206</p> <p>10.3.1.4 Masking and Anonymizing PII Data 206</p> <p>10.3.2 Privacy Regulations/Acts 207</p> <p>10.3.2.1 Collecting Data 207</p> <p>10.3.2.2 Erasing Data 207</p> <p>10.4 Data Security Tools 208</p> <p>10.4.1 Apache Ranger 208</p> <p>10.4.1.1 Ranger Policies 208</p> <p>10.4.1.2 Managed Components 209</p> <p>10.4.1.3 Architecture Overview 211</p> <p>10.4.2 Apache Sentry 212</p> <p>10.4.2.1 Managed Components 212</p> <p>10.4.2.2 Architecture Overview 213</p> <p>10.4.3 Apache Knox 214</p> <p>10.4.3.1 Authentication and Authorization 215</p> <p>10.4.3.2 Supported Hadoop Services 215</p> <p>10.4.3.3 Client Services 216</p> <p>10.4.3.4 Architecture Overview 217</p> <p>10.4.3.5 Audit 218</p> <p><b>11 Putting All Together </b><b>219</b></p> <p>11.1 Platforms 219</p> <p>11.1.1 In-house Solutions 220</p> <p>11.1.1.1 Cloud Provisioning 220</p> <p>11.1.1.2 On-premise Provisioning 221</p> <p>11.1.2 Cloud Providers 221</p> <p>11.1.2.1 Vendor Lock-in 222</p> <p>11.1.2.2 Outages 223</p> <p>11.1.3 Hybrid Solutions 223</p> <p>11.1.3.1 Kubernetes 224</p> <p>11.2 Big Data Systems and Tools 224</p> <p>11.2.1 Storage 224</p> <p>11.2.1.1 File-Based Storage 225</p> <p>11.2.1.2 NoSQL 225</p> <p>11.2.2 Processing 226</p> <p>11.2.2.1 Batch Processing 226</p> <p>11.2.2.2 Stream Processing 227</p> <p>11.2.2.3 Combining Batch and Streaming 227</p> <p>11.2.3 Model Training 228</p> <p>11.2.4 A Holistic View 228</p> <p>11.3 Challenges 229</p> <p>11.3.1 Growth 229</p> <p>11.3.2 SLA 230</p> <p>11.3.3 Versioning 231</p> <p>11.3.4 Maintenance 232</p> <p>11.3.5 Deprecation 233</p> <p>11.3.6 Monitoring 233</p> <p>11.3.7 Trends 234</p> <p>11.3.8 Security 235</p> <p>11.3.9 Testing 235</p> <p>11.3.10 Organization 236</p> <p>11.3.11 Talent 237</p> <p>11.3.12 Budget 238</p> <p><b>12 An Ideal Platform </b><b>239</b></p> <p>12.1 Event Sourcing 240</p> <p>12.1.1 How It Works 240</p> <p>12.1.2 Messaging Middleware 241</p> <p>12.1.3 Why Use Event Sourcing 242</p> <p>12.2 Kappa Architecture 242</p> <p>12.2.1 How It Works 243</p> <p>12.2.2 Limitations 244</p> <p>12.3 Data Mesh 245</p> <p>12.3.1 Domain-Driven Design 245</p> <p>12.3.2 Self-Serving Infrastructure 247</p> <p>12.3.3 Data as Product Approach 247</p> <p>12.4 Data Reservoirs 248</p> <p>12.4.1 Data Organization 248</p> <p>12.4.2 Data Standards 249</p> <p>12.4.3 Data Policies 249</p> <p>12.4.4 Multiple Data Reservoirs 250</p> <p>12.5 Data Catalog 250</p> <p>12.5.1 Data Feedback Loop 251</p> <p>12.5.2 Data Synthesis 252</p> <p>12.6 Self-service Platform 252</p> <p>12.6.1 Data Publishing 253</p> <p>12.6.2 Data Processing 254</p> <p>12.6.3 Data Monitoring 254</p> <p>12.7 Abstraction 254</p> <p>12.7.1 Abstractions via User Interface 255</p> <p>12.7.2 Abstractions via Wrappers 256</p> <p>12.8 Data Guild 256</p> <p>12.9 Trade-offs 257</p> <p>12.9.1 Quality vs Efficiency 258</p> <p>12.9.2 Real time vs Offline 258</p> <p>12.9.3 Performance vs Cost 258</p> <p>12.9.4 Consistency vs Availability 259</p> <p>12.9.5 Disruption vs Reliability 259</p> <p>12.9.6 Build vs Buy 259</p> <p>12.10 Data Ethics 260</p> <p><b>Appendix A Further Systems and Patterns </b><b>261</b></p> <p>A.1 Lambda Architecture 261</p> <p>A.1.1 Batch Layer 262</p> <p>A.1.2 Speed Layer 262</p> <p>A.1.3 Serving Layer 262</p> <p>A.2 Apache Cassandra 263</p> <p>A.2.1 Cassandra Data Modeling 263</p> <p>A.2.2 Cassandra Architecture 265</p> <p>A.2.2.1 Cassandra Components 265</p> <p>A.2.2.2 Storage Engine 267</p> <p>A.3 Apache Beam 267</p> <p>A.3.1 Programming Overview 268</p> <p>A.3.2 Execution Model 270</p> <p><b>Appendix B Recipes </b><b>271</b></p> <p>B.1 Activity Tracking Recipe 271</p> <p>B.1.1 Problem Statement 271</p> <p>B.1.2 Design Approach 271</p> <p>B.1.2.1 Data Ingestion 271</p> <p>B.1.2.2 Computation 272</p> <p>B.2 Data Quality Assurance 273</p> <p>B.2.1 Problem Statement 273</p> <p>B.2.2 Design Approach 273</p> <p>B.2.2.1 Ingredients 273</p> <p>B.2.2.2 Preparation 275</p> <p>B.2.2.3 The Menu 277</p> <p>B.3 Estimating Time to Delivery 277</p> <p>B.3.1 Problem Definition 278</p> <p>B.3.2 Design Approach 278</p> <p>B.3.2.1 Streaming Reference Architecture 278</p> <p>B.4 Incident Response Recipe 283</p> <p>B.4.1 Problem Definition 283</p> <p>B.4.2 Design Approach 283</p> <p>B.4.2.1 Minimizing Disruptions 284</p> <p>B.4.2.2 Categorizing Disruptions 284</p> <p>B.4.2.3 Handling Disruptions 284</p> <p>B.5 Leveraging Spark SQL Metrics 286</p> <p>B.5.1 Problem Statement 286</p> <p>B.5.2 Design Approach 286</p> <p>B.5.2.1 Spark SQL Metrics Pipeline 286</p> <p>B.5.2.2 Integrating SQL Metrics 288</p> <p>B.5.2.3 The Menu 289</p> <p>B.6 Airbnb Price Prediction 289</p> <p>B.6.1 Problem Statement 290</p> <p>B.6.2 Design Approach 290</p> <p>B.6.2.1 Tech Stack 290</p> <p>B.6.2.2 Data Understanding and Ingestion 291</p> <p>B.6.2.3 Data Preparation 291</p> <p>B.6.2.4 Modeling and Evaluation 293</p> <p>B.6.2.5 Monitor and Iterate 294</p> <p>Bibliography 295</p> <p>Index 301</p>
<p><b>Yusuf Aytas</b> is an experienced software engineer. He holds a Master and Bachelor of Computer Science from Bilkent University, Ankara, Turkey. Yusuf has experience in building highly scalable, optimized analytics platforms. He is proficient in agile methodologies, continuous delivery, and software development best practices.</p>
<p><b>Provides expert guidance and valuable insights on getting the most out of Big Data systems</b></p><p>An array of tools are currently available for managing and processing data—some are ready-to-go solutions that can be immediately deployed, while others require complex and time-intensive setups. With such a vast range of options, choosing the right tool to build a solution can be complicated, as can determining which tools work well with each other. <i>Designing Big Data Platforms</i> provides clear and authoritative guidance on the critical decisions necessary for successfully deploying, operating, and maintaining Big Data systems.</p><p>This highly practical guide helps readers understand how to process large amounts of data with well-known Linux tools and database solutions, use effective techniques to collect and manage data from multiple sources, transform data into meaningful business insights, and much more. Author Yusuf Aytas, a software engineer with a vast amount of big data experience, discusses the design of the ideal Big Data platform: one that meets the needs of data analysts, data engineers, data scientists, software engineers, and a spectrum of other stakeholders across an organization. Detailed yet accessible chapters cover key topics such as stream data processing, data analytics, data science, data discovery, and data security. This real-world manual for Big Data technologies:</p><ul><li>Provides up-to-date coverage of the tools currently used in Big Data processing and management</li><li>Offers step-by-step guidance on building a data pipeline, from basic scripting to distributed systems</li><li>Highlights and explains how data is processed at scale</li><li>Includes an introduction to the foundation of a modern data platform</li></ul><p><i>Designing Big Data Platforms: How to Use, Deploy, and Maintain Big Data Systems</i> is a must-have for all professionals working with Big Data, as well researchers and students in computer science and related fields.</p>

Diese Produkte könnten Sie auch interessieren:

Statistics for Microarrays
Statistics for Microarrays
von: Ernst Wit, John McClure
PDF ebook
90,99 €