Cover Page
image

Social Media Data Mining and Analytics

 

 

Gabor Szabo

Gungor Polatkan

Oscar Boykin

Antonios Chalkiopoulos

 

 

 

 

 

 

 

Wiley Logo

About the Authors

Gabor Szabo works on large-scale data analysis and modeling problems in social networks, self-organized online ecosystems, transportation systems, and autonomous driving. Previously, his research focus was on the description of randomly organized networks in online communities and biological systems at Harvard Medical School, the University of Notre Dame, and HP Labs. After that he built distributed algorithms to understand and predict user behavior at Twitter. He has created models for resource allocation in Lyft's ride-sharing network, and most recently he led a team at Tesla's Autopilot.

Gungor Polatkan is a machine learning expert and engineering leader with experience in building massive-scale distributed data pipelines serving personalized content at LinkedIn and Twitter. Most recently, he led the design and implementation of the AI backend for LinkedIn Learning and ramped the recommendation engine from scratch to hyper-personalized models learning billions of coefficients for 500M+ users. He deployed some of the first deep ranking models for search verticals at LinkedIn improving Talent Search. He enjoys leading teams, mentoring engineers, and fostering a culture of technical rigor and craftsmanship while iterating fast. He has worked in several notable applied research groups in Twitter, Princeton, Google, MERL and UC Berkeley before joining LinkedIn. He published and refereed papers at top-tier ML & AI venues such as UAI, ICML, and PAMI.

Oscar Boykin works on machine learning infrastructure at Stripe, building systems to predict fraud at scale. Prior to Stripe, Oscar spent more than 4 years at Twitter, first working on modeling and prediction for ads, and later on data infrastructure systems. At Twitter, Oscar co-developed many open-source scala libraries including Scalding, Algebird, Summingbird, and Chill. Before Twitter, Oscar was an assistant professor of electrical and computer engineering at the University of Florida. Oscar has a Ph.D. in physics from the University of California, Los Angeles and is the coauthor of dozens of academic papers in top journals and conferences.

Antonios Chalkiopoulos is a fast/big data distributed system specialist with experience in delivering production-grade data pipelines in the media, IoT, retail, and finance industries. Antonios is a published author in big data, an open source contributor, and the co-founder and CEO of Landoop LTD. Landoop LTD builds the innovative and award winning Lenses platform for data in motion, which provides visibility and control over streaming data, data discovery via an intuitive web interface, and is a comprehensive SQL experience for data in motion, monitoring, alerting, data governance, multi-tenancy, and security. Lenses is a complete user experience for building and managing real-time data pipelines and micro-services.

About the Technical Editors

Sriram Krishnan is a senior director of the Einstein Platform team at Salesforce, where he is responsible for the foundational services that bring machine learning capabilities to Salesforce. Prior to Salesforce, Sriram was head of the Data Platform team at Twitter, and a tech lead on the Big Data Platform team at Twitter. He holds a Ph.D. in Computer Science from Indiana University, and spent several years as a researcher and group lead at the San Diego Supercomputer Center enabling scientific applications to use grid and cloud technologies. Sriram has co-authored more than 50 publications in the area of data, grid, and cloud computing, and his work has been cited more than 1700 times. Sriram has contributed to several influential open source projects that are being used widely in industry and academia.

Ben Peirce is director of XR Analytics at Samsung, which he joined on the acquisition of Vrtigo, a virtual reality analytics startup he co-founded. Previously, Ben built analytics systems at early stage startups in healthcare and advertising technology for over a decade. He holds a Ph.D. from Harvard, where he studied control systems and robotics.

Dashun Wang is an associate professor of management and organizations at the Kellogg School of Management, (by courtesy) industrial engineering and management sciences at the McCormick School of Engineering, and a core faculty at NICO, the Northwestern Institute on Complex Systems. Dashun received his Ph.D. in physics in 2013 from Northeastern University, where he was a member of the Center for Complex Network Research. From 2009 to 2013, he had also held an affiliation with Dana-Farber Cancer Institute, Harvard University as a research associate. He is a recipient of the AFOSR Young Investigator Award (2016).

Dr. Jian Wu is an assistant professor in the Department of Computer Science at the Old Dominion University. Dr. Wu obtained his Ph.D. in 2011 from Pennsylvania State University and then worked with Dr. C. Lee Giles on the CiteSeerX project as a tech leader. Dr. Wu's research interest is text mining and knowledge extraction on scholarly big data using machine learning, deep learning, and natural language processing. He has published nearly 30 peer-reviewed papers in ACM, IEEE, and AAAI conferences and magazines with best papers and nominations. He was the best reviewer in the ACM/IEEE Joint Conference on Digital Libraries (JCDL) 2018. As a tech leader, Dr. Wu made critical improvements to the architecture, web crawling, and extraction modules of CiteSeerX, increasing the collection to 10 million by 2017.

Credits

Project Editor
Tom Dinse

Technical Editors
Sriram Krishnan
Ben Peirce
Dashun Wang
Dr. Jian Wu

Production Editor
Athiyappan Lalith Kumar

Copy Editor
San Dee Phillips

Production Manager
Kathleen Wisor

Content Enablement and Operations Manager
Pete Gaughan

Marketing Manager
Christie Hilbrich

Associate Publisher
Jim Minatel

Project Coordinator, Cover
Brent Savage

Proofreader
Evelyn Wellborn

Indexer
Johnna VanHoose Dinse

Cover Designer
Wiley

To our families who supported us even though we missed a lot of time from them to write this book.

Acknowledgments

We would like to send our gratitude to our friends and colleagues at Twitter. With them invaluable discussions and collaborations have opened new perspectives for us to be able to look at social media data in unexpected ways, and allowed us to work on tools and approaches that let us expand our understanding of social media users. Their open-minded support throughout has always been greatly appreciated.

A very special thank you to Prof. David Blei, who provided the innovative research on topic modeling and a proper methodology for teaching machine learning through his Princeton class “Interacting with Data.” In this book we followed his examples to cover the topics on representation learning and the applications in recommendations problems.

We would like to thank Jonathan Chang, the author of the R LDA package, for providing a machine learning tool for efficient and easy-to-use topic modeling techniques.

We would also like to thank Tom Dinse, Robert Elliott, and Jim Minatel, our editors at Wiley, who have been leading us down the path of publishing this book since the beginning for their great project management and editorial review of the content, as well as our team of technical editors for their review and insightful suggestions throughout the process. Moreover we would like to thank all the people who worked behind the scenes to help get this book together.

As for the rest of the authors, we would like to thank one of us, Gabor Szabo, who patiently shepherded the entire book writing process while we were working on it.