reading

Cover
Introduction
1. Human Interactions Measured
2. Asking and Answering Questions with Data
3. The Datasets Used in This Book
4. The Languages and Frameworks Used in This Book
5. System Requirements to Run the Examples
6. Overview of the Chapters
7. Online Repository for the Book
CHAPTER 1: Users: The Who of Social Media
1. Measuring Variations in User Behavior in Wikipedia
2. Long Tails Everywhere: The 80/20 Rule (p/q Rule)
3. Online Behavior on Twitter
4. Summary
CHAPTER 2: Networks: The How of Social Media
1. Types and Properties of Social Networks
2. Visualizing Networks
3. Degrees: The Winner Takes All
4. Capturing Correlations: Triangles, Clustering, and Assortativity
5. Summary
CHAPTER 3: Temporal Processes: The When of Social Media
1. What Traditional Models Tell You About Events in Time
2. Inter-Event Times
3. Bursty Activities of Individuals
4. Forecasting Metrics in Time
5. Summary
CHAPTER 4: Content: The What of Social Media
1. Defining Content: Focus on Text and Unstructured Data
2. Using Content Features to Identify Topics
3. Extracting Low-Dimensional Information from High-Dimensional Text
4. Summary
CHAPTER 5: Processing Large Datasets
1. MapReduce: Structuring Parallel and Sequential Operations
2. Multi-Stage MapReduce Flows
3. Patterns in MapReduce Programming
4. Sampling and Approximations: Getting Results with Less Computation
5. Executing on a Hadoop Cluster (Amazon EC2)
6. Summary
CHAPTER 6: Learn, Map, and Recommend
1. Social Media Services Online
2. Problem Formulation
3. Learning and Mapping
4. Prediction and Recommendation
5. Summary
CHAPTER 7: Conclusions
1. The Surprising Stability of Human Interaction Patterns
2. Averages, Standard Deviations, and Sampling
3. Removing Outliers
Index
End User License Agreement

List of Tables

Introduction
1. Table I.1: Descriptions and Locations of the Datasets Used in This Book
2. Table I.2: The R Packages Used in This Book's Code Examples
3. Table I.3: The Python Packages Used in This Book's Code Examples
Chapter 1
1. Table 1.1: The Most Active Wikipedia Accounts in Our Period 1, January 2013
Chapter 4
1. Table 4.1: The expectation for the relative frequency of the top five most used tags by any individual user. On average, the most frequently used tag appears in 52.1% of a user's posts, the second most frequent tag appears 25.8% of the time, and so on.
Chapter 5
1. Table 5.1: A Comparison of Traditional and the Corresponding Probabilistic Data Structures, and the Questions They Are Designed to Answer
2. Table 5.2: The Estimation Error of the HLL as a Function of the Amount of Storage Allocated to It
3. Table 5.3: The Sizes of the Datasets We Generated for the HLL Performance Evaluation
4. Table 5.4: Bloom Filter Size Per 100 Thousand Elements
5. Table 5.5: The Count-Min Sketch Sizes as a Function of the Desired Error Bounds
Chapter 7
1. Table 7.1: The Average Revision Counts after a Certain Percentage of the Most Active Users Have Been Removed
2. Table 7.2: The measured means and standard deviations of the sampling distributions with varying sample sizes. The expected mean of the sampling distribution is 4/3 = 1.66…. The “SD Expectation” column is calculated using Equation 7.7, divided by the square root of the sample size, as in Listing 7.2. This column should be comparable to the “Measured Standard Deviation” column.
3. Table 7.3: The changes in the means and the standard deviations of the remaining samples, after the largest values are removed from the power-law and the normal distribution, respectively. In this example, the power law had an exponent of −3.5, and the normal distribution had a mean of 10 and a standard deviation of 1. The fractions by which the size of the original samples were truncated are shown in the top row of the table.

List of Illustrations

Introduction
1. Figure I.1: An entry from the online encyclopedia Wikipedia about Wikipedia
2. Figure I.2: A screen shot of a typical Twitter search timeline. Tweets appear in the main section, whereas trending topics and “who to follow” recommendations are shown on the side.
3. Figure I.3: Stack Exchange is a question answering service with a lot of topical sub-sites. We chose the Science Fiction & Fantasy category as it is not overly technical in nature (compared to computer-related categories or those focused on mathematics, for instance), yet has a decent number of users and amount of content.
4. Figure I.4: The main page of LiveJournal, a blogging platform that encourages the creation of communities as well
5. Figure I.5: An interactive IPython session with plotting
Chapter 1
1. Figure 1.1: Possible choices for sampling time windows to measure aggregate user activity. In scenario (a), we pick non‐consecutive time windows randomly. In (b), we choose a continuous time window between two given points in time.
2. Figure 1.2: The number of editors who made a certain number of revisions to any Wikipedia article during January 2013. The horizontal axis has been truncated to show no more than 20 edits a month; however, the data shows that you can find users with tens of thousands of edits as well.
3. Figure 1.3: The number of revisions for three different time windows: for the first, the first 2, and first 3 months of 2013, respectively, as is also indicated by the figure's legend. The calculations were made the exact same way as for Figure 1.2.
4. Figure 1.4: The probability that a user will make a given number of Wikipedia edits. Note that the functions for the three periods overlap to a large degree, and it is hard to see a difference for any but the first data point.
5. Figure 1.5: The number of active users were taken for the three periods we used before (Jan 2013 for Period 1, Jan–Feb 2013 for Period 2, and Jan–Mar 2013 for Period 3) for any given revision count. For this plot, we divided the number of editors with a given number of revisions in Period 2 with those in Period 1 with the same number of revisions, and plotted it in the dark line. Similarly, we also took the ratio of the user counts in Period 3 to those in Period 1 and plotted those with the lighter line.
6. Figure 1.6: The average number of edits made by users in Periods 2 (and 3), given that they made a certain number of edits in Period 1. The average values seem to be in a linear relationship with the number of edits in Period 1, and the best fitting linear functions are shown with the straight lines.
7. Figure 1.7: Similar to Figure 1.3, we show the number of users who made a certain number of revisions in the three time periods. However, in this figure, we rescaled both axes logarithmically, so we can now clearly observe the power law relationships.
8. Figure 1.8: The cumulative distribution function of the number of users with a given number of edits. The CDF is the fraction of users with no greater than a specific number of edits. We rescaled the horizontal axis for better visibility.
9. Figure 1.9: The tail distribution of the users' revisions: This shows what fraction of users had more revisions than a threshold. Comparing this with Figure 1.8, we can immediately see that the two functions add up to 1.
10. Figure 1.10: The tail distribution of the users' revisions again as in Figure 1.9, but this time on double‐logarithmic axes. We can now see that, similar to the PDF, this tail distribution also follows a power law (or two power laws, given the slightly faster decay at the end as we can recognize by the steeper linear section of the plot starting at approximately $images$ revisions).
11. Figure 1.11: Imagine that we are summing up the values of a function indicated by the five white dots. This is the same as the sum of the areas of the white (unfilled) rectangles. However, we can approximate this area by taking the integral of the underlying continuous function as well (shaded by gray): $images$ . Although we cover a slightly smaller area as can be seen from the figure, the error we're making is negligible compared to the actual differences between our model and the actual social media system.
12. Figure 1.12: The fraction of all edits made by users up to a certain rank. We rescaled the axis for the user rank logarithmically.
13. Figure 1.13: The expectation for the fraction of activities generated by the most active users, in a hypothetical system whose user activity distribution perfectly follows the $images$ law. The horizontal axis shows how many of the most active users we consider, and the vertical axis is the proportion of activities attributable to them. We show three examples with different γ exponents. The lighter unmarked line shows the real measurements for Wikipedia, which happens to be the initial part of the line in Figure 1.12.
14. Figure 1.14: The probability distribution function for the number of users with a given number of Tweets sent, for a one‐week, a two‐week, and a three‐week period, respectively.
15. Figure 1.15: Logarithmic binning illustrated. In this example our original range is 1 … 100, and we split this range up into six buckets that increase exponentially in size: The length of every bucket is a constant multiple of the previous one. You can see that in the beginning the buckets are short, whereas their size is growing rapidly on this natural, linear scale.
16. Figure 1.16: The average number of Tweets that Twitter users sent in Periods 2 and 3, as a function of the number of Tweets they sent in Period 1, respectively.
Chapter 2
1. Figure 2.1: A small graph where the source S wants to send a message to destination node D. The time it takes for one node to relay the message to another one is shown on the links, and these are the weights of the links. To indicate that a node cannot contact another one, we can use infinite (∞) weights, such as between S and $images$ and between $images$ and $images$ .
2. Figure 2.2: A small subset of the Wikipedia user talk connection network. We started from a randomly chosen center node (see callout), and performed a breadth-first search on the network to include all nodes with a distance of at most 3 to our center node. Only the connections among the included nodes are kept. Also, we shaded nodes according to how many hops they are from the center: Black nodes are at distance 1 (there're few of these), gray ones at distance 2, and white nodes are at distance 3.
3. Figure 2.3: The in- and out-degree distributions of the Wikipedia talk network. We counted how many other users a given user contacted, or was contacted by, through talk; this is what we call out- and in-degree, respectively.
4. Figure 2.4: The in- and out-degree distributions of the Twitter follower network. Here the number of followers and followees were taken at a specific time for a sample of representative users. We can see an idiosyncratic spike for the number of followees at 2,000, which is because the Twitter service used to limit the number of other users a user may follow under normal circumstances at this count.
5. Figure 2.5: Suppose we have three bins, with six, three, and one ball in each, respectively. We would like to pick a bin (1 to 3) for a new ball to be put into, and this should happen with a probability proportional to the balls already in them.
6. Figure 2.6: An illustration of how to calculate the local clustering coefficient in a directed network. The central node i has $images$ neighbors (considering both incoming and outgoing links). Among the neighbors, we find $images$ links, but if all the connections were there, we could have 5 × 4 = 20 links. Therefore, the node's local clustering coefficient is 4 / 20 = 0.2.
7. Figure 2.7: The average local clustering coefficients in the Wikipedia talk network as a function of the node's degree
8. Figure 2.8: The average local clustering coefficients in the LiveJournal social network as a function of the node's degree. We show both the actual measurements (circles) and the reference rewired version (squares), as described in the text.
9. Figure 2.9: The repeated procedure through which we rewire the graph to get a reference graph for comparison for the clustering coefficient. Originally, nodes A and C, and B and D were connected, respectively, as shown by the thinner arrows. After we rewire the pair, A and D, and B and C will be linked.
10. Figure 2.10: The average degrees of the neighbors of nodes with a given degree. We call the latter reference nodes. We consider both in- and out-degrees for the neighbors, and neighbors were also taken as nodes on the other ends of both the incoming and outgoing edges of the reference node. We only show reference node degrees up to 200, beyond which the data becomes noisy. The dashed line shows the y = x identity relationship.
Chapter 3
1. Figure 3.1: The distribution of times between Tweets mentioning “lunch” over a 1-hour, and a 24-hour period, respectively. Tweet arrival times could only be measured in seconds. The vertical axis has been rescaled logarithmically to show correspondences with the exponential function.
2. Figure 3.2: Number of mentions of “lunch” per minute on Twitter on a given day
3. Figure 3.3: Time of the $images$ mention of “lunch” on Twitter
4. Figure 3.4: Time between successive mentions of “lunch” on Twitter on a particular day. Note that the solid-looking black plot is the result of high variances between the inter-event times of the adjacent Tweets.
5. Figure 3.5: Exponentially smoothed time-between-mentions of lunch
6. Figure 3.6: Inter-arrival times for the memoryless (Poisson) process as a function of the event index
7. Figure 3.7: Counts in windows of 60 seconds for the memoryless, Poisson data
8. Figure 3.8: The exponential moving average of the memoryless, generated data
9. Figure 3.9: The Q-Q (quantile-quantile) plot of the measured inter-arrival time distribution of the Tweets versus the theoretical exponential distribution. You want to test if the measured distribution follows an exponential. You can see that at the higher quantile ranges, the distributions differ.
10. Figure 3.10: Autocorrelation of the time between mentions of lunch, as a function of the number of time step lags, which is measured in minutes
11. Figure 3.11: Autocorrelation of the shuffled data: not much to see
12. Figure 3.12: Time between events after 0 (E) and after chaining 1 (E → I) and 2 (E → I → I) impatient filtering agents
13. Figure 3.13: Tweets per minute containing the string “lunch” for 30 days
14. Figure 3.14: Distribution of the number of Tweets containing the word “lunch” per minute for 30 days
15. Figure 3.15: The total number of Tweets that fell on each minute of the day for each day of a 30-day period
16. Figure 3.16: Difference of the actual number of Tweets per minute from the average daily cycle
17. Figure 3.17: Distribution of the difference from the number of Tweets per minute and what we expect from the daily average cycle. Note how much closer to Gaussian this is.
18. Figure 3.18: Minute-of-week total “lunch” Tweets per minute for 4 weeks
19. Figure 3.19: Difference of the actual data and the 4-week average referred to in Figure 3.18
20. Figure 3.20: Distribution of difference in “lunch” Tweets per minute from the weekly cycle average
21. Figure 3.21: The number of revisions per hour for four randomly selected Wikipedia editors, during the first week of 2013. We partitioned the 1-week date range by non-overlapping hours and counted the times that a particular user made edits in any hour. User A was active at irregular times but made quite a few edits on 4 days; User B's pattern is similar to this. User C, however, made edits more often and was active for longer times in a “session.” In User D's time series, you can see low levels of activities most of the time and active hours 7 times on each day at approximately the same hour during the week.
22. Figure 3.22: The bold line shows the revisions per hour for a randomly chosen (but high-activity) Wikipedia editor. The lighter line is a simulation of a Poisson process that results in approximately the same number of total events during the whole week as the chosen editor had (approximately 1,200). We achieve this by choosing the rate parameter of the Poisson process such that the expected number of events per time unit is the same as the number of events we observed for the given user, λ = (# of events) / (1 week).
23. Figure 3.23: The empirical probability density function of times between two edits of any given user, with a bold line. The measured inter-edit time distribution has a power-law exponent of approximately −1.3. With the lighter line, we show the probability density function for a Poisson process with a rate parameter λ that would result in the same number of total edit events as what we have in the measurement. We used logarithmic binning to compute the empirical PDF for the Wikipedia inter-edit times.
24. Figure 3.24: The PDF of the times between two edits on any Wikipedia page for the first 90 days of 2013. We consider all pages together here, similar to how we aggregated all users for Figure 3.23.
25. Figure 3.25: The heatmap of the conditional probability density function of an inter-edit time, given that it follows another inter-edit time of a certain length. (Both axes are in seconds, and the times are logarithmically binned.) The way to look at this plot is to imagine that we have three successive edits made by a Wikipedia user, which gives us two inter-edit times. The first inter-event time is represented by the x axis in this figure, and assuming we fix that, the distribution of the second inter-edit time is given by the corresponding column. Therefore, the columns of the heatmap sum to 1, and the shade of the cells is a representation of the value of the PDF at that point.
26. Figure 3.26: The daily number of unique editors on Wikipedia who contribute some content (make an edit) at least once a day. The first stage shows accelerating growth, whereas the second stage shows declining daily user counts and has a strong apparent periodic component.
27. Figure 3.27: The first stage of the time series describing the number of daily active users on Wikipedia, this time on a logarithmic vertical axis
28. Figure 3.28: Trend found using median filtering on the log-transformed first stage of editor growth in Wikipedia
29. Figure 3.29: Trend found using median filtering, second stage
30. Figure 3.30: The trend removed from the log-transformed signal, first stage
31. Figure 3.31: The trend removed from the original signal, second stage
32. Figure 3.32: Autocorrelation of the detrended signal
33. Figure 3.33: Autocorrelation of the signal with the daily and weekly seasonalities removed
34. Figure 3.34: ARIMA model applied to forecast the number of Wikipedia editors
Chapter 4
1. Figure 4.1: The first question posted on Stack Exchange in the “Science Fiction & Fantasy” category (http://scifi.stackexchange.com/questions/1/). We labeled the parts of the post that are also stored as part of the downloadable data from the site. Although the body of the post is obviously easy to parse for a human reader, we need to transform it into a format that is amenable for automatic analysis by a computer.
2. Figure 4.2: The distribution of the number of times a given term appears in any post in the Science Fiction & Fantasy section of Stack Exchange. Apparently, the stemmed term frequency distribution follows a power law, which has an exponent of −1.45. (We performed the linear fit on the double logarithmic axes in the range indicated by the lighter straight line, but naturally shifted this line down by approximately one-half a decade so that the curves do not overlap each other.)
3. Figure 4.3: Assume that the items we would like to find groups of can be situated in a two-dimensional Euclidean space, and their “similarities” are exactly the distances between them: The further away they are, the less similar they are to each other. This is a natural way of looking at clustering, where for instance the items are fishing boats on the sea, and we're looking for groups of fishermen who are fishing together.
4. Figure 4.4: A small illustrative dendrogram for a clustering on posts. The leaves of the tree (the small, hanging stubs) correspond to the original items in the clustering, and the horizontal lines represent the distance at which the merging occurred for the two sub-clusters into a new cluster.
5. Figure 4.5: The hierarchy of post cluster merges shown for all of the Science Fiction & Fantasy questions on Stack Exchange, for the merges that took place at a distance of 0.9 or higher. This way we highlight the last stages of the agglomerative clustering algorithm, only already operating on larger clusters that are below the area shown. Using Listing 4.4, we selected three subtrees of the dendrogram, shaded and labeled as A, B, and C. Underneath each subtree we also show the word clouds of the most frequently appearing terms in their leaf posts. The size of the terms is proportional to the number of times they appear in the union set of all the subtree posts.
6. Figure 4.6: Word clouds for the tags that users attached to their questions, in the same subtrees A, B, and C as in Figure 4.5.
7. Figure 4.7: The number of times that any given tag appears together with any of the questions. We can see a power-law distribution again, with an exponent of −1.4.
8. Figure 4.8: The distributions of the numbers of posts (questions) we find in the clusters if we were to stop the clustering at distances of 0.5, 0.7, and 0.9, respectively. We didn't actually stop the algorithm at any point but instead calculated these sizes using the history of cluster merges with Listing 4.5.
9. Figure 4.9: The expected relative frequency of how often a user uses a tag, versus the rank of a tag. Again, we had to rescale both axes logarithmically to see the function in full detail.
10. Figure 4.10: A Tweet about music. Keywords in tags represent clues for the message. The users are interested in music, and we explicitly see words such as “music” and “instrument” that might help us decide that this Tweet is about music indeed.
11. Figure 4.11: A Tweet about TV broadcast news. Keywords in boxes tell us that this Tweet is about a piece of news as it appeared on television.
12. Figure 4.12: The human brain processes raw text and associates it with latent semantic themes.
13. Figure 4.13: Topic modeling analyzes raw text and finds what topics it is most likely about.
14. Figure 4.14: A Tweet about a proposal to write a blog post about a programming language. We can claim that this Tweet consists of two topics: (1) social/blog: “blog post” is a clue; and (2) tech/computer science: “ecosystem” and “language” are keywords.
15. Figure 4.15: A graphical model for Latent Dirichlet Allocation (LDA). Shaded node w represents the observed words in the text and unshaded nodes represent the latent variables such as topics β, topic proportions θ, and per-word topic assignments z. α is a fixed parameter to the model. The plates represent replications; for instance, there are M documents in the corpus and N words within each document.
16. Figure 4.16: Topic proportions of 10 samples from the corpus. To characterize the topics, we pick the top 5 words from them and concatenate them with periods. The bar height of a topic within each subplot represents the weight of that topic in the corresponding document. The sum of topic proportions within each subplot is 1.
17. Figure 4.17: A graphical model for supervised Latent Dicirichlet Allocation (sLDA). The shaded node W represents the observed words in the text as in LDA, and Y represents the response variables (for example, age and gender of a user). Unshaded nodes represent the latent variables: for topics β, topic proportions Θ, per-word topic assignments Z, as in LDA. Moreover, we added supervision parameters η and $images$ . Similarly to LDA, α is a fixed parameter passed to the model. (For the meaning of the parameters, see https://arxiv.org/abs/1003.0783 . The figure appeared in this paper originally also.) The plates represent replications. In particular, there are D documents in the corpus, N words within each document, and K topics to learn.
18. Figure 4.18: Coefficients η for each topic connecting topics with political ideology. The higher the coefficient of a topic on the x axis, the higher the probability that the conservative bloggers talk about it, and the lower the coefficient, the higher the probability that liberal users talk about the topic.
19. Figure 4.19: Prediction density for each group of users. The solid curve represents conservative, and the dashed curve represents liberal. Overlapping areas represent potential errors either as false positives or false negatives, depending on a decision threshold.
20. Figure 4.20: A graphical model for relational topic modeling (RTM). d and d′ represent two nodes in the graph, respectively. Shaded nodes w_d and w_d′ represent the observed words in the content generated by user d and d′, and y_d,d′ represents the link variable (e.g., friends in a social graph, citations between documents). Unshaded nodes represent the latent variables such as topics β, topic proportions Θ_d and Θ_d’, per-word topic assignments z_d and z_d′ as in LDA, and relational parameters such as η. Similarly to LDA, α is a fixed parameter to the model. The plates represent replications; for instance, there are N_d′ words within each document and K topics to learn. The figure and more explanations can be found in http://proceedings.mlr.press/v5/chang09a/chang09a.pdf.
21. Figure 4.21: Link probabilities by RTM versus LDA for a sample of 100 existing edges.
Chapter 5
1. Figure 5.1: Associativity functions, here written as +, allow aggregation over a range of size T with time complexity O(log T) assuming maximal parallelism
2. Figure 5.2: The runtime it takes to estimate the cardinalities of our toy data sets
3. Figure 5.3: Bloom filter hashing to set specific bits of the bit array to 1
4. Figure 5.4: The number of bits per element in BF, and the resulting false positive rate
5. Figure 5.5: Aggregating Bloom filters on reducers
6. Figure 5.6: CMS increments the matrix values pointed to by the indices calculated from hashing the items with each of the d hash functions
7. Figure 5.7: Setting up the Security Group to open the firewall for certain traffic
8. Figure 5.8: Configure the instance type, and select an appropriate VPC and security group
9. Figure 5.9: Selecting the user key
10. Figure 5.10: Discover hosts
11. Figure 5.11: Set the user name and key
12. Figure 5.12: Collaborator access key
13. Figure 5.13: Collaborator password
14. Figure 5.14: IAM access link
15. Figure 5.15: Stopping the cluster with Cloudera Manager
16. Figure 5.16: Stop Amazon EC2 instances
Chapter 6
1. Figure 6.1: A typical Web interface for search engine results
2. Figure 6.2: Netflix on-demand movie library
3. Figure 6.3: A location-based social media service: Foursquare check-ins
4. Figure 6.4: Representation of the observation matrix $images$ (e.g., movie ratings by users, in which rows represent the users and columns represent the items). The image is from “The Indian Buffet Process: An Introduction and Review” by Griffiths, et. al. (http://jmlr.org/papers/volume12/griffiths11a/griffiths11a.pdf)
5. Figure 6.5: A generic problem-solving paradigm explaining why we transform data into different model representations
6. Figure 6.6: Intuitive explanations for under- and overfitting of models. Underfitting explains the observed data with a curve with complexity not enough to have a good generalization to new data. Overfitting explains the observed data with a curve with unnecessarily high model complexity. This causes poor generalization as well. The plot is from http://pingax.com/regularization-implementation-r/.
7. Figure 6.7: Weights for a given latent stereotype movie choice. This is a plot of $images$ where k = 1. We observe that weights are all positive due to non-negativity assumption of the model and they give high weights to different movies.
8. Figure 6.8: This is similar to Figure 6.7, but this plot is for k = 5.
9. Figure 6.9: The distribution of the actual genres of the top movies for the first latent stereotype we learned from data. This is a plot for $images$ where k = 1. We get the top 10 movies with the highest weights in the stereotype. The genre information is part of the MovieLens dataset. Each movie in the top 10 is a binary vector of genres, and one movie can have multiple genres, e.g., drama, fantasy, animation, etc. This data suggests that drama is the most common genre for this stereotype, whereas the others exist with a secondary relevance.
10. Figure 6.10: The distribution of the actual genres of the top movies for the k = 5 latent stereotype we learned from the data. The genre information states that “horror” is the most common genre.
11. Figure 6.11: This is a plot for $images$ where k = 8. According to movies seen at the top 10 locations in the stereotype (see the following list), we conclude that this stereotype consists of artistic and independent top-rated classic movies.
12. Figure 6.12: Here we plot the factor scores. Every user is represented by a point in the figures, and we mapped users by their ages, genders, and occupations. This is shown for the stereotype k = 1. Recall from Figure 6.9 that this stereotype prefers classic movies. We observe a slight bias towards adults and males. We do not observe any specific bias for occupation.
13. Figure 6.13: The factor scores for the k = 5 stereotype
14. Figure 6.14: The factor scores for the k = 8 stereotype
15. Figure 6.15: Factor scores of a new user after watching and highly rating two movies in the Terminator series. This is a representation of the new user in the latent stereotype space.
16. Figure 6.16: The densities of the predicted probabilities P. The held-out data are first split into two: Positive examples, and negative examples. The distribution of the predicted probabilities is plotted here separately for the two groups: For positives (dashed line) and for negatives (solid line). The more separable the two hills, the better.
17. Figure 6.17: Receiver operating characteristics (ROC) curve. This curve explains how much we gain in the true positive rate as we start making more false positive errors.
18. Figure 6.18: Precision-recall curve. This curve explains the trade-off we make between precision and recall and helps determine the classifier threshold, the operating point.
Chapter 7
1. Figure 7.1: The sampling distributions of the sample means for sample size 64, 256, and 1024, respectively. The larger the sample size, the narrower the distribution. Also, the one vertical line at approximately 1.67 is actually three overlapping vertical lines, and these indicate the calculated means of the sampling distributions. According to our expectations, these should be equal to the population mean, 4/3 ≈ 1.67.
2. Figure 7.2: The relative change in the means of two samples, drawn from a normal and a power-law distribution, respectively. We removed 0–90% of the largest values in 5% increments. The removed fraction is shown on the x axis.
3. Figure 7.3: The relative change in the standard deviations of two samples, in similar setups to Figure 7.2.

Social Media Data Mining and Analytics

About the Authors

About the Technical Editors

Credits

Acknowledgments