Demographic Prediction Based on User’s Browsing Behavior


Jian Hu, Hua-Jun Zeng, Hua Li, Cheng Niu, Zheng Chen

Microsoft Research Asia

49 Zhichun Road

Beijing 100080, P.R. China

{jianh, hjzeng, huli, chengniu, zhengc}@microsoft.com


Copyright is held by the International World Wide Web Conference Committee (IW3C2). Distribution of these papers is limited to classroom use, and personal use by others.

WWW 2007, May 8¨C12, 2007, Banff, Alberta, Canada.

ACM 978-1-59593-654-7/07/0005.


ABSTRACT

Demographic information plays an important role in personalized web applications. However, it is usually not easy to obtain this kind of personal data such as age and gender. In this paper, we made a first approach to predict users’ gender and age from their Web browsing behaviors, in which the Webpage view information is treated as a hidden variable to propagate demographic information between different users. There are three main steps in our approach: First, learning from the Webpage click-though data, Webpages are associated with users’ (known) age and gender tendency through a discriminative model; Second, users’ (unknown) age and gender are predicted from the demographic information of the associated Webpages through a Bayesian framework; Third, based on the fact that Webpages visited by similar users may be associated with similar demographic tendency, and users with similar demographic information would visit similar Webpages, a smoothing component is employed to overcome the data sparseness of web click-though log. Experiments are conducted on a real web click-through log to demonstrate the effectiveness of the proposed approach. The experimental results show that the proposed algorithm can achieve up to 30.4% improvements on gender prediction and 50.3% on age prediction in terms of macro F1, compared to baseline algorithms.

Categories and Subject Descriptors

I.5.2 [Pattern Reorganization]: Design Methodology-Classifier design and evaluation.

General Terms

Algorithms, Experimentation, Performance, Human Factors

Keywords

Demographic Prediction, Singular Value Decomposition, Supervised Regression, Browsing Behavior.

1.     INTRODUCTION

Many general web services, such as search engines, websites and etc, start to pay more and more attentions to customized service for a better user experience. My Yahoo! [13] and Google Personal [7] are two good examples among these approaches. My Yahoo! allows users to build their preferences explicitly and only show sections and details which they may be interested in. Google Personal organizes users’ search results according to their search histories including their previous search results and news headlines clicked. Accompany with the prosperous of general web service, online advertising is growing rapidly in recent years, in which behavioral targeting is becoming particularly popular [22]. Behavior targeting helps advertisers to target proper users upon their behaviors while surfing online. As reported in [18], companies like Tacoda Systems, Claria, Revenue Science and TM Advertising provide advertisers with behavioral targeting technologies. According to the recent studies of TM Advertising, compared to simple web ads, behavior-based ads gain 115% more business traffic a year, and the targeted consumers also scored 3% higher than the average viewers in brand awareness. User profile which includes prior search results, demographic information, geographic information and interested topics plays a key role in these systems to provide personalized targeting.

However, demographic information is usually not easy to obtain. Internet users are reluctant to expose this kind of personal data to public. The alternative way to predict users’ demographic information is then of great interest to both industry and academia. In Koppel’s work [11], blogers’ writing styles are used to predict their actual gender and age information. However, only 8% internet users write blogs [29]. In contrast, the majority of users browse news, products, or other webpages through internet, which provides us a large number of web-page click-through log data.

Previous studies show that there is correlation between users’ browsing behavior and their demographic attributes. As reported in “Computerworld” [3], 74% of women seek health or medical information online, while only 58% of men do so. 34% of women seek religious information from the Web versus 25% of men.  Similar phenomena occur in movie domain, where demographic information correlates the genres of the movies the audiences appreciate. “Action for men”, “love for women”, or “cartoon for teenager” are common mappings between movie genre and audience demographic categories. So the diversity of the user’s online browsing activities can be exploited to determine an unknown user’s demographic attributes such as gender and age on the basis of user’s online browsing activities.

In this paper we investigate the problem of predicting internet users’ gender and age based on their browsing behaviors, in which the webpage view information is treated as a hidden variable to propagate demographic information between different users. The solution consists of three steps. Firstly, based on users’ profiles and their browsing history, the user’s age and gender information is propagated to the browsed pages, and then, a supervised regression model is trained to predict a Webpage’s gender and age tendency, i.e. the probability distribution of the ages and genders of a given Webpage’s readers. Secondly, within Bayesian framework, an internet user’s age and gender are predicted based on the age and gender tendency of the Webpages that he/she has browsed. Based on the error analysis, the prediction model resulted from the above two steps suffers from serious data sparseness, hence leaves much room for improvement. In the user browsing history data, many Webpages are browsed only by a few users and a significant portion of users are associated with a browsing history as short as a few pages. So both the Webpage demographic tendency prediction and the users’ demographic prediction are not accurate. To deal with the noise and data sparseness problem, in Step 3, a smoothing approach is employed by making use of the fact that Webpages visited by similar users may be associated with similar demographic tendency, and users visiting similar Webpages may also share similar demographic attributes. First, latent semantic indexing is applied on user browsing data to derive the similarity among users and the similarity among Webpages.  Then linear interpolation is used to combine content-category based demographic prediction and demographic attributes of similar Webpages and similar users.

Experiments are conducted on a real web click-through log to demonstrate the effectiveness of the proposed approach. The experimental results show that the proposed algorithm can achieve up to 30.4% improvements on gender prediction and 50.3% on age prediction in terms of macro F1, comparing with baseline algorithms.

The rest of the paper is organized as follows. In Section2, we present related work. In Section 3, we define the demographic prediction problem. In Section 4, we propose our solution for demographic prediction. The experimental results are shown in Section 5. Then we draw a conclusion and highlight future research directions in Section 6.

2.     RELATED WORKS

In this section we briefly present some of the research literature related to demographic prediction.

Previous research on demographic prediction mainly focused on modeling the diversity of the linguistics writing and speaking styles associated with the demographic attributes. [4, 6] classified the user’s gender by the spoken language difference including intentional, phonological and conversational cues. Mulac et al studied the gender difference in primary and secondary students' impromptu essays [13, 14]. Herring et al studied the gender difference in writing electronic communications [5]. Palander studied male and female styles in 17th century correspondence [15]. Biber investigated male and female difference in language structure using on correspondence corpus [2]. Berryman-Fink [1] and Simkins-Bullock[17] investigated the male and female writing styles in formal contexts such as books and articles, and asserted that no significant difference between male and female writing styles in such formal contexts. However, scant evidence indicated differences between male and female writing studied in these works are enough to be parlayed into an algorithm for categorizing an unseen text as being authored by a male or by a female [12].

Koppel [12] proposed to automatically categorize written texts by author gender. Based on a corpus from the British National Corpus, a simple Balanced Winnow algorithm is used with features including function words and parts-of-speech n-grams for author gender prediction. This model achieves classification accuracy of approximately 80%. After analysis of a corpus of tens of thousands of blogs, Koppel [11] found that there are significant differences in both writing style and content between male and female bloggers as well as among authors of different ages. Based on such difference on blog’s content and style, they used the Multi-Class Real Winnow algorithm to learning models that classify blogs according to author gender and age, and obtain 80.1% accuracy on gender and 76.2% accuracy on age segmented in three categories (13-17, 23-27, 33-42).

These research works were mainly focused on classifying users’ demographic attributes based on authorship. As far as we know, there is litter work on predicting users’ gender or age according to what they browsed on the Web.

3.     PROBLEM DEFINITION

Before introducing our technology for demographic prediction, we formulize the problem in this section.

The demographic attributes concerned in this paper include gender and age. We present a user’s demographic attributes as two vectors gender and age. The gender prediction is defined as classifying users as male or female, while the age prediction is defined as classifying users into one of the following groups in Table 1.

Table 1. Age Group

Group

Age

Teenage

< 18

Youngster

18-24

Young

25-34

Mid-Age

35-49

Elder

>49

We define the browsing data as a set of records, where each record is a pair comprised of the user and the corresponding Webpages that the user viewed. So the browsing data can be modeled as a weighted directed bipartite graph G=(V, E). A node in V represents a user or a Webpage, and each edge in E denotes that the user has clicked on the page. We can divide the nodes in V into two subsets, U={u1, u2, …, ui} and W={w1, w2, …, wj} where U represents the users and W represents the Webpages. A matrix R is used to represent the adjacency matrix, whose element rij  in R is the weight from user ui to Webpage wj. In this paper, we simply deem the weight as the frequency of the Webpages being viewed by the user.

Given the webpage click-through log of some users with known demographic attributes, the problem is to find a general method to predict some users with unknown demographic attributes given their web-page click-through log.

4.     DEMOGRAPHIC PREDICTION

One intuitive way for demographic prediction is to use Collaborative Filtering (CF) [8]. For a user with unknown gender/age, we could “recommend” the user’s gender/age based on the users with similar online behavior. However, the webpage click-through log is quite sparse (see in experiments part), while CF is quite sensitive to data sparseness [24]. Another simple way for demographic prediction is to train classifier in the user side directly. We can aggregate all the Webpages a user clicked as a document, and trained classifier in user side. Since different users have different tastes on different Webpages, the feature of users may contain much more non-discriminative features than that of Webpages. Directly training the classifier in user side will lead the poor performance of classification. In our experimental result, we also show that classifier on user side show lower performance.

In following subsections, we first predict a Webpage’s gender and age tendency by training a supervised regression model based on user self reported gender, age and his/her browsing history. Then, based on the age and gender tendency of the Webpages that a user has browsed, we predict a user’s gender and age within Bayesian framework. To solve the data-sparseness problem suffered in the above two steps, we propose an approach to make use of similarity relationship between users and Webpages.

4.1     Webpages’ Demographic Tendency Prediction

4.1.1     Gender and age tendency of Webpages

Since Webpages don’t have explicit demographic attributes, we can not simply label a Webpage as Male, Female or Teenage directly. Instead, we propose to predict the demographic distribution among the readers of a given Webpage, and here the demographic attributes of a Webpage are described as follows:

(1)

Let  be the probability of a demographic attribute c of the jth Webpage,  be the value of the same attribute of the ith user, and  be the edges between the ith user and the jth Webpage. There are six demographic attributes for a Webpage : male, female, teenage, youngster, young, mid-age, elder, and each have a real value. For example, means male tendency of this Webpage. Obviously, the sum of  and  is 1, and the sum of  to  is equal to 1.

4.1.2     Learning Gender and age Tendency of Webpages

To learn the gender and age tendency of Webpages, we need to select some pages for training. Since the gender and age tendency of a Webpage is based on the demographic distribution of the readers of this page, the demographic tendency of pages visited by few users is not reasonable. So we selected Webpages which are read by at least 10 users. Based on the demographic attributes of a Webpage computed by Equation 1, we use the linear form of Support Vector Machine (SVM) Regression [20] to learn the gender and age tendency of Webpages. For each attributes of gender and age, we learn a model separately. After we get the tendency value of each gender/age attributes learned from their models, we normalize their value within the range [0, 1] using max-min normalization [26], so that the sum of  and  is 1, and the sum of  to is equal to 1.

4.1.2.1     Support Vector Machine Regression

The Support Vector Machine (SVM) model is a powerful classification and regression method based on a solid theoretical foundation -- structural risk minimization [21]. The classification and regression performance is outstanding in practice.

In the linear kernel mode, an SVM constructs the hyper-plane that lies “close” to as many of the data points as possible. The decision function is f(x)=<w·x>+c, where <w·x> is the dot product of the hyper-plane's normal vector w and the example's feature vector x and c is a constant vector. For an input vector xi and its correct value yi,  the aim of SVM is to select a hyper-plane and threshold (w, b) so that we can get a hyper-plane w with small norm, while simultaneously minimizing the sum of the distances from our points to the hyper-plane, measured using Vapnik's -insensitive loss function:

(2)

4.1.2.2     Features

For the purpose of learning age and gender tendency of Webpages, each selected document is represented as a numerical vector in which each entry represented the weight of a corresponding feature in some feature set. Two different kinds of potential distinguishing features can be considered: content-based features and category-based features.

Content-based features

We take the content words of the Webpages as the features. We first remove “stopwords” in the Webpages, and then do content words selection based on distribution grade (DG) of a Webpage on demographics attributes and Information Gain (IG) [19]. DG can be readily calculated on the basis of the variance coefficient, which normalizes the variance of a distribution by its mean. Taken the gender as an example, the calculation is as follows:

  where

(3)

The DG measures the variance on gender. The smaller the value, the more evenly the gender is distributed. The bigger the value, the more value the feature is for the training. In our work, we set the minimal DG to 1.3. On the pages selected by DG, we select the top 20000 terms sorted by their IG value as the feature set.

Category-Based features

As new content will emerge on the Web everyday which can not be covered by current model, we use a hierarchy of web concepts (or categories) to alleviate the problem. Base on Web concept hierarchy, we first use SVM to build a hierarchical classifier. Then, all the Webpages in the training data are classified into the concept hierarchy. Finally, based on the demographic attributes of Webpages in each category of the concept hierarchy, we can compute the demographic distribution of categories in the concept hierarchy. Since the first level of the concept hierarchy is too coarse for demographic prediction. For example, for the category “Health”, the majority distribution of gender is female, but for the category “Health\Men”, the subcategory of “Health”, the majority distribution of gender is male. We build the classifier at deeper category level. Based on the demographic distribution of categories in the concept hierarchy, for each Webpage, we can get the demographic distribution value of its top 3 classified categories, and use them as features.

4.2     Users’ Demographic Prediction

Based on the age and gender tendency of the Webpages that a user has browsed, we use a Bayesian framework [30] to predict the user’s demographic attributes. Suppose the pages a user clicked are independent, then

(4)

Where {w} is the collection of Webpage that clicked by the user, c is the attribute of gender (male or female) or age (teenage, youngster… elder), and can be got from the Webpages’ gender/age tendency prediction..

4.3     Demographic Prediction by Leveraging Similarity among Users and Webpages

Since a user may click pages from different sites of different topics every day, to predict a user’s gender and age by analyzing clicked pages history within a few days is not accurate enough. As people in the similar gender or age may have similar interests and preference, they might visit same or similar pages, thus we can assist the prediction of a user’s gender/age through analyzing the gender/age of users with similar browsing behavior. Also in the Webpages side, through analyzing the gender and age tendency of Webpages visited by similar users, we can assist the prediction of a Webpage’s gender/age tendency. However, a Web site may contain hundreds of thousands of Webpages, and the pages a user clicked are relatively few. Thus, finding the similar users or pages in this sparse data may bring much noise. As Latent Semantic Indexing (LSI) [28], which uses Singular Value Decomposition (SVD) as its underlying matrix factorization algorithm, has been proved useful to address the data sparseness problem in many recommender systems [24, 25]. The reduced orthogonal dimensions resulting from SVD are less noisy than the original data and capture the latent associations between the pages and users [24]. In our work, we also use SVD to produce a low-dimensional representation of original user-page space.

4.3.1     Singular Value Decomposition

SVD is a well-known matrix factorization technique that factors a matrix R into three matrices as the following:

(5)

Where,  and are the matrices of the left and right singular vectors. The column vectorsand are orthogonal.  is the diagonal matrix of singular values which satisfy. By setting the smallest  singular values in S to zero, the matrix R is approximated with a rank-k matrix and this approximation is best measured in reconstruction error. Theoretical details on matrix SVD can be found in [23].

We start with a user-page click matrix that is very sparse, we call this matrix R. To capture meaningful latent relationship, we first removed sparseness by filling out user-page click matrix. A constant based smoothing is used: For pages that a user does not visit, an intuitive and straightforward smoothing method is to replace the zero elements with a small constant c (0<c<1). That is, even a page p is not visited by user u in the data, and it is assumed that page p is in general visited by u with a small probability if u browses in the site. We also considered two normalization techniques:  for the normalization in the user dimension, all the values corresponding with u are divided by a constant and the values sum to 1 after division for each user u; Normalization in the page dimension is similar. We found the formal approach to provide better results. After normalization, we get a filled and normalized matrix.

We factor the matrix  and obtain a low-rank approximation after applying the following steps:

1.        Factor using SVD to obtain U, S and V.

2.        Reduce the matrix S to dimension k

3.        Compute two resultant matrices:  and , we denote  and .

Based on the low-dimensional space of use and page sides ( and ), we compute the neighborhood of each user and page respectively, and then we use the demographic attributes of its neighbors to smooth the gender/age tendency learning in the page side and gender/age prediction in the user side. 

4.3.2     Smooth Webpages’ Demographic Tendency Prediction

There are two kinds of neighbors for a Webpage: one is the neighbors computed by the vector similarity (cosine similarity) in the reduce space, and we denote this kind of neighbors as; the other is the neighbors computed by the similarity of Webpage’s content, and we denote this kind of neighbors as. We use both of them to enhance Webpages’ demographic tendency prediction.

Based on the top N most similar neighbors of page, we predict the gender/age tendency of page using the Equation below:

(6)

Where is the gender/age tendency probability of the top i () neighbor.

Thus, we can smooth the gender/age tendency of page  by

(7)

Where, is the original gender/age tendency value learned by SVM regression, and is the parameter to control the influence of the page’s gender/age tendency predicted by  neighbors.

Based on the top M most similar nbp neighbors of page , we predict the gender/age tendency of page  using the Equation below:

(8)

Where is the gender/age tendency of the top j () neighbor.

Then, the Equation 7 can be extended into (9) as below:

(9)

Where, andare used to balance influence of gender/age tendency probability based on nbr neighbors and influence of gender/age tendency probability based on nbp neighbors.

Obviously, the smoothing can be further changed into an iterative procedure where the smoothed Webpage demographic attributes will be used to update the neighborhood average, and then re-smooth the Webpage demographic attributes. In the later experiment, the iterative learning is processed until the demographic attributes of each page are stable.</