Demographic Prediction Based on
User’s Browsing Behavior
Jian Hu, Hua-Jun Zeng, Hua Li, Cheng Niu, Zheng Chen
Microsoft Research
{jianh, hjzeng, huli, chengniu, zhengc}@microsoft.com
|
Copyright is held by the
International World Wide Web Conference Committee (IW3C2). Distribution of
these papers is limited to classroom use, and personal use by others. WWW
2007, May 8¨C12, 2007, ACM
978-1-59593-654-7/07/0005. |
ABSTRACT
Demographic information plays an important role in personalized
web applications. However, it is usually not easy to obtain this kind of
personal data such as age and gender. In this paper, we made a first approach
to predict users’ gender and age from their Web browsing behaviors, in which
the Webpage view information is treated as a hidden variable to propagate
demographic information between different users. There are three main steps in
our approach: First, learning from the Webpage click-though data, Webpages are
associated with users’ (known) age and gender tendency through a discriminative
model; Second, users’ (unknown) age and gender are predicted from the
demographic information of the associated Webpages through a Bayesian framework;
Third, based on the fact that Webpages visited by similar users may be
associated with similar demographic tendency, and users with similar
demographic information would visit similar Webpages, a smoothing component is
employed to overcome the data sparseness of web click-though log. Experiments
are conducted on a real web click-through log to demonstrate the effectiveness
of the proposed approach. The experimental results show that the proposed
algorithm can achieve up to 30.4% improvements on gender prediction and 50.3%
on age prediction in terms of macro F1, compared to baseline algorithms.
Categories and Subject Descriptors
I.5.2 [Pattern
Reorganization]: Design
Methodology-Classifier design and evaluation.
General Terms
Algorithms, Experimentation, Performance, Human Factors
Keywords
Demographic Prediction, Singular
Value Decomposition, Supervised Regression, Browsing Behavior.
Many general web services, such as search engines, websites and etc,
start to pay more and more attentions to customized service for a better user experience. My Yahoo! [13] and Google Personal [7] are
two good examples among these approaches. My
Yahoo! allows users to build their preferences explicitly
and only show sections and details which they may be interested in. Google
Personal organizes users’ search results according to their search histories including their previous search results and news headlines clicked.
Accompany with the prosperous
of general web service, online advertising is growing
rapidly in recent years, in which behavioral targeting is becoming particularly
popular [22]. Behavior targeting helps advertisers to target proper users upon
their behaviors while surfing online. As reported in [18], companies like Tacoda
Systems, Claria, Revenue Science and TM Advertising provide advertisers with
behavioral targeting technologies. According to the recent studies of TM Advertising, compared to simple
web ads, behavior-based ads gain 115% more business
traffic a year,
and the targeted consumers also scored 3% higher
than the average viewers in brand awareness. User profile which includes prior search results, demographic information, geographic
information and
interested topics plays a key role in these systems to
provide personalized targeting.
However, demographic information
is usually not easy to obtain. Internet users are reluctant to expose this kind
of personal data to public. The alternative way to predict users’ demographic
information is then of great interest to both industry and academia. In Koppel’s
work [11], blogers’ writing styles are used to predict their actual gender and
age information. However, only 8% internet users write blogs [29]. In contrast,
the majority of users browse news, products, or other webpages through internet,
which provides us a large number of web-page click-through log data.
Previous studies show that there
is correlation between users’ browsing behavior and their demographic attributes.
As reported in “Computerworld” [3], 74% of women seek health or medical information
online, while only 58% of men do so. 34% of
women seek religious information from the Web versus 25% of men. Similar phenomena occur in movie domain,
where demographic information correlates the genres of the movies the audiences
appreciate. “Action for men”, “love for women”, or “cartoon for teenager” are
common mappings between movie genre and audience demographic categories. So the
diversity of the user’s online browsing activities can be exploited to
determine an unknown user’s demographic attributes such as gender and age on
the basis of user’s online browsing activities.
In this paper we investigate the
problem of predicting internet users’ gender and age based on their browsing
behaviors, in which the webpage view information is treated as a hidden
variable to propagate demographic information between different users. The
solution consists of three steps. Firstly, based on users’ profiles and their
browsing history, the user’s age and gender information is propagated to the
browsed pages, and then, a supervised regression model is trained to predict a Webpage’s
gender and age tendency, i.e. the probability distribution of the ages and
genders of a given Webpage’s readers. Secondly, within Bayesian framework, an
internet user’s age and gender are predicted based on the age and gender tendency
of the Webpages that he/she has browsed. Based on the error analysis, the prediction
model resulted from the above two steps suffers from serious data sparseness,
hence leaves much room for improvement. In the user browsing history data, many
Webpages are browsed only by a few users and a significant portion of users are
associated with a browsing history as short as a few pages. So both the Webpage
demographic tendency prediction and the users’ demographic prediction are not
accurate. To deal with the noise and data sparseness problem, in Step 3, a
smoothing approach is employed by making use of the fact that Webpages visited
by similar users may be associated with similar demographic tendency, and users
visiting similar Webpages may also share similar demographic attributes. First,
latent semantic indexing is applied on user browsing data to derive the
similarity among users and the similarity among Webpages. Then linear interpolation is used to
combine content-category based demographic prediction and demographic
attributes of similar Webpages and similar users.
Experiments are conducted on a
real web click-through log to demonstrate the effectiveness of the proposed
approach. The experimental results show that the proposed algorithm can achieve
up to 30.4% improvements on gender prediction and 50.3% on age prediction in
terms of macro F1, comparing with baseline algorithms.
The rest of the paper is
organized as follows. In Section2, we present related work. In Section 3, we
define the demographic prediction problem. In Section 4, we propose our
solution for demographic prediction. The experimental results are shown in
Section 5. Then we draw a conclusion and highlight future research directions in
Section 6.
In this section we briefly present
some of the research literature related to demographic prediction.
Previous research on demographic
prediction mainly focused on modeling the diversity of the linguistics writing
and speaking styles associated with the demographic attributes. [4, 6] classified
the user’s gender by the spoken language difference including intentional,
phonological and conversational cues. Mulac et al studied the gender difference
in primary and secondary students' impromptu essays [13, 14]. Herring et al studied
the gender difference in writing electronic communications [5]. Palander
studied male and female styles in 17th century correspondence [15]. Biber investigated
male and female difference in language structure using on correspondence corpus
[2]. Berryman-Fink [1] and Simkins-Bullock[17] investigated the male and female
writing styles in formal contexts such as books and articles, and asserted that
no significant difference between male and female writing styles in such formal
contexts. However, scant evidence indicated differences between male and female
writing studied in these works are enough to be parlayed into an algorithm for
categorizing an unseen text as being authored by a male or by a female [12].
Koppel [12] proposed to
automatically categorize written texts by author gender. Based on a corpus from
the British National Corpus, a simple Balanced Winnow algorithm is used with
features including function words and parts-of-speech n-grams for author gender
prediction. This model achieves classification accuracy of approximately 80%.
After analysis of a corpus of tens of thousands of blogs, Koppel [11] found that
there are significant differences in both writing style and content between
male and female bloggers as well as among authors of different ages. Based on
such difference on blog’s content and style, they used the Multi-Class Real
Winnow algorithm to learning models that classify blogs according to author
gender and age, and obtain 80.1% accuracy on gender and 76.2% accuracy on age
segmented in three categories (13-17, 23-27, 33-42).
These research works were mainly
focused on classifying users’ demographic attributes based on authorship. As
far as we know, there is litter work on predicting users’ gender or age
according to what they browsed on the Web.
Before introducing our technology
for demographic prediction, we formulize the problem in this section.
The demographic attributes
concerned in this paper include gender and age. We present a user’s demographic
attributes as two vectors gender and age. The gender prediction is defined as
classifying users as male or female, while the age prediction is defined as
classifying users into one of the following groups in Table 1.
Table 1. Age Group
|
Group |
Age |
|
Teenage |
< 18 |
|
Youngster |
18-24 |
|
Young |
25-34 |
|
Mid-Age |
35-49 |
|
Elder |
>49 |
We define the browsing
data as a set of records, where each record is a
pair comprised of the user and the corresponding Webpages that the user viewed. So the browsing data can be modeled as a weighted directed
bipartite graph G=(V, E). A node in V represents a user or a Webpage, and each edge in E denotes that the user has clicked on
the page. We can divide the nodes in V into two subsets, U={u1, u2, …, ui}
and W={w1, w2, …, wj} where U represents the users and W represents the Webpages. A matrix R is used to represent the adjacency matrix, whose element rij in R is the weight from user ui
to Webpage wj. In this
paper, we simply deem the weight as the frequency of the Webpages being viewed by
the user.
Given the webpage click-through
log of some users with known demographic attributes, the problem is to find a
general method to predict some users with unknown demographic attributes given
their web-page click-through log.
One intuitive way for demographic
prediction is to use Collaborative Filtering (CF) [8]. For a user with unknown
gender/age, we could “recommend” the user’s gender/age based on the users with
similar online behavior. However, the webpage click-through log is quite sparse
(see in experiments part), while CF is quite sensitive to data sparseness [24].
Another simple way for demographic prediction is to train classifier in the
user side directly. We can aggregate all the Webpages a user clicked as a
document, and trained classifier in user side. Since different users have
different tastes on different Webpages, the feature of users may contain much
more non-discriminative features than that of Webpages. Directly training the classifier
in user side will lead the poor performance of classification. In our
experimental result, we also show that classifier on user side show lower
performance.
In following subsections, we
first predict a Webpage’s gender and age tendency by training a supervised
regression model based on user self reported gender, age and his/her browsing
history. Then, based on the age and gender tendency of the Webpages that a user
has browsed, we
predict a user’s gender and age within Bayesian framework. To solve the
data-sparseness problem suffered in the above two steps, we propose an approach
to make use of similarity relationship between users and Webpages.
Since Webpages don’t have
explicit demographic attributes, we can not simply label a Webpage as Male,
Female or Teenage directly. Instead, we propose to predict the demographic distribution
among the readers of a given Webpage, and here the demographic attributes of a
Webpage are described as follows:
|
|
(1) |
Let
be the probability of a demographic attribute c of
the jth Webpage,
be the value of
the same attribute
of the ith
user, and
be the edges
between the ith
user and the jth Webpage. There are six demographic
attributes for a Webpage
: male, female, teenage, youngster, young, mid-age, elder, and each have
a real value
. For
example,
means
male tendency of this Webpage. Obviously, the sum of
and
is 1, and the sum of
to
is equal to 1.
To learn the gender and age
tendency of Webpages, we need to select some pages for training. Since the
gender and age tendency of a Webpage is based on the demographic distribution
of the readers of this page, the demographic tendency of pages visited by few
users is not reasonable. So we selected Webpages which are read by at least
10 users. Based on
the demographic attributes of a Webpage computed by Equation 1, we use the
linear form of Support Vector Machine (SVM) Regression [20] to learn the gender
and age tendency of Webpages. For each attributes of gender and age, we learn a
model separately. After we get the tendency value of each gender/age attributes
learned from their models, we normalize their value within the range [0, 1]
using max-min normalization [26], so that the sum of
and
is 1, and the sum of
to
is equal to 1.
The Support Vector Machine (SVM) model
is a powerful classification and regression method based on a solid theoretical
foundation -- structural risk
minimization [21]. The classification and regression performance is
outstanding in practice.
In the
linear kernel mode, an SVM constructs the hyper-plane that lies “close” to as
many of the data points as possible. The decision function is f(x)=<w·x>+c, where <w·x> is the dot product of the hyper-plane's normal vector w and the example's feature vector x and c is a constant vector. For an input vector xi and its correct value yi, the aim of SVM is to select a hyper-plane
and threshold (w, b) so that we can get
a hyper-plane w with small norm,
while simultaneously minimizing the sum of the distances from our points to the
hyper-plane, measured using Vapnik's
-insensitive loss function:
|
|
(2) |
For the purpose of learning age
and gender tendency of Webpages, each selected document is represented as a
numerical vector in which each entry represented the weight of a corresponding
feature in some feature set. Two different kinds of potential distinguishing
features can be considered: content-based features and category-based features.
Content-based features
We take
the content words of the Webpages as the features. We first remove “stopwords”
in the Webpages, and then do content words selection based on distribution grade (DG) of a Webpage on
demographics attributes and Information Gain (IG) [19]. DG can be readily calculated on the
basis of the variance coefficient, which normalizes the variance of a
distribution by its mean. Taken the gender as an example, the calculation
is as follows:
|
|
(3) |
The DG measures the
variance on gender. The smaller the value, the more evenly the gender is
distributed. The
bigger the value, the more value the feature is for the training. In our work,
we set the minimal DG to 1.3. On the pages selected by DG, we select the
top 20000 terms sorted by their IG value as the feature set.
Category-Based features
As new content will emerge on the
Web everyday which can not be covered by current model, we use a hierarchy of web
concepts (or categories) to alleviate the problem. Base on Web concept hierarchy,
we first use SVM to build a hierarchical classifier. Then, all the Webpages in
the training data are classified into the concept hierarchy. Finally, based on
the demographic attributes of Webpages in each category of the concept hierarchy,
we can compute the demographic distribution of categories in the concept hierarchy.
Since the first level of the concept hierarchy is too coarse for demographic
prediction. For example, for the category “Health”, the majority distribution
of gender is female, but for the category “Health\Men”, the subcategory of “Health”,
the majority distribution of gender is male. We build the classifier at deeper
category level. Based on the demographic distribution of categories in the
concept hierarchy, for each Webpage, we can get the demographic distribution
value of its top 3 classified categories, and use them as features.
Based on the age and gender
tendency of the Webpages that a user has browsed, we use a Bayesian framework [30]
to predict the user’s demographic attributes. Suppose the pages a user clicked
are independent, then
|
|
(4) |
Where {w} is the collection of Webpage
that clicked by the user
, c is the
attribute of gender (male or female) or age (teenage, youngster… elder), and
can be got from the Webpages’ gender/age tendency prediction..
Since a user may click pages from
different sites of different topics every day, to predict a user’s gender and
age by analyzing clicked pages history within a few days is not accurate
enough. As people in the similar gender or age may have similar interests and preference,
they might visit same or similar pages, thus we can assist the prediction of a
user’s gender/age through analyzing the gender/age of users with similar
browsing behavior. Also in the Webpages side, through analyzing the gender and
age tendency of Webpages visited by similar users, we can assist the prediction
of a Webpage’s gender/age tendency. However, a Web site may contain hundreds of
thousands of Webpages, and the pages a user clicked are relatively few. Thus, finding
the similar users or pages in this sparse data may bring much noise. As Latent
Semantic Indexing (LSI) [28], which uses Singular Value Decomposition (SVD) as
its underlying matrix factorization algorithm, has been proved useful to
address the data sparseness problem in many recommender systems [24, 25]. The
reduced orthogonal dimensions resulting from SVD are less noisy than the
original data and capture the latent associations between the pages and users [24].
In our work, we also use SVD to produce a low-dimensional representation of original
user-page space.
SVD is a well-known matrix
factorization technique that factors a
matrix
R into three matrices as the following:
|
|
(5) |
Where,
and
are the matrices of the left and right singular vectors. The
column vectors
and
are orthogonal.
is the diagonal
matrix of singular values which satisfy
. By setting the smallest
singular values
in S to zero, the matrix R is
approximated with a rank-k matrix and this approximation is best measured in
reconstruction error. Theoretical details on matrix SVD can be found in [23].
We start with a user-page click
matrix that is very sparse, we call this matrix R. To capture meaningful latent relationship, we first removed sparseness
by filling out user-page click matrix. A constant based smoothing is used: For
pages that a user does not visit, an intuitive and straightforward smoothing
method is to replace the zero elements with a small constant c (0<c<1). That is, even a page p is not visited by user u in the data, and it is assumed that
page p is in general visited by u with a small probability if u browses
in the site. We also considered two normalization techniques: for the normalization in the user
dimension, all the values corresponding with u are divided by a constant and the values sum to 1 after division
for each user u; Normalization in the
page dimension is similar. We found the formal approach to provide better
results. After normalization, we get a filled and normalized matrix
.
We factor the matrix
and obtain a low-rank approximation after
applying the following steps:
1.
Factor
using
SVD to obtain U, S and V.
2.
Reduce the matrix S to dimension k
3.
Compute two resultant matrices:
and
,
we denote
and
.
Based on the low-dimensional
space of use and page sides (
and
),
we compute the neighborhood of each user and page respectively, and then we use
the demographic attributes of its neighbors to smooth the gender/age tendency
learning in the page side and gender/age prediction in the user side.
There are two kinds of neighbors
for a Webpage: one is the neighbors computed by the vector similarity (cosine
similarity) in the reduce space
,
and we denote this kind of neighbors as
; the other is the neighbors computed by the similarity of Webpage’s
content, and we denote this kind of neighbors as
. We use both of them to enhance Webpages’ demographic
tendency prediction.
Based on the top N most similar
neighbors of page
, we predict the gender/age tendency of page
using the Equation below:
|
|
(6) |
Where
is the gender/age tendency probability of the top i (
) neighbor.
Thus, we
can smooth the gender/age tendency of page
by
|
|
(7) |
Where,
is
the original gender/age tendency value learned by SVM regression, and
is the parameter to control the influence of the page’s gender/age tendency
predicted by
neighbors.
Based on the top M most similar nbp neighbors of page
, we predict the gender/age tendency of page
using the Equation
below:
|
|
(8) |
Where
is the gender/age tendency of the top j (
) neighbor.
Then, the Equation 7 can be extended
into (9) as below:
|
|
(9) |
Where,
and
are used to balance influence of gender/age tendency probability
based on nbr neighbors and influence
of gender/age tendency probability based on
nbp neighbors.
Obviously, the smoothing can be
further changed into an iterative procedure where the smoothed Webpage
demographic attributes will be used to update the neighborhood average, and
then re-smooth the Webpage demographic attributes. In the later experiment, the
iterative learning is processed until the demographic attributes of each page
are stable.