User Profiling

part of the ArsDigita Community System by Philip Greenspun, Tracy Adams, and Michael Yoon

The Categories Themselves

The profiling system provides an unlimited number of dimensions along which to place a user's interest. Each dimension is represented in the following table:

create table categories (
        category_id     integer not null primary key,
        category        varchar(50) not null,
        category_description    varchar(4000),
        -- e.g., for a travel site, 'country', or 'activity' 
        -- could also be 'language'
        category_type   varchar(50),
        -- language probably would weight higher than activity 
        profiling_weight        number default 1 check(profiling_weight >= 0),
        enabled_p       char(1) default 't' check(enabled_p in ('t','f')),
        mailing_list_info       varchar(4000)
);
Note that this is the same table that holds the dimensions for classifying content on a site. If this were not so, it would be painful to match up users and content.

The category_type column bears mentioning. It isn't used for score computation but only for user interface when querying the user or giving options for the site administrator. The profiling_weight column may be used to disable scoring along a certain dimension or to make language preference much more important than other preferences.

The mailing_list_info is legacy column from time where the predominant function of categorization system was to track down who was spammed about what. It survived only to support some legacy code that may still lurk around (note that current ACS no longer uses this column). The category_description can be used in other places where a user simply wants clarification about a category.

Optional Hierarchy

Suppose that the flat category_type system doesn't work for you. You are running a travel site and want to present the "Appalachian Trail" category to folks who have selected one of the states through which the AT runs (Georgia, Tennessee, North Carolina, Virginia, West Virginia, Maryland, Pennsylvania, New Jersey, New York, Connecticut, Massachusetts, Vermont, New Hampshire, Maine).

You represent this hierarchy with the following table:


create table category_hierarchy (
   parent_category_id     integer references categories,
   child_category_id      integer regerences categories
   unique (parent_category_id, child_category_id)
);
Notice that this isn't a strict hierarchy; a category can have more than one parent. Also notice that we use UNIQUE rather than PRIMARY KEY. That is because we signify the top-level hierarchies with a NULL parent_category_id column.

Mapping content to categories

Similar to the approach taken in the site_wide_search and general_comments module, one table stores all the mappings of content to categories. These content items will be in stored in disparate Oracle tables. Note that this replaces some older tables such as static_categories (mapped static pages to categories).

create sequence site_wide_cat_map_id_seq;

create table site_wide_category_map (
             map_id                  integer primary key,
             category_id             not null references categories,
             -- We are mapping a category in the categories table
             -- to another row in the database.  Which table contains
             -- the row?
             on_which_table          varchar(30) not null,
             -- What is the primary key of the item we are mapping to?
             -- With the bboard this is a varchar so we can't make this
             -- and integer
             on_what_id              varchar(500) not null,
             mapping_date            date not null,
             -- how strong is this relationship?
             -- (we can even map anti-relationships with negative numbers)
             mapping_weight          integer default 5 
                                     check(mapping_weight between -10 and 10),
             -- A short description of the item we are mapping
             -- this enables us to avoid joining with every table
             -- in the ACS when looking for the most relevant content 
             -- to a users' interests
             -- (maintain one_line_item_desc with triggers.)
             one_line_item_desc      varchar(200) not null,
             mapping_comment         varchar(200),
             -- only map a category to an item once
             unique(category_id, on_which_table, on_what_id)
);
To build user and admin interfaces when querying site_wide_category_map, we use the central table_acs_properties, which is shared by side-wide index and the general comments facility.

create table table_acs_properties (
             table_name      varchar(30) primary key,
             section_name    varchar(100) not null,
             user_url_stub   varchar(200) not null,
             admin_url_stub  varchar(200) not null
);
Here is an example entry for the bboard table: You're probably thinking that it would be nice to have table_name reference the Oracle data dictionary view user_tables but this doesn't seem to work.

To build admin pages for inserting, updating, and deleting data in the site_wide_category_map table, you can use the Tcl procs:

ad_categorization_widget

Call ad_categorization_widget within an HTML form to create a user interface for categorizing a specific row in the database. It returns a selection widget that contains options for each category; already mapped categories are pre-selected, and, for each category, the category_type (if one exists) is included in parentheses. If a category hierarchy is defined, then indentation makes it visually apparent.

Suppose that you're running a restaurant-rating site like ZAGAT.COM and you want to categorize restaurants geographically. The ad_categorization_widget for a restaurant with multiple locations like Nobuyuki Matsuhisa's excellent Nobu would look something like this:

ad_categorization_widget takes the following parameters: