In this blog post, I'll describe how we designed the MongoDB schema of the facebook-like newsfeed in Entexis, a modern recruiting-application that helps organizations to optimize their recruiting efforts, supply online application forms, etc.

MongoDB is a schema-less document database. Nonetheless, schema design is as important as ever, if not even more important. Why so?

Mostly, because the schema design in NoSQL is generally not determined by the data you want to store, but by the operations you want to perform. This is a fundamentally different approach. What I describe here works well for us, but it will almost certainly not work well for e.g. twitter, where the follower-count is probably distributed exponentially, rather than evenly.

Entexis Newsfeed Screenshot

Newsfeed Basics

Of course you know news feeds from facebook, linkedIn and many other software-as-a-service applications. The news feed's goal is to provide the user with relevant information that is usually roughly sorted by time, if not real-time.

The key to a good news feed is relevancy scoring (i.e., what does the user really care about?) which is often strongly, but not exclusively, based on recency and the relationship of reader and author. Recency does not only refer to the action that triggered the news item itself, but it might be the time the last comment was posted, or somebody interacted in any other way with the action item at hand.

At first glance, this might seem like a perfect example for a join. Users post updates to the general news stream, and each user has her own 'view' on that stream where some information is highlighted, other is invisible based on who follows whom. But the complexity of the scoring and the size of the action table makes joins very hard, if not impossible, to write and to compute.

Let's say there is an Activity "Mary Fanning applied as a software engineer" that is three weeks old. If your recruiting efforts are going somewhere, chances are that this entry is currently ranked like 300th in your personal news feed. Now, a new comment is created. Obviously, for someone to see it, the news item must go up. Yet, it might not have to pop up for everyone, but it should be ranked very high only for people in the same department, or those who are assigned to Mary Fanning. If the comment was posted by your supervisor, you might want to see it as well.

Now, if you don't do joins, you usually have to copy (or denormalize) data. In fact, we'll do both. We store each user's individual news feed items in a collection, but we don't copy the actual action information for a number of reasons, most importantly because of the data size and the pain (and performance hit) it incurs when the action must be updated.

Tracking Activity

Let's back off a bit. First, we need to keep track of what happened in the first place. Let's call these items Action:

Action {
    ActorId : {UserId},
    Conducted: DateTime,
    IsHidden: bool,
    Scope: {Team | Department | Restricted},
    FanOut: {Pending | InProgress | Done}
    // and a descriptor that describes what actually happened
    // the descriptor is embedded, so fetching is very fast
}
An action is created whenever you do something. Let's say you rate an applicant for "Senior Software Engineer". First, we create the actual rating in the database. Second, the ActionService is invoked through an event aggregator and writes the new action to the database.

Of course, we don't have to use a separate collection for this. Instead, we could monitor the actual ratings. However, by separating the actions from the actual data, all code that is concerned with the news feed does not have to read or understand the 'actual' data, which allows for much better isolation and cleaner code. Also, the Action collection items contain a FanOut field which is a pessimistic lock flag which we use to allow multiple workers and find aborted jobs.

Fanning Out

The second step is to fan this information out to individual user's news feeds. However, we don't want to copy all information, but just store a reference to the Action:

NewsfeedItem {
    UserId: {UserId},
    ActionId: {ActionId},
    Relevancy: DateTime
}
Here, UserId refers to id of the reading user (i.e., the owner of the NewsfeedItem).

Note that we don't use the DbRef feature, because we know which collection we refer to. We simply store the foreign key. Keep in mind that there are no referential checks in MongoDB, no matter if you use DbRef or not.

A background job will now iterate all Actions that haven't been processed yet, and creates a NewsfeedItem for each of your 'followers'. In that step, relevancy is determined as well. Still in private beta, Entexis doesn't yet allow to follow (or un-follow) specific users; all actions will be distributed according to the Scope parameter of the action. However, since the approach is so simple, that functionality can be added to the fan-out process easily. If you unfollowed a user, no NewsfeedItem will be created for you.

The details of relevancy scoring are convoluted and don't affect schema design. Relevancy is a DateTime because we must have a globally, everlasting relevancy, rather than a temporary one: if the maximum relevancy was, say, 100, we'd have to update the relevancy of all older items to push them down. Instead, the 'maximum relevancy' is simply DateTime.Now, and grows automatically. No need to shift old items down.

The Queries

In order to display a user's news feed, all we have to do is this:

// find the 25 most important action ids for the current user
var newsfeedItemIds = 
    _db.Find<NewsfeedItem>(Query.EQ("UserId", AuthenticatedUser.Id)).
            SetSortOrder(SortOrder.Descending("Relevancy")).
            SetLimit(25).
            Select(p => p.ActionId).
            ToList();
// fetch those actions
var feedActions = _db.Find<Action>(Query.In("_id", newsfeedItemIds));
Now, we have obtained all actions that should be presented in the current user's news feed and can display them.

Inserting Actions is easy, as is the fan-out writing which really just iterates the Action-collection. If a user wants to delete an action, we have to delete the referencing NewsfeedItems as well ("on delete cascade", if you will).

Indexes

Since we need to find all news feed items for a specific user, we definitely need to index that field. In the very same query, we want to sort by relevancy, so we use this index: {"UserId" : 1, "Relevancy" : -1}.

The actions will be queried using the in-clause, which is a very efficient way to fetch multiple items because it avoids the n+1-problem: we only need two round-trips to the database, which is very fast.

When I now post a comment, relevancy will have to be re-calculated for each 'subscriber', i.e. everyone who can see this information in his news feed. Again, this process is well-isolated and will be performed by the background worker process.

The Ids

We haven't discussed the data type of the ids yet. Since we're a .NET shop, we normally use Guids as primary keys everywhere. However, Guids have a very random distribution, which is not the best fit for a news feed. It makes sense to use MongoDB's ObjectId, which are roughly monotonically increasing instead, to improve data locality.

Whether this has a significant impact is currently hard to say, but it seems to be a good idea to not touch old items very often.

Our news feed is now essentially complete, but we haven't talked about the very important actual action descriptors.

The Action Descriptors

To make those news feed items look right and have all the functionality we need, we have to store action descriptor objects. For instance, an application rating might look like this:

ApplicationRatingDescriptor {
    RaterId: {UserId},
    PositionId: {PositionId},
    ApplicantId: {ApplicantId},
    ApplicantName: "Mary Fanning",
    PositionName: "Senior Software Engineer",
    RaterName: "John Doe",
    RatingScore: 4.5,
    RatingText: "Really great candidate, 
        but didn't eat the fortune cookie!",
    Tags: ["skilled", "friendly", "no-cookie"]
}


Note how applicant name, position name, etc. are all de-normalized so we don't have to perform lots of subqueries or joins. This makes the queries very fast, but it requires an update when the name of position, applicant or user changes.

The big point here is that there are dozens of these types. In a database with a fixed schema, we'd have to create dozens of tables (I wouldn't), or use a serialized XML or blob element. The former is very annoying and also quite tricky because we'd have to fetch multiple times in multiple tables. No good.

The XML or blob-approach comes with other disadvantages: we'd have to cope with another form of serialization, it's hard or impossible to read in the database, and it can't be indexed. Let's take a quick look at the code:

class ActionDescriptor 
{
    public List<string> Tags { get; set; }
}

class ActionDescriptorApplicationRating 
           : ActionDescriptor
{
    // ...
}

class ActionDescriptorNewApplicant { }
class ActionDescriptorApplicantHired { }
class ActionDescriptorApplicantRatingRequested { }
// ...
With this inheritance in place, we could justify, e.g. an index for Tags. But even if we really need to fetch all Actions based on RaterId, a field that is not present in most ActionDescriptor-classes, we could create a sparse index on RaterId, so only those documents that actually have such a value take up space in the index! Eat that, XML!

Slices

Now we're almost done. But there is one more feature we wanted to have, and it turns out a simple embedded array is very helpful here.

As you probably know from facebook, an activity can pop up in your news feed not because you're best friends with the poster, but because you were tagged or mentioned. In Entexis, we allow something similar: each applicant details view shows a slice of the news feed that contains only actions related to the respective applicant, only without relevancy scoring.

Entexis Screenshot
We could do this like this:
Action {
    // as above
    ApplicantId: {ApplicantId}
}
That is easy, but it does have a number of disadvantages. First, it's very rigid because we can now have HR-related stuff in the Action object, which is bad separation of concerns. Secondly, we might want to not only slice by applicant, but also by position. Hence, we're using an embedded array instead:

Action {
    // as before
	RefIds: [{Id1}, {Id2}, ... ]
}
RefIds also has an index. To find all actions that were performed on or related to a specific applicant, application or position, all we have to do is this:

    _db.Find<Action>(Query.EQ("RefIds", applicationId));
Note that we don't have to indicate that RefIds is really an array - it just works and finds all documents, where applicationId is element of RefIds, which is exactly the semantics we need. To display those references as links in the frontend is a bit more complicated, because we need to know what type of object we're referring to, and ideally we also store a string representation, but that lies a bit outside the scope of this post. Thanks for Reading!