What's the best manner of implementing a social activity stream? [closed]

I'm interested in hearing your opinions in which is the best way of implementing a social activity stream (Facebook is the most famous example). Problems/challenges involved are:

Different types of activities (posting, commenting ..)
Different types of objects (post, comment, photo ..)
1-n users involved in different roles ("User x replied to User y's comment on User's Z post")
Different views of the same activity item ("you commented .." vs. "your friend x commented" vs. "user x commented .." => 3 representations of a "comment" activity)

.. and some more, especially if you take it to a high level of sophistication, as Facebook does, for example, combining several activity items into one ("users x, y and z commented on that photo"

Any thoughts or pointers on patterns, papers, etc on the most flexible, efficient and powerful approaches to implementing such a system, data model, etc. would be appreciated.

Although most of the issues are platform-agnostic, chances are I end up implementing such a system on Ruby on Rails

Solution 1:

I have created such system and I took this approach:

Database table with the following columns: id, userId, type, data, time.

userId is the user who generated the activity
type is the type of the activity (i.e. Wrote blog post, added photo, commented on user's photo)
data is a serialized object with meta-data for the activity where you can put in whatever you want

This limits the searches/lookups, you can do in the feeds, to users, time and activity types, but in a facebook-type activity feed, this isn't really limiting. And with correct indices on the table the lookups are fast.

With this design you would have to decide what metadata each type of event should require. For example a feed activity for a new photo could look something like this:

{id:1, userId:1, type:PHOTO, time:2008-10-15 12:00:00, data:{photoId:2089, photoName:A trip to the beach}}

You can see that, although the name of the photo most certainly is stored in some other table containing the photos, and I could retrieve the name from there, I will duplicate the name in the metadata field, because you don't want to do any joins on other database tables if you want speed. And in order to display, say 200, different events from 50 different users, you need speed.

Then I have classes that extends a basic FeedActivity class for rendering the different types of activity entries. Grouping of events would be built in the rendering code as well, to keep away complexity from the database.

Solution 2:

This is a very good presentation outlining how Etsy.com architected their activity streams. It's the best example I've found on the topic, though it's not rails specific.

http://www.slideshare.net/danmckinley/etsy-activity-feeds-architecture

Solution 3:

We've open sourced our approach: https://github.com/tschellenbach/Stream-Framework It's currently the largest open source library aimed at solving this problem.

The same team which built Stream Framework also offers a hosted API, which handles the complexity for you. Have a look at getstream.io There are clients available for Node, Python, Rails and PHP.

In addition have a look at this high scalability post were we explain some of the design decisions involved: http://highscalability.com/blog/2013/10/28/design-decisions-for-scaling-your-high-traffic-feeds.html

This tutorial will help you setup a system like Pinterest's feed using Redis. It's quite easy to get started with.

To learn more about feed design I highly recommend reading some of the articles which we based Feedly on:

Yahoo Research Paper
Twitter 2013 Redis based, with fallback
Cassandra at Instagram
Etsy feed scaling
Facebook history
Django project, with good naming conventions. (But database only)
http://activitystrea.ms/specs/atom/1.0/ (actor, verb, object, target)
Quora post on best practises
Quora scaling a social network feed
Redis ruby example
FriendFeed approach
Thoonk setup
Twitter's Approach

Though Stream Framework is Python based it wouldn't be too hard to use from a Ruby app. You could simply run it as a service and stick a small http API in front of it. We are considering adding an API to access Feedly from other languages. At the moment you'll have to role your own though.

Solution 4:

The biggest issues with event streams are visibility and performance; you need to restrict the events displayed to be only the interesting ones for that particular user, and you need to keep the amount of time it takes to sort through and identify those events manageable. I've built a smallish social network; I found that at small scales, keeping an "events" table in a database works, but that it gets to be a performance problem under moderate load.

With a larger stream of messages and users, it's probably best to go with a messaging system, where events are sent as messages to individual profiles. This means that you can't easily subscribe to people's event streams and see previous events very easily, but you are simply rendering a small group of messages when you need to render the stream for a particular user.

I believe this was Twitter's original design flaw- I remember reading that they were hitting the database to pull in and filter their events. This had everything to do with architecture and nothing to do with Rails, which (unfortunately) gave birth to the "ruby doesn't scale" meme. I recently saw a presentation where the developer used Amazon's Simple Queue Service as their messaging backend for a twitter-like application that would have far higher scaling capabilities- it may be worth looking into SQS as part of your system, if your loads are high enough.

Solution 5:

If you are willing to use a separate software I suggest the Graphity server which exactly solves the problem for activity streams (building on top of neo4j graph data base).

The algorithms have been implemented as a standalone REST server so that you can host your own server to deliver activity streams: http://www.rene-pickhardt.de/graphity-server-for-social-activity-streams-released-gplv3/

In the paper and benchmark I showed that retrieving news streams depends only linear on the amount of items you want to retrieve without any redundancy you would get from denormalizing the data:

http://www.rene-pickhardt.de/graphity-an-efficient-graph-model-for-retrieving-the-top-k-news-feeds-for-users-in-social-networks/

On the above link you find screencasts and a benchmark of this approach (showing that graphity is able to retrieve more than 10k streams per second).