API pagination best practices

I'd love some some help handling a strange edge case with a paginated API I'm building.

Like many APIs, this one paginates large results. If you query /foos, you'll get 100 results (i.e. foo #1-100), and a link to /foos?page=2 which should return foo #101-200.

Unfortunately, if foo #10 is deleted from the data set before the API consumer makes the next query, /foos?page=2 will offset by 100 and return foos #102-201.

This is a problem for API consumers who are trying to pull all foos - they will not receive foo #101.

What's the best practice to handle this? We'd like to make it as lightweight as possible (i.e. avoiding handling sessions for API requests). Examples from other APIs would be greatly appreciated!

Solution 1:

I'm not completely sure how your data is handled, so this may or may not work, but have you considered paginating with a timestamp field?

When you query /foos you get 100 results. Your API should then return something like this (assuming JSON, but if it needs XML the same principles can be followed):

{
    "data" : [
        {  data item 1 with all relevant fields    },
        {  data item 2   },
        ...
        {  data item 100 }
    ],
    "paging":  {
        "previous":  "http://api.example.com/foo?since=TIMESTAMP1" 
        "next":  "http://api.example.com/foo?since=TIMESTAMP2"
    }

}

Just a note, only using one timestamp relies on an implicit 'limit' in your results. You may want to add an explicit limit or also use an until property.

The timestamp can be dynamically determined using the last data item in the list. This seems to be more or less how Facebook paginates in its Graph API (scroll down to the bottom to see the pagination links in the format I gave above).

One problem may be if you add a data item, but based on your description it sounds like they would be added to the end (if not, let me know and I'll see if I can improve on this).

Solution 2:

If you've got pagination you also sort the data by some key. Why not let API clients include the key of the last element of the previously returned collection in the URL and add a WHERE clause to your SQL query (or something equivalent, if you're not using SQL) so that it returns only those elements for which the key is greater than this value?

Solution 3:

You have several problems.

First, you have the example that you cited.

You also have a similar problem if rows are inserted, but in this case the user get duplicate data (arguably easier to manage than missing data, but still an issue).

If you are not snapshotting the original data set, then this is just a fact of life.

You can have the user make an explicit snapshot:

POST /createquery
filter.firstName=Bob&filter.lastName=Eubanks

Which results:

HTTP/1.1 301 Here's your query
Location: http://www.example.org/query/12345

Then you can page that all day long, since it's now static. This can be reasonably light weight, since you can just capture the actual document keys rather than the entire rows.

If the use case is simply that your users want (and need) all of the data, then you can simply give it to them:

GET /query/12345?all=true

and just send the whole kit.

Solution 4:

There may be two approaches depending on your server side logic.

Approach 1: When server is not smart enough to handle object states.

You could send all cached record unique id’s to server, for example ["id1","id2","id3","id4","id5","id6","id7","id8","id9","id10"] and a boolean parameter to know whether you are requesting new records(pull to refresh) or old records(load more).

Your sever should responsible to return new records(load more records or new records via pull to refresh) as well as id’s of deleted records from ["id1","id2","id3","id4","id5","id6","id7","id8","id9","id10"].

Example:- If you are requesting load more then your request should look something like this:-

{
        "isRefresh" : false,
        "cached" : ["id1","id2","id3","id4","id5","id6","id7","id8","id9","id10"]
}

Now suppose you are requesting old records(load more) and suppose "id2" record is updated by someone and "id5" and "id8" records is deleted from server then your server response should look something like this:-

{
        "records" : [
{"id" :"id2","more_key":"updated_value"},
{"id" :"id11","more_key":"more_value"},
{"id" :"id12","more_key":"more_value"},
{"id" :"id13","more_key":"more_value"},
{"id" :"id14","more_key":"more_value"},
{"id" :"id15","more_key":"more_value"},
{"id" :"id16","more_key":"more_value"},
{"id" :"id17","more_key":"more_value"},
{"id" :"id18","more_key":"more_value"},
{"id" :"id19","more_key":"more_value"},
{"id" :"id20","more_key":"more_value"}],
        "deleted" : ["id5","id8"]
}

But in this case if you’ve a lot of local cached records suppose 500, then your request string will be too long like this:-

{
        "isRefresh" : false,
        "cached" : ["id1","id2","id3","id4","id5","id6","id7","id8","id9","id10",………,"id500"]//Too long request
}

Approach 2: When server is smart enough to handle object states according to date.

You could send the id of first record and the last record and previous request epoch time. In this way your request is always small even if you’ve a big amount of cached records

Example:- If you are requesting load more then your request should look something like this:-

{
        "isRefresh" : false,
        "firstId" : "id1",
        "lastId" : "id10",
        "last_request_time" : 1421748005
}

Your server is responsible to return the id’s of deleted records which is deleted after the last_request_time as well as return the updated record after last_request_time between "id1" and "id10" .

{
        "records" : [
{"id" :"id2","more_key":"updated_value"},
{"id" :"id11","more_key":"more_value"},
{"id" :"id12","more_key":"more_value"},
{"id" :"id13","more_key":"more_value"},
{"id" :"id14","more_key":"more_value"},
{"id" :"id15","more_key":"more_value"},
{"id" :"id16","more_key":"more_value"},
{"id" :"id17","more_key":"more_value"},
{"id" :"id18","more_key":"more_value"},
{"id" :"id19","more_key":"more_value"},
{"id" :"id20","more_key":"more_value"}],
        "deleted" : ["id5","id8"]
}

Pull To Refresh:-

enter image description here

enter image description here

Solution 5:

It may be tough to find best practices since most systems with APIs don't accommodate for this scenario, because it is an extreme edge, or they don't typically delete records (Facebook, Twitter). Facebook actually says each "page" may not have the number of results requested due to filtering done after pagination. https://developers.facebook.com/blog/post/478/

If you really need to accommodate this edge case, you need to "remember" where you left off. jandjorgensen suggestion is just about spot on, but I would use a field guaranteed to be unique like the primary key. You may need to use more than one field.

Following Facebook's flow, you can (and should) cache the pages already requested and just return those with deleted rows filtered if they request a page they had already requested.