Kulpreet उर्फ Jungly

Ballardia and World of the Living Dead

Wal-e for Managing Postgres WAL Backups

Included file ‘JB/setup’ not found in _includes directory

Pros

  1. Smarts built in to find last backup and use that number to delete older wal segments

  2. Really active community

Con

  1. Python and all sorts of other dependencies. I don’t like my postgres.conf having dependencies on libraries that have to be installed using not so robust package managers like easy_install or pip. [I do a lot more with Ruby, so Python’s package managers still seem alien to me.]

  2. Keep N Backups still is a todo.

  3. Documentation on how to switch from existing S3/WAL backups to wal-e is not there.

Switching from an existing setup

I used to do the following

1
2
3
4
psql my_database
pg_start_backup('some label');
pg_stop_backup();
archive_command = '/var/lib/postgresql/8.4/main/s3test %f && s3cmd -c /home/ubuntu/.s3cfg put --acl-private %p s3://pg_archive/%f'

In the above s3test is

1
2
3
4
5
6
7
8
9
#!/bin/bash
xxx=`s3cmd -c /home/ubuntu/.s3cfg ls s3://pg_archive/$1`
res=`expr "$xxx" : ".*s3://pg_archive/$1$"`
if [[ $res > 0 ]]
then
    exit 1
else
    exit 0
fi

The switch is basically a hack, as I haven’t found much help on what is the best way to switch directories.

Should I change the archive_command, restart server and then run the first backup-push?

Instead, I am running with the following steps

  1. Let the old archive_command (as shown above) run as it does.
  2. run a wal-r backup-push. This will create a directory called basebackups_NNN under the s3-prefix path you specify. 1. As soon as backup-push returns, restart databased to pick up the new archive_command.

I really recommend someone to find out more about this first. But this is what I am doing for now.

There has to be a tool that is easier to setup and better documented. But using wal-e for now.

Dropbox API, a SCRUM Taskboard, and Offline Application Framework?

Offline Apps and Sharing Data

What about having an offline application that saves your data using offline storage, but syncs to a server that allows the same ‘offline storage’ to be edited by two more users.

The idea is simple.

  1. Load an app from a service that stores and serves files - S3? Dropbox? Anything you fancy. That is where you get the ‘app’ from.

  2. The App is essentially a HTML/JS page that allows you to create and edit data.

  3. The app saves your data in local offline storage - which seems to have a limit of 5MB. So of course such offline apps will only work for apps that keep data consumption to less than 5MB.

  4. Finally on issuing a ‘sync’ command that offline storage is saved back on to a file server. It could be S3, Dropbox or a plain simple FTP server.

  5. Security will be an issue as credentials for ftp, or s3 will have to be saved in plain text in the application page.

Dropbox Apps to the rescue

I think the folks at Dropbox are really up to something.

While looking around what the current state of the art is on such an approach to building offline applications, I found that Dropbox team has been busily building away something that just works marvelously well. They call them apps Dropbox Apps.

Using the Dropbox Core API from their Javascript client library everything seems so simple and straight forward.

No offline storage

You even start to wonder if you need to use the offline storage at all. Just update the dropbox files and let the dropbox daemon sync them up with your team.

Locks

I do think it’ll be cool to use a lock file so that multiple user’s can’t edit the app at the same time. If we really need that then we need to start thinking merging and probably using Git JS libs to do the syncing. All to complicated for people who are simply interested in say a SCRUM Taskboard.

Having a start editing -> edit -> save semantics will enable us to check for lock file, and if two people do manage to get locks then the one with the earlier version wins. Dropbox core api lets us get the versions and timestamps for files.

Current State of the Art

A few people have tried and it seems quite an active area.

Unhosted have been working on providing just these kinds of ‘offline’ applications.

RemoteStorage lets anyone run a remote storage server which they provide client apis to talk to from JS applications.

Definitely worth keeping an eye on them.

Game Intelligence

In this post I argue for the need for an open source analytics engine for use in games and any other gamified (or just a regular web application).

Why not just use Google Analytics (or any such analytics)

  1. Only tracks front end events, like page loads and clicks. Does a fantastic job of handling country, browser etc. We need to able to track back end events, creating equivalent front end events for google analytics might be too much?

  2. Can’t really be used as a business intelligence tool. For example, we can’t answer the question, “How many users have reached level 10” or even a more complex question to answer “What percentage of users have reached level 10 - split by the week number they joined our system.”

The latter is what is called Cohort Analysis and Google Analytics tools and other ‘web analytics’ tools (for example Piwik) don’t help us do cohort analysis.

Open Source Cohort Analysis

So the next question is what is on offer for a self hosted free, open source cohort analysis tool that we can use. I really can’t find anything useful out there. The closest I came to was aarrr a Ruby library backed by MongoDB to track cohorts.

It seems a lot of people build their own little tools for such analysis. Could it be cause

  1. Describing cohorts and tracking activity along cohorts can result in a lot of data replication, so no one is releasing a framework, or

  2. Most of the analysis is carried out in an Business Intelligence ETL manner.

So Why Not an ETL Business Intelligence Tool?

  1. Not open source and can be pretty expensive

  2. Mostly written in Java, and have a painful learning curve, especially the wysiwg query editors. The whole thing stinks of ‘enterprise’ I must say.

  3. ETL might not be flexible and dynamic enough. With databases like mongodb, we don’t need to run cron jobs hogging the database to return analytics results. Instead, we can try and set up a small db to collect events from the backend as users/players hit certain landmarks.

MongoDB and Analytics

A lot has been written on how Mongo can be used to aggregate analytics data. The difference is very clear, no more ETL, just simple ’pings’ to mongodb to track analytics. Late the aggregated results can be shown using any of the freely available graphing libraries - google charts or rgraph come to mind.