09 7 / 2014

Mobile filesystems - when native storage will let you down

At the beginning of 2012 I started developing for mobile devices. Before that, I performed some graph visualization work and backend oriented development. In the software engineering world, I have grown accustomed to a sense of gravity, where by “gravity” I mean a certain set of basic ideas/laws you can rely on unconditionally on, similar to how you can rely on the planet’s gravity.

In this post, I want to share some of the experiences we’ve had at KeepSafe that left us questioning our own notion of gravity in the software engineering world.

File Systems are unreliable

At KeepSafe we do a lot of file system operations; as our app is essentially a photo vault, dealing with the file system comes with the territory. However, we didn’t just run into issues with the photo files but with smaller files we store beside the photos. We ultimately found that many of the things we take for granted about how file system operations should behave don’t always hold. And when things break, they break in strange, unpredictable ways. Typically with my previous work I would never question the reliability of concrete things such as file systems — it’s what I would consider a part of the software engineering gravity idea. Unfortunately, we’ve seen massive file system related inconsistencies over time that forced us to adhere to a different mindset.

Because our app has by now a good amount of users combined the the availability of good in app crash analytics, we have repeatedly found weird edge cases that should not happen and left us scratching our heads We’ve grown to be more mindful and careful of certain operations and of how we structure our file system code in general. Here are a few things we’ve incorporated into our engineering mindset at KeepSafe.

Avoid changing or moving data

In the next few paragraphs I want to dive a little deeper into behaviors that we’ve encountered during different operations and how we managed to solve those.

As a guiding principle that goes for everything we do by now we are very careful about not touching data that has once data has been written successfully to disk and has been verified as usable. We really try to not touch data to avoid the risk of any type of data loss after a modification. In more detail:

Renaming files and folders can lead to data loss

Unfortunately we have seen problems during and after renaming different files or folders within our application. This was very surprising to us as the assumption has always been that

file.renameTo(destinationFile);

is one of the safest operations that you can do on the file system.

After a renaming a directory that our app created in the past, we’ve experienced FileNotFoundException: ENOENT (No such file or directory) or FileNotFoundException: EISDIR (No such file or directory) exceptions while trying to open a new InputStream on an existing files from within the directory that we just renamed. The files object that we try to open would be returned from file.listFiles() on the directory that was just renamed

We follow the rule of not touching files when possible pretty strictly now. After importing an image into KeepSafe, we do not touch the file except when exporting the file. All other modifications, such as changing a photo’s orientation, are stored in a different place, removing the need to modify the file itself.

Changing a file’s content

If you must change the content of a file, you want to make sure to NEVER change the actual file initially. It is not guaranteed that overwriting a file in-place via a write operation actually completes. When an overwrite does not complete, it will leave the file in an ambiguous state where the new content is not written completely and the old content of the file no longer exists, which is the worst case.

To help combat such a case, we wrote our own little TransactionSafeFileUtil class to facilitate good practices when attempting to write a new version of a file. Our TransactionSafeFileUtil stores the new content in an original_file_name.new file, and then renames the original file to original_file_name.old. Finally, the original_file_name.new file is renamed back to original_file_name.

With this technique we can always guarantee that there is at least one valid version of the file on disk in case anything fails during the process. Because the files can be in different states, we synchronize read and write operations per file to guarantee that it’s in the right state.

Despite our attempt to follow best practices, unfortunately when writing a file to disk we’ve seen incomplete writes occurring on quite a few end-user devices. Upon further investigation, we learned that our attempt to make sure that the content was written fully to disk was flawed. As we previously thought, OutputStream.flush() does not ensure that all buffered data is completely written to disk. We now know to always call

fileOutputStream.getFD().sync();

if we need assurance that the file was completely written. The sync() function guarantees that the data buffered by the OS is written to the physical device (disk). You can read more at: http://docs.oracle.com/javase/6/docs/api/java/io/FileDescriptor.html#sync%28%29

Triple check what you write and read

Even with the technique described above we run into strange problems on some phones. For this reason we double check everything we read and write, especially important files. When you read a file that is expected be 32 bytes long, it’s important to make sure to check that the size read is indeed 32 bytes long. On Android we see quite a few write operations fail without any exception thrown.

Here is an example of some code:

FileOutputStream fos = new FileOutputStream(file);
BufferedOutputStream bos = new BufferedOutputStream(fos);
bos.write(byteData);
bos.flush();
fos.getFD().sync(); // sync to disk as recommended: http://android-developers.blogspot.com/2010/12/saving-data-safely.html
fos.close();


if (file.length() != byteData.length) {
    final byte[] originalMD5Hash = md.digest(byteData);

    InputStream is = new FileInputStream(file);
    BufferedInputStream bis = new BufferedInputStream(is);
    byte[] buffer = new byte[4096];

    while(bis.read(buffer) > -1) {
        md.update(buffer);
    }
    is.close();

    final byte[] writtenFileMD5Hash = md.digest();

    if(!Arrays.equals(originalMD5Hash, writtenFileMD5Hash)) {
        String message = String.format(
                "After an fsync, the file's length is not equal to the number of bytes we wrote!\npath=%s, " +
                " expected=%d, actual=%d.  >>  " +
                "Original MD5 Hash: %s, written file MD5 hash: %s",
                file.getAbsolutePath(), byteData.length, file.length(),
                digestToHex(originalMD5Hash), digestToHex(writtenFileMD5Hash));
        throw new GiantWtfException(message);
    }
}

return true;

We didn’t believe we would hit that GiantWtfException() when we put that check in, but in reality, it happens quite often in the wild. The number of bytes from the file is always less than the number of bytes from the byte array.

Unfortunately, we see the same issue on iOS from time to time:

NSDictionary *fIn = [[NSFileManager defaultManager] attributesOfItemAtPath:in error: NULL]
NSDictionary *fOut = [[NSFileManager defaultManager] attributesOfItemAtPath:out error: NULL];

if (![fOut fileSize] || ([fOut fileSize] != [fIn fileSize])) {
    NSLog(@"File size is wrong");
    return nil;
}

Seeing how our app contains user generated content that is irreplaceable, checking for consistency is extremely important for data that can’t be reproduced.

If it doesn’t work the first time, try again

We also discovered that just because writing a file does not work the first time, it doesn’t mean that it won’t work when you try again.

I have seen a variety of IOExceptions that should never happen ideally. Retrying the same operation again sometimes works. Some examples:

  • java.io.FileNotFoundException: /storage/emulated/0/folder/filename: open failed: EACCES (Permission denied) - on a file that I write over and over, and I can see in the logs that I did just that
  • java.io.IOException: Read-only file system - In the location I use all the time to write
  • java.io.IOException: write failed: EDQUOT (Quota exceeded)

An exception that is even more confusing…

java.lang.IllegalStateException: neither file nor directory: /mnt/sdcard/Android/data/packageName/cache/filename_not_written_by_me

Thrown in my own code at

if(!d.isDirectory()) {
    throw new IllegalArgumentException("given param is not a directory");
}

File[] files = d.listFiles();

if (files != null) {
    for (File f: files) {
        if(f.isFile()) {
            f.delete();
        } else if (f.isDirectory()) {
            deleteDirectoryContent(f);
            f.delete();
        } else {    
            throw new IllegalStateException("neither file nor directory: " + f.getAbsolutePath()));
        }
    }
}

Every now and then you will also see some

java.io.FileNotFoundException: /path: open failed: ENOENT (No such file or directory) - On files that you wrote to disk and made 100% sure that everything is fine.

I had a great time reading up on Linux System Errors to see what can go wrong at a file system level.

Prepare for failure

Sometimes things just don’t work. In case we want to write important data to disk and it does not work, even after a few retries, we just throw a RuntimeException. I’d rather have the app crash hard, instead of it running in an unhealthy state. For example, we crash the client when we cannot persist the generated encryption key, to prevent encrypting files with a key that will be nonexistent soon. The goal is to always keep your user’s data in a state that is either recoverable or valid, with strong preference for validity.

Flash memory

Finally, I just want to touch on flash memory for a bit. By and large, for Android devices, all of the data is stored on flash memory. However, flash memory has its limitations. There is a finite amount of write operations that each block can handle, which is called memory wear. To stretch the problem of memory wear there are different kinds of wear leveling. Just a reminder that continuous file system operations are degrading the underlying hardware, which is always something to keep in mind.

Share:

04 6 / 2014

Migrating A large Android Application From Ant to Gradle

Introduction

This post describes our experience migrating the KeepSafe Android app to the new Gradle-based build system. This is not meant to be a step-by-step tutorial. For a more detailed description, you might want to check out the links below first:

In this post, I want to focus on some specific cases we have had problems with. This also represents the state of the build tools at the time of writing. Some of it might be outdated and no longer valid.

Why move from an existing solution?

When starting to build an application, it’s easy to pick the tool that least gets in your way. With KeepSafe, we started with the default choice at the time, which was building the project with Eclipse/Ant. This choice ended up not being ideal, however. As we grew and more and more people started working on the same thing, managing many different Android library projects became a massive pain. We realized we needed an extendable, IDE-agnostic build system with dependency management that was also easy to use with CI.

Our options

There are several tools on the market that allow you to build an Android app. We spent some time investigating how each of them would fit in our environment:

  • Makefile/Scripts - the most flexible but also the most time consuming solution. The lack of dependency management makes this option no better than what we already had.
  • Ant - What we already had. Dependency management can be added through Ivy. It’s easy to write custom tasks, but it’s ugly and verbose. The more you need, the faster the scripts grow, and after a while you end up with thousands of lines of XML.
  • Maven - Seems to be the most used solution for building Android apps. While Maven is supported by most tools, doing anything custom is a pain. Also, it’s still XML.
  • Gradle - Picked by Google last year as the default Android build tool. It’s progressing quite fast, but it’s still missing features, and the API is not stable. Gradle has a nice DSL and it’s easy to extend and integrate with other tools.
  • Buck - Created by Facebook. It’s hard to judge this tool without using it. Seems to a be niche tool with very specific use cases according to the documentation: > Buck is designed for building multiple deliverables from a single repository rather than across multiple repositories.

Our decision

For us, it really came down to Maven and Gradle. While Maven is more mature, Gradle seems to be gaining ground quickly, and because it’s backed by Google, we can expect good, long-term support. It’s also the default build tool for Android Studio, it has some great features, and it isn’t XML-based.

The migration process

apk, apklib, aar - a whole bunch of different library formats

In the Android world there have been many different file formats to include libraries and Android library projects. So far there have been apk files for the actual app, jars for normal Java libs and apklibs for android library projects. Unfortunately, those are not compatible with each other.

Google decided to introduce a new format for Android libraries, .aar. This is way nicer than having to link to the actual Android library project in Eclipse and makes versioning very easy.

The old Maven plugin format,apklib, is not supported by Gradle. This means we have to migrate the existing projects that offer apklib as a format into aar lib files. Our solution was to build aar lib files for those projects ourselves.

Main app

If you’re using the default Android Developer Tools (ADT) in Eclipse you’ll find an option to export you current project to Gradle (only in ADT version 22.0 or higher). ADT itself does not support Gradle right now, so you will be better off switching to Android Studio. You can try to use Gradle outside of ADT on the command line or use the default Gradle plugin for Eclipse, but I didn’t have much pleasure/luck with that. You have to do some wonky symlink stuff for Eclipse to find your resources and other things.

There are several differences between the ADT Eclipse and the Gradle build system. The biggest two differences are the project structure and the test project integration into the main project. If you don’t want to move to the new default gradle structure you need to tell Gradle where the files are located. In our case:

android {
    ...
    sourceSets {
        main {
            java.srcDirs = ['src']
            resources.srcDirs = ['src']
            aidl.srcDirs = ['src']
            renderscript.srcDirs = ['src']
            res.srcDirs = ['res']
            assets.srcDirs = ['assets']
            jniLibs.srcDirs = ['libs']

            manifest.srcFile 'AndroidManifest.xml'
        }
        debug {
            java.srcDirs = ['debugSrc']
            res.srcDirs = ['debugRes']
        }
        release {
            java.srcDirs = ['releaseSrc']
            res.srcDirs = ['releaseRes']
        }
        androidTest.setRoot('test')
        androidTest {
            java.srcDirs = ['test/src']
            res.srcDirs = ['test/res']
            resources.srcDirs = ['test/src']

            manifest.srcFile file('test/AndroidManifest.xml')
        }
    }
    ...
}

This is roughly what we ended up with. The test project AndroidManifest.xml is usually auto-generated, but if you have something specific, you might want to use your own.

Spec everything

We’ve learned that you want to spec as much as possible in your build.gradle file. This is one of the main advantages to get consistent builds across different environments. We define our Java compiler, and the build tools and SDK versions.

    tasks.withType(JavaCompile) {
        options.encoding = "UTF-8"
        sourceCompatibility = "1.6"
        targetCompatibility = "1.6"

    }

    android {
        compileSdkVersion 17
        buildToolsVersion '19.0.1'
    }

Libraries

As I’ve already mentioned, there is a new Android library format .aar. If you have your code split into more than one project like us, you will need to migrate that as well. It’s not much different from migrating the main app. There are, however, some problems you should be aware of.

Build types

There are 2 default build types: debug and release. Unfortunately, this setting does not propagate when building your main app in debug. Even if you have your dependencies set up correctly and you try to build debug apk, it will include release aar. This means you want to have all of your configuration in the main app. We already had an existing library with debug settings. We ended up moving all of the library configurations that are dependent on the build settings to the main app to configure the lib at runtime. One example is debug logs. We are now initializing our internal lib with the log configuration from the main app instead of with the config from the build type of the lib. There is an issue #1 and #2 opened for this already.

Maven and aar

There is some work done to be able to create aar from Maven:

It was not stable enough when we were trying to use it, so we ended up creating Gradle build scripts for the libraries we were building and including those in our CI.

Build types & signing

We have 3 different types: debug, beta and release. You can have different sources, resources and so on for each build. This is quite useful if you want to configure some services, endpoints, or different icons for debug build. These are working quite well and we had no problems here. If you want another build type, the easiest way is to inherit it from an existing one.

beta.initWith(buildTypes.debug)
beta {
    runProguard true
    proguardFile 'proguard.cfg'
    versionNameSuffix "-beta"
    packageNameSuffix ".beta"
}

The build types we ended up with are:

Debug - has to run as fast as possible as we run it all the time during development. This means we don’t sign the APK or run code optimization tools like Proguard.

Beta - a similar build to our release build, just without signing the APK at the end as we don’t feel comfortable having our Google Play store key password flying around. We load the same config as for release and also run Proguard during the build. This is the build type we run on our CI server, as it will detect errors around Proguard configurations that might have slipped during development.

Release - This is the final version that we shipped to the app store. Because the release apk also needs to get signed, we wanted to make sure that the password for our release keystone are never stored anywhere besides our password managers. So we pass it in as a parameter when building with the following settings under signingConfigs:

release {
    storeFile file("release.keystore")
    storePassword project.hasProperty('storePass') ? project.storePass : "default_pass"
    keyAlias "release"
    keyPassword project.hasProperty('storePass') ? project.storePass : "default_pass"
}

It takes the password from the command line with -P<property_name>="<property_value>". You can read more here

    gradle assmbleRelease -PstorePass='password'

If the password is not set for assembleRelease we throw an error:

gradle.taskGraph.whenReady { taskGraph ->
    if (taskGraph.hasTask(':assembleRelease') && !project.hasProperty('storePass')) {
        throw new IllegalArgumentException('Run with "-PstorePass=<value>" to sign the release build')
    }
}

There are a few other options on the web for getting the password from the console. The problem we had was an error if we tried to change it after the build had started, like getting it from System.console(). Gradle complained that the password had been messed with. This seems to be new with the 0.9.* release of the build tools.

Smoke/sanity tests & testing Proguard

Ideally we would like to run smoke/sanity tests on our release apk, but since we need to sign it separately, there is no way to do it with CI. For this reason we use our beta build that has Proguard enabled. You can choose which build type is used with the integration tests by specifying it in your script:

android {
   testBuildType "foo"
}

More details on this can be found here. We’ve ended up creating a separate Gradle file, importing the standard build, and overriding some values:

apply from: 'build.gradle'
android {
    sourceSets.androidTest.setRoot('smokeTests')
    sourceSets.androidTest {
        java.srcDirs = ['smokeTests/src']
        res.srcDirs = ['smokeTests/res']
        resources.srcDirs = ['smokeTests/src']
        manifest.srcFile file('test/AndroidManifest.xml')
    }
}

NDK

The Gradle plugin is able to run NDK. This was not officially supported when we were migrating to Gradle. For this reason, we ended up checking the compiled binary files (*.so) into our repo and including them with the build:

android {
    sourceSets {
        main {
            jniLibs.srcDirs = ['libs']
        }
    }
}

Full NDK support is coming and should work out of the box as soon as it’s stable in the plugin.

Package suffixes

Being able to have different package suffixes is something I’m very excited about. This finally allows us to have a debug build in parallel to the production version on one device. You can have a package suffix for different build types like this:

    android {
        buildTypes {
            packageNameSuffix ".beta"
        }
    }

Remember to not hard-code the package in your xml files. Instead use:

xmlns:header="http://schemas.android.com/apk/res-auto"

Additional info

Gradle tasks

The Android Gradle plugin updates the task graph dynamically. This means you can’t reference tasks in your scripts directly like:

task.doLast()

Instead you need to do it after the graph has been created:

gradle.taskGraph.whenReady { taskGraph ->
    ...
}

Some tasks might be included depending on your configuration. For instance, the preDex task won’t be included if you configure your build to run with proguard. It’s what it is :(

Performance

One of the major disadvantages of switching to a Gradle build system is increased compile/build time. However, we were able to speed up the compile time for a gradle clean assembleDebug build by a significant amount by adding

    DEFAULT_JVM_OPTS="-Xmx512m"

to our gradlew wrapper file.

We also use Facebook’s fork of proguard that is, according to Facebook, about 2.5X faster than the original Proguard implementation while maintaining identical binary output.

Several different integration tests

We wanted to have separate smoke/sanity tests; tests that generate screenshots, and normal tests run separately. The simplest way we were able to find to do this was to create a separate build file and override the test configuration:

android {
    sourceSets.androidTest.setRoot('smokeTests')
    sourceSets.androidTest {
        java.srcDirs = ['smokeTests/src']
        res.srcDirs = ['smokeTests/res']
        resources.srcDirs = ['smokeTests/src']
        manifest.srcFile file('test/AndroidManifest.xml')
    }
}

This is far from ideal, but it gets the job done.

Trying to be dynamic

A few things we tried to make work dynamically didn’t pan out; e.g. getting the version from the manifest…

The mix of setting the data during the script compile time and generating the task graph dynamically somewhat limits flexibility. We ended up hard coding some of those things. This might not be as elegant or as concise as we would like, but it works.

Linting

If you haven’t used lint on a regular basis, it can be a pain to remove all errors at once. To skip failing on error add:

android {
    lintOptions {
        abortOnError false
    }
}

Unused resources

We’ve created a separate tool to remove unused resources reported by lint. Check out our android-resoruce-remover

Bugs

Some bugs you might be interested in following:

Share:

15 5 / 2014

Remove unused Android resources to reduce APK size

I’ve been programming Android applications for 2.5 years. I mainly work on the KeepSafe Hide pictures app but I also built our Hide SMS app and some other small private projects.

Over time, one problem I keep running into over and over again is dealing with resource bloat — which happens when you keep adding new resources but there’s no good way to remove the ones that are no longer in use. In a typical web project, resource bloat doesn’t really affect performance because the browser will only load what’s needed. On mobile, however, it’s a real problem because space is much more of a commodity, and all of your unused resources will end up getting shipped to the user.

To fix this, I wrote a small tool called android-resource-remover that pinpoints unused resources and removes them from your codebase.

The problem

In Android, resources are everything. Your layout XML files define your views, images, strings, colors, dimensions, animations, and so on. Different screen densities across different devices make you store every image in 4-5 different resolutions with the same name in different folders. In cases with graphical buttons, you need to provide a different image for different button states and then wire it all up in an additional XML in your drawable folder. This makes it easy to have one button in 11 different files. The same goes for strings; e.g. our app is translated into 10 different languages. Adding a new string to the app will introduce a new <string> tag in 10 different XML files.

If I want to stop using a button, I have to find out what image resources my drawable XML is using and then delete everything. In the case of the string that I just defined, I have to delete the string in those 10 different files and remove them inline.

Because there’s so much overhead and manual work, I pretty much never did resource cleanup.

Solution and Usage

Today, the Android Developer Tools come with a tool called Lint that gives you great output on what is unused in your codebase. However, you still have to remove everything yourself. For this reason we wrote a small tool called android-resource-remover that will remove all of your unused resources based on Lint’s output.

You can install it with:

pip install android-resource-remover

Usage is pretty easy too. You just run

android-resource-remover

in your Android project directory. android-resource-remover expects that you have lint in your classpath. In case you don’t, you can specify the location of your lint tool with --lint

You can find more information about how to use it and the source under https://github.com/KeepSafe/android-resource-remover. In case you built your Android project with gradle, tale a look at the tools README for instructions.

One thing to note is that you might have to run android-resource-remover twice because of dependencies. In other words, new unused resources will likely emerge after you run it for the first time. For instance, an unused layout in your code will be removed on the first run. Then, when you run it again, it will point out all the strings and graphics that have been used by this layout, as there is no reference to them anymore.

Technical results

What is the difference in APK size after running android-resource-remove on our hide pictures app? I branched off from our develop branch to build a different APK after removing all unused resources to compare file size. The result is quite fascinating. Even with our rather large app (due to native code and other libs), I can reduce the APK size by 10% by shaving off ~1MB. Here is a ls -la output of that dir.

-rw-r--r--    1 philipp  staff    10M Apr 30 09:22 App_without_unused_resources.apk
-rw-r--r--    1 philipp  staff    11M Apr 30 09:07 App_including_unused_resources.apk

How does it work?

android-resource-remover is based off the result XML from the Android Lint tool. There are other tools out there that tell you about unused resources, but we feel strongly about using the output of the most supported tool over time. In this case, Lint from the Android project itself seemed like a solid approach; it allowed us to focus on removing the actual resources from disk and from within the other XML resource files.

Lint tool gives you a lot of things you can use to improve in your app. For this project, we’re only looking at improvements that are tagged with UnusedResources. So we’re parsing the entire output XML for

    root.findall('.//issue[@id="UnusedResources"]')

The slightly tricky part is that Lint output is not currently made for tools like this, so there is no clear indication if a resource is an entire file or just a single element in a resource.xml file (e.g. string.xml). Our current approach is to look at the location value of the unused resource from Lint’s output. If it contains line or column, we expect the resource to be within a file and not the entire file.

From here on it’s pretty easy: resources that are an entire file just get deleted. For resources that we identified as an element within a XML file, we first parse the resource XML and then remove that element. All this happens in the android_clean_app.py either in remove_resource_file() or in remove_resource_value().

While writing this, I’m contemplating whether we should just commit a patch to the Android Lint tool that gives us a more clear output about a resource being a value within a XML file or an entire file.


I hope this will help as many Android developers over time as possible. I’m very happy to hear your comments, suggestions, experiences and how you think we can improve android-resource-remover. Feel free to submit bugs or open pull requests for improvements.

Share:

31 7 / 2012

A/B testing for mobile apps made easy - how we built Switchboard

TLDR: Implement A/B testing for your app in hours. Switchboard is a lightweight mobile A/B testing framework with consistent user segmentation. It can be used for A/B testing, stage rollout and remote configuration. It’s designed to server high traffic and be as flexible as possible. Released under the Apache License, you can download it from: http://www.github.com/KeepSafe/Switchboard

Problem: On mobile, you can’t just simply roll back

Many mobile devs have been through this: added a new feature, submitted it to the app store and just after it is finally out there you realize that something is not working. For whatever reason.

If you are building a web-only product, solving this is usually fairly easy. You fix your bugs, deploy your code to your servers - done!

On mobile, it’s different. Once it’s shipped, you can’t take it back. The only thing you can do is fix, push an update and pray that it gets published quickly. On Android it will take a couple hours. For iPhone…. you know, longer.

And then you still haven’t answered emails from those annoyed users who get too frequent updates.

You can avoid all that by plugging Switchboard into your app and use that to quickly react. Our mobile app, KeepSafe has millions of users on a multitude of different devices, form factors, OS versions, language settings etc. We do thorough testing but we don’t have all the devices in all the configurations available. Our solution: Switchboard.

We built Switchboard for three main use cases

  1. Staged rollout of new features
  2. A/B testing of features
  3. remote configuration

Staged Rollout

Because of the Android device fragmentation, the staged rollout was really something we needed to ensure a good user experience. With Switchboard, we can release a new feature to only a subset of users and see what causes trouble and what does not. This comes in really handy because we also use Crittercism to get real-time crash reports. We can roll back for some or all devices as soon as we see problems with a particular configuration.

A/B testing

Observation beats theorizing when it comes to determining user preferences. There are many optimization issues as well as design decisions where testing the best variant would have been really helpful.

In previous projects, we were working on web products and could really leverage the power of A/B testing. On the mobile device that’s a little harder, especially for parts of the app that just have to be native. We didn’t really find a lightweight solution so we found our staged rollout to work just as well for this case.

Remote configuration

Switchboard allows you to wrap 3rd party libraries. This helps you to turn off 3rd party code where you don’t have control over in the case it starts breaking. There is nothing worse than having a 3rd party SDK in your app that crashes it while it’s out there. Be it an analytics package or the customer support API, with Switchboard, we can turn it off. Another use case for remote configuration is to change your API end points on old apps that users don’t update.

Simple does it: Switchboard design

For weeks we wished for a tool to do staged rollout with. We shied away from building it because it would distract us and take up a lot of time. Eventually, we came to a point where we changed a major part of the KeepSafe and errors would really impact millions of users. Now we really needed to be able to switch the next release off, should it break. With our next code cut-off 3 days away we said: “Build something we can make work by Tuesday”. And so we built it: quick, simple and not polished, but it does the job. Startup life :)

Switchboard was designed as a super lightweight tool that can handle lots of traffic from a few servers and scale horizontal. It should be as flexible as possible in terms of usage and robust against downtime or connectivity problems in the communication.

Client — (example code from android SDK)

Every part of our app contains one or more multivariate switches. The app ships with a default setting for each of those.

On app start, the app downloads a on the fly generated configuration file that determines the latest settings for this specific user. The configuration file is generated based on device, OS version, language etc.

SwitchBoard.loadConfig(getApplicationContext());

If there is a successful response, the switchboard config is updated. Otherwise, it just uses the last saved version and catches every possible exception so that your app does not crash. Request are made asynchronously so they don’t block the UI. The generated configuration file can also contain custom variables that are set on the server and parsed in the client code during runtime.

The configuration update should be done at a point in the app where the user is not getting confused about changes. In our case, we have a login screen where we update our config.

All the logic is on the server side, so you have fast access and ongoing control over the executed code.

We implemented a client library for android and iOS, that makes it easy to get started. The client supports productions and staging environment and creates its own unique user ID (UUID) if you don’t have one already.

Embedding an experiment on the client is super easy.

//get settings from Switchboard
boolean isSmiling = Switchboard.isInExperiment(getApplicationContext(), "experimentName");

//Switching code for testing
if (isSmiling) {
    //execute code for people who are in the experiment here.
    showSmileyWelcomeMessage();
}
else
    showNormalFaces();

Have a look at our example application for more details.

Server

All the configuration logic resides on the server. Client requests for configuration are processed here and the configuration is sent to the client in a simple JSON string. The core parameter is is the unique device ID that is parsed from the client. Based on that each user is segmented into buckets. Each user will remain in the same bucket over his lifetime. The server is designed to work without a database or any IO operations for maximum performance on a large user base with as little servers as possible. We don’t want to worry about how to scale our tools.

The device sends parameters to the server. By default this is the UUID (automatically generates user buckets), device OS version, app version, language and other system parameters.

Based on these the server decides which switch/experiment is set ON or OFF.

$manager->turnOnBucket(0, 50); // an experiment for 50% of all user

Beside a boolean status to indicate if the switch/experiment is turned on for a specific user, you can pass values to the client for each each experiment. Values are parsed as a JSON object, so you can put everything in there that JSON can handle. These can then be used dynamically in code.

if($this->manager->isInBucket(0, 50)) {
    $values = array();
    $values['message'] = 'You are not an english user dude. So the message is not displayed';
    $values['messageTitle'] = 'get KeepSafe ver 2';
    return $this->manager->activeExperimentReturnArray($values);
}

Infrastructure

Because we wanted to have something fast out of the door that does the job, we couldn’t afford thinking much about infrastructure setup. We went with an easy and robust solution, Heroku. Currently, we serve more than 1m requests per day with two Dynos and we could service way more, since there is not that much heavy lifting done. Heroku also allows us to scale Switchboard horizontal without any effort.

To make it easy for you to try Switchboard, we have set up a running instance of the server with a sample implementation. All the Switchboard example apps are pointing to that instance so you can play with it.

How we use Switchboard

User grouping - buckets We divide our users in buckets based on a UDID that the client computes and sends to the server at request time. We split our whole user base in 100 buckets. This allows us to address our user base on a fairly granular level. You can divide your users into 1000 buckets if you have a larger user base.

This bucket grouping is orthogonal to any other parameter by which we segment our user base, be it device, OS, country, language. This makes it easy to keep an overview of how many people see what feature.

Keeping track

Switchboard does not take care of tracking. You can implement your existing logging/analytics solution to track results.

We currently simply use a combination of Google Spreadsheet, Google analytics and Crittercism. We use a basic spreadsheet to keep track when we turn which experiments on/off or change something together with our core metrics. This makes it super easy to see changes in the core metrics and to associate them with the experiments or new features we run/released.

For tracking we use Google analytics. The main reason is that it’s free. We used Google’s custom variables to build our own segmentation by app version and install date. To track experiments we use standard Google Analytics events. Depending on the experiment we pass the event label from the server to the client config. This allows us to change the experiments without updating the app. This makes sense in particular when you want to test messaging and click through rates.

For rolling out new features, we leverage Crittercism who boast live crash reports on a device level. This is especially powerful on Android when you roll out a new feature that might break on some individual devices. Using Crittercism and the user feedback in our helpdesk, we can see problems post-launch and quickly roll back or turn on new features when they prove to be stable. Switchboard also allows us to remotely turn off features that are not supported by specific devices.

How you should use it

We don’t know. But please - download it, try it, fork it, improve it.

We have example apps for all clients and server on the github project page. The example applications are working out of the box with the example server code running on our server.

We’d love to hear what you think. Here is the link again: http://www.github.com/KeepSafe/Switchboard

When you start implementing it, using it or want to help improve Switchboard, I would be happy to hear from you. Write me an email: philipp [at] getkeepsafe (dot) com

Comments and Discussion

Please comment on HN: http://news.ycombinator.com/item?id=4319905

Share: