Creating a photo mosaic web app

A few months ago, my good friend Satish Talim had this great idea to create a series of Go challenges to help Go programmers to up their game. The idea was to create a programming problem every month (or so) that presents a fresh and interesting challenge to the Go community. There would be prizes to win, but more importantly, it was a community effort to help each other to improve ourselves. Satish asked me to write a challenge, and I readily agreed to create challenge #3.

Being a web application programmer for most of my career, it was the most natural thing to do to create a challenge that’s based on creating a web application. And some time back, I wrote a mosaic-generating script using Ruby during a hackathon, so I thought to marry both ideas together to create a challenge to create photo mosaic web app.

To be honest, at the time of issuing the challenge I haven’t actually written the photo mosaic web app yet. In fact, I only started writing it after the challenge was over. It took me the better part of 2 days to complete the photo mosaic web app. But I wasn’t finished yet, and I wanted to go a bit further and use Go’s concurrency to improve its performance. This blog post is the results of what I did.

The code for this blog post is found at https://github.com/sausheong/mosaic. Note that the code in GitHub has slight variations from the code described below.

The live site is found at http://mosaic.saush.com. It is deployed using Docker to Digital Ocean, through Tutum. The performance on the live site is not as good as described here, as it is only running on 1 CPU VM with 512MB.

Creating the photo mosaic

A photographic mosaic, or a photo mosaic is a picture (usually a photograph) that has been divided into (usually equal sized) rectangular sections, each of which is replaced with another picture (called a tile picture). If we view it from far away or if you squint at it, then the original picture can be seen. If we look closer though, we will see that the picture is in fact made up of many hundreds or thousands of smaller tile pictures.

The basic idea is simple – the web application allows a user to upload a target picture, which will be used to create a photo mosaic. To make things simple, I will assume that tile pictures are already available and are correctly sized.

Let’s start with the photo mosaic algorithm. The steps are simple and the whole web application can be written without the use of any third-party libraries.

  1. Build up a tile database, hash of tile pictures, by scanning a directory of pictures, then using the file name as the key and the average color of the picture as the value. The average color is a 3-tuple calculated from getting the red, green and blue (RGB) of every pixel and adding up all the reds, greens and blues, then divided by the total number of pixels
  2. Cut the target picture into smaller pictures of the correct tile size
  3. For every tile-sized piece of the target picture, assume the average color to the the color of the top left pixel of that piece
  4. Find the corresponding tile in the tile database that is the nearest match to the average color of the piece of the target picture, and place that tile in the corresponding position in the photo mosaic. To find the nearest match, we calculate the Euclidean distance between the two color 3-tuples by converting each color 3-tuple into a point in a 3-dimensional space
  5. Remove the tile from the tile database so that each tile in the photo mosaic is unique

I placed all the mosaic creating code in a single source file named mosaic.go. Let’s look at each function in this file.

// find the average color of the picture
func averageColor(img image.Image) [3]float64 {
    bounds := img.Bounds()
    r, g, b := 0.0, 0.0, 0.0
    for y := bounds.Min.Y; y < bounds.Max.Y; y++ {
        for x := bounds.Min.X; x < bounds.Max.X; x++ {
            r1, g1, b1, _ := img.At(x, y).RGBA()
            r, g, b = r+float64(r1), g+float64(g1), b+float64(b1)
        }
    }
    totalPixels := float64(bounds.Max.X * bounds.Max.Y)
    return [3]float64{r / totalPixels, g / totalPixels, b / totalPixels}
}

First is the averageColor function, which takes the red, green and blue of each pixel in the image, adds up all the reds, greens and blues and then divides each sum by the total number of pixels in the image. Then we create a 3-tuple (actually a 3 element array) consisting of these numbers.
Next, we have the resize function. The resize function resizes an image to a new width.

// resize an image to its new width
func resize(in image.Image, newWidth int) image.NRGBA {
    bounds := in.Bounds()
    width := bounds.Max.X - bounds.Min.X
    ratio := width / newWidth
    out := image.NewNRGBA(image.Rect(bounds.Min.X/ratio, bounds.Min.X/ratio, bounds.Max.X/ratio, bounds.Max.Y/ratio))
    for y, j := bounds.Min.Y, bounds.Min.Y; y < bounds.Max.Y; y, j = y+ratio, j+1 {
        for x, i := bounds.Min.X, bounds.Min.X; x < bounds.Max.X; x, i = x+ratio, i+1 {
            r, g, b, a := in.At(x, y).RGBA()
            out.SetNRGBA(i, j, color.NRGBA{uint8(r), uint8(g), uint8(b), uint8(a)})
        }
    }
    return *out
}

The tileDB function creates a database of the tile picture by scanning the directory where the tile pictures are located.

// populate a tiles database in memory
func tilesDB() map[string][3]float64 {
    fmt.Println("Start populating tiles db ...")
    db := make(map[string][3]float64)
    files, _ := ioutil.ReadDir("tiles")
    for _, f := range files {
        name := "tiles/" + f.Name()
        file, err := os.Open(name)
        if err == nil {
            img, _, err := image.Decode(file)
            if err == nil {
                db[name] = averageColor(img)
            } else {
                fmt.Println(":", err, name)
            }
        } else {
            fmt.Println("cannot open file", name, err)
        }
        file.Close()
    }
    fmt.Println("Finished populating tiles db.")
    return db
}

The tile database is a map with a string as the key a 3-tuple (in this case a 3-element array) as the value. I open up each image file in the directory and then get the average color of the image to create an entry in the map. The tile database is used to find the correct tile picture in the tile picture directory. It is passed into the nearest function, along with the target color 3-tuple.

// find the nearest matching image
func nearest(target [3]float64, db *map[string][3]float64) string {
    var filename string
    smallest := 1000000.0
    for k, v := range *db {
        dist := distance(target, v)
        if dist < smallest {
            filename, smallest = k, dist
        }
    }
    delete(*db, filename)
    return filename
}

Each entry in the tile database is compared with the target color and the entry with the smallest distance is returned as the nearest tile, and also removed from the tile database. The distance function calculates the Euclidean distance between two 3-tuples.

// find the Eucleadian distance between 2 points
func distance(p1 [3]float64, p2 [3]float64) float64 {
    return math.Sqrt(sq(p2[0]-p1[0]) + sq(p2[1]-p1[1]) + sq(p2[2]-p1[2]))
}

// find the square
func sq(n float64) float64 {
    return n * n
}

Finally, scanning and loading the tile database every time a photo mosaic is created can be pretty cumbersome. I want to do that only once, and clone the tile database every time a photo mosaic is created. The source tile database, TILEDB is then created as a global variable and populated on the start of the web application.

var TILESDB map[string][3]float64

// clone the tile database each time we generate the photo mosaic
func cloneTilesDB() map[string][3]float64 {
    db := make(map[string][3]float64)
    for k, v := range TILESDB {
        db[k] = v
    }
    return db
}

The photo mosaic web application

With the mosaic-generating functions in place, I can start writing my web application. The web application is placed in a source code file named main.go.

package main

import (
    "fmt"
    "html/template"
    "net/http"
    "bytes"
    "encoding/base64"
    "image"
    "image/draw"
    "image/jpeg"
    "os"
    "strconv"
)

func main() {
    mux := http.NewServeMux()
    files := http.FileServer(http.Dir("public"))
    mux.Handle("/static/", http.StripPrefix("/static/", files))
    mux.HandleFunc("/", upload)
    mux.HandleFunc("/mosaic ", mosaic)
    server := &http.Server{
        Addr:    "127.0.0.1:8080",
        Handler: mux,
    }
// building up the source tile database
    TILESDB = tilesDB()
    fmt.Println("Mosaic server started.")
    server.ListenAndServe()
}

// to display the upload page
func upload(w http.ResponseWriter, r *http.Request) {
    t, _ := template.ParseFiles("upload.html")
    t.Execute(w, nil)
}

// the HandlerFunc that contains all the photo mosaic generating algorithms
func mosaic(w http.ResponseWriter, r *http.Request) {
    t0 := time.Now()
    // get the content from the POSTed form
    r.ParseMultipartForm(10485760) // max body in memory is 10MB
    file, _, _ := r.FormFile("image")
    defer file.Close()
    tileSize, _ := strconv.Atoi(r.FormValue("tile_size"))
    // decode and get original image
    original, _, _ := image.Decode(file)
    bounds := original.Bounds()
    // create a new image for the mosaic
    newimage := image.NewNRGBA(image.Rect(bounds.Min.X, bounds.Min.X, bounds.Max.X, bounds.Max.Y))
    // build up the tiles database
    db := cloneTilesDB()
    // source point for each tile, which starts with 0, 0 of each tile
    sp := image.Point{0, 0}
    for y := bounds.Min.Y; y < bounds.Max.Y; y = y + tileSize {
        for x := bounds.Min.X; x < bounds.Max.X; x = x + tileSize {
            // use the top left most pixel as the average color
            r, g, b, _ := original.At(x, y).RGBA()
            color := [3]float64{float64(r), float64(g), float64(b)}
            // get the closest tile from the tiles DB
            nearest := nearest(color, &db)
            file, err := os.Open(nearest)
            if err == nil {
                img, _, err := image.Decode(file)
                if err == nil {
                    // resize the tile to the correct size 
                    t := resize(img, tileSize)
                    tile := t.SubImage(t.Bounds())
                    tileBounds := image.Rect(x, y, x+tileSize, y+tileSize)
                    // draw the tile into the mosaic
                    draw.Draw(newimage, tileBounds, tile, sp, draw.Src)
                } else {
                    fmt.Println("error:", err, nearest)
                }
            } else {
                fmt.Println("error:", nearest)
            }
            file.Close()
        }
    }

    buf1 := new(bytes.Buffer)
    jpeg.Encode(buf1, original, nil)
    originalStr := base64.StdEncoding.EncodeToString(buf1.Bytes())

    buf2 := new(bytes.Buffer)
    jpeg.Encode(buf2, newimage, nil)
    mosaic := base64.StdEncoding.EncodeToString(buf2.Bytes())
    t1 := time.Now()
    images := map[string]string{
        "original": originalStr,
        "mosaic":   mosaic,
        "duration": fmt.Sprintf("%v ", t1.Sub(t0)),
    }
    t, _ := template.ParseFiles("results.html")
    t.Execute(w, images)
}

The main logic for creating the photo mosaic is in the mosaic function, which is a handler function. First, I get the uploaded file and also the tile size from the form.

// get the content from the POSTed form
r.ParseMultipartForm(10485760) // max body in memory is 10MB
file, _, _ := r.FormFile("image")
defer file.Close()
tileSize, _ := strconv.Atoi(r.FormValue("tile_size"))

Next, I decode the uploaded target image, and also create a new photo mosaic image.

// decode and get original image
original, _, _ := image.Decode(file)
bounds := original.Bounds()
// create a new image for the mosaic
newimage := image.NewNRGBA(image.Rect(bounds.Min.X, bounds.Min.X, bounds.Max.X, bounds.Max.Y))

I also clone the source tile database, and set up the source point for each tile (this is needed by the image/draw package later).

// build up the tiles database
db := cloneTilesDB()
// source point for each tile, which starts with 0, 0 of each tile
sp := image.Point{0, 0}

I am now ready to iterate through each tile-sized piece of the target image.

for y := bounds.Min.Y; y < bounds.Max.Y; y = y + tileSize {
    for x := bounds.Min.X; x < bounds.Max.X; x = x + tileSize {
        // use the top left most pixel color as the average color
        r, g, b, _ := original.At(x, y).RGBA()
        color := [3]float64{float64(r), float64(g), float64(b)}
        // get the closest tile from the tiles DB
        nearest := nearest(color, &db)
        file, err := os.Open(nearest)
        if err == nil {
            img, _, err := image.Decode(file)
            if err == nil {
                // resize the tile to the correct size 
                t := resize(img, tileSize)
                tile := t.SubImage(t.Bounds())
                tileBounds := image.Rect(x, y, x+tileSize, y+tileSize)
                // draw the tile into the mosaic
                draw.Draw(newimage, tileBounds, tile, sp, draw.Src)
            } else {
                fmt.Println("error:", err, nearest)
            }
        } else {
            fmt.Println("error:", nearest)
        }
        file.Close()
    }
}

For every piece, I pick the top left pixel and assume that’s the average color. Then I find the nearest tile in the tile database that matches this color. The tile database gives me a filename, so I open up the tile picture, and resize it to the tile size. The resultant tile is drawn into the photo mosaic I created earlier.

Once the photo mosaic is created, I encode it into JPEG format, then encode it once again into a base64 string.

    buf1 := new(bytes.Buffer)
    jpeg.Encode(buf1, original, nil)
    originalStr := base64.StdEncoding.EncodeToString(buf1.Bytes())

    buf2 := new(bytes.Buffer)
    jpeg.Encode(buf2, newimage, nil)
    mosaic := base64.StdEncoding.EncodeToString(buf2.Bytes())
    t1 := time.Now()
    images := map[string]string{
        "original": originalStr,
        "mosaic":   mosaic,
        "duration": fmt.Sprintf("%v ", t1.Sub(t0)),
    }
    t, _ := template.ParseFiles("results.html")
    t.Execute(w, images)

The original target picture and the photo mosaic are then sent to the results.html template to be displayed on the next page. As you can see, the image is displayed with using a data URL with the base64 content that is embedded in the web page itself.

<!DOCTYPE html>
<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
    <title>Mosaic</title>
    ...
  </head>   
  <body>
    <div class='container'>
        <div class="col-md-6">
          <img src="data:image/jpg;base64,{{ .original }}" width="100%">
          <div class="lead">Original</div>
        </div>
        <div class="col-md-6">
          <img src="data:image/jpg;base64,{{ .mosaic }}" width="100%">
          <div class="lead">Mosaic – {{ .duration }} </div>
        </div>
        <div class="col-md-12 center">
          <a class="btn btn-lg btn-info" href="/">Go Back</a>
        </div>
    </div>   
    <br>
  </body>
</html>

Here’s a screenshot of the mosaic that’s created.

09-01
Basic photo mosaic web app

Now that we have the basic mosaic generating web application, let’s create the concurrent version of it.

Concurrent photo mosaic web application

One of the more frequent use of concurrency is to improve performance. The web application I just showed created a mosaic from a 151KB JPEG image in about 2.25 seconds. The performance is not really fantastic and can definitely be improved using some concurrency. The algorithm I am using in this example to build some concurrency into the photo mosaic web application is quite straightforward.

  1. Split the original image into 4 quarters
  2. Process them at the same time
  3. Combine the results back into a single mosaic

From a diagrammatic point of view:

Concurrency algorithm
Concurrency algorithm

A word of caution – this is not the only way that performance can be improved or concurrency can be achieved, but only one relatively simple and straightforward way.

The only function that changes in this is the mosaic handler function. In the earlier program, I had a single handler function that created the photo mosaic. In the concurrent version of the photo mosaic web application, I need to break up that function into two separate functions, called cut and combine respectively. Both the cut and the combine functions are called from the mosaic handler function.

func mosaic(w http.ResponseWriter, r *http.Request) {
    t0 := time.Now()
    r.ParseMultipartForm(10485760) // max body in memory is 10MB
    file, _, _ := r.FormFile("image")
    defer file.Close()
    tileSize, _ := strconv.Atoi(r.FormValue("tile_size"))
    original, _, _ := image.Decode(file)
    bounds := original.Bounds()
    db := cloneTilesDB()

    // fan-out
    c1 := cut(original, &db, tileSize, bounds.Min.X, bounds.Min.Y, bounds.Max.X/2, bounds.Max.Y/2)
    c2 := cut(original, &db, tileSize, bounds.Max.X/2, bounds.Min.Y, bounds.Max.X, bounds.Max.Y/2)
    c3 := cut(original, &db, tileSize, bounds.Min.X, bounds.Max.Y/2, bounds.Max.X/2, bounds.Max.Y)
    c4 := cut(original, &db, tileSize, bounds.Max.X/2, bounds.Max.Y/2, bounds.Max.X, bounds.Max.Y)

    // fan-in
    c := combine(bounds, c1, c2, c3, c4)

    buf1 := new(bytes.Buffer)
    jpeg.Encode(buf1, original, nil)
    originalStr := base64.StdEncoding.EncodeToString(buf1.Bytes())

    t1 := time.Now()
    images := map[string]string{
        "original": originalStr,
        "mosaic":   <-c,
        "duration": fmt.Sprintf("%v ", t1.Sub(t0)),
    }

    t, _ := template.ParseFiles("results.html")
    t.Execute(w, images)
}

Cutting up the image is handled by the cut function, in what is known as the fan-out pattern.

Splitting the target picture into 4 quadrants
Splitting the target picture into 4 quadrants

The original image is cut up into 4 quadrants to be processed separately.

c1 := cut(original, &db, tileSize, bounds.Min.X, bounds.Min.Y, bounds.Max.X/2, bounds.Max.Y/2)
c2 := cut(original, &db, tileSize, bounds.Max.X/2, bounds.Min.Y, bounds.Max.X, bounds.Max.Y/2)
c3 := cut(original, &db, tileSize, bounds.Min.X, bounds.Max.Y/2, bounds.Max.X/2, bounds.Max.Y)
c4 := cut(original, &db, tileSize, bounds.Max.X/2, bounds.Max.Y/2, bounds.Max.X, bounds.Max.Y)

You might notice that these are regular functions and not goroutines, how can they run concurrently? The answer is because the cut function creates an anonymous goroutine and returns a channel.

func cut(original image.Image, db *map[string][3]float64, tileSize, x1, y1, x2, y2 int) <-chan image.Image {
    c := make(chan image.Image)
    sp := image.Point{0, 0}
    go func() {
        newimage := image.NewNRGBA(image.Rect(x1, y1, x2, y2))
        for y := y1; y < y2; y = y + tileSize {
            for x := x1; x < x2; x = x + tileSize {
                r, g, b, _ := original.At(x, y).RGBA()
                color := [3]float64{float64(r), float64(g), float64(b)}
                nearest := nearest(color, db)
                file, err := os.Open(nearest)
                if err == nil {
                    img, _, err := image.Decode(file)
                    if err == nil {
                        t := resize(img, tileSize)
                        tile := t.SubImage(t.Bounds())
                        tileBounds := image.Rect(x, y, x+tileSize, y+tileSize)
                        draw.Draw(newimage, tileBounds, tile, sp, draw.Src)
                    } else {
                        fmt.Println("error:", err)
                    }
                } else {
                    fmt.Println("error:", nearest)
                }
                file.Close()
            }
        }
        c <- newimage.SubImage(newimage.Rect)
    }()
    return c
}

The logic is exactly the same as in the original photo mosaic web application. I created a channel in the cut function and started up an anonymous goroutine that sends the results to this channel, then return the channel. This way, the channel is immediately returned to the mosaic handler function, and the completed photo mosaic segment is sent to the channel when the processing is done. You might notice that while I’ve created the return channel as a bi-directional channel, I can typecast it to be returned as a receive-only channel.

I’ve cut the original image into 4 separate pieces and convert each piece into a part of a photo mosaic. It’s time to put them together again, using what is commonly known as the fan-in pattern, in the combine function.

func combine(r image.Rectangle, c1, c2, c3, c4 <-chan image.Image) 
<-chan string {
    c := make(chan string)
    // start a goroutine
    go func() {
        var wg sync.WaitGroup
        img:= image.NewNRGBA(r)
        copy := func(dst draw.Image, r image.Rectangle, 
src image.Image, sp image.Point) {
            draw.Draw(dst, r, src, sp, draw.Src)
            wg.Done()
        }
        wg.Add(4)
        var s1, s2, s3, s4 image.Image
        var ok1, ok2, ok3, ok4 bool
        for  {
            select {
            case s1, ok1 = <-c1:
                go copy(img, s1.Bounds(), s1,
                    image.Point{r.Min.X, r.Min.Y})
            case s2, ok2 = <-c2:
                go copy(img, s2.Bounds(), s2, image.Point{r.Max.X / 2, r.Min.Y})
            case s3, ok3 = <-c3:
                go copy(img, s3.Bounds(), s3, image.Point{r.Min.X, r.Max.Y/2})
            case s4, ok4 = <-c4:
                go copy(img, s4.Bounds(), s4, image.Point{r.Max.X / 2, r.Max.Y / 2})
            }
            if (ok1 && ok2 && ok3 && ok4) {
                break
            }
        }
        // wait till all copy goroutines are complete
        wg.Wait()
        buf2 := new(bytes.Buffer)
        jpeg.Encode(buf2, newimage, nil)
        c <- base64.StdEncoding.EncodeToString(buf2.Bytes())
    }()
    return c
}

As in the cut function, the main logic in combining the images are in an anonymous goroutine, and I create and return a receive-only channel. As a result, I can encode the original image while combining the 4 photo mosaic segments.

In the anonymous goroutine, I create another anonymous function and assign it to a variable copy. This function copies a photo mosaic segment into the final photo mosaic and will be run as a goroutine later. Because the copy function is called as a goroutine, I will not be able to control when they complete. To synchronize the completion of the copying, I use a WaitGroup. I create a WaitGroup wg, then set the counter to 4 using the Add method. Each time the copy function completes, it will decrement the counter using the Done method. I call the Wait method just before encoding the image to allow all the copy goroutines to complete and I actually have a complete photo mosaic image.

Remember that the input to the combine function includes the 4 channels coming from the cut function containing the photo mosaic segments, and I don’t know when the channels actually have segments. I could try to receive each one of those channels in sequence, but that wouldn’t be very concurrent. What I would like to do is to start processing whichever segment that comes first and the select statement fits the bill nicely.

var s1, s2, s3, s4 image.Image
var ok1, ok2, ok3, ok4 bool
for  {
select {
    case s1, ok1 = <-c1:
        go copy(img, s1.Bounds(), s1, image.Point{r.Min.X, r.Min.Y})
    case s2, ok2 = <-c2:
        go copy(img, s2.Bounds(), s2, image.Point{r.Max.X / 2, r.Min.Y})
    case s3, ok3 = <-c3:
        go copy(img, s3.Bounds(), s3, image.Point{r.Min.X, r.Max.Y / 2})
    case s4, ok4 = <-c4:
        go copy(img, s4.Bounds(), s4, image.Point{r.Max.X / 2, r.Max.Y / 2})
    }
    if (ok1 && ok2 && ok3 && ok4) {
        break
    }
}

I loop indefinitely and in each iteration, I select the channel that is ready with a value (or if more than one is available, Go randomly assigns me one). I use the image from this channel and start a goroutine with the copy function. Note that I’m using the multi-value format for receiving values from the channel, meaning the second variable (ok1, ok2, ok3 or ok4) tells me if I have successfully received from the channel. The for loop breaks once I have received successfully on all channels.

Moving on, and referring to the WaitGroup wg I used earlier, remember that even though I received all the photo mosaic segments successfully, I have in turn started 4 separate goroutines, which might not have completed at that point in time. The Wait method on the WaitGroup wg blocks the encoding of the assembled photo mosaic until the photo mosaic is completed.

Here’s a screenshot of the results, using the same target picture and tile pictures.

Photo mosaic web application with concurrency
Photo mosaic web application with concurrency

If you’re sharp-eyed, you might see the slight differences in the photo mosaic that’s generated. The final photo mosaic is assembled from 4 separate pieces and the algorithm doesn’t smoothen out the rough edges. However, you can see the difference in performance – where the basic photo mosaic web application took 2.25 seconds, this one using concurrency only takes almost a quarter of that time, around 646 miliseconds.

For readers who are really observant you might realize that both web applications are actually running on just one CPU! As mentioned by Rob Pike, concurrency is not parallelism – what I’ve shown you is how we can take a simple algorithm and break it down into a concurrent one, with no parallelism involved! None of the goroutines are actually running in parallel (since there is only one CPU) even though they are running independently.
Of course it would be cruel not to go the last step and show how it actually runs using multiple CPUs. To do this, I simply need to set the GOMAXPROCS in the runtime to the actual number of CPUs running on my system. The changes are in the main.go file. Remember to import the runtime package before making the change below.

func main() {
    // this prints out the number of CPUs in my system
    fmt.Println("Number of CPUs:", runtime.NumCPU())
    runtime.GOMAXPROCS(runtime.NumCPU())
    fmt.Println("Starting mosaic server ...")
    mux := http.NewServeMux()
    files := http.FileServer(http.Dir("public"))
    mux.Handle("/static/", http.StripPrefix("/static/", files))
    mux.HandleFunc("/", upload)
    mux.HandleFunc("/mosaic", mosaic)
    server := &http.Server{
        Addr:    "127.0.0.1:8080",
        Handler: mux,
    }
    TILESDB = tilesDB()
    fmt.Println("Mosaic server started.")
    server.ListenAndServe()
}

I compile and then upload the same target picture again.

Photo mosaic web application with concurrency and 8 CPUs
Photo mosaic web application with concurrency and 8 CPUs

As you can see, the performance has improved 3 times, from 646 milliseconds to 216 milliseconds! And if we compare that with our original photo mosaic web application with 2.25 seconds, that’s a 10 times performance improvement! That is an actual comparison, even though we did not run it with 8 CPUs previously, our original photo mosaic web application is not concurrent and therefore can only use one CPU – giving it 8 CPUs make no difference at all.
What is also interesting to note is that there is no difference between the original and the concurrent web applications, in terms of the photo mosaic algorithm. In fact, between the two applications, the mosaic.go source file was not modified at all. The only difference is concurrency, and that is a testament to how powerful it is.

HTML Forms and Go

This is an excerpt out of my new Go Web Programming book that talks about using Go to process HTML forms sent from the browser. This sounds pretty trivial but as in much of web programming (and much of programming per se), it’s often the trivial things that stumble us.

Go Web Programming

Before we get into getting form data from a POST request, let’s take a closer look into HTML forms and see what they are. Most of the time, POST requests come in the form (pun intended) of a HTML form and often look like this:

<form action="/process" method="post">
<input type="text" name="first_name"/>
<input type="text" name="last_name"/>
<input type="submit"/>
</form>

Within the form tag, we place a number of HTML form elements like text input, text area, radio buttons, checkboxes, file uploads and so on. These elements allow users to enter data to be submitted to the server. Data is submitted to the server when the user clicks a button or somehow triggers the form submission.

We know the data is sent to the server through a HTTP POST request, and is placed in the body of the request. But how is the data formatted? The HTML form data is always sent as name-value pairs but how are these name-value pairs formatted in the POST body? It is important for us to know this because as we receive the POST request from the browser, we need to be able to parse the data and extract the name-value pairs.
The format of the name-value pairs sent through a POST request is specified by the content type of the HTML form. This is defined using the enctype attribute like this:

<form action="/process" method="post" enctype="application/x-www-form-urlencoded">
<input type="text" name="first_name"/>
<input type="text" name="last_name"/>
<input type="submit"/>
</form>

The default values for enctype is application/x-www-form-urlencoded but browsers are required to support at least application/x-www-form-urlencoded and multipart/form-data (HTML5 also supports a text/plain value).

If we set enctype to application/x-www-form-urlencoded, the browser will encode the HTML form data a long query string with the name-value pairs separated by an ampersand (&) and the name is separated from the values by an equal (=), that is the same as URL encoding, hence the name. In other words, the HTTP body will look something like this:

first_name=sau%20sheong&last_name=chang

If we set enctype to multipart/form-data, each name-value pair is converted into a MIME message part, each with its own content type and content disposition. For example, the same form data as above will now look something like this:

------WebKitFormBoundaryMPNjKpeO9cLiocMw
Content-Disposition: form-data; name="first_name"

sau sheong
------WebKitFormBoundaryMPNjKpeO9cLiocMw
Content-Disposition: form-data; name="last_name"

chang
------WebKitFormBoundaryMPNjKpeO9cLiocMw--

When would we use either one or the other? If we’re sending simple text data, the URL encoded form is better as it is simpler, more efficient and less processing is needed. If we’re sending large amounts of data, especially when uploading files the multipart-MIME form is better. We can even specify to do base64 encoding to send binary data as text.

So far we’ve only talked about POST requests, what about GET requests in a HTML form? HTML allows the method attribute to be either POST or GET, so this is also a valid format.

<form action="/process" method="get">
<input type="text" name="first_name"/>
<input type="text" name="last_name"/>
<input type="submit"/>
</form>

In this case, there is no request body (GET requests have no request body), all the data are set in the URL as name-value pairs.

Now that we know how data is sent from a HTML form to the server, let’s go back to the server and see how we use net/http to process the request.

Form

One way to extract data from the HTTP request is to extract data from the URL and the body in the raw form, which requires us to parse the data ourselves. However we normally do not need to, because the net/http library provides us with a rather comprehensive set of functions, although not named entirely correctly, normally provides us with all we need. Let’s talk about each one of them in turn.

The functions in Request that allows us to extract data from the URL and/or the body revolve around the Form, PostForm and MultipartForm fields. The data are in the form of key-value pairs (which is what we normally get from a POST request anyway). The general algorithm is:

  • Call ParseForm or ParseMultipartForm to parse the request
  • Access Form, PostForm or MultipartForm accordingly

Let’s take a look at some code.

package main

import (
"fmt"
"net/http"
)

func process(w http.ResponseWriter, r *http.Request) {
r.ParseForm()
fmt.Fprintln(w, r.Form)
}

func main() {
server := http.Server{
Addr: "127.0.0.1:8080",
}
http.HandleFunc("/process", process)
server.ListenAndServe()
}

The focus of this server is on these 2 lines:

r.ParseForm()
fmt.Fprintln(w, r.Form)

As mentioned earlier, we need to first parse the request using ParseForm, and then access the Form field.

Let’s take a look at the client that is going to call this server. We’ll create a simple, minimal HTML form to send the request to the server. Place the code in a file named client.html.

<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>Go Web Programming</title>
</head>
<body>
<form action="http://127.0.0.1:8080/process?hello=world&thread=123" method="post" enctype="application/x-www-form-urlencoded">
<input type="text" name="hello" value="sau sheong"/>
<input type="text" name="post" value="456"/>
<input type="submit"/>
</form>
</body>
</html>

In this form we are:

  • Sending the URL http://localhost:8080/process?hello=world&thread=123 to the server using the POST method
  • Specifying the content type (in the enctype field) to be application/x-www-form-urlencoded
  • Sending 2 HTML form key-value pairs – hello=sau sheong and post=456 to the server

Note that we have 2 values for the key hello. One of them is world in the URL and the other is sau sheong in the HTML form.

Open the client.html file directly in your browser (you don’t need to serve it out from a web server, just running it locally on your browser is fine) and click on the submit button. What you will see on the browser is:

map[thread:[123] hello:[sau sheong world] post:[456]]

This is the raw string converted version of the Form struct in the POST request, after the request has been parsed. The Form struct is a map, which keys are strings and values are a slice of strings. Notice that the map is not sorted so you might get a different sorting of the returned values. Nonetheless what we get is the combination of the query values hello=world and thread=123 as well as form values hello=sau sheong and post=456. As you can see, the values are URL decoded (there is a space between sau and sheong).

PostForm

Of course if you wanted to just get the value to the key post, you can use r.Form["post"] which will give you a map with 1 element – [456]. If the form and the URL have the same key, both of them will be placed in a slice, with the form value always prioritized before the URL value.

What if we need just the form key-value pairs and want to totally ignore the URL key-value pairs? For this we have the PostForm, which only provides key-value pairs for the form and not the URL. If we change from using r.Form to using r.PostForm in the code this is what we get:

map[post:[456] hello:[sau sheong]]

We used application/x-www-form-urlencoded for the content type. What happens if we use multipart/form-data? Make the change to the client HTML form, switch back to using r.Form and let’s find out:

map[hello:[world] thread:[123]]

What happened here? We only get the URL query key-value pairs this time and not the form key-value pairs, because PostForm only supports application/x-www-form-urlencoded. To get multipart key-value pairs from the body, we need to use the MultipartForm.

MultipartForm

Instead of using ParseForm and then calling Form on the request, we have to use ParseMultipartForm then use MultipartForm on the request. ParseMultipartForm also calls ParseForm when necessary.

r.ParseMultipartForm(1024)
fmt.Fprintln(w, r.MultipartForm)

We need to tell ParseMultipartForm how much data we want to extract from the multipart form, in bytes. Now let’s see what happens:

&{map[hello:[sau sheong] post:[456]] map[]}

This time we see the form key-value pairs, but not the URL key-value pairs. This is because MultipartForm only contains the form key-value pairs. Notice that the returned value is no longer a map, but a struct that contains 2 maps. The first map has keys that are strings and values that are slices of string while the second map is empty. It’s empty because it’s a map with keys that are strings but values that are files.

There is one last set of functions that allows us to access the key-value pairs even easier than what we’ve just went through. The FormValue function allows us to access the key-value pairs just like in Form, except that it is for a specific key and we don’t need to call ParseForm or ParseMultipartForm beforehand – the FormValue function does that for us.

From our previous example, this means if we do this in our handler function:

fmt.Fprintln(w, r.FormValue("hello"))

And we set the client’s form enctype to application/x-www-form-urlencoded, we will get this:

sau sheong

We get only sau sheong because FormValue only retrieves the first value, even though we actually have both values in the Form struct. To prove this, let’s add another line below the earlier line of code, like this:

fmt.Fprintln(w, r.FormValue("hello"))
fmt.Fprintln(w, r.Form)

This time we’ll see:

sau sheong
map[post:[456] hello:[sau sheong world] thread:[123]]

The PostFormValue function does the same thing, except that it is for PostForm instead of Form. Let’s make some changes to the code to use the PostFormValue function:

fmt.Fprintln(w, r.PostFormValue("hello"))
fmt.Fprintln(w, r.PostForm)

This time we get this instead:

sau sheong
map[hello:[sau sheong] post:[456]]

As you can see we get only the form key-value pairs.

Both FormValue and PostFormValue call ParseMultipartForm for us so we don’t need to call it ourselves, but there’s a slightly confusing gotcha that you should be careful with (at least as of Go 1.4). If we set the client form’s enctype to be multipart/form-data and try to get the value using either FormValue or PostFormValue, we won’t be able to get it even though MultipartForm has been called!

To be clearer, let’s make some changes to the server’s handler function again:

fmt.Fprintln(w, "(1)", r.FormValue("hello"))
fmt.Fprintln(w, "(2)", r.PostFormValue("hello"))
fmt.Fprintln(w, "(3)", r.PostForm)
fmt.Fprintln(w, "(4)", r.MultipartForm)

This is our result from using our form with enctype set to multipart/form-data:

(1) world
(2)
(3) map[]
(4) &{map[hello:[sau sheong] post:[456]] map[]}

The first line in the results gives us the value for hello that’s found in the URL and not the form. The second line and third line tells us why, because if we just take the form key-value pairs, we actually get nothing. That’s because FormValue and PostFormValue corresponds to Form and PostForm, and not MultipartForm. The last line in the results proves to us that ParseMultipartForm was actually called, that’s why if we try to access the MultipartForm we’ll get the data there.

We covered quite a bit in this blog post so let’s recap how these functions are different, in a nice table.

html_form_and_go

Undoubtedly the naming convention leaves much to be desired!

Go Web Programming

I’ve gone and done it. I’ve started writing another book. And not just any book, but a book on good old honest-to-goodness web programming. If I’m qualified to write about any programming topic that’s probably it. I’ve been doing web application programming so long now that almost the first thing I check out in any new programming language I get to know is their http library. I’ve written web applications in almost every programming language I know, and some I’m not even sure I actually know.

So now this. And on the Go language. We’ll see.

http://www.manning.com/chang/

Create 3D anaglyph images with 3 lines of Ruby code

3D has always fascinated me. When I was young my brother and I had a ViewMaster and a Pan-Pet Panorama Stereo Viewer, both of which totally bowled us over when we first saw it. My brother as usual totally took it apart and fixed it up again, multiple times while I simply spent hours goggling at the thing. I have no idea where they are now but thinking back they were my first recollection of understanding stereoscopy is.

ViewMaster
ViewMaster
Pan-Pet Panorama Stereo Viewer
Pan-Pet Panorama Stereo Viewer

A bit of history

It’s probably surprising to most people (at least it was to me) that the modern techniques of 3D imaging and stereoscopy dated way back before even photography. In fact the first few stereoscopic images were drawings. The picture below shows one of the earliest stereoscopic drawings by Jacopo Chimenti, a painter from Florence, Italy.

Jacopo Chimenti's first stereoscopic image
Jacopo Chimenti’s first stereoscopic image

In 1838, Charles Wheatstone, a British inventor, published a paper that provided the scientific basis for stereography. He showed that the brain unifies the slightly different two-dimensional images from each eye into a single object of three dimensions. Wheatstone’s early stereographs were also drawings rather than photographs.

Wheatstone's stereoscope
Wheatstone’s stereoscope

Photographic stereographs were first produced in 1849 by the English physicist David Brewster who improved the stereoscope and in 1849 the first true stereo camera with two lenses.

3D/stereographic imaging techniques

The principles of stereoscopy are quite simple. We see things in 3 dimensions (i.e. being able to see 3-dimensional depth) because each of our 2 eyes actually see a slightly different image. This is because our eyes positioned are apart from each other which generates what is called binocular disparity. Recreating this effect with a 2-dimensional image then allows us to ‘see’ the image in 3D.

There are a number of ways to do this but generally the principle revolves around creating a set of 2 images, one for each eye and ‘forcing’ the left eye to view the left image and the right eye to view the right image.

Freeviewing

This method places the left image on the right side and the right image on the left side. To view the image in stereo, force your eyes to go cross-eyed, which will produce 3 images. Then slowly ease the eyes to view the middle image in 3D. This is not as silly as it sounds, and actually works though it can be a strain on the eyes (obviously).

Stereogram with cross-eyed method
Stereogram with cross-eyed method

Wiggle method

Stereogram with wiggle method
Stereogram with wiggle method

The wiggle method surprises a lot of people (including me when I first read about it) but it can sometimes be pretty effective. Basically you use the 2 images and create a GIF that alternates between each other.

Viewers

This method uses various kinds of viewers, from the 18th century Wheatstone stereoscope to the popular Holmes American stereoscope and the transparency viewers like the ViewMaster and the Pan-Pet that I grew up with. It also includes high tech head-mounted displays.

Holmes' American Stereoscope (reproduction)
Holmes’ American Stereoscope (reproduction)

Parallax barrier and lenticular printing

These 2 methods are similar though parallax barrier is pretty high-tech while lenticular prints is as low-tech as it can be. Parallax barrier essentially places a barrier in front of an image source, usually a LCD display, with a series of precision slits, allowing each eye to see a different set of pixels. This is famously used in the Nintendo 3DS.

Nintendo 3DS
Nintendo 3DS

Lenticular printing uses a similar technique but with lenticular lenses. Lenticular prints are popular as novelty items and you’d probably encountered them in many places without knowing what it was called.

These 2 methods are often also classified as ‘autostereoscopy’ or glasses-free 3D.

Difference between parallax barrier and lenticular printing
Difference between parallax barrier and lenticular printing
Lenticular print of a promotion item
Lenticular print of a promotion item

3D glasses

This is probably the method you’re most likely to encounter nowadays in movies and in 3D TVs. I classify both passive and active glasses in this category though the actual technologies can be vastly different such as alternating different frames with special projectors and using polarized light.

Which brings us to the type of 3D image we’ll be trying out today — anaglyphs.

The idea of anaglyphs is simple. We start with the 2 left and right images again. This time they are superimposed on each other, but the left would be corrected to show only red color while the right would be corrected to show cyan color. Actually we can use other colors besides red and cyan but these 2 colors are the most popular (and patent-free).

The image is then viewed with a pair of glasses that filter soff red on the left lens and cyan on the right lens. The results is that the left eye would only see the left image and the right eye the right image, therefore generating the perception of depth.

Red-cyan anaglyph glasses
Red-cyan anaglyph glasses

The main problem with this technique (besides the necessity of wearing glasses) is that the colors are a bit wonky. Also if some color from the left image gets into the right eye (and vice versa) a faintly colored “ghost” will be seen. And if the filter from each lens filters off different amount of light resulting in luminance imbalance, it can easily cause headaches (happened to me lots of times during the experiments I did below).

However there are plenty of advantages of anaglyphs. Firstly there isn’t a need for fancy high-tech equipment. Anaglyph red-cyan glasses can be easily created at home or bought cheaply and as you will see below, creating anaglyphs is child’s play.

Creating anaglyphs with Ruby

Creating anaglyphs is ridiculously easy with RMagick. This is the whole script I used.

#!/usr/bin/env ruby

require ‘rubygems’
require ‘rmagick’
include Magick

left = ImageList.new(ARGV[0]).gamma_correct(1,0,0)
right = ImageList.new(ARGV[1]).gamma_correct(0,1,1)
anaglyph = left.composite right, CenterGravity, ScreenCompositeOp

anaglyph.write(‘anagylph.jpg’)

As you can see, the real work is done in only 3 lines of code. Firstly I create an ImageList object (assuming the first parameter is the file name of the first image). Then I use #gamma_correct and filter off the greens and blues of the left image while keeping the reds. The for the right image, I do the same thing, except this time I filter off the reds while keeping the greens and blues. Finally I use #composite and blend the 2 images together using the screen blending mode (which lightens the image after blending). I used CenterGravity to place the right image at the center of the left image here but it really doesn’t matter since both images are supposed to be the same size anyway. And what remains is just to write the anaglyph back into a file.

Of course, all of these means nothing if we can’t capture the left and right images. For this there are the stereo cameras, ranging from the amazing to the weird and the totally slap-together.

3D World 120 Tr-Lens: Stereoscopic Three Lenses 3D Camera
3D World 120 Tr-Lens: Stereoscopic Three Lenses 3D Camera
Fujifilm FinePix REAL 3D W3
Fujifilm FinePix REAL 3D W3
2 instant cameras as conjoined stereo camera
2 instant cameras as conjoined stereo camera

Alternatively you can do the same thing with a single camera, although I wouldn’t recommend it for anything else except still shots. To do this, some recommend to use what is known as the ‘cha-cha’ technique. This requires the photographer to snap a picture then shifting weight slightly to the left or right therefore moving a few centimeters to take a reasonably good second image.

Me? I didn’t want to buy a fancy 3D camera and wasn’t inclined do the cha-cha so I applied a bit of MacGyverism on my primary camera.

MacGyver'ed dual iPhones with rubber-bands
MacGyver’ed dual iPhones with rubber-bands

It’s not perfect but it does take a reasonably good picture.

Left image of Kai Wen
Left image of Kai Wen
Right image of Kai Wen
Right image of Kai Wen
Anaglyphic image of Kai Wen
Anaglyphic image of Kai Wen

Edge detection with the Sobel operator in Ruby

I was never much into image processing. Sure, like most programmers I dabbled into it for cropping images or doing some fancy-schmancy filtering effects stuff. I even wrote a Flickr clone for my last book which has a rather impressive photo editor (mashed up from Pixlr, not mine). But I never thought much on how those effects were done or who came up with them in the first place. That is until I met Irwin Sobel.

For those who know their image processing, this should ring bells immediately. Yes, it’s that Sobel. But a minute to give some background — Irwin is a colleague of mine working in the Mobile and Immersive Experience Lab in HP Labs. I was visiting about two weeks ago and was introduced to him and his current projects. Inevitably someone talked about the Sobel operator, a commonly used algorithm used for edge detection. I was, unfortunately, totally clueless about what it was. Not good. So not surprisingly I ended up Googling for ‘Sobel operator’ at the first possible chance and found out what it was.

The Sobel operator is an algorithm for edge detection in images. Edge detection for those who are not familiar with the term, is an image processing technique to discover the boundaries between regions in an image. It’s an important part of detecting features and objects in an image. Simply put, edge detection algorithms help us to determine and separate objects from background, in an image.

The Sobel operator does this in a rather clever way. An image gradient is a change in intensity (or color) of an image (I’m over simplifying but bear with me). An edge in an image occurs when the gradient is greatest  and the Sobel operator makes use of this fact to find the edges in an image. The Sobel operator calculates the approximate image gradient of each pixel by convolving the image with a pair of 3×3 filters. These filters estimate the gradients in the horizontal (x) and vertical (y) directions and the magnitude of the gradient is simply the sum of these 2 gradients.

The magnitude of the gradient, which is what we use, is calculated using:

That’s the simplified, 2-paragraph theory behind the algorithm. If this fascinates you, you should grab a couple of books on image processing and computer vision and go through them.

Let’s look at how to implement the Sobel operator. This is simply by creating the 2 filters and running them through each pixel in the image, starting from the left and going right. Note that because the filter is a 3×3 matrix, the pixels in the first and last rows as well as the first and last columns cannot be estimated so the output image will be a 1 pixel-depth smaller than the original image.

To calculate the pixel in the right side of the equation (the one with coordinates 1,1) the following equation is used:

output pixel [1,1] = ([0,0] x -1) + ([0,1] x 0) + ([0,2] x 1) + ([1,0] x -2) + ([1,1] x 0) + ([1,2] x 2) + ([2,0] x -1) + ([2,1] x 0) + ([2,2] x 1)

To simplify matters even more, the grayscale version of the original image is usually used.

Now let’s look at the Ruby implementation


require 'chunky_png'

class ChunkyPNG::Image
  def at(x,y)
    ChunkyPNG::Color.to_grayscale_bytes(self[x,y]).first
  end
end

img = ChunkyPNG::Image.from_file('engine.png')

sobel_x = [[-1,0,1],
           [-2,0,2],
           [-1,0,1]]

sobel_y = [[-1,-2,-1],
           [0,0,0],
           [1,2,1]]

edge = ChunkyPNG::Image.new(img.width, img.height, ChunkyPNG::Color::TRANSPARENT)

for x in 1..img.width-2
  for y in 1..img.height-2
    pixel_x = (sobel_x[0][0] * img.at(x-1,y-1)) + (sobel_x[0][1] * img.at(x,y-1)) + (sobel_x[0][2] * img.at(x+1,y-1)) +
              (sobel_x[1][0] * img.at(x-1,y))   + (sobel_x[1][1] * img.at(x,y))   + (sobel_x[1][2] * img.at(x+1,y)) +
              (sobel_x[2][0] * img.at(x-1,y+1)) + (sobel_x[2][1] * img.at(x,y+1)) + (sobel_x[2][2] * img.at(x+1,y+1))

    pixel_y = (sobel_y[0][0] * img.at(x-1,y-1)) + (sobel_y[0][1] * img.at(x,y-1)) + (sobel_y[0][2] * img.at(x+1,y-1)) +
              (sobel_y[1][0] * img.at(x-1,y))   + (sobel_y[1][1] * img.at(x,y))   + (sobel_y[1][2] * img.at(x+1,y)) +
              (sobel_y[2][0] * img.at(x-1,y+1)) + (sobel_y[2][1] * img.at(x,y+1)) + (sobel_y[2][2] * img.at(x+1,y+1))

    val = Math.sqrt((pixel_x * pixel_x) + (pixel_y * pixel_y)).ceil
    edge[x,y] = ChunkyPNG::Color.grayscale(val)
  end
end

edge.save('engine_edge.png')

First thing you’d notice is that I used a library called ChunkyPNG, which is PNG manipulation library that is implemented in pure Ruby. While wrappers over ImageMagick (like RMagick) is probably the defacto image processing and manipulation library in Ruby, I thought it’s kind of pointless to do a Sobel operator with ImageMagick since it already has its own edge detection implementation.

To simplify the implementation, I opened up the Image class in ChunkyPNG and added a new method that will return a grayscale pixel at a specific location. Then I created the 2 Sobel filters with arrays of arrays. I created 2 nested loops to iterate through each pixel column by column, then row by row and at each pixel I used the equation above to calculate the gradient by applying the x filter then the y filter. Finally I used the gradient and set a grayscale pixel based on the gradient value, on a new image.

Here you can see the original image, which I reused from the Wikipedia entry on Sobel operator.

And the edge detected image with the x filter applied only.

This is the edge detected image with the y filter only.

Finally this is the edge detected image with both x and y filters applied.

This short exercise might not be technically challenging but it made me appreciate the pioneers who invented things that we now take for granted. Here’s a final picture, one with myself and Irwin (he is the guy who’s sitting opposite me), and a bunch of other colleagues at HP Labs Palo Alto over lunch. Thanks Irwin, for the Sobel operator!

Generate MP3 waveforms with Ruby and R

I blame Rully for this. If it wasn’t for him I wouldn’t have been obsessed with this and spent a good few hours at night figuring it out last week. It all started when Rully mentioned that he knew how many beeps there are in the Singapore MRT (subway system) ‘doors closing’ warning. There are 13 beeps, he explained and he said he found out by running a WAV recording of it through a MatLab package which in turn generated a waveform that allowed him to count the number of beeps accurately (it is normally too fast for the ear to determine the number of beeps). Naturally such talk triggered the inner competitive geek in me. I happened to be doing Ruby and R integration at the moment (watch out for my talk at RedDotRubyConf at the end of April) so I had no choice but to retrace his steps using my new toys. Really.

MP3 is a compressed and lossy audio encoding format and to generate a waveform I decided to convert it to WAV first. Doing this is relatively simple — there is a library called icanhasaudio that wraps around LAME to encode and decode audio. Naturally you need to have LAME installed first in your machine before you can do this but once you have done that, decoding the MP3 is a breeze:

reader = Audio::MPEG::Decoder.new
File.open('mrt_closing.mp3', 'rb') do |input|
  File.open('out.wav', 'wb')  do |output|
    reader.decode(input, output)
  end
end

That was easy enough. The next step was a bit trickier though. To understand how to create a waveform from a WAV file let’s digress a bit into what a WAV file is. WAV is an audio file format, originally from IBM and Microsoft, used to store audio bitstreams.WAV is an extended RIFF format, which is a little-endian version of the AIFF format (which is big-endian). In RIFF, data are stored in ‘chunks’ and for WAV, there are basically 2 types of chunks — a format chunk and a sound data chunk. The format chunk contains the parameters describing the waveform for example its sample rate, and the data chunk contains the actual waveform data. There are other chunks but because I’m really only interested in the waveform I’ll conveniently ignore them. This is how a minimal WAV file looks like:

The data chunk has a chunk ID which is always ‘data’, and a chunk size that is a long integer. Data in the data chunk is stored in sample points. A sample point is a value that represents a sample of a sound at a given moment in time. Each sample point is stored as a linear 2’s-complement value from 9 – 32 bits wide, specificed in the BitsPerSample field in the format chunk. Sounds in a WAV file can also come in multiple channels (for e.g. a stereo sound will come in 2 channels, like our file.) For such multi-channel sounds, the sample points are interleaved, one from each channel. A grouping of sample points for a single point in time for all the channels is called a sample frame. This graphic explains it best.

If you open the WAV up with a hex editor it will look something like this:

I wouldn’t go through the format chunks, in fact there is an easier way to find out the format, and that is for me to open up the WAV file using QuickTime and inspect it.

For more information you can get the WAV specs here. This is the information we found of the WAV file that we will use in a while:

  • Format : Linear PCM
  • Number of channels : 2
  • Number of bits per sample : 16

In order to create the waveform, I opened up the WAV file, and collected each sample point from each channel and convert that sample point into an integer. This will be the data file I will use later in R to generate the waveform. Let’s take a look at the code now that we use to generate the data:

FasterCSV.open('wavdata.csv', 'w') do |csv|
 csv << %w(ch1 ch2 combined)
  File.open('out.wav') do |file|
    while !file.eof?
      if file.read(4) == 'data'
        length = file.read(4).unpack('l').first
        wavedata = StringIO.new file.read(length)
        while !wavedata.eof?
          ch1, ch2 = wavedata.read(4).unpack('ss')
          csv << [ch1, ch2,ch1+ch2]
        end
      end
    end
  end
end

Note that I didn’t read the number of channels from the WAV file but instead assumed it has 2 channels (stereo), to simplify the code. Firstly I open up the WAV file. I ignored the format chunk completely and looked for the data chunk only. Then I read the length of the data chunk by reading the next 4 bytes and unpacking it as a long integer (hence ‘l’ format in the String#unpack method). This gives me the length of the data chunk that I will need to read next.

Next, for ease of reading I wrap the returned data string in a StringIO object. As we found out earlier, each sample has 2 channels and each sample point has 16 bits, so we need to retrieve 32 bits or 4 bytes. Since each sample point has 16 bits, this means a short integer, so we unpack the 4 bytes that are read into 2 short integers, and this will give us the 2 sample points of 2 channels of that sample frame.

After that it’s a simple matter of stuffing the sample points into a CSV file.

Finally, to generate the waveform from the data file, I run it through a simple R script, which I integrated with Ruby using the Ruby Rserve client.

script=<<-EOF
  png(file='/Users/sausheong/projects/wavform/mrtplot.png', height=800, width=600, res=72)
  par(mfrow=c(3,1),cex=1.1)
  wav_data <- read.csv(file='/Users/sausheong/projects/wavform/wavdata.csv', header=TRUE)
  plot(wav_data$combined, type='n', main='Channel 1', xlab='Time', ylab='Frequency')
  lines(wav_data$ch1)
  plot(wav_data$combined, type='n', main='Channel 2', xlab='Time', ylab='Frequency')
  lines(wav_data$ch2)
  plot(wav_data$combined, type='n', main='Channel 1 + Channel 2', xlab='Time', ylab='Frequency')
  lines(wav_data$combined)
  dev.off()
EOF
Rserve::Connection.new.eval(script)

The script generates the following PNG file:

As you can see from the waveform (ignoring the first 2 bulges, which are ‘doors’ and ‘closing’ respectively) there are 13 sharp pointy shapes, which represent a beep each.

You can get the code and images here from my GitHub repository. If you’re interested to hear more about Ruby and R integration, do come down to the RedDotRubyConf on 22 and 23 April 2011!

My new book is out!

It’s been a while from the day I started writing Cloning Internet Applications with Ruby but it’s finally out! You can get it from bookstores, Amazon or its main site at Packt. It’s available in both a paper and a digital version (PDF), so get it now!

The main idea behind this book is actually quite simple and it started out in this blog. The first ‘clone’ I wrote was the Internet search engine in 200 lines of code, which was really very much scratching an itch that I had while I was in Yahoo, about a year and a half ago. I was interested in search engines athen and I wanted to write a very simple search engine to illustrate the principles behind an Internet search engine. That gave me a chance to try out Sinatra, the minimalist web application framework, which worked out really well for me eventually. In turn, that kickstarted me into on a whimsy challenge to do the same with Twitter in the same number of lines of code, using Sinatra and later, TinyURL in 40 lines of code. After that it was only a short leap to writing a whole book about it.

While the original idea revolved around writing clones with the smallest codebase possible, eventually the book evolved to be about writing minimal-feature clones written using the Ruby libraries that I now love to use i.e. Sinatra, DataMapper and Haml. The fundamental premise of the book still remained though, that is to illustrate how clones of popular Internet applications can be written with Ruby.

While this is a highly technical book with lots of code, I added in plenty of elements of the reasons and rationale (according to me, that is) why and how certain features of those applications work. For example, Twitter’s and Facebook’s features for connecting their users (‘friending’ features) in a social network are different, because they target users differently. Twitter’s friending features are primarily one-way and do not need explicit approval while Facebook’s friending features are two-ways and need explicit approvals from both parties. This means design and implementation differences, which are explained in detail in the book.

The experience in writing this book was good, and I have learnt tremendously in the process though it was a struggle. I can say this now that it’s published, but there were certain times I wanted to throw in the towel because of the messy my career was in then. I was still in Yahoo when I started, and I continued while doing my consulting work which eventually led to Garena, then wrapping up before I left Garena and finally being published now as I’m in HP Labs. It took a longer time to finish this than my first book, because of the upheaval in my career in the past year or so and also because overall I wanted to come up with a better book. This resulted in a book that has been revised repeated as companies, statistics and technologies changed. When I started, TinyURL was the king of the hill of URL shorteners while bit.ly was up and coming having just taken over as the default URL shortener in Twitter. TinyURL is now one of the players, with bit.ly probably the largest but Twitter has come out with its own shortener. Facebook Connect was the way to go when I wrote the chapter on Facebook, but the Open Graph APIs has taken over since then. Twitter used both HTTP Basic Authentication and OAuth when I was writing, but has switched over completely to OAuth now. And I expect the list to go on and on.

Still, it has been a good journey and a good fight. Finally publishing it is a grand feeling second to none (except when I had my first child). Hope you enjoy the book!