Rethinking the VLC’s mirrors infrastructure

Around 2005 when we started to gather some statistics about VLC, download numbers were around 150,000 downloads per day. Since then this number has increased significantly to reach more than 1M the good days. In the beginning we used few mirrors to handle the file distribution and it was an hassle to manage since back in the days it required a lot of human power to do a VLC release. Mostly because we had to wait several hours (if not days) until all mirrors were synchronized. Frustrated by the situation we moved to SourceForge.net during the month of April, 2010 to simplify the release process. We stayed there for 3 years until recently.

To better understand why we backed off let’s talk a bit about the SourceForge.net business model. Like any other company they have to make money to pay their bills and employees. No problem with that. The way they do it is to put ads on their downloads pages while your download starts. No problem with that either. Except when it comes to ads that are obviously designed to trick the user into believing they are part of the download procedure. Which is indeed bad and misleading. Let’s illustrate what I’m talking about.

VLC’s page on SF.net as taken on April 15, 2013 in France with an IE8 user agent.

Do you see these big buttons? Of course! They are even bigger than the real download link and you have absolutely no idea where they are linking to (Spoiler alert: it’s a scam). Obviously a lot of our users were tricked into clicking these ads and were downloading all kind of crapware. I don’t blame SourceForge for this, this is more a matter of how most advertising programs works on the web nowadays but anyway we care enough about our users to not continue this way. And yes, we asked SF.net many times to be more vigilant about the ads they are showing without much success. This is one of the reason why we (the VideoLAN organization) decided to move away from SourceForge and return to a more typical distribution channel.

Back to the mirrors

We went back to the traditional way of distributing files in the free software world: using mirrors. But we are no traditional software. We have millions of users to serve and tens of terabytes to transmit each day everywhere in the world in a reliable way. That’s not a trivial matter when you have no money for buying servers and bandwidth in every part of the world. So we had to rely on generous sponsors.

Finding the sponsors

Finding sponsors able to setup the mirrors and handle all the related costs (disk storage, bandwidth  maintenance) is nothing easy. I’ve sent hundreds of email to hosting providers, network operators and ISPs around the world and surprisingly most of them answered positively. One of the constraint we had to consider is where to put mirrors so that it reflects more or less our current user base in each country (dense areas tend to have more mirrors than others).

Every single server can (and will) fail

The situation of having a failing mirror is scary since you have no easy way to get this information soon enough to disable it without having too much users unable to download the requested files. There is no silver bullet but having good tools can help a lot in those situations. We opted for mirrorbrain, a full featured, battery included, open-source geographic load-balancer. Among its supported features mirrorbrain monitors each server, on a network and file level which is great for availability. If one of our mirror is misbehaving it will be disabled automatically, rerouting the requests to the closest available mirror in a matter of minutes and will be re-enabled as soon as it gets back online.

The setup

The first thing you need to know is that mirrorbrain only works as an Apache module. On a personal level I don’t like the Apache HTTP server, because the configuration is a pain and most of all it scales badly under pressure, hogging your CPU and memory quite fast when the traffic exceeds a certain amount of requests per second. Being scalable was not an option but a requirement so I achieved this by adding a fine-tuned nginx frontend.

Another requirement we had was to show a webpage during the download to show the logo of the selected mirror, a checksum of the file and few ads (we are currently supporting the open-source music player Tomahawk).

Putting things together this is how the actual platform looks like and what happen when you’re downloading VLC or any other software from the VideoLAN website:

Nginx is used as a frontend here, all the incoming requests are served through it. It provides static files (images, css, javascript) itself, forwarding download requests to a web application (the glue) in charge of querying mod_mirrorbrain for the best mirror for the given user and file. Eventually it generates the page containing the redirect, ads and checksum. Only few requests are directly forwarded to the Apache backend without passing by the web app but these are only used for monitoring and debugging purpose and are not part of the standard download process.

Conclusion

One month after we put the whole thing into production we are quite pleased by the result. We’re serving dozens of downloads (and VLC’s updates!) each second everywhere in the world in a reliable way from a total of 42 mirrors provided by awesome sponsors. And we even survived to a DDOS attack without a single downtime!

 

3 Comments.

  1. Nice. Thanks for cutting down the confusing download junk :) Now if only there were a similar option for “all popular open source projects” … hmm … maybe you can make it into some kind of reusable system somehow?

    • Doing something reusable is really hard since tuning the whole system is definitely the most difficult part. But if other projects are interested I would be glad to share my experience with them!

  2. We average 560 requests/sec on a daily basis with mirrorbrain, and I’ve been thinking about reverse-proxying nginx as well. Not sure how much that will do on a plain mirrorbrain install – we do not need an intermediary page.