On 6th February, I went to the Manchester Web Performance meet up group at the BBC's new premises in Salford Quays. The attendance was very good, with around 60 developers and testers turning out.
The presentation was delivered by Andy "Bob" Brockhurst - Principal Engineer, BBC Platforms/Frameworks. Andy began by describing the BBC's web platform which is based on RedHat Linux .Interestingly the BBC host all their 360+ applications on the same shared platform. This means that something affecting service on say iPlayer, could have a knock on effect on other applications such as weather or news.
The BBC has small (10% of production) performance test environments. However these environments aren't suitable for large scale testing, and applications tend to be tested in isolation. The BBC accepts that this is a risk, but they have been forced to take it because of the difficulties orchestrating tests to suit potentially hundreds of different development teams.
In anticipation of the extra load, the web platform was upgraded; server capacity was doubled and their web servers were upgraded from Red Hat 4 to Red Hat 6. Upgrades were also made to the load balancer's network interfaces. The quote 'doing the rounds' in the BBC was "What the abdication crisis did for radio and the Coronation did for TV, the Olympics would do for online." The pressure to deliver was building, and against this backdrop the test team had little confidence that their tests were realistic. They also knew that their BAU services had to be maintained despite the anticipated "Olympics onslaught" onto their shared platform.
Andy shared some of the "normal stats" for use of their platform. The sport servers typically handle 9m visitors and 90m page views per day. They have significant peaks in demand. For example there are 1000 pv/s (page views per second) on the sport site each Saturday, but this peaks at 4000 pv/s when the final football scores are published. On a normal day the BBC has peaks of around 750,000 users on their platform, they anticipated double this figure for the Olympics.
So how did they go about testing?
Unsurprisingly, they did it the same way that we all do. Albeit on a grand scale! They worked out the likely usage patterns, wrote and prepared test scripts and test data, and started to test iteratively. They used a mixture of internal load testing using the Avalanche test tool and cloud-based external load from SOASTA.
As with all performance tests, they had to make compromises. The CDN provider was reluctant to allow too much simulated load on their systems and compromises were sought balancing realism against practicalities.
The tests identified problems with content caching, load balancer configurations and a lack of adequate performance monitoring. Services were moved around to reduce the competition between different applications. For example, the sport, twitter and news applications were moved to prevent them competing for the same server resources.
Due to the lack of sufficiently realistic test environments, all performance testing was performed against the live environment. This works well but can cause problems (which they found by "breaking" the platform and affecting service). Despite this mishap and the fact that testing has to stop when "real" usage increases, large-scale events will be tested in this way in the future.
Towards the end of the presentation, Andy shared his "Lessons learnt" and despite the scale of the BBC's testing project they were familiar to many of us.
- Test environments needed to be improved and they need a better understanding of usage patterns
- More testing needed to take place in "dev". This fits in the "test early, test often" mantra
- Changes were recommended for load balancing. They found that simple "round robin" is not adequate
- External specialist testers were brought in to supplement the internal team
- Detailed monitoring was essential to help diagnose performance bottlenecks and eradicate them
From the meeting I took the message that as testers we all face the same problems, suitability of test environments and test tools. We usually have too little time to complete thorough testing prior to immovable deadlines. Everything we do is a compromise between spending more time, money and delivering valid test results in enough time for them to be acted upon. Getting the balance right between realism, cost and delivering on time are some of the toughest skills that a tester can master. From Wednesday's presentation it seems that Andy and his team have these skills in abundance.