We’ve had so many questions from people asking how this project has moved from pipe dream to reality that we decided to blog on it. So the idea of this blog is to give an overview of some of the technologies we have used and challenges we have faced as the project has come to life.
The website aims to transform the way that people use historical newspapers and currently offers access to 4 million fully searchable pages, with a further 36 million to follow over the next 9 years. brightsolid online technology was chosen by the British Library due our track record of delivering projects of this scale, such as the digitisation of the historic 1911 Census of England and Wales and the hosting of Scotland’s 2011 Census, which successfully protected the nation’s personal data.
All of this data is hosted and stored at our state-of-the-art data centres in Dundee, which provide the most secure and performant environment for our website. They were designed and built in partnership with IBM and use industry best practice and the latest techniques to deliver one of the country's leading facilities.
brightsolid’s agreement with the British Library was signed in May 2010 and so the launch has been over 18 months in the planning. As expected, with any project of this scale, detailed and lengthy planning was a necessity and involved internal expertise, as well as support from external suppliers such as Content Conversion Specialists (CCS) who provide the document workflow system that our team uses and Digital Divide Data (DDD) who dissect the digitised information ready for searching on the British Library website.
Our digitisation team consists of more than 36 in-house employees, with a further 160 people employed by outsourced companies to assist full-time in the project. Tens of thousands of man hours have been necessary to get this project to the launch stage and with very large volumes of data still being digitised every working day.
As part of our move towards a Unified Publishing Platform across our family history websites, the British Library infrastructure is made up of three distinct elements – a processing backend (the fully-searchable newspaper images), a front-end website (customer facing interaction) and a backend for the website (indexing & searching via the website). A Unified Publishing Platform provides a more stable operating platform for our sites and provides a scalable infrastructure to meet the varying peak demands of traffic. The infrastructure diagram below illustrates the three elements and their supporting features.
While the meticulous planning and efficient project management detailed above is absolutely critical, in my experience, an unexpected challenge usually surfaces at some stage. These unforeseen challenges are the reason that I thrive on working on such projects, as you are stretched beyond your comfort zone and are challenged to draw on all of your technical experience. In the case of the British Library project, this challenge came from the massive disk I/O requirements that were required to maintain a useable service with all the various components of the British Library website. We initially approached this problem by procuring a small number of servers and then went through a rigorous process of changing elements within these systems to best meet the requirements. This involved trying various caching techniques, altering the amount of memory and changing the final resting places of data to find the sweet-spot of cost versus performance for the systems.
The systems were exercised by using models of user behaviour gathered from our previous large scale site launches and this involved a very repeatable and consistent test harness that allowed individual components to be changed and evaluated in isolation. We went through various iterations changing the amount of system memory, the data stores kept on NFS shares, the data stores on RAID1, RAID5 and RAID10 mechanical disks, the data stores on striped SSD drives and also with combinations of these changes to see how certain component changes affected each other. At each stage an objective approach was taken and the amount of transactions that could be achieved was calculated for a fixed amount of money, resulting in the best Return on Investment (ROI).
Ultimately the highest performance would have been achieved using large amounts of memory but the best cost per transaction model was a large amount of memory paired with very fast SSD striped drives. When this had been determined, the final servers were purchased and configured to this optimal configuration with final full site simulations using the test harness once again. Further to this, an innovative approach was taken to image serving with requests being distributed across two sites to provide not only resilience for the service but allowing maximum utilisation of our internet connections, ensuring that should the numbers of users at launch exceed expectations, our single site bandwidth would not be saturated.
On the first day of the launch, we had over 1.2 million searches and will add 8,000 pages every working day to the archive. To achieve this, we are using four storage servers – two 50 terabyte servers and two 16 terabyte servers. So the pace of progression for the website is pretty relentless and of course, this brings with it further challenges and technical requirements in the future. Scanning, digitising and making publicly searchable a football field worth a newspaper data every day is a phenomenal task but we’re confident that the architecture we have in place will provide a world-leading user experience for the global audience we’ve already seen using the site.
Why not give it a go for yourself at www.britishnewspaperarchive.co.uk and search out some gems from your family, local area or major events of the time?
If you have any questions or want to find out more information about what we can do for your organisation then follow us on Twitter @brightsolid_tec
Senior Unix Consultant
brightsolid Online Technology