How should we think about a National Data Library?

Gavin Freeguard
9 min readJul 3, 2024

--

He inhaled deeply, thoughtfully, as he tilted his head. He raised his fingers, resting them against his eminently, imminently, strokable chin.

“What is a ‘library’?”, he asked, staring into the middle distance, wonkishly.

The Labour manifesto includes a pledge to:

create a National Data Library to bring together existing research programmes and help deliver data-driven public services, whilst maintaining strong safeguards and ensuring all of the public benefit.

This follows a report by Onward proposing:

The Government should establish a British Library for Data — a centralised, secure platform to collate high-quality data for scientists and start-ups.

The library should work with public services to make their data AI-ready and bring Government-held datasets together. It should include language and multimodal data with robust privacy-preserving mechanisms.

The library should be open to contributions from archives, universities, and private companies. Starting with NHS data, the library would create a potent resource for AI advancement, particularly for new powerful, tailored foundation models.

These sit alongside a similar suggestion of a National Data Trust for health data from the Tony Blair Institute for Global Change.

If we want to understand and shape what a National Data Library could look like, we could do worse than taking our more conventional idea of a library as a starting point.

Purpose

Libraries play several roles. One is a repository/collection/custodian role, gathering and stewarding information to support research and learning. The 1972 British Library Act establishes the British Library as ‘a comprehensive collection of books, manuscripts, periodicals, films and other recorded matter, whether printed or otherwise’. The Public Libraries and Museums Act 1964 has securing an adequate stock to meet ‘general requirements and any special requirements’, and encouraging adults and children to make full use of the service and provide advice on doing so, as two of the three objects of a library service.

Shelf life: the King’s Library, and people around tables, at the British Library (photograph by Mike Peel (www.mikepeel.net), CC-BY-SA-4.0, Wikimedia Commons)

There is another role in sitting at the apex of and supporting a wider ecosystem to do all of that properly. The British Library Act says it is to be managed ‘as a national centre for reference, study and bibliographical and other information services’ for both science and technology, and the humanities; make its services ‘available in particular to institutions of education and learning, other libraries and industry’; and ‘carry out and sponsor research’ where it is ‘expedient for achieving the objects of this Act and generally for contributing to the efficient management of other libraries and information services. The third object in the Public Libraries Act is to secure cooperation between the library and ‘any other authority whose functions are exercisable within the library area’.

A third role is suggested by the Royal Charter of the National Library of Wales (‘TO ALL TO WHOM THESE PRESENTS SHALL COME, GREETING!’). Its object is ‘to collect, preserve and give access to all kinds and forms of recorded knowledge, especially relating to Wales and the Welsh and other Celtic peoples, for the benefit of the public, including those engaged in research and learning’. This cultural role preserves and projects forward a nation’s sense of itself, allowing the population to see itself in the collection.

A National Data Library might, in theory, be able to play all of these roles. It could be the place where people are able to access a lot of data and information — we know data availability remains a real challenge. But we also know the problems with portals; trying to bring together ALL THE DATA into one institution, rather than around compartmentalised sectors, problems and questions, could overwhelm the organisation and the user (a Library of Babel problem); and one person’s ‘one institution to bind them all’ is another’s ‘yet another in a long line and cluttered landscape’ (the xkcd 927 problem). How existing initiatives — national statistics, departmental data publications, the Integrated Data Service, ADR UK, HDR UK, data.gov.uk to name a few — fit (or don’t fit) into this is a vital question.

The Library could play a catalysing and coordinating role for a wider ecosystem. It could provide support — infrastructure, guidance, convening — for others looking to make the most effective use of their data. The legislative reference to, well, ‘reference’ services might make us think of reference data, which helps categorise our world and underpins research and services. It brings to mind everything from (now-retired) GOV.UK Registers (see also this) to the National Information Infrastructure to the Postcode Address File to a register of government-commissioned research. Would this institution have a wider role in supporting data sharing across (as well as beyond) government? Where would existing responsibilities, of units like the Central Digital and Data Office, fit?

And the Library could help the population see themselves in several ways. Better analysis of the country we live in using the Library’s data is only a part of it: it could also help citizens and communities use the data themselves, contribute their own data and have a meaningful say in what data the Library was responsible for and how they should use it.

The elephant in the (reading) room from the public proposals so far is that they focus on somewhat different purposes for different audiences. The Labour pledge brings together research for the improvement of public services, for the benefit of the public; the Onward and TBI proposals are more explicit in supporting AI startups. Our conventional library plays several roles catering to several audiences (including supporting businesses) — so why not our Data Library?

Our libraries are designed with research, learning, public benefit and — ultimately — a broad range of people in mind, rather than being targeted at business and profit (or any other specific group). When I go to the Library, I’m reading published works — books, articles — that authors wanted to share with the world (or left to posterity after their deaths). In our Data Library, non-personal, anonymised high-level statistics (for example) might be widely accessible.

But this is where detail about what ‘data’ we’re talking about becomes key — and the library analogy breaks down. I am not going to the Library to read sensitive, personal information about, or the innermost thoughts of, my fellow library goers. If our National Data Library is to hold more personal data — from our health records, or other public services — then we have a very different relationship with it. Bringing that data together could be powerful in positive ways (better policy, better services ) or negative ones (hacks, leaks, social control, exploitative practices for private profit). We know from existing research that how we feel about this — who is using our data, in what way, with what safeguards, to what end — will fundamentally change how we feel about it, and should fundamentally change how those data systems are — how the Data Library is — designed, accessed, and governed.

Lending and access

There are reference libraries and there are lending libraries. There are open shelves and there are special rooms, restrictions and equipment for more sensitive books, manuscripts and other media. There are books and manuscripts held on site and those that have to be ordered from different locations. There are facsimiles and online versions. There are libraries that can cater for several different types and levels of access, depending on the material.

An analogy with the National Data Library suggests that there might be some materials available to everyone, and some restricted to specialist researchers. There may be different access models for more sensitive material. There may be better and worse options — bringing together all the data in one place for accredited researchers to access, for example, would be a logistical and security nightmare (for starters). It may be possible to keep the data where it already is, but provide researchers with the ability to access different systems. It may be possible (as with eg OpenSAFELY) to allow researchers to send their analysis needs to the data.

There’s at least one obvious place where the analogy falls down. Many library materials can be accessed by only one person at a time — they will consult it and send it back (and may reorder it again in future). With data, lending is often not one-off and singular, but ongoing and simultaneous.

Keeping the data secure will always be important. If only there were a recent analogy with libraries to underline that.

Curation

What makes a library easy to navigate?

First, catalogues with key information about the collection, based on well-defined, consistent classification systems. The data analogy here is obvious: catalogues with key metadata, based on well-defined, consistent data standards.

But what if the information about a book or manuscript is incomplete or confusing? What if I have some idea of what I’m looking for, but not the exact details of where I might find it? What if I’m interested in a particular subject, but lack the domain knowledge to be able to find out more?

If only there were some well-trained actual *people* who were able to help me in my endeavours. We might call them ‘librarians’.

And if only there were a proper analogy with librarians in our more digitised data world.

You’d be surprised how difficult it was to find a usable image of librarians. This one is from Super Furry Librarian, licensed under CC BY-NC-SA 2.0. Happily, it was previously used by Martin Belam in a 2009 post about an unconference talk I went to which makes similar points

Knowledge and information management is a real challenge, particularly in government. Departments often have an imperfect idea of what information they hold — let alone where else in government (or beyond) they might find it. (A recent IfG report noted local government calls for a data brokering service, so they could find information they needed from central government.) In the digitisation of government (and other sectors), we seemed to think that saving things into the right folders would be enough — even though we can seldom be bothered, and laugh in the face of our organisation’s file naming conventions. In the world of Google, semantic search would somehow solve all — notwithstanding challenges ranging from different departments using different, incompatible systems to all manner of vocabulary issues, as we call the same thing different names and different things the same name.

A library is a service. A National Data Library would be a service. Being able to find the right data is a skill — helping people to find the right data is a service. And if you want to build a data service, you need to think about designing the whole thing end-to-end. (As per an ODI report on why the focus on data portals should shift to services for accessing data, and Jeni’s ‘data as a service’ thinking for Public Digital.)

Collection

The UK’s national libraries, along with the Bodleian in Oxford, Cambridge University Library and the library of Trinity College Dublin, are legal deposit libraries: they have a legal right to request published material, meaning they hold copies of all publications. This right has been extended to online material in recent years. And even without these powers, many libraries will order new material to fill gaps in their collections and service their users.

A National Data Library could have similar powers to ‘legal deposit’ to require data — something the Digital Economy Act already provides — from public and private sectors. It might have an explicit responsibility for identifying and filling data gaps — something the Centre for Public Data have been highlighting for several years, and as recommended by PACAC’s report on the UK’s evidence base and the Lievesley review of the UK Statistics Authority. (PACAC also backs Lievesley’s recommendation that a Statistical Assembly with wide consultation be part of the solution.)

The central control analogy is less useful in some other respects. Building a top-down cathedral of knowledge may be less useful than drawing upon bottom-up bazaars, starting small, testing, iterating and evolving as good digital service design suggests. And outside a command and control centralised model, there may be other organisations (civil society, charities, community groups) with data they would like to contribute or share with others, lending into the library.

Conclusion

We already have a lot of government institutions with responsibility for data. There are risks in adding yet another, that confuses rather than coordinates, aggravates rather than alleviates, and ultimately disappoints. But a new National Data Library could bring real opportunities, signalling a strong commitment to better use of data and providing the machinery to make that happen.

No really, it might (GIF of Tobias and Lindsay Fünke from Arrested Development, text reads ‘But it might work for us’)

Whether that does happen will depend on the detail. What purposes should be prioritised? What services will it provide? In whose interests will it operate? How will it demonstrate trustworthiness and build trust? What ‘data’ are we talking about? What problems is it trying to solve?

The final library lesson is that, whatever answers a new government gives to those questions and whichever direction it chooses to follow, we already have a lot of knowledge — from history, from practice, from theory, and from people — we can learn from. Let’s use it.

--

--