Digital Continuity: Record Classification and Retention on Shared Drives and Email Vaults
VIDALIS, Stilianos and Angelopoulou, Olga and Emmanuel, Lesly (2011) Digital Continuity: Record Classification and Retention on Shared Drives and Email Vaults. Project Report. Welsh Government, Cardiff, UK.
INFO - MERGE PDF's Dig Con Fin.pdf
Download (2MB) | Preview
Abstract or description
In 2007 the UK government identified several objectives for improving the storage of public sector information. In particular, and of direct relevance to this project, it wanted to:
improve the responsiveness to demands for public sector information
ensure the most appropriate supply of information for reuse
improve the supply of information for reuse
promote the innovative use of public sector information.
The aim of this project was to mine, categorise and classify information from a heterogeneous large-scale computer infrastructure and then store the search results in a forensically sound manner. Duplicate information was to identified for destruction and the process designed so that it could be implemented without disrupting staff operations.
The test data was a a 217Gb (810,000 files) sample taken from the Welsh Government (WG) shared drives and email vault. The records concerned largely related to the work of the Department of Education and Skills though 25% of the sample were taken from the wider organisation in order to ensure that the classification system used were useful over a broad range of subjects. The test data was stored in an isolated test environment with virtualised structures. All development work within the project occurred within the test environment.
De-duplication of the test data was achieved. Some 35.88% of the files were identified as duplicates. Removing these files resulted in a saving of 29.49% of physical space. After one pass of the data, it was possible to generate usable metadata for 75.7% of the de-duplicated data set. This became the rich data set. The retention policies of the WG were used to design queries and rules for analysing the rich data set.
It was possible to extract 65% of the files in the rich data-set for long-term retention together with their metadata in a format that would allow transfer to the WG Electronic Document and Record Management System (ERDMS Know as iShare within the WG). This translates to 55% of the de-duplicated data set. Further analysis of the rich data set would have produced a better extraction rate. This would have been further facilitated by the use of knowledge extraction applications such as Pingar.
|Item Type:||Monograph (Project Report)|
|Additional Information:||The report was distributed to all the Governmental Departments and is currently being considered by the National Archives for further development.|
|Subjects:||G400 Computer Science
G500 Information Systems
P100 Information Services
|Faculty:||Faculty of Computing, Engineering and Sciences > Computing|
|Depositing User:||Stilianos VIDALIS|
|Date Deposited:||01 Jul 2013 08:29|
|Last Modified:||01 Jul 2013 08:29|
Actions (login required)