This document offers a variety of general recommendations for creating publications that are more preservable. If working with specific preservation partners, however, their capabilities, standards used, and services offered may vary greatly. It is helpful to engage with these partners early to discuss new projects. This gives the preservation partners an opportunity to indicate whether they have local practices that differ from the suggestions here. For example, some preservation services may prefer that specific file formats or metadata standards are used. Discussing the project early can improve the preservation outcomes.
These other guidelines can facilitate conversations with preservation partners:
5. Establish formatting rules for common features
6. Keep preservation partners informed of changes
9. Define “version of record” for your context
10. Define and document the core intellectual components of a work
Other guidelines in this publication lay out many aspects of platform technology that could be considered, but in brief, here are some indicators that a platform may have features that facilitate preservation: The platform utilizes appropriate standards relevant to the publishing community e.g. standardized metadata, exports to common formats, accessibility standards. The platform uses established technologies rather than being dependent on newer more experimental technologies that may not be well supported. The platform itself is well established and broadly adopted. There are existing workflows for preservation. The platform has a comprehensive export option that includes all raw materials, dependencies (e.g. fonts), descriptive metadata, and packaging metadata that describe how it all fits together. The export package supports, through completeness and use of standards, a complete migration to a new platform with equivalent features, rather than being closely tied to the current platform. In the absence of an export, the platform includes a predictable structure or API that could facilitate content discovery, enumeration, or harvesting from an external source. Finally, the platform does not have an over-abundance of built-in features that will not be used, as these can add bulk and complexity to preservation workflows.
If you are developing a new publishing platform, or have control over how publishing platform features are designed or implemented, use existing standards to guide decisions. For example, there are standards for bibliographic data (e.g. ONIX, Dublin Core), full-text data (e.g. TEI, EPUB), annotations (e.g. W3C’s Web Annotation Data Model), persistent identifiers (e.g. DOIs, Handles, ARK IDs), citations (e.g. MLA, BibTeX), metrics (e.g. COUNTER), accessibility (e.g. W3C’s Web Content Accessibility Guidelines) and more. Preservation workflows scale best when working with common standards.
When using out-of-the-box software solutions for your publishing platform, export and preservation workflows are often designed around the built-in functionality of that software. For this reason, it is helpful to use platform features as intended. If the built-in functionality of the publishing software does not meet local requirements, avoid making undocumented, one-off changes to core code in order to get something working quickly. Instead, attempt to formalize and document any changes to the out-of-the-box software so that the new functionality is reusable in other publications and internally consistent within the platform. If the platform software has a formal process for applying enhancements (e.g. a plugin process), make use of this. Ensure any export processes are modified to align with the local changes and if working with a preservation partner, communicate any local changes to the software. The risk of not following a formal process may be loss of the new features during preservation, updates, or platform migration. An undocumented customization can disrupt the preservation of entire publications.
These other guidelines may be helpful when implementing new features:
3. Use existing standards when implementing features
5. Establish formatting rules for common features
6. Keep preservation partners informed of changes that affect the publications
Consistency is the key to a scalable preservation workflow, and so if a publisher or platform supports multimedia content or other enhanced features, establish basic rules early on and continue to express these features in a consistent way. Limit formats and arrangements as much as possible. For example, if one embedded video is an MP4 with no caption, another is a WebM and has a caption in a box, and another still is a Vimeo video with a caption but no box, for some approaches these minor inconsistencies can cause problems when performing preservation activities at scale. These potential variations should be clearly defined and constrained.
These other guidelines may be helpful when implementing new features:
3. Use existing standards when implementing features
6. Keep preservation partners informed of changes that affect the publications
70. Consider systematically tagging material that should be excluded or tagging material that should be included as part of the preserved content
For platforms and publishers working with a preservation service, preservation workflows will be designed based on the sample publications provided. If these are not representative of the full range of functionality that the publishing platform supports, then the preservation workflow developed may miss things that the publisher wants to preserve. Keep a record of the scope of variations that might be found in a publication. As formatting rules for a publication change, expand, or new file formats or arrangements can be expected, inform your preservation service so that they can adapt their workflows accordingly and avoid missing important features.
For more about changes that should be communicated to a preservation service:
4. Document any changes to the default functionality of a platform
5. Establish and document basic formatting rules
10. Define and document the core intellectual components of a work
70. Consider systematically tagging material that should be excluded from preservation
71. Document and share the platform-level approach to preserving components of a publication
If a publication platform enables user contributed content and that content is managed by the platform, e.g. annotations or comments, the platform’s Terms of Use should clearly define the rights related to that content, especially if they may wish to preserve it or migrate it as part of the context of the publication. If a publication is likely to be archived with this context intact, the implementation of these features and their associated terms should factor in ethical consideration of how a user’s information is displayed on the platform, and how they are informed about and consent to the use of the content.
See also:
55. Ethical concerns of user-contributed content
70. Consider systematically tagging material that should be excluded from preservation
PubPub supports features that allow users to contribute content through annotations and comments. This content is integrated into the page and can’t be excluded from web crawls. The default PubPub Terms of Service template includes language that covers User-Generated Content under a Creative Commons Attribution 4.0 License:
By submitting User-Generated Content, you hereby make that User-Generated Content available under the Creative Commons Attribution 4.0 License, and you represent and warrant that you have the right to provide your User Generated Content under that license, that all of that User Generated Content is either authored by you, or provided by third parties under the Creative Commons Attribution 4.0 License or in the public domain, and that your User Generated Content contains no personally identifiable information of third parties who have not expressly authorized you to provide it as part of your User Generated Content. All of your User-Generated Content must be appropriately marked with licensing and attribution information.
These terms allow for preservation of User-Generated Content on PubPub.
If a publication platform integrates third party applications for features such as annotations or comments, the publisher should ensure that the terms of service for that application provide appropriate permission for preserving and migrating that content over time.
See also:
14. Avoid being dependent on third party services for core features
15. Plan a strategy for preservation when third party dependencies exist
Some third-party annotation services have restrictive default terms of service or do not define their terms of service. Hypothesis, an annotation tool that can be added to or used with most websites, grants a CC0 license for all annotation data stored on their servers. This means you don’t need to seek special permission to preserve the annotation data.
A preservation service will work with a publisher to determine the version(s) of record. If there may be multiple versions of record, or if draft versions are considered significant, the parameters of these should be clearly defined. In addition, these versions should be identified in a formal way so that automated updates can occur as needed while retaining clarity across the preservation copies.
These guidelines relate to other aspects of versioning:
23. Express versioning in bibliographic metadata
31. Assign new identifiers to significant versions
For each work, establish what readers need in order to perceive the authors’ intellectual and rhetorical contributions, acknowledging that the current form of the publication may not be available in the future with changing technologies and social frameworks. Preservation efforts should focus on these core components. Many publishers will define the rules about what is preserved at a platform level with a single model across multiple publications. It is important for authors and publishers to work together to understand the core components, and ensure that they are represented in the content that will be preserved.
These guidelines refer to preservation of core components:
70. Consider tagging material that should be included or excluded for preservation
71. Share documentation about what will be preserved at a platform level
Architecture & Memory, written by Robert Kirkbride and published by Columbia University Press in 2008, was an early enhanced digital book. It used videos of architectural features and included an interactive navigation feature that utilized Adobe Flash—a now obsolete technology that was popular in the early web. When browsers stopped supporting Flash, the videos and interactive menu stopped working. The publication was migrated to the Fulcrum platform around the same time. From an outside perspective, the interactive navigation component could be viewed as a secondary form of navigation after the standard browsable menu, which was composed of hyperlinks and still worked. A conversation with the author, however, revealed that the interactive navigation feature was critical to the work. The publisher worked with the author to recreate the interactive Flash component using HTML5. If the author had not been available to advocate for the importance of this feature, the publication would have lost a critical component.
During early conversations related to preserving Vulci: Urban Context and Waterscapes, the publisher was contemplating how to include a representation of the 3D visualizations referenced in the publication. They were stored in a third-party platform that specialized in supporting these visualizations. Discussion was had with the author to identify whether the 3D visualizations were integral to the research contribution and narrative of the book or whether the book could stand alone. It was decided to develop the manuscript so that the book could be independent of the features thereby saving investment of time and money in preserving features which were nice-to-have rather than essential.
The Library of Congress updates their Recommended Formats Statement regularly. This is a helpful quick reference for selecting a format that is stable when there is an opportunity to choose. If converting data from a proprietary format to an open file format results in some data loss, consider saving both. For less established or proprietary formats, consider recording the type, version, and software used to generate and play the file—this can be included in the metadata or documentation.
These guidelines may also be considered during file format selection:
13. Acquire the highest quality version of media to use for preservation
34. For EPUBs, opt for core media types, as defined by the EPUB specification
Thinking through the best ways to present and preserve media assets such as video early in the publication cycle will allow for lead time to implement best practices for preservation, such as procuring and/or licensing media for local hosting or exclusively for preservation, or choosing remote services better suited to web harvesting.
Move supporting files such as multimedia, fonts, JavaScript, and CSS, local to the publication or inside the application used for publishing. This helps ensure the vital components of the work can be easily packaged together, reduces ongoing maintenance, and helps ensure exports contain all necessary resources.
If this is impractical in the live environment, other guidelines may be relevant:
15. Develop a strategy to capture any external media content
16. Captions for non-text features add meaningful context
20. Ensure all core intellectual components of a work are reflected in the export package
29. Consider a preservation-specific EPUB in your workflow
51. Host media files local to the website
72. Record a walkthrough of features with important layout or interactivity
Sometimes it is necessary or preferable to reference or embed third-party content that is outside of the control of the publisher but integral to the understanding of the work. For these features, anticipate that their availability may be temporary and make plans to ensure that they are not only preserved, but sustained in some form as part of the publication while they are on the publisher platform. In the case of an embedded YouTube video, for example, some options to support preservation might include: retaining or requesting a copy of the video file; getting permission to copy the content directly from YouTube using a downloader tool in order to bring it into the local publication; or web archiving the video page and linking to the archived copy, e.g. on the Internet Archive. An informative caption can help support future readers if the content is unavailable.
These guidelines may also improve preservability of third party hosted media:
12. Start discussions about multimedia early in the project
14. Avoid externally hosted media
16. Captions for non-text features add meaningful context
20. Ensure all core intellectual components of a work are reflected in the export package
39. Avoid the use of iframes to embed multimedia
42. Facilitate a local web archive workflow for iframe content
Owning My Masters (Mastered): The Rhetorics of Rhymes & Revolutions by A.D. Carson includes an annotated interactive timeline created using the Northwestern University Knight Lab’s TimelineJS. A simplified text representation of this timeline is included in the EPUB on the Fulcrum publishing platform. The interactive version, hosted at University of Virginia and embedded on the author's website using an iframe, is linked as an external resource. The timeline is configured from data stored in a Google Sheet owned by the author. A web archive file (WARC) of the interactive timeline site and a CSV of the Google Sheet are included as hosted resources on Fulcrum and available for download. Since Fulcrum resources are included in the export, the archived web page (WARC file) and the text version are both part of the preserved copy.
When referencing an external resource in a publication, see if there is a version of the resource that has a unique persistent identifier and if so use that identifier to reference it. While all “persistent” identifiers can eventually break depending on whether they are properly maintained, they are more likely to last than other links and uniquely identify a resource. Another option for tackling “link rot”—the term for when links stop working—is to use a web archiving snapshot service such as archive.today or Internet Archive’s Save Page Now service to archive the page and reference the resulting snapshot as an alternative link in the document. Robust Links are one way to present this to users.
These guidelines cover other instances that may benefit from use of identifiers:
27. Assign persistent identifiers to publication resources and use them
31. Assign identifiers to significant new versions of the work
If a publication contains digital enhancements that are important enough to warrant preservation, the publication inclusive of its enhancements may be substantial enough to warrant a new ISBN, DOI, or other persistent identifier. This practice would ensure that the new version can be easily distinguished from other unenhanced versions of the publication in the preservation system.
These guidelines also relate to management of versions and use of identifiers:
9. Define the “version of record” in your context
17. Use persistent identifiers to link or cite external resources
23. Include version information in bibliographic metadata
Key URLs for a publication, such as a publication’s home page, should not change over time. If they must change, redirect the original URL to the new location. Apart from helping to decrease broken links from other websites, using a well planned URL structure can help with website preservation. Ensuring the publication’s URL does not change over time can make it easier to manage and connect different versions of the publication that are preserved and avoid duplication.
These guidelines discuss identifiers, another way to support URL persistence:
27. Persistent identifiers can be used at the publication resource level
31. Persistent identifiers should assigned to new versions of the work
Where there are multiple publications on the same domain or subdomain, and each one spans multiple pages, using a consistent and hierarchical naming convention in the URL path helps web harvesting tools identify its scope. For example, if the publication content is organized in these directories: example.org/book-slug/text, example.org/book-slug/resources, a crawler can be set to generate an archive of the resources within the “book-slug” directory.
If publishers are involved early enough in the development process for a custom web application that is being built for a single publication, they should encourage developers and authors to make choices that avoid external dependencies or to have fallback mechanisms when external dependencies fail. For example, if a connection to Google Maps fails, fall back to a still image and the vector coordinates. Developers can test their site by running it in a virtual environment with no internet connection. If it works, it is not only likely to be easier to preserve, but also much more sustainable and easier for the publisher to maintain.
These guidelines may be referred to when considering encapsulation:
14. Avoid depending on externally hosted web services
51. Embed multimedia locally
56. Avoid embedding map visualizations where a static representation would suffice
All websites have to be maintained in order to be sustained on the live web. An over-complicated web application will not only degrade more quickly and be more expensive to maintain, it will likely be even more difficult to preserve as an application. Unless the focus of the project is experimental technology, use technologies and programming languages that will be easily supported by technical staff. Do not unnecessarily overcomplicate the infrastructure and code. A helpful reference to building sustainable projects is the University of Victoria’s Endings Principles for Digital Longevity.
These guidelines may also be helpful when considering publication software:
2. General considerations for designing or selecting publication platforms
3. Favor existing standards
When a custom publication is developed using plain HTML5, CSS, and JavaScript that does not communicate with a live web server, it may be possible to run the entire application from a local machine by opening it in a browser. In this case, a clean application package should be created and retained by the publisher as a backup and for preservation. Work with the developer and author to ensure that this preservation copy: functions fully offline; does not contain any system files, server information, or logs; uses relative links that do not contain a specific domain name; and contains only local stylesheet, font, or JavaScript references. If there are features that depend on a third-party service, e.g. for search or commenting, that are not a core intellectual component of the work, these can be disabled. A README file should be placed in the root of the application folder to describe the project, instructions, dependencies, versions of technologies used, and details of any unique features that might be useful for playback later. The entire package can be stored as a zip file. If updates happen once the application is deployed on the live server, these should be reflected in the clean preservation copy and a version number should be expressed in the package.
When other methods of preserving a web publication (export, web crawling) cannot appropriately capture the important properties of a publication because it is dynamic and data-driven, a preservation service may attempt to preserve the application itself with the goal of running it in an emulated web server environment in the future. In order to do this, the preservation service would require a clean installation package as well as documentation of the requirements, dependencies, and installation process. A preservation copy could be created during the publication process. Work with the developer and author to ensure this preservation copy: functions fully in a self-contained web server that does not have access to any resources outside of the machine; does not contain any server information or logs; uses relative links that do not contain a specific domain name; and contains only local stylesheet, font, or JavaScript references. Where features require a live third-party site, consider a local functionality that could replace it adequately in this package. Overall, it would be beneficial for the developers of the publication to design any website with sustainability and encapsulation in mind—ensuring files are local to the application where possible and that there is a simple way to fallback to local functionality for integrations such as third-party resources.
These guidelines also discuss the installation package for a web application:
58. Consider encapsulation of custom-built web applications early
60. Request an installation script for custom software and websites
61. Produce packages for software and websites that don’t require a live server
For publications where some content should not be preserved, consider tagging what can be preserved in a consistent way that can be used by preservation export or harvesting processes to exclude items that should not be preserved. Platforms may want to facilitate this tagging.
These guidelines also concern the inclusion and exclusion of content in the preservation process:
10. Define and document core intellectual components that need to be preserved
20. Represent all core intellectual components of the work in the export package
40. Identify the rights for external web content
55. Consider whether it is ethical/appropriate to preserve social media
65. Ensure irrelevant or private administrative data is removed from data exports
To achieve a shared understanding between the publisher, authors, and preservation service about what can be preserved so that authors can make informed decisions about what enhancements to include in their publication, broadly describe preservation approaches for different types of content added to a platform. This documentation could indicate to authors, for example, that they should have appropriate rights to files uploaded into the system and that they will be shared with a preservation service. It might also define a platform’s approach to third-party content in iframes by stating that content in iframes may not be preserved or maintained. Alternatively, it could instruct authors that all content in iframes will be archived, so iframes should only be used if the content in them is owned by the author or they have rights that allow it to be harvested by a preservation service. Information about a platform-level approach can be incorporated into or connected to a Terms of Use document, or could be in the form of a publicly visible preservation policy.
See also:
6. Keep preservation partners informed of changes
10. Define and document the core intellectual components of a work