Successful website archiving is contingent on a harvester visiting each URL that forms the work. If a full list of URLs is not supplied to the harvesting tool via a sitemap or through other configuration, automation may be used to discover the URLs. Automated website crawling tools can easily identify the target of simple HTML <a> or <link> tags with a relative or full URL, and will include them in a crawl. Many websites, however, use JavaScript actions to fetch content. Crawlers may not be able to identify the URLs that are loaded by JavaScript causing the content to be missed during an automated archiving process. Similarly, hyperlinks that are within compiled features e.g. compiled 3D visualizations, can be difficult or impossible for a crawler to discover. When designing web content, consider the value of using simple HTML links so that crawlers can identify the URLs that make up a work. Note that as with <link> tags, the target URLs of <a> tags will likely be crawled even if they do not display text on the page, and so they can be used to guide a crawler to relevant content. Conversely, a crawler cannot determine which of these tags link to content that is not vital to the work, and so using these tags for other purposes or having hidden link tags that are never used can guide the crawler to things that may be out of scope for an archived copy of the publication, such as previous or unused iterations of a page.
This guideline may make changes for efficient crawling less critical:
43. Include a sitemap for all web-based publications
In an earlier version of the Manifold platform, the HTML for some of the buttons/links that are used to navigate the project were coded in a way that a web crawler would not be able to discover the target page of the link. Because web crawlers could not find these pages, portions of Manifold projects could be missing from a web archived copy. Web archive tools typically crawl a website by finding <a> tags in the HTML and then using the href= value of this tag to identify other pages that should be archived. Originally, the buttons that were used to page through the resources associated with a project in Manifold were all coded using <a href=”#”>. Clicking on this tag triggered some JavaScript that loaded the next set of resources into the page. This works fine for a human user, but for most web archive tools, this would lead them to retrieve the page https://{domain}/resources# instead of the various numbered resource pages that a user would see. In this case it resulted in only the first resource page being discovered for the archived version—the subsequent pages were not discovered, and so nor were the resources that were linked on those pages. After feedback from the embedding team, these links were re-coded to be in the format <a href=”?page=2”> which leads to a URL that can be visited by a web crawler and also bookmarked by human users, e.g., https://{domain}/resources?page=2. In another part of the system, instead of using <a> tags, the code used <div> tags that were styled to look like buttons and had onclick() actions in JavaScript that loaded new content. A web crawler looks for <a> tags to identify links, but <div> tags have many uses and a crawler would not “know” to click on one to reach new content, so these pages would be missed by web archive tools. These were changed to use the format <a href=”/projects/path-to-page”> that is understood by a web crawler. The content on those pages can now be automatically discovered and archived by a web crawler.