Frequently Asked Questions

What is the Archiving Program?

The State Government Website and Social Media Archive is a joint project of the State Library of North Carolina and the State Archives of North Carolina. It captures selected state government websites and social media accounts, specifically ones that have been suggested by their agencies for archiving and which have also been vetted by the State Archives’ Records Analysts as having historically significant information not available through other means.

Tab/Accordion Items

The North Carolina State Government Web Archives uses the Internet Archive tool Archive-It to capture websites. Archive-It performs web crawls on websites to capture text, images, the structure and organization of data, and associated files (Word documents, PDFs, etc.) when possible. Archive-It can be used to crawl both traditional websites and social media. Several factors can interfere with Archive-It's ability to crawl a website, including website logins (as with some social media platforms) and data stored in databases. Archive-It can only access publicly displayed information, meaning it cannot capture private messages available to users of some social media. Web crawls are initiated at regular intervals by a member of the Web and Social Media Archive Committee. Crawls are typically initiated once every two months or, in the cases of infrequently updated websites, once a year. Crawls are regularly vetted for quality control by members of the Web and Social Media Archive committee.

The North Carolina Social Media Archives uses the tool ArchiveSocial to collect selected state government social media. ArchiveSocial captures social media posts as well as other content that may not be publicly available, such as private messages. It does this by using the API (Application Programming Interface) feed available from some social media platforms to download this information. For ArchiveSocial to access the feed, the account owner must give their permission via the social media platform that is to be captured. This means that when account owners change their passwords, ArchiveSocial loses access to the API, and the account owners must give permission again to restart the collection of information. 

Because ArchiveSocial captures information as soon as it is posted, information on the accounts it captures is always up to date rather than having to wait for a web crawl to be initiated. ArchiveSocial, unlike Archive-It, has no means to vet or quality control data before it is saved.

Archive-It captures all embedded elements on a seed site page (including images, style sheets, JavaScript, PDFs, and so on) for up to 100 hops from the original seed page within the same host domain. Archive-It does not capture links to other sites or subdomains (such as axaem.archives.ncdcr.gov) unless the subdomain is entered as a separate seed or the primary seed has been entered in a very specific way.

Several factors can limit or expand the amount of data captured for each seed, depending on how often the site is updated and how much data there is to crawl: 

  • We crawl most sites every 2 months; sites for boards and commissions are crawled only once a year. Data that has been added and removed within the window between crawls will not be captured. 
  • Our crawls are set to expire after 7 days, so depending on the rate of data capture, there may be data missing from a completed crawl. 
  • Robots.txt exclusions built into websites and other code can limit what Archive-It captures—for instance, as of December 2023, the crawl data from X (Twitter) and Facebook seeds is unusable because of log-in requirements and other coding barriers, and these issues change as fast as the technology does. 
  • Crawler traps can create link loops that expand the data collected from a site by duplicating the same links infinitely (or until the crawl hits 100 hops or 7 days). 

Every seed is QC’d once a year by the web and social media archiving team, which allows the team to run patch crawls on any missing pages—but that means that only 1 in 6 crawls for each seed has been checked for completeness. 

For all these reasons, the data captured by Archive-It is considered only a snapshot of the website at a given time and cannot be relied upon as a perfect copy of the web content of a seed. Thus, even if other record types are captured via the Archive-It web crawls (for instance, meeting minutes loaded as PDFs on an agency website), the Archive-It copy should never be considered the record copy. 

By contrast, ArchiveSocial captures social media content in real-time. It relies on APIs on the backend of social media platforms to download the content of each data component of the site, including posts, messages, events, and so on. ArchiveSocial is limited by the availability of APIs (for instance, Threads does not have an API as of December 2023) and by our account level, which only allows 125 accounts to be captured at a time. (As of December 2023, we have 117 historical accounts and 115 active accounts in ArchiveSocial comprising 8,686,071 records at an average clip of 47,817 records per month.) The platforms that are supported as of December 2023 are: 

  • Flickr 
  • Facebook 
  • Facebook Groups 
  • Facebook Pages 
  • Instagram Business 
  • Instagram Personal 
  • LinkedIn Company 
  • LinkedIn Personal 
  • Google+ 
  • Pinterest 
  • TikTok 
  • X (Twitter) 
  • Vimeo 
  • YouTube

For the website archives in Archive-IT, several factors can limit or expand the amount of data captured for each seed, depending on how often the site is updated and how much data there is to crawl. In general:

  • We crawl most sites every two months. 
  • Sites for boards and commissions are crawled only once a year. 
  • Data that has been added and removed within the window between crawls will not be captured.

By contrast, ArchiveSocial captures social media content in real-time. It relies on APIs on the backend of social media platforms to download the content of each data component of the site, including posts, messages, events, and so on.

The State Government Website and Social Media Archive is a joint venture between the State Archives of North Carolina (SANC) and the State Library of North Carolina (SLNC). This distinction denotes that both these organizations are involved in the implementation and management of web and social media archiving. We work together to run crawls over the requested websites and social media sites to capture the data in Archive-IT and create a record of this digital information.

State Archives of North Carolina Members: 

  • Digital Services Section Head 
  • Digital Archivist 
  • Systems Integration Librarian 
  • Information Management Archivist 
  • Digital Description Archivist 

State Library of North Carolina Members: 

  • Digital Projects Librarian 
  • State Publications Clearinghouse Liaison 
  • Systems Support Librarian

What we do: 

  1. We crawl websites and social media accounts using Archive-It and ArchiveSocial to grab a record of the agency’s online business. 
  2. We add URLs as seeds to Archive-IT and potentially to ArchiveSocial (depending on appraisal value) at the request of the agency to their analyst. 
  3. We inactivate seeds that are no longer in operation and no longer need to be crawled. When marked inactive, the platform will retain the previous information captured as a historical record, and the URL is marked as no longer needing to be crawled. 
  4. We divide the crawl results for the bi-monthly and yearly crawls to perform quality control over the captured information. If any issues arise during quality control, we discuss possible solutions as a group and send tickets to Archive-IT for further details 

What we don’t do: 

  • We do not appraise websites and social media accounts for archival content. This duty falls to the coordinating records analyst within the Records Analysis Unit of the State Archives, the Records Description Unit supervisor, as well as the appraisal archivist. 
  • We do not approach agencies to archive their sites. Agencies should speak with their coordinating analyst to determine if their websites or social media accounts can and should be archived.
  • We do not capture every social media account. As previously mentioned, we are limited to what the tools can capture.

For state agencies, websites and social media records are scheduled in the Functional Schedule for North Carolina State Agencies in records retention schedule “15. Public Relations,” available at https://archives.ncdcr.gov/public-relations.

RC No. 1515 Social Media and Websites identifies three possible retentions for state agency web and social media content:

Screenshot of rule 1515 from the State Agency Functional Schedules which deals with social media

Note that “social media sites and other websites that have historical content” are scheduled as permanent records, but an appraisal step is required to determine whether the State Archives will collect the content or the agency will be responsible for maintaining its social media and website records in office. 

Routine social media records have a 5-year retention, and records produced during planning and executing social media activities may be destroyed once they are superseded or obsolete. 

The Government Records Section has developed and is in the process of expanding a set of social media appraisal criteria and an accompanying workflow to identify social media of enduring value. This is particularly important for social media accounts that cannot be captured via Archive-It (i.e., X (Twitter), Facebook) because our account with the platform for social media capture, ArchiveSocial, has limited storage space. In general, the criteria follow the guidance provided in the Functional Schedule “Overview” document’s section titled “Historical Value.” 

A list of records analysts and the agencies they work with is available on the State Archives of North Caroline website.

Likely because your accounts are being archived through ArchiveSocial and something has changed, either your password(s) and/or the person managing your accounts. Changes like that can break ArchiveSocial's access to the backend of your accounts where archiving happens, meaning the accounts are no longer being archived. In those cases, someone from the State Archives will attempt to contact the person listed as managing your accounts. If we don't hear back from them, we will attempt to contact the Chief Records Officer (CRO) for the agency to update that information. If we do not hear back from the agency within two months, we will make the account inactive to free up space for collecting accounts that we do have access to archive. The previously archived information for the accounts will still remain in the archives. If we have open space to offer, accounts can be reconnected at a later date.