Skip to main content

Text and Data Mining Resources

Disclaimer

  • When in doubt about how your intended use of library electronic resources will comply with library license agreements, please contact your subject librarian. The library can reach out to vendors to help you explore paid options that you can then write into your grant proposals.
  • Systematic downloading of materials from Atkins Library electronic resources is not supported by our license agreements and will result in a shutdown of access for all of our users. We can assist in the location of datasets to text mine and contact information or instructions/documentation from resources accessible to our library.
  • We cannot assist in programming or APIs to text or data mine these resources.

TDM Sources

Vendor Details

JSTOR - Data for Research (DfR) - 

(free)

A self-service system for text mining. Provides a self-service system for text mining. By creating a free DfR account you can download the metadata, word frequencies, citations, key terms, and N-grams of up to 1,000 documents. To get larger datasets or a type of data not available through the main site, you have to contact JSTOR directly: support@ithaka.org  (Guide for using DfR)
SpringerLink - free (through library subscription) You can download subscribed and open access content for TDM purposes directly from the SpringerLink platform. 
Full-text content can be accessed via friendly URLs: PDF: http://link.springer.com/[DOI].pdf  OR HTML(when available): http://link.springer.com/[DOI].html
Content can be downloaded via a web browser or with an HTTP GET request using a scripting tool. Researchers are requested to be considerate and limit their downloading speed to a reasonable rate.
PLEASE read their instructions for more detailed information HERE
More information about Springer's Nature API Portal HERE or BioMed Central HERE
National Center for Biotechnology Information Multiple collections of articles/abstracts from the National Library of Medicine. You can click "Tools" or "Web APIs" on the top menu to find out more information about accessing the data here
PLOS Search API (Public Library of Science)  Gives developers access to rich data that can be flexibly integrated into applications for the web, desktop or mobile devices.
HathiTrust

A partnership of academic and research institutions, offering a collection of millions of titles digitized from libraries around the Access HathiTrust Datasets and its Research Center for information about text and data mining.
Lesson on Text Mining in Python through the HTRC Feature Reader.

Also, a new HTRC Derived dataset (2.0) is available with documentation here. HTRC Extracted Features 2.0 is the most current version of a derived dataset consisting of metadata and data elements extracted from volumes in the HathiTrust Digital Library. The dataset is composed of 17+ million JSON files representing a snapshot of the HathiTrust corpus from February 2020.

Digital Public Library of America (DPLA) Access to DPLA API Codex, Bulk Download, Technical Documentation, and Sample Code and Libraries.
Internet Archive Instructions for developers and those interested in bulk download and API access.
Chronicling America (API)  Provides access to information about historic newspapers and select digitized newspaper pages. Search America's historic newspaper pages from 1789-1963 or use the U.S. Newspaper Directory to find information about American newspapers published between 1690-present.
Gale Eighteenth Century Collection Online (ECCO) Available at the Text Creation Partnership (ECCO-TCP). Online searching at the partnership is free, but text-mining the collection requires a fee to access "raw" data. 
Early English Books Online (EEBO) - free The Text Creation Partnership creates standardized, accurate XML/GSML encoded electronic text editions of early print books. Phase I is freely available (25,000 titles).
Oxford English Dictionary and Oxford University Press (Oxford Scholarship Online) - free  Oxford accommodates TDM for non-commercial use. Researchers are not required to request permission for non-commercial text-mining, If you have any questions please e-mail Data.Mining@oup.com
LexisNexis Academic (now Nexis UNI) - free/fee-based Not specifically available for text mining, but since text files can be downloaded many articles at a time, mining is possible. You can contact LexisNexis to inquire about using/purchasing their "Data as a Service" for larger datasets. Here is also a link for LexisNexis Bulk Content API mining personal consultation service.
CAP API (from Harvard Law Library) - free The Caselaw Access Project API, also known as CAPAPI, serves all official US court cases published in books from 1658 to 2018. The collection includes over six million cases scanned from the Harvard Law Library shelves.
Adam Matthew Databases - free/fee-based Source of unique primary digital collections that are readily available for mining. Researchers may contact him directly at info@amdigital.co.uk to discuss data mining requests/projects. More information here.
Govinfo Bulk Data Repository 

Bills, regulations, rules, and papers of the presidents of the U.S.

Royal Society of Chemistry (RSC) Researcher should send following information to  jnl_licences@rsc.org one month before activity: 
Date to start, Completion date, Institution, Crawler IP address, Crawler user agent,Types of content (HTML / PDF), Institution contact email, and Researcher contact email.
Wiley databases Instructions can be found at the linked website.

 

More APIs

<