Daniele Rotolo, PhD
Professor in Science, Technology and Innovation Policy – SPRU, University of Sussex Business School
Associate Professor - Department of Mechanics, Mathematics and Management, Politecnico di Bari
Matching MEDLINE/PubMed Data with Web of Science (WoS):
A Routine in R-language
Rotolo and Leydesdorff (2015)
Journal of the Association for Information Science and Technology, 66(10): 2155–2159
medlineR is a ruotine based on R language that enables to match data
from MEDLINE/PubMed with records indexed in the ISI Web of Science
(WoS) database, thus providing access to field (e.g. citation data, list of
cited reference, full list of the addresses of authors’ host organisations,
WoS subject categories) not available in MEDLINE/PubMed neither
directly in the MEDLINE interface of WoS. For a similar routine for
Windows-based computers see www.leydesdorff.net/medline/.
The R-script of medlineR is available here.
To run medlineR, the user has to follow a number of steps:
-
The user first installs R and the package "stringr". The latter can be installed from the R command line using the following string install.packages("stringr"). An Internet connection is required.
-
The user can identify a list of "PubMed Identifiers", namely PMIDs, in MEDLINE/PubMed to match with WoS data. The list of PMIDs can be derived from the set of publications obtained from the specific query the user is performing. The interface of MEDLINE/PubMed allows for the download of PMIDs in txt format. It is worth noting that medlineR can also work directly at WoS MEDLINE with a valid search string in the advanced-search interface. In this case, the user skips step 2 and step 3.
-
Assuming that the list is composed by M PMIDs, the user builds an advanced search string in the WoS MEDLINE interface according to the following syntax: PM = PMID_1 or PM=PMID_2 or ... PM=PMID_M. Conventional spreadsheets can be used to generate the search string.
-
This search string is then used to query the WoS MEDLINE through the advanced search interface. The query will return a list of documents. Each document can be accessed through a weblink. This implies that a url link is associated to each document the search retrieved.
-
The user has to input three parameters in the medlineR script:
(a) One of the url links to WoS MEDLINE documents (variable wosurl). To do so, the user can access to the first document in the list and copy-paste, between quotation marks, the url link associated with this document in the R code – medlineR will use this url to generate the remaining ones automatically. [If the user build a search string in the advanced search interface of WoS MEDLINE and the search returns more than 10 documents, the url associated to one of these documents will have the following format: "http://apps.webofknowledge.com/full_record.do?product=MEDLINE&search_mode= AdvancedSearch&qid=***&SID=***&page=***&doc=***". Stars replace codes that are generated by WoS MED- LINE according to the performed search and the session number. In the case of a search in the general in- terface the url will have the following format: "http://apps.webofknowledge.com/full_record.do?product=MEDLINE&search_mode=GeneralSearch&qid=***&SID=***&page=***&doc=***".](b) The number of documents to collect (variable numdocs).
(c) The path to the folder in which the outputs of the medlineR routine should be saved. This should be input in R between quotation marks (e.g. "C:\\Users \\user" in Windows or "/Users/username/Desktop/" in Mac OS X).
-
The user has to launch medlineR (the shortcut in Windows is "CTRL+A" and then "CTRL+R" whereas in Mac OS X is "CMD+A" and then "CMD+Return"), which parses the wosurl variable to generate the whole set of links pointing to the identified documents and collect the full html code of the associated webpages. The html code is then parsed to retrieve documents’ UTs, i.e. "Unique Article Identifiers" in WoS. An indication of the number of processed records is reported in the R interface as the routine advances.
-
The medlineR routine generates three different outputs: (i) a set of files (sequentially named wosPMID=1.txt, wosPMID=2.txt, etc.) including the html code of each webpage, (ii) a document, called wosut.txt, which lists PMIDs with associated (when available) UTs, and (iii) a file, called search.txt, that provide the full search string for WoS based on the collected UTs. This string can be use in the advanced interface of WoS to retrieve the full records of the identified documents. These can be then downloaded from the WoS interface. The medlineR routine may return warning messages at the end of the data collection process. The user can ignored these messages since they do not affect the produced files.
As example, the files generated for the case of Brugada Syndrome (BRs) are include below:
-
PMID list from MEDLINE/PubMed: BRs_pmid.txt
-
Excel file and adavanced searh in WoS MEDLINE: BRs_wosmedline.xlsx
-
Html code of WoS webpages: BRs_wos.zip
-
PMIDs and associated UTs: BRs_wosut.txt
-
Search string for WoS: BRs_search.txt
IMPORTANT NOTES
-
In the case the user has to match samples of publications (PMIDs) larger than 1000 records, it is strongly suggested to run medlineR (from step 3 to step 7) multiple times by using sub-samples of 1000 or fewer records. This is due to the limitation WoS imposes in the data collection from its interface – the connection established with the database may be closed by WoS after having processed about 1000 records (see also the fair use policy at http://wos.isitrial.com/policy/Policy.htm).
-
medlineR may not work with remote access to WoS. This issue can be addressed with a remote desktop (I am grateful to Frederik Kok Hansen for noting this).
© 2014 Daniele Rotolo