How Teseo Doctoral Thesis Data Was Collected

First, I would like to thank the widespread attention and reception that the initiative to make the Teseo Doctoral Theses database available to the scientific community has received. My intention in this work is to provide the data accessible from the Teseo website. Given the significant impact of the results obtained, I aim to detail several aspects to consider and the methodology used for data collection, with the objective of validating the methodology.

Aspects to Consider Regarding Teseo

To obtain records from the Teseo database without direct access to the source server, a crawling method based on permalinks with sequential record numbers can be employed. All theses available through the Teseo search engine are assigned a marker or permalink containing a reference number corresponding to their record. For example, my doctoral thesis has the permalink «https://www.educacion.gob.es/teseo/mostrarRef.do?ref=933534». Therefore, it is possible to develop a program that systematically analyzes all sequential links from a given starting number—for instance, reference 1 up to reference 5 million. This approach ensures that all records in the database, provided they are published and accessible via the Teseo website, can be downloaded.

On the other hand, it must be taken into account that there are many duplicate entries. In fact, most records have one or more duplicates. For instance, the record corresponding to my doctoral thesis, titled «Applications of Syndication for the Management of Bibliographic Catalogs», can be found duplicated in the references «933534» and «933535». If we examine the first record available in Teseo, the references «3», «4», and «5» exhibit the same effect triply, and hundreds of similar cases could be cited. The duplication observed in many records and entries may stem from multiple factors, of which we can only propose some hypotheses: a) Each modification of data in a Teseo record generates a duplicate as if it were a version control system; b) There may be an undefined duplication issue when updating records in the Teseo database; c) Some form of data redundancy system might be implemented to ensure the presence of records through multiple entries. In all possible scenarios, these duplications would undermine the purpose of Teseo’s markers or permalinks as a method of unique identification.

Although the method applied to obtain the records from the Teseo database takes all these factors into account, as will be explained below, certain aspects that may have influenced the results must also be considered. I wish to emphasize that the resources allocated for this important data collection effort were very limited. Specifically, a portable device with reduced capacity and an Internet connection that may have experienced interruptions during the process. These factors could indeed have affected the acquisition of the results, and I acknowledge that the data might differ from those presented. Consequently, I am conducting a second analysis to confirm and, if necessary, correct the results obtained thus far. In any case, my intention is to provide accurate information, and I am open to constructive suggestions and contributions aimed at achieving a database as identical as possible to Teseo.

Teseo Automated Data Collection System

Having said this, I wish to share the automated method used for collecting data from the Teseo database. The code in question is shown below.

Although it is a simple program, it may appear quite complex to those unfamiliar with the PHP programming language. Therefore, I proceed to provide a detailed description of the web crawler’s operation.

The program is designed to retrieve all web pages from Teseo that have the link pattern «https://www.educacion.gob.es/teseo/mostrarRef.do?ref=NNNN», where «NNNN» is a number that sequentially ranges from «1» to «5,000,000». The numerical value of the link reference enables retrieval of the entire spectrum of records available in the Teseo database. This can be verified in the «for($i=0; $i<=5000000; $i++)» loop on line 23. Therefore, the program crawls 5 million markers from the Teseo database and obtains their HTML source code using the cURL functions shown between lines 27 and 47.

To facilitate the filtering process of the information available in the HTML-formatted profiles that have been downloaded, a DOM (Document Object Model) object is created to map all nodes of the HTML structure of the profiles; see lines 49 to 51.

The next step is the extraction of elements containing the thesis metadata, such as the title, author, university, date of defense, supervisor(s), tribunal members, descriptors, and abstract. Data extraction from the thesis card can be verified from line 53 to line 229, where XPath queries are employed to select the relevant content. Additionally, character encoding decoding mechanisms (the utf8_decode function), text preprocessing (the preg_replace function), and removal of excessive spaces and tabs that hinder data processing (the trim function) are applied, achieving text normalization through functions such as mb_strtolower, ucfirst, and ucwords. In addition to these text preparation mechanisms, the program is capable of detecting common positioning errors in content placement and reassigning information to the correct variables using regular expression-based functions such as [preg_match(«/(president|member|secretary)/»,…)].

Lines 231 to 235 verify whether any duplication exists among the collected data with respect to the target database for data storage. This prevents duplication issues in Teseo, ensuring that all inserted records are unique, as their titles will always differ.

If the Doctoral Thesis title is absent or does not appear in the target dump table, named «catalogoteseo», then the instructions listed in lines 236 to 240 are executed, corresponding to the insertion of the data extracted from the Doctoral Thesis card.

The final step of the program is to delete all variables and data arrays that have been used, in order to avoid residual data in the next iteration of the loop. The entire process described above is repeated several million times until reaching the marker «https://www.educacion.gob.es/teseo/mostrarRef.do?ref=5000000» , at which point the program terminates and displays the word «FIN» on screen.

Final considerations

Teseo is a live database that can change and indeed changes daily. It is possible that Teseo does not display all doctoral theses it actually holds, but only publishes records for those for which at least minimal basic data are available. This explains the discrepancies between the obtained data and the official statistics available at [http://www.mecd.gob.es/educacion-mecd/areas-educacion/universidades/estadisticas-informes/estadisticas/tesis-doctorales.html].

It is also possible that gaps exist in certain chronological ranges within Teseo and that these are not apparent in the public records, even though such records may exist in third-party databases or registries; in such cases, the necessary access would be unavailable, making data extraction or download impossible. Furthermore, if Teseo contains more than five million references among its markers, it is conceivable that some records lie outside the range analyzed by the web crawler. In that case, it would suffice to specify a broader range encompassing the remaining Teseo records. This possibility, although unlikely, is currently under investigation to confirm or rule it out.

In the event of obtaining new data on the Teseo datasets, I can assure you that I will promptly report them and update both the references in the entries on the mblazquez portal [ref1] and [ref2], as well as the project published on SourceForge [https://sourceforge.net/projects/teseo-database/], to ensure the information is always as accurate as possible.

Acknowledgments

Finally, I would like to once again thank all the support received and the trust placed in the work I have been developing. I have strived to be as transparent and faithful to reality as possible with the tools and resources available. I take this opportunity to extend a warm greeting to all readers and followers of mblazquez.es.

How Teseo Data Was Collected: Key Considerations and New Actions

Aspects to Consider Regarding Teseo

Teseo Automated Data Collection System

Final considerations

Acknowledgments

List of Teseo Articles