First, I would like to thank the widespread attention and positive reception that the initiative to make the Teseo Doctoral Theses database available to the scientific community has received. My intention in this work is to provide the data accessible from the Teseo website [https://www.educacion.gob.es/teseo]. Given the significant impact of the results obtained, I aim to detail certain aspects to consider and the methodology used for data collection, with the objective of validating the approach.
Aspects to consider regarding Teseo
To retrieve records from the Teseo database without direct access to the data source server, a crawling method based on sequential permalinks can be employed. All theses available through the Teseo search engine are assigned a marker or permalink containing a reference number corresponding to their record. For example, my doctoral thesis has the permalink «https://www.educacion.gob.es/teseo/mostrarRef.do?ref=933534». Therefore, it is possible to develop a program that systematically analyzes all sequential links from a given starting number—for instance, reference 1 up to reference 5 million. This approach ensures that all records published and accessible via the Teseo website can be downloaded.
On the other hand, it must be taken into account that there are many duplicate entries. In fact, most records have one or more duplicates. For instance, the record corresponding to my doctoral thesis, titled «Applications of Syndication for the Management of Bibliographic Catalogs», can be found duplicated in the references «933534» and «933535». If we examine the first record available in Teseo, the references «3», «4», and «5» exhibit the same effect triply, and hundreds of similar cases could be cited. The duplication observed in many records and entries may result from multiple factors, only a few of which we can hypothesize: a) Each modification of data in a Teseo record generates a duplicate, as if it were a version control system; b) There may be an unresolved issue causing indefinite duplication when updating records in the Teseo database; c) Some form of data redundancy system might be implemented to ensure the presence of records through multiple entries. In all possible scenarios, these duplications undermine the purpose of Teseo’s markers or permalinks as a method of unique identification.
Although the method applied to obtain the records from the Teseo database takes all these factors into account, as will be explained below, certain aspects that may have influenced the results must also be considered. I wish to emphasize that the resources allocated for this important data collection effort were very limited. Specifically, a portable device with reduced capacity and an Internet connection that may have experienced interruptions during the process. These factors could indeed have affected the acquisition of the results, and I acknowledge that the data might differ from those presented. Consequently, I am conducting a second analysis to confirm and, if necessary, correct the results obtained thus far. In any case, my intention is to provide accurate information, and I am open to constructive suggestions and contributions aimed at achieving a database as identical as possible to Teseo.
Teseo Automated Data Collection System
Having said this, I wish to share the automated method used for collecting data from the Teseo database. The code in question is shown below.
<meta http-equiv='content-type' content='text/html; charset=UTF-8' />
<?php
$namedb = "teseo";
$con = mysqli_connect ( 'localhost', 'root', 'root', 'teseo' );
if (mysqli_connect_errno ()) {
$con = new mysqli ( "localhost", "root", "root" );
}
mysqli_select_db ( $con, "$namedb" );
$cf_agent = "MBOT webcrawler by Prof. Dr. Manuel Blázquez Ochando";
$cf_header = "Content-Type: text/plain, text/xml, text/html, text/htm, text/jsp, text/json, text/x-json, application/xml, application/xhtml+xml, text/plain, text/php, text/asp, application/jsp, application/json, application/x-httpd-php, application/php, application/asp, text/vcard, text/xvcard";
$cf_buffer = "6291456";
$cf_timecache = "1";
$cf_timeconnect = "1000";
$cf_timeout = "1000";
for($i = 0; $i <= 5000000; $i ++) {
$url1 = "https://www.educacion.gob.es/teseo/mostrarRef.do?ref=$i";
$thread1 = curl_init ();
curl_setopt ( $thread1, CURLOPT_URL, $url1 );
curl_setopt ( $thread1, CURLOPT_USERAGENT, $cf_agent );
curl_setopt ( $thread1, CURLOPT_HTTPHEADER, array (
"'$cf_header'"
) );
curl_setopt ( $thread1, CURLOPT_SSL_VERIFYPEER, false );
curl_setopt ( $thread1, CURLOPT_FAILONERROR, true );
curl_setopt ( $thread1, CURLOPT_FOLLOWLOCATION, true );
curl_setopt ( $thread1, CURLOPT_LOW_SPEED_TIME, 3 );
curl_setopt ( $thread1, CURLOPT_LOW_SPEED_LIMIT, 1048576 );
curl_setopt ( $thread1, CURLOPT_AUTOREFERER, true );
curl_setopt ( $thread1, CURLOPT_RETURNTRANSFER, true );
curl_setopt ( $thread1, CURLOPT_FORBID_REUSE, true );
curl_setopt ( $thread1, CURLOPT_FRESH_CONNECT, true );
curl_setopt ( $thread1, CURLOPT_BUFFERSIZE, $cf_buffer );
curl_setopt ( $thread1, CURLOPT_DNS_CACHE_TIMEOUT, $cf_timecache );
curl_setopt ( $thread1, CURLOPT_CONNECTTIMEOUT_MS, $cf_timeconnect );
curl_setopt ( $thread1, CURLOPT_TIMEOUT_MS, $cf_timeout );
$html1 = curl_exec ( $thread1 );
curl_close ( $thread1 );
$dom1 = new DOMDocument ();
@$dom1->loadHTML ( $html1 );
$xpath1 = new DOMXPath ( $dom1 );
// Title
@$data00 = $xpath1->query ( "//div[@id='contenido']/div/ul/li" )->item ( 0 )->nodeValue;
$data00 = utf8_decode ( $data00 );
$data00 = preg_replace ( "/(Title:)/", "", $data00 );
$data00 = trim ( $data00, chr ( 0xC2 ) . chr ( 0xA0 ) );
$data00 = mb_strtolower ( $data00, 'UTF-8' );
$data00 = ucfirst ( trim ( $data00 ) );
// Author
@$data01 = $xpath1->query ( "//div[@id='contenido']/div/ul/li" )->item ( 1 )->nodeValue;
$data01 = utf8_decode ( $data01 );
$data01 = preg_replace ( "/(Author:)/", "", $data01 );
$data01 = trim ( $data01, chr ( 0xC2 ) . chr ( 0xA0 ) );
$data01 = mb_strtolower ( $data01, 'UTF-8' );
$data01 = ucwords ( trim ( $data01 ) );
// University
@$data02 = $xpath1->query ( "//div[@id='contenido']/div/ul/li" )->item ( 2 )->nodeValue;
$data02 = utf8_decode ( $data02 );
$data02 = preg_replace ( "/(University:)/", "", $data02 );
$data02 = trim ( $data02, chr ( 0xC2 ) . chr ( 0xA0 ) );
$data02 = mb_strtolower ( $data02, 'UTF-8' );
$data02 = ucfirst ( trim ( $data02 ) );
// Date of defense
@$data03 = $xpath1->query ( "//div[@id='contenido']/div/ul/li" )->item ( 3 )->nodeValue;
$data03 = utf8_decode ( $data03 );
$data03 = preg_replace ( "/(Date of Defense:)/", "", $data03 );
$data03 = trim ( $data03, chr ( 0xC2 ) . chr ( 0xA0 ) );
$array_data03 = explode ( "/", $data03 );
$data03 = $array_data03 [2] . "-" . $array_data03 [1] . "-" . $array_data03 [0];
if (preg_match ( "/department/i", $data03 )) {
@$data03 = $xpath1->query ( "//div[@id='contenido']/div/ul/li" )->item ( 4 )->nodeValue;
$data03 = utf8_decode ( $data03 );
$data03 = preg_replace ( "/(Date of Defense:)/", "", $data03 );
$data03 = trim ( $data03, chr ( 0xC2 ) . chr ( 0xA0 ) );
$array_data03 = explode ( "/", $data03 );
$data03 = $array_data03 [2] . "-" . $array_data03 [1] . "-" . $array_data03 [0];
}
// Committee (1*)
for($a0 = 0; $a0 <= 10; $a0 ++) {
@$st04 = $xpath1->query ( "//div[@id='contenido']/div/ul/li[5]/ul/li" )->item ( $a0 )->nodeValue;
$st04 = utf8_decode ( $st04 );
$st04 = preg_replace ( "/\x{00a0}/", "", $st04 );
$st04 = mb_strtolower ( $st04, 'UTF-8' );
$st04 = ucwords ( trim ( $st04 ) );
if (preg_match ( "/director/i", $st04 )) {
if (strlen ( $st04 ) <= 2) {
} else {
$array4 [] = "$st04";
}
} elseif (preg_match ( "/(chair|member|secretary)/i", $st04 )) {
if (strlen ( $st04 ) <= 2) {
} else {
$array5 [] = "$st04";
}
} else {
if (strlen ( $st04 ) <= 2) {
} else {
$array6 [] = "$st04";
}
}
unset ( $st04 );
}
// Tribunal (1*)
for($a0 = 0; $a0 <= 10; $a0 ++) {
@$st05 = $xpath1->query ( "//div[@id='contenido']/div/ul/li[6]/ul/li" )->item ( $a0 )->nodeValue;
$st05 = utf8_decode ( $st05 );
$st05 = preg_replace ( "/\x{00a0}/", "", $st05 );
$st05 = mb_strtolower ( $st05, 'UTF-8' );
$st05 = ucwords ( trim ( $st05 ) );
if (preg_match ( "/(director)/i", $st05 )) {
if (strlen ( $st05 ) <= 2) {
} else {
$array4 [] = "$st05";
}
} elseif (preg_match ( "/(chair|member|secretary)/i", $st05 )) {
if (strlen ( $st05 ) <= 2) {
} else {
$array5 [] = "$st05";
}
} else {
if (strlen ( $st05 ) <= 2) {
} else {
$array6 [] = "$st05";
}
}
unset ( $st05 );
}
// Keywords (1*)
for($a0 = 0; $a0 <= 10; $a0 ++) {
@$st06 = $xpath1->query ( "//div[@id='contenido']/div/ul/li[7]/ul/li" )->item ( $a0 )->nodeValue;
$st06 = utf8_decode ( $st06 );
$st06 = preg_replace ( "/\x{00a0}/", "", $st06 );
$st06 = mb_strtolower ( $st06, 'UTF-8' );
$st06 = ucfirst ( trim ( $st06 ) );
if (preg_match ( "/(director)/i", $st06 )) {
if (strlen ( $st06 ) <= 2) {
} else {
$array4 [] = "$st06";
}
} elseif (preg_match ( "/(chair|member|secretary)/i", $st06 )) {
if (strlen ( $st06 ) <= 2) {
} else {
$array5 [] = "$st06";
}
} else {
if (strlen ( $st06 ) <= 2) {
} else {
$array6 [] = "$st06";
}
}
unset ( $st06 );
}
// In case keywords are not found, attempt retrieval with an alternative key
$test6 = count ( $array6 );
if ($test6 == "0") {
for($a0 = 0; $a0 <= 10; $a0 ++) {
@$st06 = $xpath1->query ( "//div[@id='contenido']/div/ul/li[8]/ul/li" )->item ( $a0 )->nodeValue;
$st06 = utf8_decode ( $st06 );
$st06 = preg_replace ( "/\x{00a0}/", "", $st06 );
$st06 = mb_strtolower ( $st06, 'UTF-8' );
$st06 = ucfirst ( trim ( $st06 ) );
if (! preg_match ( "/(director|chair|member|secretary|marker)/i", $st06 )) {
if (strlen ( $st06 ) <= 2) {
} else {
$array6 [] = "$st06";
}
}
}
}
// Abstract
@$data07 = $xpath1->query ( "//div[@id='contenido']/div/ul/li[10]" )->item ( 0 )->nodeValue;
$data07 = utf8_decode ( $data07 );
$data07 = preg_replace ( "/(Abstract:)/", "", $data07 );
$data07 = trim ( $data07, chr ( 0xC2 ) . chr ( 0xA0 ) );
$data07 = mb_strtolower ( $data07, 'UTF-8' );
$data07 = ucfirst ( trim ( $data07 ) );
if (preg_match ( "/(marker)/i", $data07 )) {
@$data07 = $xpath1->query ( "//div[@id='contenido']/div/ul/li[11]" )->item ( 0 )->nodeValue;
$data07 = utf8_decode ( $data07 );
$data07 = preg_replace ( "/(Abstract:)/", "", $data07 );
$data07 = trim ( $data07, chr ( 0xC2 ) . chr ( 0xA0 ) );
$data07 = mb_strtolower ( $data07, 'UTF-8' );
$data07 = ucfirst ( trim ( $data07 ) );
$data07 = preg_replace_callback ( '/[.!?].*?\w/', create_function ( '$matches', 'return strtoupper($matches[0]);' ), $data07 );
} else {
$data07 = preg_replace_callback ( '/[.!?].*?\w/', create_function ( '$matches', 'return strtoupper($matches[0]);' ), $data07 );
}
foreach ( $array4 as $item4 ) {
$data04 .= "$item4|";
}
$data04 = substr ( "$data04", 0, - 1 );
foreach ( $array5 as $item5 ) {
$data05 .= "$item5|";
}
$data05 = substr ( "$data05", 0, - 1 );
foreach ( $array6 as $item6 ) {
$data06 .= "$item6|";
}
$data06 = substr ( "$data06", 0, - 1 );
// Check for duplication
$results = mysqli_query ( $con, "SELECT COUNT(*) AS nrows FROM catalogoteseo WHERE titulo LIKE '%$data00%';" );
$row = mysqli_fetch_array ( $results );
if ($row [nrows] >= "1") {
} else {
$datetime = date ( c );
// Insert record
mysqli_query ( $con, "INSERT INTO catalogoteseo SET core='7', regdate='$datetime', ref='$url1', titulo='$data00', autor='$data01', universidad='$data02', fecha='$data03', director='$data04', tribunal='$data05', materia='$data06', resumen='$data07';" );
}
unset ( $html1 );
unset ( $data00 );
unset ( $data01 );
unset ( $data02 );
unset ( $data03 );
unset ( $data04 );
unset ( $data05 );
unset ( $data06 );
unset ( $data07 );
unset ( $array4 );
unset ( $array5 );
unset ( $array6 );
unset ( $item4 );
unset ( $item5 );
unset ( $item6 );
}
echo "FIN";
Although it is a simple program, it can be quite complex for those unfamiliar with the PHP programming language. Therefore, I proceed to provide a detailed description of the crawler's operation.
The program is designed to retrieve all web pages from Teseo that have the link pattern «https://www.educacion.gob.es/teseo/mostrarRef.do?ref=NNNN», where «NNNN» is a number that sequentially ranges from «1» to «5,000,000». The numerical value of the link reference enables retrieval of the entire spectrum of records in the Teseo database. This can be verified in the «for($i=0; $i<=5000000; $i++)» loop on line 23. Therefore, the program crawls 5 million markers from the Teseo database and retrieves their HTML source code using the cURL functions shown between lines 27 and 47.
$url1 = "https://www.educacion.gob.es/teseo/mostrarRef.do?ref=$i";
$thread1 = curl_init();
curl_setopt($thread1, CURLOPT_URL, $url1);
curl_setopt($thread1, CURLOPT_USERAGENT, $cf_agent);
curl_setopt($thread1, CURLOPT_HTTPHEADER, array("'$cf_header'"));
curl_setopt($thread1, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($thread1, CURLOPT_FAILONERROR, true);
curl_setopt($thread1, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($thread1, CURLOPT_LOW_SPEED_TIME, 3);
curl_setopt($thread1, CURLOPT_LOW_SPEED_LIMIT, 1048576);
curl_setopt($thread1, CURLOPT_AUTOREFERER, true);
curl_setopt($thread1, CURLOPT_RETURNTRANSFER, true);
curl_setopt($thread1, CURLOPT_FORBID_REUSE, true);
curl_setopt($thread1, CURLOPT_FRESH_CONNECT, true);
curl_setopt($thread1, CURLOPT_BUFFERSIZE, $cf_buffer);
curl_setopt($thread1, CURLOPT_DNS_CACHE_TIMEOUT, $cf_timecache);
curl_setopt($thread1, CURLOPT_CONNECTTIMEOUT_MS, $cf_timeconnect);
curl_setopt($thread1, CURLOPT_TIMEOUT_MS, $cf_timeout);
$html1 = curl_exec($thread1);
curl_close($thread1);
To facilitate the process of filtering the information available in the HTML-formatted profiles that have been downloaded, a DOM (Document Object Model) object is created to map all nodes of the HTML structure of the profiles; see lines 49 to 51.
$dom1 = new DOMDocument(); @$dom1->loadHTML($html1); $xpath1 = new DOMXPath($dom1);
The next step involves extracting the elements containing the metadata information, such as the title, author, university, date of reading, supervisor(s), thesis committee members, descriptors, and abstract. Data extraction from the metadata can be verified from lines 53 to 229, where XPath queries are employed to select the relevant content. Additionally, character encoding decoding mechanisms (utf8_decode function), text preprocessing (preg_replace function), removal of excess spaces and tabs that hinder data processing (trim function), and text normalization (mb_strtolower, ucfirst, and ucwords functions) are applied. In addition to these text preparation mechanisms, the program is capable of detecting common positioning errors in content placement and relocating the information into the appropriate variables using regular expression-based functions such as [preg_match("/(president|member|secretary)/",…)].
Lines 231 to 235 verify whether any duplication exists among the collected data with respect to the target database designated for data storage. This prevents duplication issues in Teseo, ensuring that all inserted records are unique, as their titles will always differ.
// Check for duplication based on the title field
$results = mysqli_query($con, "SELECT COUNT(*) AS nrows FROM catalogoteseo WHERE titulo LIKE '%$data00%';");
$row = mysqli_fetch_array($results);
if($row[nrows] >= "1"){
// If one record exists in the dump target table, duplication is detected
} else {
// Otherwise, insert the record into the table
}
If the Doctoral Thesis title is not present or does not appear in the destination dump table, named «catalogoteseo», then the instructions listed on lines 236 to 240 are executed, corresponding to the insertion of data retrieved from the Doctoral Thesis record.
// Insert record mysqli_query($con, "INSERT INTO catalogoteseo SET core='7', regdate='$datetime', ref='$url1', titulo='$data00', autor='$data01', universidad='$data02', fecha='$data03', director='$data04', tribunal='$data05', materia='$data06', resumen='$data07';");
The final step of the program is to delete all variables and data arrays that have been used, to prevent residual data from carrying over into the next loop iteration. The entire process described above is repeated several million times until reaching the marker «https://www.educacion.gob.es/teseo/mostrarRef.do?ref=5000000» , at which point the program terminates and displays the word «FIN» on screen.
Final considerations
Teseo is a dynamic database that can change and indeed changes daily. It is possible that Teseo does not display all doctoral theses it actually holds, and only the records of those theses with minimum basic data are published. This explains the discrepancies between the data obtained and the official data available at [http://www.mecd.gob.es/educacion-mecd/areas-educacion/universidades/estadisticas-informes/estadisticas/tesis-doctorales.html].
It is also possible that gaps exist in certain chronological ranges within Teseo and that these are not apparent in the public records, even though they may be present in third-party databases or registries; in such cases, the necessary access would not be available, making retrieval or download impossible. Furthermore, since Teseo contains over five million references among its markers, it is also possible that some records fall outside the range analyzed by the crawler. In that case, it would only be necessary to specify a broader range encompassing the remaining records in Teseo. This possibility, although unlikely, is currently under study to confirm or rule it out.
In the event of obtaining new data on the Teseo datasets, I can assure you that I will promptly report them, and updates will be made to the references of the entries on the mblazquez portal [ref1] and [ref2], as well as to the project published on SourceForge [https://sourceforge.net/projects/teseo-database/], to ensure the information is always as accurate as possible.
Acknowledgments
Finally, I wish to once again thank all the support and trust placed in the work I have been developing. I have strived to be as transparent and faithful to reality as possible with the tools and resources available. I take this opportunity to extend a warm greeting to all readers and followers of mblazquez.es.
List of Teseo Articles