The following diagram corresponds to Figure 21 from the doctoral thesis, Applications of Syndication for the Management of Bibliographic Catalogs. It illustrates the operation of a specialized parser program designed to analyze XML files encoded in syndication formats.
General operation of an XML parser
It can be observed how the program loads the URL linking the .xml extension file containing the code and information embedded within tags. To extract data from the file, it is first necessary to identify and validate the document type. This is possible if the first line of the file resembles <?xml version="1.0" encoding="UTF-8"?>, which identifies the document as XML and the character set employed. Subsequently, the syndication format used in the file is identified. To determine this, the parser attempts to recognize a pattern of root tags with their opening and closing elements. This pattern distinguishes each format, such as <feed> (Atom), <rdf:RDF> (RSS1.0), <rss> (RSS2.0), <opml> (OPML), <collection> (MARC-XML). In addition to identifying the format, it is necessary to identify the namespaces used to recognize the utilization of other formats or modules. Once the file type, format, and modules have been verified, a tree map of the channel and its entries’ structure is typically generated. This map of all tags containing content is essentially an array of arrays, or in other words, a matrix of data matrices. One of the most effective techniques for manipulating array structures in parser programs is the use of DOM (Document Object Model) functions, which enable the treatment of information as objects and elements, facilitating their querying. That is, this structural map must be queried to extract each item, news article, or entry one by one along with its corresponding sections (Title, summary, description, date, etc.). In this regard, a series of query languages oriented toward data retrieval from XML files are employed—namely XPath and XQuery, languages that allow filtering, selecting content from tags and their attributes, and even searching for text in a manner analogous to SQL. The result of these queries is the extraction of text and values from the file, which are stored in the program’s variables for subsequent use, such as outputting to a web page readable by users or generating SQL instructions for inserting data into the aggregator’s database, among other purposes.