Building on the research presented at the 8th Hispano-Mexican Seminar on Library and Information Science, this work continues the development of the experimental platform for investigating syndication channels, «Resync», with the aim of implementing multiple content classification methods for materials published by Spanish and Mexican media outlets. In this regard, the multilingual European thesaurus «Eurovoc» is adopted as the reference classification vocabulary, distinguished by its multidisciplinary nature and sufficient heterogeneity to classify content from media sources across a wide variety of domains. Furthermore, to enable effective use of this vocabulary, it is transformed into a functional ontology that supports the classification process through three precision-oriented thematic classification algorithms and two general thematic classification algorithms, specifically developed for this research. Additionally, automated evaluation forms are implemented to collect user assessments, with the goal of measuring the classification accuracy of each algorithm. The results of these new developments are applied to a collection of 400,000 contents and news items published by media outlets through their syndication channels during the one-month operational period of the Resync platform, achieving classification rates ranging from 1.8% to 99%, depending on the algorithm employed. Finally, a comprehensive table is provided, detailing the quantitative results for all Eurovoc categories and themes, along with their classified content according to each algorithm used.

Reference

  • BLÁZQUEZ OCHANDO, M. 2012. [Paper]. Development of an automatic classification system for content in Spanish and Mexican media. In: 9th Hispano-Mexican Seminar on Librarianship and Documentation (Mexico, May 7-9)

Abstract

The objective of this research is to develop an automatic classification system for content retrieved through the Resync platform, specialized in investigating information sources in the media. Its development is justified by the lack of automated methods for organizing the information collected via this platform. Furthermore, there is a need to conduct in-depth studies on the thematic categories addressed by media outlets according to country. To address these issues, the multilingual Eurovoc thesaurus is transformed into a pseudo-ontology, which serves as the classification vocabulary for a documentary corpus comprising over 400,000 news articles published between June and July 2011 from Mexican and Spanish media. Additionally, five automatic classification algorithms—both precise and generic in query type—are designed and tested, employing the aforementioned classification vocabulary to match against the test collection. All quantitative results of the experiment are obtained, revealing a progressive increase in the percentage of classified content according to the algorithm’s precision and conditioning. Finally, the foundations are laid for the qualitative evaluation of the classification performed by the system, with the aim of refining the process described herein.

Keywords

Automatic classification, ontologies, thesauri, automation, content syndication, media, text normalization, information retrieval, evaluation

Download

Paper. 9o-seminario-hispanomexicano-manuel-blazquez-ochando

Presentation. http://prezi.com/lqc4-k5losi6/desarrollo-de-un-sistema-de-clasificacion-automatica-de-contenidos-en-medios-de-comunicacion-hispano-mexicanos/