Automated Content Classification for Hispanic-Mexican Media

Building upon the research presented at the 8th Hispanic-Mexican Seminar on Library and Information Science, the experimental platform for investigating syndication channels, «ReSync», is further developed with the aim of implementing multiple content classification methods for materials published by Spanish and Mexican media outlets. In this regard, the multilingual European thesaurus «Eurovoc» is adopted as the reference classification vocabulary, notable for its multidisciplinary nature and sufficient heterogeneity to classify content from media outlets across diverse domains. Furthermore, to enable effective use of this vocabulary, it is transformed into a functional ontology that supports the classification process through three precision-based thematic classification algorithms and two general thematic classification algorithms, specifically developed for this research. Additionally, automated evaluation forms are implemented to collect user assessments, with the objective of measuring the classification accuracy of each algorithm. The results of these new developments are applied to a collection of 400,000 contents and news items published by media outlets through their syndication channels during a one-month platform execution period, achieving classification rates ranging from 1.8% to 99%, depending on the algorithm employed. Finally, a comprehensive table is provided, detailing quantitative results for all Eurovoc categories and themes, along with the classified content according to each algorithm used.

Reference

Blázquez-Ochando, M. 2012. [Paper]. Development of a system for automatic classification of content in Spanish and Mexican media. In: 9th Hispano-Mexican Seminar on Librarianship and Documentation (Mexico, May 7-9). http://eprints.rclis.org/19031/

Abstract

The objective of this research is to develop an automatic classification system for content retrieved through the ReSync platform, specialized in investigating information sources in media outlets. Its development is justified by the absence of automated methods to organize the information collected via this platform. Furthermore, there is a need to conduct in-depth studies on the thematic categories addressed by media outlets according to country. To address these issues, the multilingual Eurovoc thesaurus is transformed into a pseudo-ontology, which serves as the classification vocabulary for a documentary corpus comprising over 400,000 news articles published between June and July 2011 from Mexican and Spanish media sources. Additionally, five automatic classification algorithms—both precise and generic in query design—are developed and tested, employing the aforementioned classification vocabulary to match against the test collection. All quantitative results of the experiment are obtained, revealing a progressive increase in the percentage of classified content according to the algorithm’s precision and conditioning. Finally, the foundations are laid for the qualitative evaluation of the system’s classification, with the aim of refining the process described herein.

Keywords

Automatic classification, ontologies, thesauri, automation, content syndication, media outlets, text normalization, information retrieval, evaluation

Download

Paper. paper_9o-sistema-clasificacion-automatica-contenidos-medios-hispano-mexicanos.pdf

Presentation 1. http://prezi.com/lqc4-k5losi6/desarrollo-de-un-sistema-de-clasificacion-automatica-de-contenidos-en-medios-de-comunicacion-hispano-mexicanos/