The BSC-CNS (Barcelona Supercomputing Center – Centro Nacional de Supercomputación) and the National Library of Spain have developed a massive Spanish language model using Deep Learning technology whose access is open in the following repository. The project has been funded by the Language Technologies Plan of the Ministry of Economic Affairs and Digital Agenda and the Future Computing Center, an initiative of the BSC-CNS and IBM.
MarIA is a set of Large Language Models (LLM) based on DeepLearning’s Transformer technology and very similar to GPT-2. These deep neural networks have been trained using 59 terabytes (equivalent to 59,000 gigabytes) of the National Library’s web archive, which once cleaned generated 201,080,084 clean documents occupying a total of 570 gigabytes of clean, duplicate-free text (accessible in this dataset).
The generation of this corpus is the first milestone of the MarIA project. The second is the creation and training of the models that constitute MarIA. Training these models has required 184,000 processor hours and over 18,000 GPU hours. The models released so far have 125 million and 355 million parameters respectively.
The next steps to be taken by the MarIA team are to extend the corpus by adding CSIC scientific publications, as well as training models in Portuguese, Galician, Catalan and Basque.
Although there are currently models such as GPT-3 (OpenAI), Megatron-Turing NGL 530B (Microsoft/NVIDIA) or M6 (Alibaba Damo Academy) with a greater number of parameters, they handle English or Mandarin. The importance of MarIA lies in the fact that it is the first major model in Spanish, which opens a door to Spanish speakers in the field of NLP (Natural Language Processing).
You can try MarIA at the following link.