JAKARTA Interagenty Implementation and Advanced Concepts Team (IMPACT), an interdisciplinary team that works under NASA, collaborates with International Business Machines (IBM) to create INDUS. INDUS, abbreviation from the Integrated Neural Discourse Understanding System, is a large language model (LLM) designed to analyze science data related to Earth science, biology and physics, astrophysics, and others. The model is trained using curated scientific data from various sources and has produced two types of models, namely encoders and sentence-changers. NASA explains that encoders are trained with 60 billion tokens covering a wide range of data. Encoders are designed to convert natural language text into numerical codes that can be processed by LLM so that INDUS is equipped with special vocabulary. With the creation of this encoder, INDUS can be the most superior open LLM. Once developed by IMPACT and IBM, INDUS has proven to be able to process research questions, retrieve relevant documents, and provide exact answers. Validation tests also show that INDUS can take a relevant part of the science corpus. IBM researcher Bishwaranjan Bhatcharjee said that IBM and IMPACT from NASA have achieved superior performance. The reason is, INDUS can develop both small and large models. Both work quickly together.
SEE ALSO:
"For smaller and faster versions, we are using neural architecture tracking to obtain a model architecture and knowledge distillation to train it with larger model surveillance," Bhatacherjee said. Meanwhile, Syltan Costes as NASA's BPS Project Manager for Open Science said that INDUS can assist NASA in developing and testing chatbots. These LLMs will be integrated into the Open Science Data Repository (OSDR) API. "We are looking for ways to improve OSDR's internal curation data system by leveraging INDUS to increase the productivity of our curation team and reduce the manual effort required every day," said Costes. NASA and IBM are committed to publicly presenting INDUS on the Hugging Face, open source machine learning platform. The team developing INDUS will also release benchmark datasets that include entity recognition to support climate change.
The English, Chinese, Japanese, Arabic, and French versions are automatically generated by the AI. So there may still be inaccuracies in translating, please always see Indonesian as our main language. (system supported by DigitalSiber.id)