JAKARTA - Apple has again published a new study showing how a large language model (LLM) can be used to analyze audio and motion (motion) data to understand user activity more accurately. This study opens up opportunities for the implementation of more advanced multimodal Apple Intelligence and technology in next-generation iOS.

In a paper entitled 'Using LLMs for Late Multimodal Sensor Fusion for Activity Recognition', Apple researchers explained that LLM can be combined with traditional sensor data to improve activity recognition accuracy, even when sensor data is minimal.

LLM does not receive raw audio, but rather a description of short text made by audio models, as well as IMU sensor-based activity prediction data (accelerometer & gyroscope). In other words, user privacy remains protected.

LLM is able to classify activities in a zero-shot and one-shot without special training.

Accuracy goes far beyond random opportunities even without a specific model for activity.

Giving just one example can significantly improve model performance.

This approach allows the use of multimodal models without adding memory or computational burdens.

The researchers used Ego4D, a large dataset containing first-person point of view video footage showing various daily activities.

They curated 20 seconds of samples covering 12 activities, such as:

Draining dust

cooking

washing clothes

eat

playing basketball

playing football

playing with pets

read a book

using a computer

washing dishes

watch TV

exercise/lifting

The goal is to provide a common range of household activities and fitness.

Audio and motion data are processed by a small model to generate:

audio description text

audio label

IMU-based activity predictions

All of these outputs are then given to LLM (Gemini 2.5 Pro and Qwen 32B).

Apple compared two conditions:

Closed-set: model given list of 12 activities

Open-ended: the model is free to answer anything

Various combinations of input are tested starting from only audio data, only IMU, the combination of both, to additional contexts.

The results show that LLM consistently provides precise predictions, even without special training.

Apple concludes that this LLM-based multimodal approach could be a breakthrough in:

activity analysis,

health monitoring,

tracking of habits,

safety features, especially when sensor data is incomplete or difficult to interpret.

Apple even released supporting data such as the ID Ego4D segment, timestamp, prompt, and one-shot examples so that this research could be replicated by other researchers.


The English, Chinese, Japanese, Arabic, and French versions are automatically generated by the AI. So there may still be inaccuracies in translating, please always see Indonesian as our main language. (system supported by DigitalSiber.id)