JAKARTA A new study by Apple's artificial intelligence scientists found that machines based on large language models (LLM), such as those developed by Meta and OpenAI, are still lacking in basic reasoning capabilities.

Apple is proposing a new benchmark called GSM-Symbolic to help measure the reasoning abilities of these models.

In preliminary testing, it was found that small changes to the words in questions could result in very different answers, which undermine the reliability of the model. The study highlights "stifficacy" in model mathematical reasoning, in which adding contextual information that shouldn't affect computations lead to different results.

In particular, the performance of all models decreased when the numerical value in the question was changed to the GSM-Symbolic benchmark. The research also shows that the more complex the question with more clauses, the worse the performance of the model.

In an example, Apple's team tested a simple mathematical problem that shouldn't be affected by any additional information. However, models from OpenAI and Meta mistakenly expose irrelevant information, proving that the model doesn't really understand the problem and relies solely on language patterns.

The study concludes that the current LLM model lacks critical reasoning capabilities and tends to use matching patterns that are prone to changes to simple words. Apple plans to introduce its own more sophisticated version of AI, starting with iOS 18.1, to overcome the limitations that exist in the current LLM.


The English, Chinese, Japanese, Arabic, and French versions are automatically generated by the AI. So there may still be inaccuracies in translating, please always see Indonesian as our main language. (system supported by DigitalSiber.id)