JAKARTA - In a recent research paper, Apple reiterated its claim that its artificial intelligence model, Apple Intelligence, was not trained to use illegally retrieved data from the internet.
In an era where many Artificial Intelligence (AI) systems massively collect data from all over the web, Apple insists that they adhere to ethics in the model training process.
In 2023, big companies such as OpenAI and Microsoft faced lawsuits from The New York Times over allegations of copyright infringement related to unlicensed data collection.
In contrast to this general practice, Apple reportedly in 2023 has sought to purchase license rights from major publishers such as Conde Nast and NBC News to use their work in large language model training (Large Language Models/LLM). Apple reportedly offered millions of dollars, although at that time it was not clear which publishers agreed or refused.
In a recently published research paper, Apple explains that they will not access or retrieve data from publishers who do not give permission.
We believe in training models using diverse and high-quality data. This includes data we license from publishers, data curated from open source datasets or publicly available, as well as information obtained by Applebot, our web perayap, "said Apple in its blog.
Apple also emphasizes that they do not use users' personal data or user interactions when training their foundation model. The company takes various steps to filter and remove personal information, as well as avoid inappropriate or harmful material.
Most of the paper describes how Applebot works to retrieve relevant and quality data from an internet full of noise' (disvalid or spam data). However, Apple also emphasizes its commitment to copyright and ethics, by following the robotics.txt protocol commonly used by websites.
The robotics.txt protocol allows publishers to specify which pages or parts of the site are not allowed to be accessed by web propagators, including those used to train AI models. Apple says it respects this rule, provides a detailed control issuer of what content Applebot can access while still displaying the page in the search results of Siri and Spotlight.
On the other hand, many other AI companies, such as OpenAI, claim to follow ethical standards, do not explicitly ensure compliance with robots.txt. According to market analysis firm TollBit, in the first quarter of 2025 there were about 13% of data taking activity (scribing) by AI companies that ignored robot rules.txt, an increase from 3.3% in the last quarter of 2024.
This is likely because many of the available internet have been hacked so that the company continues the process. Even in June 2025, a US district court ruled that data collection for AI training was legal.
Every web maker, including the Applebot, identifies himself when accessing the site. If a site doesn't know the Applebot, then Applebot will follow the rules applicable to Googlebot as a replacement standard.
SEE ALSO:
Several large publishers such as the BBC have blocked AI access such as OpenAI and Common Crawl on its website. A study of 1,156 news publishers found 626 of them blocking data taking by AI.
There are also cases like Perplexity.ai, which Apple is expected to buy, which also claims to be an ethical AI. However, Perplexity has been accused of still taking data without permission and its CEO admits that their system is not yet perfect.
Overall, to date, Apple has never been legally accused of violating ethics or copyright in its AI training. This is different from OpenAI and Microsoft, which have faced lawsuits, or Perplexity, which have been criticized.
However, this doesn't mean the publishers are really satisfied with the training of big language models using their data, but so far Apple seems to be the only company that has consistently conducted legal and ethical AI training.
The English, Chinese, Japanese, Arabic, and French versions are automatically generated by the AI. So there may still be inaccuracies in translating, please always see Indonesian as our main language. (system supported by DigitalSiber.id)