我思う故に我あり

日常で感じたこと、考えたことを綴ります。

AIの進化過程...sonomiraitoha

“２０４５年問題”　シンギュラリティ問題です。

AI は人間の頭脳では処理できないビッグデータを掛け合わせることで経済・社会に大きな影響・または変革をもたらすでしょう。

道徳・倫理はどうでしょうか？

進歩しながら.....叡智の集結となるなら.....さらにその先シンギュラリティを超えた人類の未来を体感したいですね！！！

Will AI Systems Run Out of Publicly Available Data on the Internet?
- Words in This Story
AIシステムはインターネット上の公開データを使い果たすのか？（和訳）

Will AI Systems Run Out of Publicly Available Data on the Internet?

June 12, 2024

FILE - A Microsoft data center on Sept. 5, 2023, in West Des Moines, Iowa. Artificial intelligence systems like ChatGPT could soon run out of what keeps making them smarter. (AP Photo/Charlie Neibergall, File) — FILE - A Microsoft data center on Sept. 5, 2023, in West Des Moines, Iowa. Artificial intelligence systems like ChatGPT could soon run out of what keeps making them smarter. (AP Photo/Charlie Neibergall, File)

A research group says artificial intelligence companies (AI) could run out of publicly available data for their systems in less than eight years.

Training data includes writing and information publicly available on the Internet. AI companies use the internet to “train” AI systems to create human-sounding writing. This “training” is what developers use to create large language models. Currently, many technology companies are developing large language models this way.

The nonprofit research group Epoch AI examines issues relating to AI. It has been following the development of large language models for a few years. In a recent paper, the group said technology companies will exhaust the supply of publicly available training data for AI language models between 2026 and 2032.

The team’s latest paper has been reviewed by experts, or peer reviewed. It is to be presented at the International Conference on Machine Learning in Vienna, Austria, this summer. Epoch AI is linked to the research group Rethink Priorities based in San Francisco, California.

A ‘gold rush’

Researcher Tamay Besiroglu is one of the paper’s writers. He compared the current situation to a “gold rush” in which limited resources are depleted. He said the field of AI might face problems as the current speed of development uses up the current supply of human writing.

As a result, technology companies like the maker of ChatGPT, OpenAI and Google are seeking to pay for high quality data. Their goal is to ensure a flow of good material to train their systems. OpenAI has made deals with social media service Reddit and news provider News Corp. to use their material. The researchers consider this a short-term answer.

Over the long term, the group said, there will not be enough new blogs, news stories or social media writing to support the speed of AI development. That could lead companies to seek online data considered private, such as email and phone communications. They also might increasingly use AI-created data, such as chatbot content.

A ‘bottleneck’ in development?

Besiroglu described the issue as a “bottleneck” that can prevent companies from making improvements to their AI models, a process called “scaling up.”

“…Scaling up models has been probably the most important way of expanding their capabilities and improving the quality of their output.”

The Epoch AI group first made their predictions two years ago. That was weeks before the release of ChatGPT. At the time, the group said “high-quality language data” would be exhausted by 2026. Since then, AI researchers have developed new methods that make better use of data and that “overtrain” models on the same data many times. But there are limits to such methods.

While the amount of written information that is fed into AI systems has been growing, so has computing power, Epoch AI said. The parent company of Facebook, Meta Platforms, recently said the latest version of its Llama 3 model was trained on up to 15 trillion word pieces called tokens.

But whether a “bottleneck” in development is a concern remains the subject of debate.

Nicolas Papernot teaches computer engineering at the University of Toronto. He was not involved in the Epoch study. He said building more skilled AI systems can come from training them for specialized tasks. Papernot said he is concerned that training AI systems on AI-produced writing could lead to a situation known as “model collapse.”

Permission and quality

Also, internet-based services such as Reddit and the information service Wikipedia are considering how they are being used by AI models. Wikipedia has placed few restrictions on how AI companies use its articles, which are written by volunteers.

But professional writers are worried about their protected materials. Last fall, 17 writers brought a legal action against Open AI for what they called “systematic theft on a mass scale.” They said ChatGPT was using their materials, which are protected by copyright laws, without permission.

AI developers are concerned about the quality of what they train their systems on. Epoch AI’s study noted that paying millions of humans to write for AI models “is unlikely to be an economical way” to improve performance.

The chief of OpenAI, Sam Altman, told a group at a United Nations event last month that his company has experimented with “generating lots of synthetic data” for training. He said both humans and machines produce high- and low-quality data.

Altman expressed concerns, however, about depending too heavily on synthetic data over other technical methods to improve AI models.

“There’d be something very strange if the best way to train a model was to just generate…synthetic data and feed that back in,” Altman said. “Somehow that seems inefficient.”

Words in This Story

exhaust –v. to completely use up a resource

depleted –adj. when a resource is almost used up

trajectory –n. the direction that something is taking or is predicted to take

synthetic –adj. created by a process that is not natural

scale –n. the level of size of a thing

generate –v. to create something through a process

AIシステムはインターネット上の公開データを使い果たすのか？（和訳）

June 12, 2024

FILE-２０２３年９月５日、アイオワ州ウェスト・デモインにあるマイクロソフトのデータセンター。ChatGPTのような人工知能システムはまもなく、より賢くするためのデータを使い果たしてしまうかもしれない。(AP Photo/Charlie Neibergall, File)

ある研究グループによると、人工知能企業（AI）は8年以内に、システム用に公開されているデータを使い果たす可能性があると言います。

学習用データとは、インターネット上で公開されている文章や情報などです。AI企業はインターネットを使ってAIシステムを "訓練 "し、人間らしい文章を作成します。この”訓練”は、開発者が大規模な言語モデルを作成するために使用するものです。現在、多くのテクノロジー企業がこの方法で大規模な言語モデルを開発しています。

非営利の研究グループエポックAI は、AIに関する問題を調査しています。ここ数年間大規模言語モデルの開発を追ってきました。同グループは最近の論文で、テクノロジー企業は２０２６年から２０３２年の間に、AI言語モデルのための一般に利用可能なトレーニングデータの供給を使い果たすだろうと述べています。

同チームの最新論文は、専門家による査読（ピアレビュー）を受けています。今夏、オーストリアのヴィエナVienna（ウィーン）で開催される機械学習国際会議で発表される予定です。エポックAIは、カリフォルニア州サンフランシスコを拠点とする研究グループリスィンクプライオリティRethink Prioritiesと連携しています。

※peer review（下記ケンブリッジディクショナリより）

https://dictionary.cambridge.org/ja/dictionary/english/peer-review

the process of someone reading, checking, and giving his or her opinion about something that has been written by another scientist or expert working in the same subject area, or a piece of work in which this is done:

同じ分野で働く他の科学者や専門家が書いたものを、誰かが読み、チェックし、自分の意見を述べるプロセス、またはそれが行われる作品：

’ゴールドラッシュ’

研究者のタメイ・ベシログ氏はこの論文の執筆者の一人です。彼は現在の状況を、限られた資源が枯渇depletする”ゴールドラッシュ”に例えました。彼は、現在の開発スピードが人間の執筆能力を使い果たしてしまうため、AIの分野が問題に直面する可能性があると述べています。

その結果、ChatGPTのメーカーであるOpenAIやグーグルのようなテクノロジー企業は、質の高いデータに対価を支払おうとしています。彼らの目標は、システムを訓練するための良質なデータの流れを確保することです。OpenAIは、ソーシャルメディアサービスのRedditやニュースプロバイダーのNews Corp.と、彼らの素材を使用する契約を結んでいます。研究者たちは、これは短期的な答えだと考えています。

長期的には、AIの開発スピードをサポートするのに十分な数の新しいブログ、ニュース記事、ソーシャルメディアへの書き込みがなくなるだろう、と研究グループは述べています。そうなると、企業は電子メールや電話でのコミュニケーションなど、個人的なオンラインデータを求めるようになるかもしれません。また、チャットボットのコンテンツなど、AIが作成したデータを利用することも増えるかもしれません。

開発の”ボトルネック”？

ベシログル氏はこの問題を、企業がAIモデルを改善するのを妨げる "ボトルネック "と表現しました、"スケールアップ "と呼ばれるプロセスです。

※scaling up：小さな規模での経験をもとに、より大きな規模での装置の設計、制作を行うこと。（世界大百科事典より）

"...スケールアップモデルは、おそらく能力を拡大し、アウトプットの質を向上させる最も重要な方法です。"

エポックAIグループが最初に予測を立てたのは２年前でした。ChatGPTがリリースされる数週間前のことです。当時、同グループは”高品質の言語データ”は２０２６年までに枯渇するだろうと述べていました。それ以来、AI研究者たちはデータをより有効に活用し、同じデータでモデルを何度も”過学習”させる新しい手法を開発してきました。しかし、そのような手法には限界があります。

AIシステムに入力される文字情報の量は増加する一方で、計算能力は低下しているとエポック社は言います。フェイスブックの親会社であるメタ・プラットフォームズは最近、同社のLlama ３モデルの最新バージョンは、トークンと呼ばれる最大１５兆個の単語で訓練されたと発表しました。

しかし、開発における "ボトルネック "が懸念されるかどうかは、まだ議論の余地があります。

ニコラス・パペルノ氏はトロント大学でコンピューター工学を教えています。彼はエポックの研究には関与していません。彼は、より熟練したAIシステムの構築は、特殊なタスクのためにAIを訓練することから生まれると述べています。パペルノ氏は、AIが作成した文章でAIシステムを訓練することは、"モデル崩壊 "と呼ばれる状況を引き起こす可能性があると懸念しています。

許可と品質

また、レディットやウィキペディアのようなインターネット上のサービスも、AIモデルによる利用を検討しています。ウィキペディアは、ボランティアによって書かれた記事をAI企業がどのように利用するかについて、ほとんど制限を設けていません。

しかし、プロのライターたちは、自分たちが保護されている資料について心配しています。昨年秋、１７人のライターがOpen AIに対し、"大規模な組織的盗用 "として訴訟を起こしました。彼らによれば、ChatGPTは著作権法で保護されている自分たちの素材を無断で使用しているとのことです。

AI開発者は、システムを訓練する対象の質を懸念しています。Epoch AIの研究では、AIモデルのために何百万人もの人間に執筆を依頼することは、パフォーマンスを向上させるための”経済的な方法にはなりそうもない”と指摘しているます。

OpenAIのチーフであるサム・アルトマン氏は、先月の国連のイベントで、同社はトレーニングのために”多くの合成データを生成する”実験を行ってきたと語りました。アルトマン氏は、人間も機械も質の高いデータと低いデータを生成すると述べていました。

しかしアルトマン氏は、AIモデルを改善するために他の技術的手法よりも合成データに過度に依存することに懸念を示しました。

「モデルを訓練する最良の方法が、ただ合成データを生成してそれをフィードバックすることだとしたら、それはとても奇妙なことだ」とアルトマン氏は言っています。「なんとなく非効率的な気がする。」

FILE - A Microsoft data center on Sept. 5, 2023, in West Des Moines, Iowa. Artificial intelligence systems like ChatGPT could soon run out of what keeps making them smarter. (AP Photo/Charlie Neibergall, File) — FILE - A Microsoft data center on Sept. 5, 2023, in West Des Moines, Iowa. Artificial intelligence systems like ChatGPT could soon run out of what keeps making them smarter. (AP Photo/Charlie Neibergall, File)

A research group says artificial intelligence companies (AI) could run out of publicly available data for their systems in less than eight years.

Training data includes writing and information publicly available on the Internet. AI companies use the internet to “train” AI systems to create human-sounding writing. This “training” is what developers use to create large language models. Currently, many technology companies are developing large language models this way.

The nonprofit research group Epoch AI examines issues relating to AI. It has been following the development of large language models for a few years. In a recent paper, the group said technology companies will exhaust the supply of publicly available training data for AI language models between 2026 and 2032.

The team’s latest paper has been reviewed by experts, or peer reviewed. It is to be presented at the International Conference on Machine Learning in Vienna, Austria, this summer. Epoch AI is linked to the research group Rethink Priorities based in San Francisco, California.

A ‘gold rush’

Researcher Tamay Besiroglu is one of the paper’s writers. He compared the current situation to a “gold rush” in which limited resources are depleted. He said the field of AI might face problems as the current speed of development uses up the current supply of human writing.

As a result, technology companies like the maker of ChatGPT, OpenAI and Google are seeking to pay for high quality data. Their goal is to ensure a flow of good material to train their systems. OpenAI has made deals with social media service Reddit and news provider News Corp. to use their material. The researchers consider this a short-term answer.

Over the long term, the group said, there will not be enough new blogs, news stories or social media writing to support the speed of AI development. That could lead companies to seek online data considered private, such as email and phone communications. They also might increasingly use AI-created data, such as chatbot content.

A ‘bottleneck’ in development?

Besiroglu described the issue as a “bottleneck” that can prevent companies from making improvements to their AI models, a process called “scaling up.”

“…Scaling up models has been probably the most important way of expanding their capabilities and improving the quality of their output.”

The Epoch AI group first made their predictions two years ago. That was weeks before the release of ChatGPT. At the time, the group said “high-quality language data” would be exhausted by 2026. Since then, AI researchers have developed new methods that make better use of data and that “overtrain” models on the same data many times. But there are limits to such methods.

While the amount of written information that is fed into AI systems has been growing, so has computing power, Epoch AI said. The parent company of Facebook, Meta Platforms, recently said the latest version of its Llama 3 model was trained on up to 15 trillion word pieces called tokens.

But whether a “bottleneck” in development is a concern remains the subject of debate.

Nicolas Papernot teaches computer engineering at the University of Toronto. He was not involved in the Epoch study. He said building more skilled AI systems can come from training them for specialized tasks. Papernot said he is concerned that training AI systems on AI-produced writing could lead to a situation known as “model collapse.”

Permission and quality

Also, internet-based services such as Reddit and the information service Wikipedia are considering how they are being used by AI models. Wikipedia has placed few restrictions on how AI companies use its articles, which are written by volunteers.

But professional writers are worried about their protected materials. Last fall, 17 writers brought a legal action against Open AI for what they called “systematic theft on a mass scale.” They said ChatGPT was using their materials, which are protected by copyright laws, without permission.

AI developers are concerned about the quality of what they train their systems on. Epoch AI’s study noted that paying millions of humans to write for AI models “is unlikely to be an economical way” to improve performance.

The chief of OpenAI, Sam Altman, told a group at a United Nations event last month that his company has experimented with “generating lots of synthetic data” for training. He said both humans and machines produce high- and low-quality data.

Altman expressed concerns, however, about depending too heavily on synthetic data over other technical methods to improve AI models.

“There’d be something very strange if the best way to train a model was to just generate…synthetic data and feed that back in,” Altman said. “Somehow that seems inefficient.”

Words in This Story

exhaust –v. to completely use up a resource

depleted –adj. when a resource is almost used up

trajectory –n. the direction that something is taking or is predicted to take

synthetic –adj. created by a process that is not natural

scale –n. the level of size of a thing

generate –v. to create something through a process

AIシステムはインターネット上の公開データを使い果たすのか？（和訳）

June 12, 2024

FILE-２０２３年９月５日、アイオワ州ウェスト・デモインにあるマイクロソフトのデータセンター。ChatGPTのような人工知能システムはまもなく、より賢くするためのデータを使い果たしてしまうかもしれない。(AP Photo/Charlie Neibergall, File)

ある研究グループによると、人工知能企業（AI）は8年以内に、システム用に公開されているデータを使い果たす可能性があると言います。

学習用データとは、インターネット上で公開されている文章や情報などです。AI企業はインターネットを使ってAIシステムを "訓練 "し、人間らしい文章を作成します。この”訓練”は、開発者が大規模な言語モデルを作成するために使用するものです。現在、多くのテクノロジー企業がこの方法で大規模な言語モデルを開発しています。

非営利の研究グループエポックAI は、AIに関する問題を調査しています。ここ数年間大規模言語モデルの開発を追ってきました。同グループは最近の論文で、テクノロジー企業は２０２６年から２０３２年の間に、AI言語モデルのための一般に利用可能なトレーニングデータの供給を使い果たすだろうと述べています。

同チームの最新論文は、専門家による査読（ピアレビュー）を受けています。今夏、オーストリアのヴィエナVienna（ウィーン）で開催される機械学習国際会議で発表される予定です。エポックAIは、カリフォルニア州サンフランシスコを拠点とする研究グループリスィンクプライオリティRethink Prioritiesと連携しています。

※peer review（下記ケンブリッジディクショナリより）

https://dictionary.cambridge.org/ja/dictionary/english/peer-review

the process of someone reading, checking, and giving his or her opinion about something that has been written by another scientist or expert working in the same subject area, or a piece of work in which this is done:

同じ分野で働く他の科学者や専門家が書いたものを、誰かが読み、チェックし、自分の意見を述べるプロセス、またはそれが行われる作品：

’ゴールドラッシュ’

研究者のタメイ・ベシログ氏はこの論文の執筆者の一人です。彼は現在の状況を、限られた資源が枯渇depletする”ゴールドラッシュ”に例えました。彼は、現在の開発スピードが人間の執筆能力を使い果たしてしまうため、AIの分野が問題に直面する可能性があると述べています。

その結果、ChatGPTのメーカーであるOpenAIやグーグルのようなテクノロジー企業は、質の高いデータに対価を支払おうとしています。彼らの目標は、システムを訓練するための良質なデータの流れを確保することです。OpenAIは、ソーシャルメディアサービスのRedditやニュースプロバイダーのNews Corp.と、彼らの素材を使用する契約を結んでいます。研究者たちは、これは短期的な答えだと考えています。

長期的には、AIの開発スピードをサポートするのに十分な数の新しいブログ、ニュース記事、ソーシャルメディアへの書き込みがなくなるだろう、と研究グループは述べています。そうなると、企業は電子メールや電話でのコミュニケーションなど、個人的なオンラインデータを求めるようになるかもしれません。また、チャットボットのコンテンツなど、AIが作成したデータを利用することも増えるかもしれません。

開発の”ボトルネック”？

ベシログル氏はこの問題を、企業がAIモデルを改善するのを妨げる "ボトルネック "と表現しました、"スケールアップ "と呼ばれるプロセスです。

※scaling up：小さな規模での経験をもとに、より大きな規模での装置の設計、制作を行うこと。（世界大百科事典より）

"...スケールアップモデルは、おそらく能力を拡大し、アウトプットの質を向上させる最も重要な方法です。"

エポックAIグループが最初に予測を立てたのは２年前でした。ChatGPTがリリースされる数週間前のことです。当時、同グループは”高品質の言語データ”は２０２６年までに枯渇するだろうと述べていました。それ以来、AI研究者たちはデータをより有効に活用し、同じデータでモデルを何度も”過学習”させる新しい手法を開発してきました。しかし、そのような手法には限界があります。

AIシステムに入力される文字情報の量は増加する一方で、計算能力は低下しているとエポック社は言います。フェイスブックの親会社であるメタ・プラットフォームズは最近、同社のLlama ３モデルの最新バージョンは、トークンと呼ばれる最大１５兆個の単語で訓練されたと発表しました。

しかし、開発における "ボトルネック "が懸念されるかどうかは、まだ議論の余地があります。

ニコラス・パペルノ氏はトロント大学でコンピューター工学を教えています。彼はエポックの研究には関与していません。彼は、より熟練したAIシステムの構築は、特殊なタスクのためにAIを訓練することから生まれると述べています。パペルノ氏は、AIが作成した文章でAIシステムを訓練することは、"モデル崩壊 "と呼ばれる状況を引き起こす可能性があると懸念しています。

許可と品質

また、レディットやウィキペディアのようなインターネット上のサービスも、AIモデルによる利用を検討しています。ウィキペディアは、ボランティアによって書かれた記事をAI企業がどのように利用するかについて、ほとんど制限を設けていません。

しかし、プロのライターたちは、自分たちが保護されている資料について心配しています。昨年秋、１７人のライターがOpen AIに対し、"大規模な組織的盗用 "として訴訟を起こしました。彼らによれば、ChatGPTは著作権法で保護されている自分たちの素材を無断で使用しているとのことです。

AI開発者は、システムを訓練する対象の質を懸念しています。Epoch AIの研究では、AIモデルのために何百万人もの人間に執筆を依頼することは、パフォーマンスを向上させるための”経済的な方法にはなりそうもない”と指摘しているます。

OpenAIのチーフであるサム・アルトマン氏は、先月の国連のイベントで、同社はトレーニングのために”多くの合成データを生成する”実験を行ってきたと語りました。アルトマン氏は、人間も機械も質の高いデータと低いデータを生成すると述べていました。

しかしアルトマン氏は、AIモデルを改善するために他の技術的手法よりも合成データに過度に依存することに懸念を示しました。

「モデルを訓練する最良の方法が、ただ合成データを生成してそれをフィードバックすることだとしたら、それはとても奇妙なことだ」とアルトマン氏は言っています。「なんとなく非効率的な気がする。」