The AI Scientist:Sakana AIの開発した仮説提案、実験、論文執筆から査読まで研究自動化AIエージェント / 開発者向けブログ・イベント

本記事はGMOインターネットグループで複数ある開発者ブログのなかから人気の記事をピックアップして再掲載しています。
機械学習や深層学習のビジネス応用を研究、社会実装をしているグループ研究開発本部 AI研究開発室のデータサイエンティストが寄稿した人気記事です。

TL;DR

東京のAIスタートアップ、Sakana AIは、2024年8月13日に「The AI Scientist」という生成AIモデルを発表しました。これは、研究アイデアの提案から実験コード生成、論文執筆とレビューまでを自動化します。具体的には、指示された研究テーマについてアイデアを提案し、Aiderというコーディングアシスタントを利用し与えられたテンプレートの実験コードの修正・実行、結果を可視化し、テンプレートに従い論文を自動生成します。基盤となるLLMモデルは、GPT-4oやClaude Sonnet 3.5、DeepSeek Coder、Llama-3.1 405bの４種類を比較しており、生成コストは論文１本あたり$15以下です。
ただし、The AI Scientistのできることは、ゼロから新規の研究ではなく既存の研究テーマとその実験コードの改良に限定されます。現状のLLMの性能で研究プロセスを自動化するために、プロンプトや処理手順など様々な創意工夫がなされていますが、生成した論文の品質には様々な問題点と改善の余地が残されています。

はじめに

こんにちは、グループ研究開発本部・AI研究室のT.I.です。Sakana AIは、先日の8月13日に、「The AI Scientist」という新しい技術を発表しました。これは研究アイデアの提案から実験、論文執筆とそのレビューまでをLLMで自動化する技術です。今回のBlogでは、このThe AI Scientistについて詳しく紹介します。Sakana AIは以前にも紹介した東京にあるAIスタートアップで、独自技術による生成AIモデル開発で注目されます。以前のBlogでも進化的モデルマージという技術を紹介しました。なお、Sakana AIはこのアルゴリズムの実装を公開していませんでしたが、後に arcee.ai が実装コードを独自に開発して公開しています。Blogでは取り上げていませんでしたが、Sakana AIでは進化的モデルマージによる視覚言語モデルの改良(Llama-3-EvoVLM-JP-v2)や浮世絵風画像生成モデルと浮世絵カラー化モデル(Evo-UkiyoeとEvo-Nishikie)、LLM自身によるLLMを訓練させるアルゴリズムの開発のPoC(LLM-squared)などの興味深い研究も行っています。

Publish or Perish: 論文出しますか？それとも研究者やめますか？

ぼくがまだ在学中に、ひとつの法律ができた。正式名称は、大学および各種教育研究機関における研究活動推進振興法第二条だ。しかしだれもこの名ではよばない。
かわりに。出すか出されるか法、とよぶ。
内容は非情だ。三年以内に一本も論文を書かない研究者は即、退職せよ。
〔中略〕
出版か、さもなくばみじめな死を。
(松崎有理『あがり』収録「代書屋ミクラの幸運」より)

Sakana AIの発表した「The AI Scientist」は、自動的に研究を行い、論文にまとめ、さらにそのレビューまでできると謳うものです。で、そもそも論文って何？と思う方がいらっしゃるかもしませんので、学術研究における論文の意義について簡単に説明します。

冒頭の引用文はフィクションではありますが、論文の発表と出版は研究者にとって非常に重要な活動です。どれだけ研究しても、論文として発表しない限り、その成果は認められません。研究成果をまとめて学術雑誌に投稿し、以下のような査読プロセスを経て妥当性が確認された論文は、研究者の評価やキャリアに大きな影響を与えます。

雑誌への論文投稿
1.研究分野ごとに複数の専門の学術雑誌があり、研究内容と成果を考慮して投稿する雑誌を共同研究者と相談して決めます。インパクト・ファクター(その雑誌に掲載される論文の被引用数から算出されるスコア)が高いものほど一般的に掲載されることが難しいです。
編集者による内容確認と査読者への送付
1.集者が論文の内容を確認し、適切であれば査読者に送付します。編集者が適切でないと判断した場合、査読者に送られる前に却下されることもあります。
査読者による評価
1.査読者は該当分野の研究者であり、論文の内容を厳しく評価します。通常、複数の査読者が選ばれます。査読者のレポートは編集者がまとめ、著者に返送されます。
2.論文の掲載か否かの判断は査読者の評価をもとに編集者が決定しますが、一般的には無条件で受理されることはなく、複数回のレビューを繰り返します。
査読者のフィードバックと論文の修正
1.査読者は論文の不備や問題点を指摘し、修正を求めることが一般的です。ただし、修正の余地なく却下される場合もあります。通常、査読者は匿名であり、著者が査読者の名前を知ることはありません。
論文の再投稿と再レビュー
1.著者は査読者のフィードバックに基づいて論文を修正し、再投稿します。その後、査読者が再度レビューを行います。修正の範囲に応じて再投稿の期限が設定される場合があります。このプロセスは複数回繰り返されますが、回数には制限があることもあります。
編集者による最終決定
1.編集者は査読者のレビュー結果をもとに、論文の採択または却下を決定します。却下された場合、著者は論文をさらに修正して他の雑誌に投稿することがあります。
論文の校正と出版
1.論文の掲載が決定すると、校正段階に移り、表現や体裁が修正されます。著者が最終確認を行い、論文が出版されます。

エンジニアの視点からで例えると、プルリクを作成してリポジトリの管理者がレビュワーを指定し、コードの品質を確認してもらいマージするということに似ています(参考資料：「論文の査読をコードレビューで説明するとわかりやすい説」https://kaityo256.github.io/pullreq)。この査読プロセスは研究成果の品質を保証する上で重要であり、査読有りと無しでは、論文の価値に大きな差が生じます。集計できる数字が全て重要とは限りませんが、研究者の評価指標として、出版した論文の数やその引用数が参考にされることが多いです。

任期つき研究員の身分を脱してより待遇のよいポストにつくには、せっせと論文を書いて業績をあげねばならない。
(松崎有理『山手線が転生して加速器になりました。』収録「未来人観光客がいっこうにやってこない50の理由」より)

研究者のキャリアパスにおいては、他の研究者から注目され引用される論文を定期的に発表することが重要です。これができない場合、次の職を得ることが難しくなります。特に、多くの若手研究者は任期付きのポストに就いているため、それが切れると研究者としてのキャリアを断念せざるを得なくなります。「出すか出されるか法」はフィクションですが、これが現実です。

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

Sakana AIは、「The AI Scientist」に関する論文(https://www.arxiv.org/abs/2408.06292)の発表と共にGitHub(https://github.com/SakanaAI/AI-Scientist)でコードを公開しています。以下は、The AI Scientistの概念図であり、大まかに４つの構成要素から成り立っています。まずアイデア生成として、与えられたテンプレートに基づいて仮説を提案します。次に、そのアイデアから実験コードを生成し、実行しデータや可視化結果を作成します。さらに、その結果を基に論文を自動的に生成します。この論文は、一般的に利用されるLaTeX形式で出力・処理されることで読みやすい形式に変換されます。最後に、LLMを用いて論文の自動レビューを行い、論文の品質を評価します。

The AI Scientistの概要図(引用元: Sakana AI Blog)

The AI Scientistは、LLMとコーディングアシスタント、Webサービスを組み合わせて構築されております。具体的なLLMとしては、論文や公開されたコードでは、GPT-4oやClaude Sonnet 3.5、DeepSeek Coder、Llama-3.1 405bの４種類を比較しています。前者の２つは、APIを通じて利用ですが、後者の２つのopen-weightのモデルです。コード生成と実行では、AiderというLLMを利用するプログラム補助のツール、Semantic Scholarという学術論文検索エンジンを利用して論文の検索を行います。なお、注意点として、The AI Scientistは、既存の実験コードをもとに新しいアイデアを実装していきます。その実行に際してはあらかじめ指定された形式のフォーマットのテンプレートを与える必要があります。

The AI Scientistの実行過程

The AI Scientistの論文生成過程を順番を追って説明します。

1. アイデア生成

では、The AI Scientistがどのようにしてアイデアを生成するのか？というと、以下のようなプロンプトに具体的なタスクの記述を入力して、アイデアを提案します。

{task_description}

<experiment.py>

{code}

</experiment.py>

Here are the ideas that you have already generated:

'''

{prev_ideas_string}

'''

Come up with the next impactful and creative idea for research experiments and directions you can feasibly investigate with the code provided.

Note that you will not have access to any additional resources or datasets.

Make sure any idea is not overfit the specific training dataset or model, and has wider significance.

Respond in the following format:

THOUGHT:

<THOUGHT>

NEW IDEA JSON:

```json

<JSON>

```

In <THOUGHT>, first briefly discuss your intuitions and motivations for the idea. Detail your high-level plan, necessary design choices and ideal outcomes of the experiments. Justify how the idea is different from the existing ones.

In <JSON>, provide the new idea in JSON format with the following fields:

- "Name": A shortened descriptor of the idea. Lowercase, no spaces, underscores allowed.

- "Title": A title for the idea, will be used for the report writing.

- "Experiment": An outline of the implementation. E.g. which functions need to be added or modified, how results will be obtained, ...

- "Interestingness": A rating from 1 to 10 (lowest to highest).

- "Feasibility": A rating from 1 to 10 (lowest to highest).

- "Novelty": A rating from 1 to 10 (lowest to highest).

Be cautious and realistic on your ratings.

This JSON will be automatically parsed, so ensure the format is precise.

You will have {num_reflections} rounds to iterate on the idea, but do not need to use them all.

ここで、task_descriptionで具体的なタスクの説明が与えられ、codeにはコードが与えられます。このコードはベースラインとなる元の実験コードで、新しいアイデアを提案するためのベースとなります。2D Diffusionという研究テーマの例の場合、以下のようなpromptでアイデアを生成させます。

{
"system": "You are an ambitious AI PhD student who is looking to publish a paper that will contribute significantly to the field.",
"task_description": "You are given the following file to work with, that trains a low-dimensional diffusion model on 4 different 2D datasets. The diffusion model is based on DDPM. Particularly interesting ideas would involve controllable generation, e.g. biased towards different modes of the data, or new encodings for low-dimensional data aside from sinusoidal encodings."
}

System promptで与えているロールがPhD student(博士課程の学生)とは随分謙虚な気がしますが、いっそProfessorにしても良いのではないでしょうか？Gemini 1.5 Proで試しに実行してみると以下のような新しい研究のアイデアを提案してくれました。このテンプレートを参考に研究のアイデア出しに利用できるかもしれませんね。

Name: learned_fourier_features
Title: Enhancing Low-Dimensional Diffusion Models with Learned Fourier Features
Experiment:
1.Implement Learned Fourier Features: Replace the SinusoidalEmbedding class for data (input_mlp1 and input_mlp2) with a learnable mapping. This involves learning a linear transformation (matrix multiplication) followed by a fixed sinusoidal function.
2.Train and Evaluate: Train the modified model on the same four 2D datasets (circle, dino, line, moons). Compare the performance with the baseline model using sinusoidal embeddings. Use KL divergence and visual inspection of generated samples for evaluation.
3.Analyze Results: Investigate if the learned Fourier features improve the model’s ability to capture different data distributions. Compare the learned transformations for each dataset to gain insights into the learned representations.
Interestingness: 8
Feasibility: 9
Novelty: 7

The AI Scientistは、こうして生成されたアイデアを更に、idea_reflection_promptで改良していきます。

Round {current_round}/{num_reflections}.
In your thoughts, first carefully consider the quality, novelty, and feasibility of the idea you just created.
Include any other factors that you think are important in evaluating the idea.
Ensure the idea is clear and concise, and the JSON is the correct format.
Do not make things overly complicated.
In the next attempt, try and refine and improve your idea.
Stick to the spirit of the original idea unless there are glaring issues.

Respond in the same format as before:
THOUGHT:
<THOUGHT>

NEW IDEA JSON:
```json
<JSON>
```

If there is nothing to improve, simply repeat the previous JSON EXACTLY after the thought and include "I am done" at the end of the thoughts but before the JSON.
ONLY INCLUDE "I am done" IF YOU ARE MAKING NO MORE CHANGES.

更に、提案されたアイデアを、Semantic Scholar というLLMを応用した論文検索サービスを利用してその新規性をチェックします。

2. 実験コードの生成と実行

The AI Scientistは、提案されたアイデアをもとに実際にコードを生成して実行します。コード生成と実行ではAiderを利用しています。Aiderに最初に与えられるpromptは以下のような指示です。生成したアイデアをもとに、既存のコード(experiment.py)を修正し実行を繰り返し、上限の回数まで成功するまで繰り返します。

Your goal is to implement the following idea: {title}.
The proposed experiment is as follows: {idea}.
You are given a total of up to {max_runs} runs to complete the necessary experiments. You do not need to use all {max_runs}.

First, plan the list of experiments you would like to run. For example, if you are sweeping over a specific hyperparameter, plan each value you would like to test for each run.

Note that we already provide the vanilla baseline results, so you do not need to re-run it.

For reference, the baseline results are as follows:

{baseline_results}

After you complete each change, we will run the command `python experiment.py --out_dir=run_i' where i is the run number and evaluate the results.
YOUR PROPOSED CHANGE MUST USE THIS COMMAND FORMAT, DO NOT ADD ADDITIONAL COMMAND LINE ARGS.
You can then implement the next thing on your list.

以下は論文から引用した具体的なコード修正の一部となります。提案されたアイデアに基づいてコードを修正して、global_networkとlocal_networkというニューラルネットワークのモジュールを追加しています。

追加されたニューラルネットワークのモジュールを処理するようforwardも修正されます。

このような手順を繰り返して、無事にコードが生成され実験が終わると、次に論文に必要なグラフの作成に移ります。これも同様にAiderにpromptを与えて、テンプレート(plot.py)を修正して作成します。

Great job! Please modify `plot.py` to generate the most relevant plots for the final writeup.

In particular, be sure to fill in the "labels" dictionary with the correct names for each run that you want to plot.

Only the runs in the `labels` dictionary will be plotted, so make sure to include all relevant runs.

We will be running the command `python plot.py` to generate the plots.

それぞれの出力結果について、notes.txtに記述を行います。これは、後の論文執筆の際に参考になるように、それぞれのプロットが何を示しているかを詳細に記述します。

Please modify `notes.txt` with a description of what each plot shows along with the filename of the figure. Please do so in-depth.

Somebody else will be using `notes.txt` to write a report on this in the future.

note.txtの具体例(https://github.com/SakanaAI/AI-Scientist/blob/main/example_papers/adaptive_dual_scale_denoising/notes.txt)は以下のようになります。各々の実験(Run)の評価指標の数値や、解析の内容、プロットの説明が記載されています。

# Title: Adaptive Dual-Scale Denoising for Dynamic Feature Balancing in Low-Dimensional Diffusion Models
# Experiment description: Modify MLPDenoiser to implement a dual-scale processing approach with two parallel branches: a global branch for the original input and a local branch for an upscaled input. Introduce a learnable, timestep-conditioned weighting factor to dynamically balance the contributions of global and local branches. Train models with both the original and new architecture on all datasets. Compare performance using KL divergence and visual inspection of generated samples. Analyze how the weighting factor evolves during the denoising process and its impact on capturing global structure vs. local details across different datasets and timesteps.
## Run 0: Baseline
Results: {'circle': {'training_time': 37.41756200790405, 'eval_loss': 0.43862981091984704, 'inference_time': 0.17150163650512695, 'kl_divergence': 0.35409057707548985}, 'dino': {'training_time': 36.680198669433594, 'eval_loss': 0.6648215834442002, 'inference_time': 0.17148971557617188, 'kl_divergence': 0.9891262038552158}, 'line': {'training_time': 37.15258550643921, 'eval_loss': 0.8037568644794357, 'inference_time': 0.16029620170593262, 'kl_divergence': 0.16078266200244817}, 'moons': {'training_time': 36.608174562454224, 'eval_loss': 0.6160634865846171, 'inference_time': 0.16797804832458496, 'kl_divergence': 0.08958379744366118}}
Description: Baseline results.

## Run 1: Dual-Scale Processing with Fixed Weighting
Description: Implemented a dual-scale processing approach with two parallel branches: a global branch for the original input and a local branch for an upscaled input. Used a fixed weighting factor of 0.5 to combine the outputs of both branches.
Results: {'circle': {'training_time': 73.06966805458069, 'eval_loss': 0.43969630813964494, 'inference_time': 0.29320263862609863, 'kl_divergence': 0.3689575513483317}, 'dino': {'training_time': 74.27817940711975, 'eval_loss': 0.6613499774499927, 'inference_time': 0.2861502170562744, 'kl_divergence': 0.8196823128731071}, 'line': {'training_time': 76.55267119407654, 'eval_loss': 0.8027192704817828, 'inference_time': 0.274810791015625, 'kl_divergence': 0.1723356430884586}, 'moons': {'training_time': 74.5637640953064, 'eval_loss': 0.6173960363773434, 'inference_time': 0.27197885513305664, 'kl_divergence': 0.09956056764691522}}
Analysis: The dual-scale processing approach with fixed weighting shows mixed results compared to the baseline. While there are slight improvements in KL divergence for some datasets (e.g., 'dino'), others show a small increase (e.g., 'circle', 'line', 'moons'). The eval_loss remains relatively similar to the baseline, indicating that the model's ability to denoise hasn't significantly changed. However, the training and inference times have approximately doubled, which is expected due to the additional computational complexity of the dual-scale approach. This suggests that the fixed weighting might not be optimal for all datasets and timesteps, motivating the need for a more adaptive approach.

...(中略)...

## Plot Descriptions

1. Training Loss Plot (train_loss.png):
This figure shows the training loss curves for each dataset (circle, dino, line, and moons) across all runs. The plot is organized as a 2x2 grid, with each subplot representing a different dataset. The x-axis represents the training steps, while the y-axis shows the loss value. Each run is represented by a different color, and the legend indicates which color corresponds to which run (Baseline, Fixed Weighting, Learnable Weighting, Weight Analysis, Weight Visualization, and Improved Weight Network).

Key insights from this plot:
- Comparison of convergence speeds across different runs and datasets
- Identification of any unusual patterns or instabilities in the training process
- Assessment of the impact of different weighting strategies on the training dynamics

...(中略)...

3. Weight Evolution Plot (weight_evolution.png):
This figure shows how the weights for global and local features evolve across timesteps for each dataset. The plot is organized as a 2x2 grid, with each subplot representing a different dataset. The x-axis represents the timesteps (from the end of the diffusion process to the beginning), while the y-axis shows the weight values (ranging from 0 to 1). For each run that implements adaptive weighting, there are two lines: one for the global feature weight and one for the local feature weight.

Key insights from this plot:
- Observation of how the balance between global and local features changes throughout the denoising process
- Comparison of weight evolution patterns across different datasets
- Identification of any significant shifts in the global-local balance at specific timesteps
- Assessment of the impact of different weight network architectures on the adaptive behavior

These plots provide a comprehensive visual analysis of our experimental results, allowing for in-depth comparisons across different runs and datasets. They offer valuable insights into the training dynamics, generation quality, and adaptive behavior of our dual-scale processing approach in low-dimensional diffusion models.

3. 論文執筆

そうして得られた結果をもとにThe AI Scientistは論文を自動生成します。これは、template.texに一定のフォーマットが指定されており、その指示にしたがって論文を生成します。また、一度に全てのセクションを埋めるのではなく、段階的に埋めていくように指示が与えられます。最初はタイトルとアブストラクト、次にイントロダクション、背景、手法、実験、結果、まとめといった順番で段階的に埋めていきます。

We've provided the `latex/template.tex` file to the project. We will be filling it in section by section.

First, please fill in the "Title" and "Abstract" sections of the writeup.

Some tips are provided below:
{per_section_tips["Abstract"]}

Before every paragraph, please include a brief description of what you plan to write in that paragraph in a comment.

Be sure to first name the file and use *SEARCH/REPLACE* blocks to perform these edits.

論文としての体裁を整えるために、セクションごとに異なる指示(per_section_tips)も与えられます。例えば、Abstractの場合、

- TL;DR of the paper
- What are we trying to do and why is it relevant?
- Why is this hard?
- How do we solve it (i.e. our contribution!)
- How do we verify that we solved it (e.g. Experiments and results)

Please make sure the abstract reads smoothly and is well-motivated. This should be one continuous paragraph with no breaks between the lines.

更にSemantic Scholar APIを使って、適切な引用文献を検索し追加し、全体を整えるプロンプトを与えて論文を推敲します。最後に、LaTeXをコンパイルしてPDFに変換します。

以上のフローをまとめてThe AI Scientistは、アイデア生成から実験、論文生成までの一連のプロセスを自動的に遂行します。

The AI Scientistの生成した論文「Adaptive Dual-Scale Denoising」の例(論文より引用)

4. 論文レビュー

更にThe AI Scientistは、生成された論文の品質をレビューできます。これもレビューのポイントを指示するpromptを与えて、実行します。結果は、review.txtとして出力されます。以下は、先ほど生成したAdaptive Dual Scale Denoisingのレビュー結果(https://github.com/SakanaAI/AI-Scientist/blob/main/example_papers/adaptive_dual_scale_denoising/review.txt)です。

Summary: This paper introduces an adaptive dual-scale denoising approach for low-dimensional diffusion models, addressing the challenge of balancing global structure and local detail in generated samples. The proposed architecture incorporates two parallel branches (global and local) with a learnable, timestep-conditioned weighting mechanism. The method is evaluated on four 2D datasets: circle, dino, line, and moons, showing improvements in sample quality, with KL divergence reductions of up to 12.8% compared to a baseline model.
Strengths:
・Novel dual-scale denoising architecture for low-dimensional diffusion models.
・Learnable, timestep-conditioned weighting mechanism to balance global and local features.
・Comprehensive empirical evaluations on four different 2D datasets.
・Significant improvements in sample quality, as evidenced by reductions in KL divergence.
Weaknesses:
・Increased computational complexity, with training times approximately doubling.
・Performance improvements are dataset-specific and not consistently superior across all datasets.
・Limited evaluation metrics; reliance on KL divergence and visual inspection may not fully capture model performance.
・Insufficient discussion of potential negative societal impacts.
・Limited to low-dimensional data, which restricts the broader impact.
・Dense and could benefit from clearer organization.
Originality: 3
Quality: 3
Clarity: 3
Significance: 3
Questions:
・How does the model perform on higher-dimensional datasets?
Can the computational complexity be reduced without significantly affecting performance?
・What is the impact of hyperparameter tuning on the model’s performance?
・Can the authors provide more details on the potential generalizability of their approach to higher-dimensional data?
・How does the learnable weighting mechanism perform in scenarios with significantly different noise schedules or datasets?
・Could the authors clarify the computational trade-offs in more detail, particularly in terms of scalability?
・Can the authors provide more insights into the increased computational cost? Are there any potential ways to mitigate this?
・Could the authors elaborate on the potential reasons for performance variability on the dino dataset?
・Have the authors considered evaluating the model on higher-dimensional datasets or more complex 2D datasets?
・Could the authors provide a more detailed breakdown of the training and inference time increases per dataset?
・Can the authors provide more details on how the upscaling process affects the local branch’s input?
・What are the computational requirements and memory usage for the proposed dual-scale model?
・Could the authors elaborate on the choice of hyperparameters and any tuning that was performed?
Limitations:
・Increased computational cost, which may limit practical applications.
・Potential inconsistency in performance, especially on more complex datasets (e.g., dino).
・The trade-off between improved sample quality and computational complexity needs careful consideration.
・Limited evaluation on low-dimensional data only.
・Performance variability on complex datasets indicates potential instability.
・Limited hyperparameter tuning.
Ethical Concerns: false
Soundness: 3
Presentation: 3
Contribution: 3
Overall: 5
Confidence: 4
Decision: Reject

このように詳細なレビューが作成されました。Summaryはほぼ論文のAbstractに近い内容ですが、このQuestionsやLimitationsに部分は、論文の更なるブラッシュアップに役立ちそうです。しかし、結果は芳しくなく、「Reject」という判定が下されました。Sakana AIの発表では、The AI Scientistの能力に関して、機械学習研究者の初等程度と評されています。アイデアの実装と実行はできるものの、まだ、十分な専門分野の背景知識を持っていないために、そのメカニズムを理解できていないとのことです。実際の論文の査読プロセスでは、このQuestionsに関して丁寧かつ率直に返答し、論文を改訂して再提出します。ただ、あまりに査読者の指摘が的外れで対応が難しい場合、その旨を編集者に伝え、判断を仰ぐこともあります。最終的な論文の採択は、査読者の意見を踏まえた上で、編集者が最終的に決定します。

The AI Scientistのレビューの精度にかんしても論文では検証されています。GPT-4oを使って、５回のself-reflection、５回のpeer-reviewを行い、1-shotの例を組み合わせたところICLR2022の論文の500のうち採択・却下の予測精度は70%程度となります。

The AI Scientistの性能

Sakana AIでは、The AI Scientistで３つの研究テーマ(Diffusion Modeling、Language Modeling、Grokking)、に関して４種類(GPT-4o、Claude Sonnet 3.5、DeepSeek Coder、Llama-3.1 405b)のLLMを用いて実験を行いました。それぞれ、50個のアイデアを生成させ、その後の実験と論文執筆を行いました。実験の時間制限は7200秒(2時間)で、プロットの生成には600秒(10分)の上限時間が設定されています。Sakana AIは、これらの実行に8x NVIDIA H100 GPUを利用して12時間ほどかかったとのことです。

以下のテーブルは、それぞれの研究テーマに対して、それぞれのLLMを用いた場合にいくつの論文が生成できたかとその評価とコストをまとめたものです。課題に応じて件数には揺れがありますが、GPT-4oは論文として出力できたものが他のLLMと比較すると少ない傾向があります。生成された論文の品質としては、Sonnet 3.5が全体的に高く、Llama-3.1は低い評価が多かったです。コストに関しては、APIを利用するSonnet 3.5とGPT-4oでそれぞれ$250~$300と高いコストがかかりました。１つの論文あたりですと、$10-$15程度のコストと見積もられています。

研究テーマとベースとなるLLMの組み合わせによるThe AI Scientistの性能(論文より引用)

The AI Scientistの限界

さて、The AI Scientistが勝手に論文を書いてくれるなら、もう研究者はいらないのでは？と思われるかもしれませんが、現状のThe AI Scientistにはいくつかの課題があります。まず、論文の生成プロセスで解説したように、The AI Scientistは、完全に新規な研究アイデアを提案しコードを生成するのではなく、あらかじめ元となる指示やコード、論文のテンプレートを与える必要があります。

また、Sakana AIのBlog(The AI Scientist)では以下の３つのポイントが指摘されていました。

視覚情報の欠如：今のバージョンはMulti-modalではないので、図を読み解いたり、論文のレイアウトを最適化することができません。テーブルがページからはみ出てしまったり、出力した図が適切に読めないケースがあります。
アイデアの実装ミス：アイデアの実装が間違っていたり、ベースラインとの比較が不公平な場合があり、誤解を招く結果になることがあります。
結果の評価時のエラー：LLMでよくある問題として数値の比較が苦手です。そのため結果の評価時に致命的なエラーが発生することがあります。

まだ、Aiderによる実験コードの修正と実行の段階でも、しばしば厄介なトラブルを招くことがあるそうです。Blogで紹介されていた事例では、勝手に自分自身を実行するように変更してしまったり、実験の条件(最大7200秒まで)を勝手に修正して実験が成功するように誘導したりといった事例が報告されています。

The AI Scientistの限界 — 自分自身のscriptを実行するように変更してしまった例(引用元: Sakana AI Blog)

The AI Scientistの論文でも以下のポイントが指摘されております。

アイデアの正当化の欠如: 論文で実装したコードのアイデアがなぜ有用なのかの議論が不十分
実験セットアップのハルシネーション: 論文ではV100 GPUを使ったと記載されているが、実際にはH100を利用している、またPyTorchのversionも間違っている
結果をポジティブに誤認する: 実際にはベースラインと比較して性能が悪化しているのに、逆に良くなったと解釈する(ex. KL Divergenceは低いほど性能が良いが、その増加を改善と記述)
実験ログ記録の不適切な参照: 実験ログで番号で管理されていたアルゴリズム名を誤って、その番号で参照する
途中結果の不適切な提示: 通常の論文の場合、その論文で実施された実験のみを記載すべきだが、不必要な途中の実験も記述している
参考文献が少ない: Semantic Scholarを利用し参考文献を追加しているが、論文の内容に対して文献の量が不十分

まとめ: Sakana AIが設立されて１年がすぎました。

今回のBlogでは、Sakana AIが発表したThe AI Scientistについて紹介しました。Sakana AIは、昨年７月の創業からわずか１年で企業価値が約11億ドルを超える日本史上最速のユニコーン企業として注目されているAIスタートアップです(日本経済新聞「サカナAIが1年でユニコーン日本最速､200億円追加調達」)。これは、既存のLLMを利用して研究アイデアの提案から実験、論文執筆とレビューまでを自動化することが可能ですが、「生成AIが全自動で研究をしてくれる！」というわけではなく、テンプレートとして研究テーマやコードを具体的に与える必要があります。これらのテンプレートや指示について順を追って解説しましたが、論文として破綻しないようにするために様々な工夫が施されていることが分かります。そのため既存の研究の改善アイデアの提案と実装は可能ですが、全く新しい研究アイデアの実現はできません。また、これほどの注意深く指示が与えられていても生成される論文の品質やコードの実行について懸念点が残るため、手放しで生成AIに論文を生成させてしまうのは危険です。あくまで現状のLLM技術のPoCとして取り扱うべきでしょう。The AI Scientistでは、研究の全行程を自動化してしまっていますが、アイデア出しや実験のプロセスの効率化といった観点で、それぞれのモジュールを独立して使うと研究の効率化につながるかもしれません。

The AI Scientistを使って大量に論文を生成して発表する人はいないとは思いますが、現状の性能では品質の良い研究成果にはなりませんし、研究者倫理としても問題があるでしょう。同じ著者から似たような論文が複数投稿された場合、まともな学術誌なら編集者がリジェクトする可能性が高いと思います。掲載料さえ支払えば出版してくれる、「ハゲタカジャーナル」なら出版してくれるでしょうが、掲載料が非常に高額になるでしょうし、そのような雑誌に論文を掲載しても、研究者としての評価は下がる一方です。論文の捏造などの研究不正は、いずれ発覚し研究者としてのキャリアが終わってしまうリスクがありますので、絶対にやめましょう、論文警察にはご注意を。

論文警察。学術界に流布する有名な都市伝説のひとつである。

あらわれるときは、いつも黒服の二人組であるという。黒いサングラスをかけ、「出すか、出されるか」と繰り返し唱えるだけでほかのことばはいっさい発しない。この台詞は英語圏のばあい「Publish or Perish」であるそうだ。もとは税金である研究費や給与を受けとりながら論文を書く義務を怠っている研究者や、書いたはいいが論文執筆における不正行為を犯した研究者を捕え、黒塗りの乗用車に押しこんで連れ去っていく。拉致された研究者は世にもはずかしい制裁を受けるのだといわれている。

(松崎有理『架空論文投稿計画』より)

グループ研究開発本部 AI研究開発室では、データサイエンティスト/機械学習エンジニアを募集しています。ビッグデータの解析業務などAI研究開発室にご興味を持って頂ける方がいらっしゃいましたら、ぜひ募集職種一覧からご応募をお願いします。皆さんのご応募をお待ちしています。

参考資料

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery [https://github.com/SakanaAI/AI-Scientist]
The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery [https://arxiv.org/abs/2408.06292]
Sakana AI Blog [https://sakana.ai/ai-scientist/]
Aider is AI pair programming in your terminal [https://aider.chat/]
Semantic Scholar [https://www.semanticscholar.org/]
「ここは作家・松崎有理の公式サイトです。」[https://yurimatsuzaki.com/]
『あがり』解説 [https://www.webmysteries.jp/archives/12245959.html]
『架空論文投稿計画』 [https://yurimatsuzaki.com/pseudopaper-book]
『山手線が転生して加速器になりました。』[https://yurimatsuzaki.com/yamanotesen]

ブログの著者欄

グループ研究開発本部

GMOインターネットグループ株式会社

グループ研究開発本部とは、GMOインターネットグループの事業領域で力を入れているスタートアップやグループ横断のプロジェクトにおいて、技術支援・開発・解析などを行い、ビジネスの成功を支援する部署です。最新のテクノロジーを研究開発し、いち早くビジネスに投入することで、お客様の発展と成功に貢献することを主要なミッションとしています。