用 AI 分析数据

E2B 用 AI 分析数据

参考: https://e2b.dev/docs/code-interpreting/analyze-data-with-ai

准备

mkdir -p ~/work/code/e2b/analyze-data-with-ai
cd ~/work/code/e2b/analyze-data-with-ai

下载:

https://www.kaggle.com/datasets/muqarrishzaib/tmdb-10000-movies-dataset

解压并重命名为 dataset.csv , 并移动到当前目录.

安装 python 依赖:

pip install e2b-code-interpreter anthropic python-dotenv

使用 AI 分析数据

vi analyze-data-with-ai.py

内容为:

import sys
import os
import base64
from dotenv import load_dotenv
load_dotenv()
from e2b_code_interpreter import Sandbox
from openai import OpenAI

# Create sandbox
sbx = Sandbox()

# Upload the dataset to the sandbox
with open("./dataset.csv", "rb") as f:
    dataset_path_in_sandbox = sbx.files.write("dataset.csv", f)


def run_ai_generated_code(ai_generated_code: str):
    print('Running the code in the sandbox....')
    execution = sbx.run_code(ai_generated_code)
    print('Code execution finished!')

    # First let's check if the code ran successfully.
    if execution.error:
        print('AI-generated code had an error.')
        print(execution.error.name)
        print(execution.error.value)
        print(execution.error.traceback)
        sys.exit(1)

    # Iterate over all the results and specifically check for png files that will represent the chart.
    result_idx = 0
    for result in execution.results:
        if result.png:
            # Save the png to a file
            # The png is in base64 format.
            with open(f'chart-{result_idx}.png', 'wb') as f:
                f.write(base64.b64decode(result.png))
            print(f'Chart saved to chart-{result_idx}.png')
            result_idx += 1

prompt = f"""
I have a CSV file about movies. It has about 10k rows. It's saved in the sandbox at {dataset_path_in_sandbox.path}.
These are the columns:
- 'id': number, id of the movie
- 'original_language': string like "eng", "es", "ko", etc
- 'original_title': string that's name of the movie in the original language
- 'overview': string about the movie
- 'popularity': float, from 0 to 9137.939. It's not normalized at all and there are outliers
- 'release_date': date in the format yyyy-mm-dd
- 'title': string that's the name of the movie in english
- 'vote_average': float number between 0 and 10 that's representing viewers voting average
- 'vote_count': int for how many viewers voted

I want to better understand how the vote average has changed over the years.
Write Python code that analyzes the dataset based on my request and produces right chart accordingly"""

client = OpenAI()

print("Waiting for model response...")
msg = client.messages.create(
  model="gpt-4o",
  max_tokens=1024,
  messages=[
    {"role": "user", "content": prompt}
  ],
  tools=[
    {
      "name": "run_python_code",
      "description": "Run Python code",
      "input_schema": {
        "type": "object",
        "properties": {
          "code": { "type": "string", "description": "The Python code to run" },
        },
        "required": ["code"]
      }
    }
  ]
)

for content_block in msg.content:
    if content_block.type == "tool_use":
        if content_block.name == "run_python_code":
            code = content_block.input["code"]
            print("Will run following code in the sandbox", code)
            # Execute the code in the sandbox
            run_ai_generated_code(code)

执行:

python analyze-data-with-ai.py

输出为: