Generate SQL Reports from Plain English using Large Language Models

Automating Data Retrieval for Non-Technical Users

Accessing valuable insights locked within databases often requires proficiency in SQL (Structured Query Language). This poses a significant barrier for non-technical users who need dynamic reports but lack the expertise to write complex queries. Manually crafting these queries can also be time-consuming and error-prone, especially in fast-paced production environments. A more efficient approach involves leveraging the power of Large Language Models (LLMs) to translate natural language requests directly into executable SQL.

This post explores a methodology for building a system that allows users to generate database reports simply by asking questions in plain English. While the examples provided utilize Ruby on Rails, the core concepts are language-agnostic and can be readily implemented using Python, Java, JavaScript, or other preferred stacks.

The Challenge: Bridging the Gap Between Users and Data

The primary goal is to empower users to retrieve data insights without needing direct SQL knowledge. An effective system must accomplish several key tasks:

Interpret User Intent: Understand the user’s natural language query to identify the specific data required and the relevant database tables involved.
Generate Accurate SQL: Convert the interpreted intent into a valid and efficient SQL query that can be executed against the database.
Retrieve and Structure Data: Execute the generated query and return the results in a clear, structured format (like a table) that is easy for the user to understand and consume.

This approach aims to create a seamless, intuitive experience for users while utilizing LLMs to handle the technical complexity of SQL generation.

The Solution: A Modular Approach with LLMs

A robust solution can be built using a modular architecture, breaking the process down into distinct components:

1. LLM Interaction Layer

This component acts as an interface to the chosen Large Language Model’s API (e.g., OpenAI’s GPT series). It’s responsible for sending structured prompts to the LLM and parsing the responses.

require 'net/http'
require 'json'

class Llm::Chat
  OPENAI_API_URL = "https://api.openai.com/v1/chat/completions"
  API_KEY = ENV['OPENAI_API_KEY'] # Store API key securely

  def initialize(payload:)
    @payload = payload
  end

  def call
    response = request_openai
    parse_response(response)
  end

  private

  def request_openai
    uri = URI(OPENAI_API_URL)
    http = Net::HTTP.new(uri.host, uri.port)
    http.use_ssl = true
    request = Net::HTTP::Post.new(uri.path, headers)
    request.body = @payload.to_json
    http.request(request)
  end

  def headers
    {
      "Content-Type" => "application/json",
      "Authorization" => "Bearer #{API_KEY}"
    }
  end

  def parse_response(response)
    JSON.parse(response.body)["choices"]&.first["message"]["content"]&.strip
  rescue
    nil # Handle potential errors gracefully
  end
end

2. Identifying the Relevant Table(s)

Before generating SQL, the system needs to determine which database table(s) hold the relevant information for the user’s query. An LLM can be prompted to perform this identification based on the user’s input and a list of available tables.

class TableIdentifier
  def initialize(query:)
    @query = query
    # In a real application, you might fetch table names dynamically
    @available_tables = "users, departments, tickets"
  end

  def call
    chat
  end

  private

  def chat
    Llm::Chat.new(payload: chat_payload).call
  end

  def chat_payload
    {
      "model": "gpt-4", # Or another suitable model
      "messages": [
          { "role": "system", "content": "Given a user query, determine the most relevant table or tables from [#{@available_tables}]. If the query involves multiple tables (e.g., grouping users by department), return a comma-separated list of table names. Only return the table name(s) with no extra text." },
          { "role": "user", "content": "#{@query}" }
      ],
      "max_tokens": 100
    }
  end
end

Example Database Schema

To provide context to the LLM, the system needs information about the database structure. Here are example schemas for users, departments, and tickets tables:

Users Table

CREATE TABLE users (
  id INT PRIMARY KEY,
  name VARCHAR(255),
  email VARCHAR(255),
  status VARCHAR(50),
  department_id INT,
  created_at TIMESTAMP,
  updated_at TIMESTAMP
);

Departments Table

CREATE TABLE departments (
  id INT PRIMARY KEY,
  name VARCHAR(255),
  manager_id INT,
  created_at TIMESTAMP,
  updated_at TIMESTAMP
);

Tickets Table

CREATE TABLE tickets (
  id INT PRIMARY KEY,
  user_id INT,
  subject VARCHAR(255),
  status VARCHAR(50),
  created_at TIMESTAMP,
  updated_at TIMESTAMP
);

3. Generating and Executing the SQL Query

Once the target table(s) are identified, the ReportGenerator component uses the LLM again. This time, it provides the table structure(s) as context along with the user’s original query, instructing the LLM to generate the appropriate SQL. The generated SQL is then executed against the database.

class ReportGenerator
  # Assumes Llm::Chat and TableIdentifier classes are available
  # Assumes ActiveRecord or similar DB connection is configured

  def initialize(query:)
    @query = query
  end

  def call
    generated_sql = generate_sql_query
    if generated_sql
      execute_and_format(generated_sql)
    else
      # Handle cases where SQL generation failed
      [{ type: "text", data: "Sorry, I couldn't generate a report for that query." }]
    end
  end

  private

  def execute_and_format(sql_query)
    # IMPORTANT: Sanitize or validate the generated SQL before execution in production!
    begin
      results = ActiveRecord::Base.connection.select_all(sql_query).to_a
      [
        { type: "text", data: "Here is your report:" },
        { type: "table", data: results }
      ]
    rescue ActiveRecord::StatementInvalid => e
      # Handle SQL execution errors
      [{ type: "text", data: "Error executing the report query: #{e.message}" }]
    end
  end

  def fetch_table_structure(table_name)
    # This implementation is specific to MySQL's SHOW CREATE TABLE.
    # Adapt for your database (e.g., using information_schema).
    # Handle multiple tables if necessary.
    sanitized_table_name = ActiveRecord::Base.connection.quote_table_name(table_name)
    ActiveRecord::Base.connection.execute("SHOW CREATE TABLE #{sanitized_table_name}").first[1]
  rescue
    "Could not retrieve table structure for #{table_name}."
  end

  def identify_table_name
    # Handle potential comma-separated list if multiple tables are needed
    TableIdentifier.new(query: @query).call
  end

  def generate_sql_query
    target_tables = identify_table_name
    return nil unless target_tables # Exit if table identification failed

    # For simplicity, this example assumes single table or correctly joined tables inferred by LLM.
    # A more robust solution might fetch and combine structures for multiple tables explicitly.
    structure_context = target_tables.split(',').map { |t| fetch_table_structure(t.strip) }.join("\n\n")

    Llm::Chat.new(payload: query_payload(structure_context)).call
  end

  def query_payload(table_structure_context)
    {
      "model": "gpt-4", # Or another suitable model
      "messages": [
          { "role": "system", "content": "Generate a MySQL query based on the following table structure(s):\n\n#{table_structure_context}\n\nSupport queries involving multiple tables where applicable (e.g., joining users and departments). Only return the SQL query as plain text with no formatting, explanations, or markdown code fences (like ```sql)." },
          { "role": "user", "content": "#{@query}" }
      ],
      "max_tokens": 1000 # Adjust as needed
    }
  end
end

Example Usage

With these components in place, generating a report becomes straightforward:

# Example 1: Simple count
ReportGenerator.new(query: "count of inactive users").call

# Example 2: List query
ReportGenerator.new(query: "list of active users").call

# Example 3: Query involving joins and grouping
ReportGenerator.new(query: "number of users per department").call

For the query “count of inactive users”, the system, after identifying the users table, would ideally generate:

SELECT COUNT(*) FROM users WHERE status = 'inactive';

For the query “number of users per department”, the system, identifying users and departments, would generate something similar to:

SELECT d.name, COUNT(u.id)
FROM users u
JOIN departments d ON u.department_id = d.id
GROUP BY d.name;

Key Considerations

Prompt Engineering: The quality of the LLM’s output heavily depends on the clarity and specificity of the prompts. Crafting effective system messages (like the instructions given to the LLM) often requires iteration and experimentation. You’ll need to refine prompts based on your specific database schema and the types of queries users make.
Security: Directly executing LLM-generated SQL in a production environment carries significant security risks (e.g., SQL injection, unintended data modification/deletion). Implement robust safeguards:
- Validation: Parse and validate the generated SQL against allowed patterns or structures.
- Permissions: Execute queries using database roles with the minimum necessary permissions (e.g., read-only access).
- Sandboxing: Consider execution environments that limit the query’s potential impact.
- Denylisting: Prevent the generation of potentially harmful SQL commands (DROP, DELETE, UPDATE without strict controls).
Accuracy and Hallucinations: LLMs can sometimes “hallucinate” or generate incorrect SQL. Include mechanisms for error handling and potentially allow users to flag incorrect results for review and prompt tuning.
Cost and Latency: API calls to LLMs incur costs and add latency. Optimize by caching results for common queries or using more efficient models where possible.

The Advantages of Natural Language Reporting

Implementing such a system offers significant benefits:

Accessibility: Empowers non-technical users to perform data analysis independently.
Efficiency: Reduces the time and effort required to generate custom reports.
Consistency: Ensures reports are generated using standardized logic defined via prompts.
Scalability: Adapts to new reporting needs or schema changes, often by simply updating the context provided to the LLM, potentially reducing development bottlenecks.

Conclusion

Leveraging Large Language Models to translate natural language into SQL queries provides a powerful pathway to democratize data access within an organization. By carefully designing the system components, crafting effective prompts, and implementing essential security measures, businesses can create intuitive reporting tools that unlock valuable insights for a broader range of users, streamlining decision-making and enhancing operational efficiency.

At Innovative Software Technology, we specialize in harnessing the power of Large Language Models (LLMs) and natural language processing to build custom software solutions that simplify complex tasks. If you’re looking to empower your team with intuitive data analytics tools, automate SQL report generation, or integrate advanced AI into your business intelligence workflows, our expert developers can help. We design and implement robust, scalable systems, leveraging technologies like Ruby on Rails, Python, and cutting-edge AI, to transform your natural language requirements into actionable data insights, streamlining operations and unlocking the true potential of your database solutions. Partner with us to build intelligent applications that drive business value through accessible LLM-powered data retrieval and custom software development.