Generate SQL Reports from Plain English using Large Language Models
Automating Data Retrieval for Non-Technical Users
Accessing valuable insights locked within databases often requires proficiency in SQL (Structured Query Language). This poses a significant barrier for non-technical users who need dynamic reports but lack the expertise to write complex queries. Manually crafting these queries can also be time-consuming and error-prone, especially in fast-paced production environments. A more efficient approach involves leveraging the power of Large Language Models (LLMs) to translate natural language requests directly into executable SQL.
This post explores a methodology for building a system that allows users to generate database reports simply by asking questions in plain English. While the examples provided utilize Ruby on Rails, the core concepts are language-agnostic and can be readily implemented using Python, Java, JavaScript, or other preferred stacks.
The Challenge: Bridging the Gap Between Users and Data
The primary goal is to empower users to retrieve data insights without needing direct SQL knowledge. An effective system must accomplish several key tasks:
- Interpret User Intent: Understand the user’s natural language query to identify the specific data required and the relevant database tables involved.
- Generate Accurate SQL: Convert the interpreted intent into a valid and efficient SQL query that can be executed against the database.
- Retrieve and Structure Data: Execute the generated query and return the results in a clear, structured format (like a table) that is easy for the user to understand and consume.
This approach aims to create a seamless, intuitive experience for users while utilizing LLMs to handle the technical complexity of SQL generation.
The Solution: A Modular Approach with LLMs
A robust solution can be built using a modular architecture, breaking the process down into distinct components:
1. LLM Interaction Layer
This component acts as an interface to the chosen Large Language Model’s API (e.g., OpenAI’s GPT series). It’s responsible for sending structured prompts to the LLM and parsing the responses.
require 'net/http'
require 'json'
class Llm::Chat
OPENAI_API_URL = "https://api.openai.com/v1/chat/completions"
API_KEY = ENV['OPENAI_API_KEY'] # Store API key securely
def initialize(payload:)
@payload = payload
end
def call
response = request_openai
parse_response(response)
end
private
def request_openai
uri = URI(OPENAI_API_URL)
http = Net::HTTP.new(uri.host, uri.port)
http.use_ssl = true
request = Net::HTTP::Post.new(uri.path, headers)
request.body = @payload.to_json
http.request(request)
end
def headers
{
"Content-Type" => "application/json",
"Authorization" => "Bearer #{API_KEY}"
}
end
def parse_response(response)
JSON.parse(response.body)["choices"]&.first["message"]["content"]&.strip
rescue
nil # Handle potential errors gracefully
end
end
2. Identifying the Relevant Table(s)
Before generating SQL, the system needs to determine which database table(s) hold the relevant information for the user’s query. An LLM can be prompted to perform this identification based on the user’s input and a list of available tables.
class TableIdentifier
def initialize(query:)
@query = query
# In a real application, you might fetch table names dynamically
@available_tables = "users, departments, tickets"
end
def call
chat
end
private
def chat
Llm::Chat.new(payload: chat_payload).call
end
def chat_payload
{
"model": "gpt-4", # Or another suitable model
"messages": [
{ "role": "system", "content": "Given a user query, determine the most relevant table or tables from [#{@available_tables}]. If the query involves multiple tables (e.g., grouping users by department), return a comma-separated list of table names. Only return the table name(s) with no extra text." },
{ "role": "user", "content": "#{@query}" }
],
"max_tokens": 100
}
end
end
Example Database Schema
To provide context to the LLM, the system needs information about the database structure. Here are example schemas for users
, departments
, and tickets
tables:
Users Table
CREATE TABLE users (
id INT PRIMARY KEY,
name VARCHAR(255),
email VARCHAR(255),
status VARCHAR(50),
department_id INT,
created_at TIMESTAMP,
updated_at TIMESTAMP
);
Departments Table
CREATE TABLE departments (
id INT PRIMARY KEY,
name VARCHAR(255),
manager_id INT,
created_at TIMESTAMP,
updated_at TIMESTAMP
);
Tickets Table
CREATE TABLE tickets (
id INT PRIMARY KEY,
user_id INT,
subject VARCHAR(255),
status VARCHAR(50),
created_at TIMESTAMP,
updated_at TIMESTAMP
);
3. Generating and Executing the SQL Query
Once the target table(s) are identified, the ReportGenerator
component uses the LLM again. This time, it provides the table structure(s) as context along with the user’s original query, instructing the LLM to generate the appropriate SQL. The generated SQL is then executed against the database.
class ReportGenerator
# Assumes Llm::Chat and TableIdentifier classes are available
# Assumes ActiveRecord or similar DB connection is configured
def initialize(query:)
@query = query
end
def call
generated_sql = generate_sql_query
if generated_sql
execute_and_format(generated_sql)
else
# Handle cases where SQL generation failed
[{ type: "text", data: "Sorry, I couldn't generate a report for that query." }]
end
end
private
def execute_and_format(sql_query)
# IMPORTANT: Sanitize or validate the generated SQL before execution in production!
begin
results = ActiveRecord::Base.connection.select_all(sql_query).to_a
[
{ type: "text", data: "Here is your report:" },
{ type: "table", data: results }
]
rescue ActiveRecord::StatementInvalid => e
# Handle SQL execution errors
[{ type: "text", data: "Error executing the report query: #{e.message}" }]
end
end
def fetch_table_structure(table_name)
# This implementation is specific to MySQL's SHOW CREATE TABLE.
# Adapt for your database (e.g., using information_schema).
# Handle multiple tables if necessary.
sanitized_table_name = ActiveRecord::Base.connection.quote_table_name(table_name)
ActiveRecord::Base.connection.execute("SHOW CREATE TABLE #{sanitized_table_name}").first[1]
rescue
"Could not retrieve table structure for #{table_name}."
end
def identify_table_name
# Handle potential comma-separated list if multiple tables are needed
TableIdentifier.new(query: @query).call
end
def generate_sql_query
target_tables = identify_table_name
return nil unless target_tables # Exit if table identification failed
# For simplicity, this example assumes single table or correctly joined tables inferred by LLM.
# A more robust solution might fetch and combine structures for multiple tables explicitly.
structure_context = target_tables.split(',').map { |t| fetch_table_structure(t.strip) }.join("\n\n")
Llm::Chat.new(payload: query_payload(structure_context)).call
end
def query_payload(table_structure_context)
{
"model": "gpt-4", # Or another suitable model
"messages": [
{ "role": "system", "content": "Generate a MySQL query based on the following table structure(s):\n\n#{table_structure_context}\n\nSupport queries involving multiple tables where applicable (e.g., joining users and departments). Only return the SQL query as plain text with no formatting, explanations, or markdown code fences (like ```sql)." },
{ "role": "user", "content": "#{@query}" }
],
"max_tokens": 1000 # Adjust as needed
}
end
end
Example Usage
With these components in place, generating a report becomes straightforward:
# Example 1: Simple count
ReportGenerator.new(query: "count of inactive users").call
# Example 2: List query
ReportGenerator.new(query: "list of active users").call
# Example 3: Query involving joins and grouping
ReportGenerator.new(query: "number of users per department").call
For the query “count of inactive users”, the system, after identifying the users
table, would ideally generate:
SELECT COUNT(*) FROM users WHERE status = 'inactive';
For the query “number of users per department”, the system, identifying users
and departments
, would generate something similar to:
SELECT d.name, COUNT(u.id)
FROM users u
JOIN departments d ON u.department_id = d.id
GROUP BY d.name;
Key Considerations
- Prompt Engineering: The quality of the LLM’s output heavily depends on the clarity and specificity of the prompts. Crafting effective system messages (like the instructions given to the LLM) often requires iteration and experimentation. You’ll need to refine prompts based on your specific database schema and the types of queries users make.
- Security: Directly executing LLM-generated SQL in a production environment carries significant security risks (e.g., SQL injection, unintended data modification/deletion). Implement robust safeguards:
- Validation: Parse and validate the generated SQL against allowed patterns or structures.
- Permissions: Execute queries using database roles with the minimum necessary permissions (e.g., read-only access).
- Sandboxing: Consider execution environments that limit the query’s potential impact.
- Denylisting: Prevent the generation of potentially harmful SQL commands (
DROP
,DELETE
,UPDATE
without strict controls).
- Accuracy and Hallucinations: LLMs can sometimes “hallucinate” or generate incorrect SQL. Include mechanisms for error handling and potentially allow users to flag incorrect results for review and prompt tuning.
- Cost and Latency: API calls to LLMs incur costs and add latency. Optimize by caching results for common queries or using more efficient models where possible.
The Advantages of Natural Language Reporting
Implementing such a system offers significant benefits:
- Accessibility: Empowers non-technical users to perform data analysis independently.
- Efficiency: Reduces the time and effort required to generate custom reports.
- Consistency: Ensures reports are generated using standardized logic defined via prompts.
- Scalability: Adapts to new reporting needs or schema changes, often by simply updating the context provided to the LLM, potentially reducing development bottlenecks.
Conclusion
Leveraging Large Language Models to translate natural language into SQL queries provides a powerful pathway to democratize data access within an organization. By carefully designing the system components, crafting effective prompts, and implementing essential security measures, businesses can create intuitive reporting tools that unlock valuable insights for a broader range of users, streamlining decision-making and enhancing operational efficiency.
At Innovative Software Technology, we specialize in harnessing the power of Large Language Models (LLMs) and natural language processing to build custom software solutions that simplify complex tasks. If you’re looking to empower your team with intuitive data analytics tools, automate SQL report generation, or integrate advanced AI into your business intelligence workflows, our expert developers can help. We design and implement robust, scalable systems, leveraging technologies like Ruby on Rails, Python, and cutting-edge AI, to transform your natural language requirements into actionable data insights, streamlining operations and unlocking the true potential of your database solutions. Partner with us to build intelligent applications that drive business value through accessible LLM-powered data retrieval and custom software development.