Validating External Data in CI Pipelines with DuckDB and HTTP Client Extension
Ensuring data quality is paramount, especially when your data relies on external resources. This article delves into a practical approach for validating such external dependencies within a Continuous Integration (CI) pipeline, leveraging DuckDB and its powerful http_client
extension. Specifically, we’ll explore how to verify the existence of GitHub accounts to prevent common data entry errors like typos, making your CI process more robust.
The Challenge: Verifying External Resources in CI
A common scenario in data management involves maintaining lists of external identifiers, such as GitHub usernames, associated with internal records. The integrity of these lists often depends on the accuracy of these external IDs. Manual checks are prone to error and don’t scale well. Integrating an automated validation step directly into your CI pipeline is an ideal solution.
For instance, if your internal system tracks personnel and their GitHub handles, you need a mechanism to confirm that each listed GitHub account genuinely exists. This validation should ideally happen automatically during data updates or deployment, flagging any invalid entries before they cause issues.
DuckDB to the Rescue for Data Validation
DuckDB, a high-performance in-process analytical database, proves to be an excellent tool for this kind of data quality assurance. When combined with GitHub Actions, it offers a seamless way to integrate data validation directly into your development workflow.
The key component for our task is DuckDB’s http_client
extension. This extension allows DuckDB to make HTTP requests directly from SQL queries, enabling the database to interact with web services and retrieve information like HTTP status codes. This capability is crucial for verifying the existence of external resources by checking their response status.
Implementing GitHub Handle Validation with DuckDB SQL
Here’s a step-by-step guide using SQL to validate GitHub usernames:
1. Install and Load the http_client
Extension
First, you need to enable the http_client
extension in your DuckDB environment:
INSTALL http_client FROM community;
LOAD http_client;
2. Prepare Your Data
Next, create a table to hold your data, including the GitHub handles you wish to validate. For demonstration, we’ll set up a sample table:
CREATE OR REPLACE TABLE person_gh_member
(
sam_accountname VARCHAR PRIMARY KEY,
gh_member VARCHAR NOT NULL
);
-- Insert example data, including some valid and potentially invalid handles
INSERT INTO person_gh_member (sam_accountname, gh_member)
VALUES
('adriens', 'adriens'),
('jdoe', 'johndoe'),
('asmith', 'annasmithRRRRR'); -- This one is likely invalid
3. Query GitHub Statuses
Now, create a view that constructs the GitHub profile URL for each member and uses http_get
to fetch the HTTP status code. A 200 OK
status indicates a valid GitHub profile page.
CREATE OR REPLACE VIEW v_person_gh_status AS
SELECT
sam_accountname,
gh_member,
'https://github.com/' || gh_member AS gh_url,
CAST(http_get(gh_url).status AS INTEGER) AS http_gh_status
FROM person_gh_member;
-- To see the results of this view:
SELECT * FROM v_person_gh_status;
This query will return the http_gh_status
for each GitHub handle. You can then filter for any entries where http_gh_status
is not 200
to identify invalid handles:
SELECT *
FROM v_person_gh_status
WHERE http_gh_status <> 200;
4. Enforce Data Quality with SQL CHECK
Constraints
To fully integrate this validation into your data schema and make it part of a data linting process, you can use CHECK
constraints on a dedicated table. This ensures that any data inserted or updated adheres to your validation rules.
CREATE OR REPLACE TABLE lint_gh_handle(
gh_handle VARCHAR PRIMARY KEY,
gh_url VARCHAR NOT NULL UNIQUE CHECK (gh_url LIKE 'https://github.com/%'),
gh_status INTEGER CHECK (gh_status = 200)
);
-- Insert data into this linting table. Any row that fails the CHECK constraint
-- (i.e., gh_status is not 200) will cause an error, effectively "linting" your data.
INSERT INTO lint_gh_handle(
gh_handle,
gh_url,
gh_status)
SELECT
gh_member,
'https://github.com/' || gh_member AS gh_url,
CAST(http_get(gh_url).status AS INTEGER) AS gh_status
FROM person_gh_member;
When you attempt to INSERT
data into lint_gh_handle
, DuckDB will automatically check the gh_status
. If any gh_status
is not 200
, the insertion will fail, signaling a data quality issue directly within your CI pipeline. This allows you to catch and correct invalid GitHub handles efficiently.
Conclusion
By combining DuckDB, its http_client
extension, and SQL CHECK
constraints, you can create a powerful and automated data validation system within your CI/CD pipeline. This method ensures the integrity of data that depends on external web resources, such as GitHub accounts, making your data management more reliable and your development process more robust. This approach exemplifies how a lightweight yet powerful database can streamline complex data quality challenges with simple SQL.