Building an Automated AWS Reporting Pipeline for Publisher Readership

This article details the creation of an automated reporting pipeline leveraging various AWS services to generate daily summaries of publisher readership, identify missing metadata, store results, and deliver structured notifications to stakeholders via Slack. The primary goal was to eliminate manual reporting efforts.

Architectural Overview

The reporting system operates on a serverless architecture, orchestrated by Amazon EventBridge and powered by AWS Lambda functions. The workflow is designed for efficiency and automation:

Scheduled Trigger: Amazon EventBridge initiates the entire process daily at a predefined time.
Report Generation (Lambda Function #1):
- Executes an Amazon Athena query to analyze reading activity data.
- Saves the generated report as a CSV file in an S3 bucket using a consistent naming convention.
- Sends a preliminary Slack notification to confirm report readiness.
Publisher Summary Aggregation (Lambda Function #2):
- Retrieves the most recent CSV report from S3.
- Aggregates readership data per publisher, grouped by quarter.
- Posts a neatly formatted summary table to Slack, showcasing publisher performance.
Missing Metadata Detection (Lambda Function #3 – Optional):
- Runs a separate Athena query to pinpoint books lacking publisher information.
- Dispatches a Slack notification, including a direct link to the generated file in S3 for easy access.

Project Organization

The project employs a structured directory layout to manage code, configuration, and infrastructure-as-code:

publisher-reporting/
│
├── deploy.sh                         # Script for infrastructure provisioning and Lambda packaging
├── config.yaml                       # Runtime configurations (e.g., bucket names, cron schedules)
│
├── lambda/                           # Directory for Lambda function code
│   ├── report_generator/             # Generates Athena query results and saves to S3
│   │   └── handler.py
│   ├── summary_report_notifier/      # Aggregates data and posts summary to Slack
│   │   └── handler.py
│   └── missing_publisher_report/     # Detects and reports missing publisher metadata
│       └── handler.py
│
└── terraform/                        # Terraform files for Infrastructure as Code (Lambda, EventBridge, IAM roles, etc.)

Core Code Snippets

Key logic within the Lambda functions includes:

Report Generator:
- Initiating Athena queries: athena.start_query_execution(...)
- Storing and copying results to S3: s3.copy_object(...)
Slack Summary Formatter:
- Processing CSV data: Iterating through rows to aggregate counts based on year_quarter and publisher.

Scheduling Mechanism

Each Lambda function is independently scheduled using EventBridge, with careful staggering to prevent conflicts:

Lambda Function	Purpose	Schedule (UTC Cron Format)
`report_generator`	Runs Athena query, saves CSV to S3	`cron(0 0 * * ? *)`
`summary_report_notifier`	Reads latest CSV, posts Slack summary	`cron(10 0 * * ? *)`
`missing_publisher_report`	Detects books without publisher info	`cron(15 0 * * ? *)`

Note: A 10-15 minute delay between functions ensures the previous step completes before the next one starts.

Overcoming Challenges

Several issues were encountered during development and deployment, leading to crucial fixes:

Stale Slack Data: Initially, Slack reports showed static data due to an EventBridge schedule running only monthly. Fix: Updated to cron(0 0 * * ? *) for daily execution.
Missing Report Files: A period where no new publisher summary files appeared in S3 was traced back to the report_generator Lambda not being triggered. Fix: Corrected the EventBridge schedule for the generator Lambda to run daily.
Lambda Race Conditions: The Slack posting Lambda sometimes accessed an outdated CSV. Fix: Implemented a 10-minute delay between the report generation and Slack notification Lambdas using distinct EventBridge schedules.
Unreadable Slack Output: Publisher read counts in Slack were poorly formatted. Fix: Utilized Slack’s triple backticks (“`) for preformatted text blocks to improve readability.
S3 Clutter: Temporary Athena query results were accumulating in the output bucket. Fix: Configured Athena to store temporary results in a dedicated temporary-athena-query-results/ folder and applied an S3 lifecycle policy to automatically delete these files after three days.

Tangible Impact

The implementation of this automated pipeline delivered significant benefits:

Zero Manual Effort: Eliminated all manual processes associated with report generation.
Enhanced Visibility: Provided the team with clear, daily insights into reader engagement.
Scalable Infrastructure: Ensured a robust, serverless solution adhering to AWS best practices.
Proactive Issue Detection: Automated alerts improved the tracking of data consistency issues.

This project exemplifies how combining AWS Athena, Lambda, S3, and EventBridge can create a cost-effective, automated data reporting system. This pattern is highly adaptable for various data workflows, including product analytics, user activity tracking, and sales dashboard generation.