Migrating CMS Assets with MongoDB and Node.js: A Practical Guide

Content platforms and strategies evolve. Businesses often find themselves needing to redistribute content across various channels or migrate to new systems. This frequently involves moving data stored within a Content Management System (CMS), especially when that CMS relies on a flexible database like MongoDB. Handling both the structured content and associated media assets during such migrations requires a careful approach.

When faced with moving content managed within a MongoDB-backed CMS, a common challenge arises: exporting not just the text or markdown, but also ensuring linked media assets (images, documents), often stored on separate cloud storage, are preserved and correctly associated.

This guide outlines a practical process using Node.js to export content and download related media assets from a MongoDB database, preparing them for migration or redistribution.

Prerequisites

To follow along or adapt this process, ensure you have the following set up:

A MongoDB database instance (either locally hosted or on a cloud service like MongoDB Atlas). Ensure network configurations allow connections from your script’s environment.
Node.js (Version 22 or later is recommended, though other recent versions might work; consult API documentation if needed).

It’s important to recognize that CMS data structures vary significantly. While this guide uses specific field names from an example scenario (calculated_slug, name, content, etc.), the core concepts of querying MongoDB and handling external assets are widely applicable. Adapt the field names and queries to match your specific data model.

Setting Up the Node.js Project

First, initialize a new Node.js project and install the necessary libraries. Open your terminal in your desired project directory and run:

npm init -y
npm install mongodb axios dotenv --save

These commands perform the following:

npm init -y: Creates a package.json file for your project with default settings.
npm install ...: Installs the required dependencies:
- mongodb: The official MongoDB driver for Node.js, allowing interaction with your database.
- axios: A promise-based HTTP client for making requests to download media assets.
- dotenv: A module to load environment variables from a .env file into process.env.

Next, create the main script file and an environment configuration file:

touch main.js
touch .env

(If you’re on Windows, create these files using your preferred method, e.g., File Explorer or an editor).

All the migration logic will reside in main.js. Before writing the code, populate the .env file with your configuration details. This file keeps sensitive information like database credentials separate from your code.

Open .env and add the following lines, replacing the values after = with your specific details:

MONGODB_ATLAS_URI=your_mongodb_connection_string
MONGODB_DATABASE=your_database_name
MONGODB_COLLECTION=your_collection_name

EXPORT_URLS=./export_list.txt
OUTPUT_DIR=./output/

MONGODB_ATLAS_URI: Your MongoDB connection string.
MONGODB_DATABASE: The name of the database containing your CMS content.
MONGODB_COLLECTION: The name of the collection holding the content documents.
EXPORT_URLS: The path to a text file containing the unique identifiers (like URL slugs) of the content entries you want to export, one identifier per line.
OUTPUT_DIR: The directory where the exported content and assets will be saved.

Security Note: Remember to add .env to your .gitignore file if you’re using Git, to prevent accidentally committing sensitive credentials.

Now, let’s add the basic structure to main.js, including required modules and database connection logic:

const { MongoClient } = require("mongodb");
const fs = require("node:fs/promises"); // For async file operations
const { createWriteStream } = require("fs"); // For streaming file writes
const axios = require("axios");

// Load environment variables from .env file
require("dotenv").config();

// Read configuration from environment variables
const MONGODB_URI = process.env.MONGODB_ATLAS_URI;
const MONGODB_DATABASE = process.env.MONGODB_DATABASE;
const MONGODB_COLLECTION = process.env.MONGODB_COLLECTION;
const EXPORT_URLS_FILE = process.env.EXPORT_URLS;
const OUTPUT_DIR = process.env.OUTPUT_DIR;

// Initialize MongoDB client
const mongoClient = new MongoClient(MONGODB_URI);
let database, collection;

// Main async function to orchestrate the export
(async () => {
    try {
        // Connect to MongoDB
        await mongoClient.connect();
        console.log("Connected to MongoDB.");
        database = mongoClient.db(MONGODB_DATABASE);
        collection = database.collection(MONGODB_COLLECTION);

        // --- Core logic will go here ---
        console.log("Starting content export process...");

        // Placeholder for export steps
        // 1. Load identifiers (slugs)
        // 2. Query MongoDB for content
        // 3. Process each content entry:
        //    - Save Markdown/content
        //    - Extract asset URLs
        //    - Download assets
        //    - Save metadata

        console.log("Export process completed.");

    } catch (e) {
        console.error("ERROR during export process: ", e.message);
    } finally {
        // Ensure the client connection is closed
        await mongoClient.close();
        console.log("Disconnected from MongoDB.");
    }
})();

// --- Helper functions will be defined below ---

This sets up the connection and provides a structure for the export logic.

Downloading Content from MongoDB

The first step is to retrieve the actual content (e.g., Markdown) and associated metadata from MongoDB based on a list of identifiers. In this example, identifiers are URL slugs stored in the file specified by EXPORT_URLS.

Let’s create a function to read these identifiers from the file:

// Function to load identifiers (e.g., URL slugs) from the specified file
async function loadIdentifiersFromFile(filePath, removeTrailingSlash = false) {
    try {
        const data = await fs.readFile(filePath, { encoding: "utf8" });
        let identifiers = data.split('\n').filter(id => id.trim() !== ''); // Split by line and remove empty lines
        if (removeTrailingSlash) {
            identifiers = identifiers.map(id => id.replace(/\/$/, "")); // Remove trailing slash if needed
        }
        console.log(`Loaded ${identifiers.length} identifiers from ${filePath}`);
        return identifiers;
    } catch (error) {
        console.error(`Error reading identifier file ${filePath}:`, error.message);
        throw error; // Re-throw to stop the process if file read fails
    }
}

Now, integrate this into the main async function and add the MongoDB query logic using an aggregation pipeline:

// Inside the main async function's try block...
        // 1. Load identifiers (slugs)
        const slugsToExport = await loadIdentifiersFromFile(EXPORT_URLS_FILE, true); // Remove trailing slashes

        if (slugsToExport.length === 0) {
            console.log("No identifiers found to export. Exiting.");
            return;
        }

        // 2. Query MongoDB for content using aggregation
        console.log(`Querying MongoDB for ${slugsToExport.length} entries...`);
        const cursor = collection.aggregate([
            {
                "$match": {
                    // Adapt 'calculated_slug' to your actual field name for the identifier
                    "calculated_slug": { "$in": slugsToExport }
                }
            },
            {
                "$project": {
                    // Define the fields you want to export
                    _id: 0, // Exclude the default _id field
                    title: "$name", // Rename 'name' field to 'title'
                    description: 1, // Include 'description' field
                    slug: "$calculated_slug", // Include identifier field as 'slug'
                    authors: 1, // Include 'authors' field
                    tags: 1, // Include 'tags' field
                    content: 1 // Include the main content field
                    // Add or modify fields based on your data model
                }
            },
        ]);

        const matchedContent = await cursor.toArray();
        console.log(`Found ${matchedContent.length} matching content entries in MongoDB.`);

        // 3. Process each content entry (Loop placeholder)
        for (const entry of matchedContent) {
            console.log(`Processing entry: ${entry.slug}`);
            // Steps for saving content, assets, metadata go here...
        }

This aggregation pipeline first filters ($match) documents whose identifier field (calculated_slug in this case) is in our list of slugsToExport. Then, it reshapes ($project) the output to include only the necessary fields, potentially renaming them for clarity.

Saving Content and Metadata

With the data fetched, we need functions to save the main content (Markdown) and its metadata to the filesystem in an organized way.

// Function to save Markdown content to a file
// Creates directory structure: OUTPUT_DIR/entry-slug/entry-slug.md
async function saveMarkdownToFile(slug, markdownContent) {
    const entryName = slug.split("/").filter(part => part !== "").pop(); // Get last part of slug for filename
    if (!entryName) {
        console.warn(`Could not determine entry name from slug: ${slug}. Skipping save.`);
        return;
    }
    const entryDir = `${OUTPUT_DIR}${entryName}/`;
    const filePath = `${entryDir}${entryName}.md`;
    try {
        await fs.mkdir(entryDir + 'assets/', { recursive: true }); // Ensure directory exists (including assets subdir)
        await fs.writeFile(filePath, markdownContent || ''); // Write content (handle null/undefined)
        console.log(` -> Saved Markdown to: ${filePath}`);
    } catch (error) {
        console.error(`Error saving Markdown for ${slug}:`, error.message);
    }
}

// Function to save metadata (as JSON) to a file
// Creates directory structure: OUTPUT_DIR/entry-slug/meta.json
async function saveMetadataToFile(slug, metadata) {
    const entryName = slug.split("/").filter(part => part !== "").pop();
    if (!entryName) {
        console.warn(`Could not determine entry name from slug: ${slug}. Skipping metadata save.`);
        return;
    }
    const entryDir = `${OUTPUT_DIR}${entryName}/`;
    const filePath = `${entryDir}meta.json`;
    try {
        await fs.mkdir(entryDir + 'assets/', { recursive: true }); // Ensure directory exists
        // Remove the large content field before saving metadata
        delete metadata.content;
        await fs.writeFile(filePath, JSON.stringify(metadata, null, 4)); // Pretty print JSON
        console.log(` -> Saved Metadata to: ${filePath}`);
    } catch (error) {
        console.error(`Error saving metadata for ${slug}:`, error.message);
    }
}

These functions derive a directory name from the slug, create the necessary directories (including an assets subdirectory for media), and save the Markdown content and the remaining metadata (as a JSON file) respectively.

Now, update the loop in the main function:

// Inside the main async function's try block, within the loop...
        for (const entry of matchedContent) {
            console.log(`Processing entry: ${entry.slug}`);

            // Save the Markdown content
            await saveMarkdownToFile(entry.slug, entry.content);

            // Placeholder for asset handling...

            // Save the metadata (content field will be removed by the function)
            // Make a copy to avoid modifying the original object if needed elsewhere
            const metadataToSave = { ...entry };
            await saveMetadataToFile(entry.slug, metadataToSave);
        }

Saving Externally Hosted Media Assets

A crucial part of CMS migration is handling media assets. If asset URLs are embedded directly within your content (e.g., in Markdown image tags ![alt](url)), you’ll need to extract these URLs and download the files.

First, a function to extract asset URLs using regular expressions:

// Function to extract asset URLs (images, pdfs, etc.) from text content
function extractAssetUrls(textContent) {
    if (!textContent) return [];
    // Regex to find absolute URLs ending in common media extensions
    // Adjust the extensions (png|jpg|...) as needed for your asset types
    const assetUrlRegex = /(https?:\/\/.*?\.(?:png|jpg|jpeg|gif|webp|pdf|svg))/gi;
    let assetUrls = textContent.match(assetUrlRegex) || [];

    // Optional: Handle specific URL structures, like those hidden in query parameters
    // Example: https://example.com/image-proxy?url=https://real-image.com/img.jpg
    assetUrls = assetUrls.map(asset => {
        try {
            const urlObject = new URL(asset);
            const urlParams = new URLSearchParams(urlObject.search);
            if (urlParams.has("url")) {
                // If a 'url' query parameter exists, use its value as the actual asset URL
                return urlParams.get("url");
            }
        } catch (e) {
             // Ignore invalid URLs during parsing
        }
        return asset; // Return original URL if no special handling needed
    });

    // Return unique URLs
    return [...new Set(assetUrls)];
}

This function scans the content for URLs matching common image and PDF extensions. It also includes an example of handling URLs hidden within query parameters, which might be necessary depending on how your original CMS served assets.

Next, a function to download a single asset using Axios:

// Function to download an asset from a URL and save it locally
async function downloadAsset(assetUrl, slug) {
    const entryName = slug.split("/").filter(part => part !== "").pop();
    if (!entryName) {
        console.warn(`Could not determine entry name from slug: ${slug}. Skipping asset download: ${assetUrl}`);
        return;
    }

    let fileName;
    try {
        // Extract filename from the asset URL path
        const urlPath = new URL(assetUrl).pathname;
        fileName = urlPath.substring(urlPath.lastIndexOf('/') + 1);
        if (!fileName) throw new Error("Could not determine filename.");
    } catch (e) {
        console.error(` -> Error parsing asset URL or filename for ${assetUrl}: ${e.message}. Skipping download.`);
        return; // Skip if URL is invalid or filename can't be determined
    }

    const assetsDir = `${OUTPUT_DIR}${entryName}/assets/`;
    const filePath = `${assetsDir}${fileName}`;

    try {
        // Make HTTP GET request with Axios, requesting response as a stream
        const response = await axios({
            method: 'GET',
            url: assetUrl,
            responseType: 'stream',
        });

        // Pipe the response stream directly to a file write stream
        const writer = createWriteStream(filePath);
        response.data.pipe(writer);

        // Return a promise that resolves on successful download or rejects on error
        return new Promise((resolve, reject) => {
            writer.on('finish', () => {
                console.log(`    -> Downloaded asset: ${fileName}`);
                resolve(filePath);
            });
            writer.on('error', (err) => {
                console.error(`    -> Error writing asset file ${fileName}:`, err.message);
                // Attempt to clean up incomplete file
                fs.unlink(filePath).catch(e => console.error(`    -> Failed to delete incomplete file ${filePath}: ${e.message}`));
                reject(err);
            });
            response.data.on('error', (err) => { // Catch errors on the response stream itself
                 console.error(`    -> Error downloading asset ${assetUrl}:`, err.message);
                 reject(err);
            });
        });

    } catch (error) {
        // Handle HTTP errors (like 404 Not Found)
        if (error.response) {
            console.warn(`    -> Skipping asset download due to HTTP error (${error.response.status}): ${assetUrl}`);
        } else {
            console.error(`    -> Skipping asset download due to network or other error: ${assetUrl}`, error.message);
        }
        return Promise.resolve(); // Resolve promise even on error to not halt Promise.all
    }
}

This function takes the asset URL and the content slug (for directory structure), extracts the filename, uses Axios to fetch the asset as a data stream, and pipes that stream directly into a local file using fs.createWriteStream. It includes error handling for network issues, HTTP errors (like 404s), and file writing problems.

Finally, integrate asset handling into the main loop. We’ll extract URLs from the content, then use Promise.all to download them concurrently for efficiency:

// Inside the main async function's try block, within the loop...
        for (const entry of matchedContent) {
            console.log(`Processing entry: ${entry.slug}`);

            // Save the Markdown content
            await saveMarkdownToFile(entry.slug, entry.content);

            // Extract and download assets
            const assetUrls = extractAssetUrls(entry.content);
            if (assetUrls.length > 0) {
                console.log(` -> Found ${assetUrls.length} potential assets for ${entry.slug}. Downloading...`);
                // Download all assets concurrently
                const downloadPromises = assetUrls.map(url => downloadAsset(url, entry.slug));
                await Promise.all(downloadPromises);
                console.log(` -> Finished asset downloads for ${entry.slug}.`);
            } else {
                 console.log(` -> No assets found in content for ${entry.slug}.`);
            }

            // Save the metadata (content field will be removed by the function)
            const metadataToSave = { ...entry };
            await saveMetadataToFile(entry.slug, metadataToSave);
        }

Running the Migration Script

With all the code in place (the full script combining all pieces should be assembled in main.js), ensure your .env file is correctly configured with your MongoDB details, the path to your identifiers list (export_list.txt), and your desired output directory.

Then, run the script from your terminal:

node main.js

The script will connect to MongoDB, read the identifiers, query the database, and then loop through each found entry, saving its Markdown, downloading its assets, and saving its metadata to your specified output directory. Monitor the console output for progress and any errors. Depending on the number of entries and assets, this process might take some time.

Conclusion

Migrating CMS data, especially when dealing with both structured content in MongoDB and externally linked media assets, requires a systematic approach. This guide demonstrated how to build a Node.js script to:

Connect to MongoDB.
Query for specific content entries based on identifiers.
Save the main content (e.g., Markdown) and metadata to local files.
Parse content to find URLs of embedded media assets.
Download these assets and store them locally, organized alongside their corresponding content.

While the specific field names and data structures will vary based on your CMS implementation, the fundamental techniques of using the MongoDB Node.js driver for querying, Axios for HTTP requests, and Node.js file system modules for saving data provide a solid foundation for tackling such migration tasks.

How Innovative Software Technology Can Help: Facing challenges with complex CMS migrations, data extraction from MongoDB, or managing digital assets? Innovative Software Technology specializes in crafting bespoke software solutions tailored to your unique needs. Our expertise spans MongoDB data handling, Node.js scripting for automation, and seamless integration strategies. We can help you design and implement robust export processes, ensure data integrity during transitions, and build custom tools to streamline your content management workflows. Partner with Innovative Software Technology for efficient, reliable, and scalable solutions to your data migration and software development challenges, ensuring your content and assets are handled with precision and care.