Efficiently Scraping and Storing Legal Documents from JDIHN with PHP

This article details a method for extracting and storing legal document data from the JDIHN (Jaringan Dokumentasi dan Informasi Hukum Nasional) website, which is Indonesia’s national legal documentation and information network. This process uses a PHP script that systematically scrapes data and stores it in a structured format.

Understanding the Process: Web Scraping and Data Storage

Web scraping is the automated process of extracting data from websites. In this scenario, we’re targeting specific information about legal documents available on JDIHN. The extracted data is then organized and inserted into a Supabase database for easy access and management.

The PHP Script: A Step-by-Step Breakdown

The core of this solution is a PHP script that leverages the DiDom library for HTML parsing and curl for handling HTTP requests. Let’s dissect the key components:

  1. Dependencies and Setup:
    • The script begins by requiring vendor/autoload.php, which is standard practice when using Composer to manage PHP dependencies. This likely loads the DiDom library.
    • The DiDom\Document class is imported, providing tools to navigate and extract data from HTML.
  2. safe_url() Function:
    • This utility function takes text input and converts it into a URL-friendly string. It does this by:
      • Converting the text to lowercase.
      • Replacing any non-alphanumeric characters with underscores (_).
      • Trimming any leading or trailing underscores.
      • This function is for creating clean and consistent identifiers.
  3. insertToSupabase() Function:
    • This function handles the insertion of scraped data into a Supabase database.
    • It defines Supabase API credentials: $supabaseUrl and $supabaseKey. (Important: These should be treated as sensitive information and stored securely, ideally using environment variables.)
    • $tableName specifies the target table in Supabase (in this case, “peraturan,” which translates to “regulations”).
    • The function formats the input $data array into a key-value pair, skipping empty fields.
    • It then encodes the formatted data into JSON format.
    • A curl request is initialized and configured to send a POST request to the Supabase REST API. The necessary headers, including the API key and authorization token, are set.
    • The JSON payload is sent, and the response from Supabase is received.
    • Finally, the function returns the HTTP status code and the API response (decoded from JSON).
  4. The Scraping Loop:
    • A for loop iterates backward from a starting ID (1986526) down to 0. This ID is used to construct the URL of the detail page on JDIHN.
    • Inside the loop, the target URL ($uri) is created using the current ID.
    • file_get_contents() attempts to retrieve the HTML content of the page. The @ symbol suppresses warnings if the page is not found.
    • Error Handling: If file_get_contents() fails (returns false), an error message (“page not found”) is displayed, and the loop continues to the next iteration.
    • HTML Parsing: If the page is successfully retrieved, a new DiDom\Document object is created, representing the HTML content.
    • Table Extraction: The script attempts to find the first <table> element within the document using $document->first('table').
    • Table Error Handling: If no table is found, an error message (“table not found”) is displayed, and the loop continues.
    • Row and Cell Extraction: If a table is found, the script retrieves all <tr> (table row) elements. It then iterates through each row and extracts the <td> (table data) elements.
    • Data Cleaning and Formatting:
      • The safe_url() function is used to clean the text content of the first cell ($cells[0]).
      • The script checks for links (<a> tags) within the third cell ($cells[2]). If links are found, the href attribute (the URL) is extracted. If no links are present, the text content of the cell is used.
      • Empty data rows are skipped.
    • Adding a URL ID: The current iteration ID ($i) is added to the data array as “url_id”.
    • Database Insertion: The insertToSupabase() function is called to insert the extracted data into the Supabase table.
    • Feedback and Delay: A success message (“maybe [OK]”) is printed to the console. A random delay (usleep()) is introduced between requests to avoid overloading the target server. This is good web scraping etiquette.

Supabase Table Structure

The script is designed to work with a specific Supabase table structure. The provided SQL creates the peraturan table with the following columns:

CREATE TABLE peraturan (
    asal_dokumen TEXT,
    jenis_dokumen TEXT,
    nomor TEXT,
    tahun TEXT,
    judul TEXT,
    t_e_u TEXT,
    singkatan_jenis TEXT,
    tempat_terbit TEXT,
    tanggal_penetapan TEXT,
    tanggal_pengundangan TEXT,
    subyek TEXT,
    status TEXT,
    penandatangan TEXT,
    sumber TEXT,
    bahasa TEXT,
    unduhan TEXT,
    unduhan_alternatif TEXT,
    abstrak TEXT,
    unduhan_abstrak_alternatif TEXT,
    detil_dokumen TEXT,
    url_id TEXT,
    id SERIAL PRIMARY KEY -- Add this line
);

The important point it to replace id SERIAL PRIMARY KEY with id and what ever datatype should id have, and mark it as auto-incrementing primary key.

Running the Script

The instructions indicate that the script should be run from the terminal using nohup. This command ensures that the script continues running even if the terminal session is closed:

nohup php script.php

Replace script.php with the actual name of your PHP file.

Key Improvements and Considerations

  • Error Handling: The script includes basic error handling for page and table not found scenarios. More robust error handling could be added to log errors, retry failed requests, or handle different HTTP status codes.
  • Rate Limiting: The usleep() function introduces a delay to avoid overwhelming the JDIHN server. Consider implementing a more sophisticated rate-limiting mechanism to comply with the website’s terms of service.
  • Data Validation: Adding data validation checks before inserting data into the database can improve data quality.
  • Scalability: For large-scale scraping, consider using a queue system to manage requests and distribute the workload.
  • Security: Store API keys and other sensitive information securely, preferably using environment variables.

How Innovative Software Technology Can Help

At Innovative Software Technology, we specialize in developing custom web scraping solutions and data management systems. We can help your organization efficiently extract, process, and store data from various online sources, just like the JDIHN legal document scraping example above. Our expertise in PHP development, web scraping best practices, database design (including Supabase), and API integration ensures that we can build a robust and scalable solution tailored to your specific needs. We prioritize SEO-friendly data structures and efficient data retrieval, enabling you to maximize the value of the information we gather. We also offer services in data cleaning, data transformation, and data visualization, turning raw scraped data into actionable insights. Contact us to discuss how we can leverage web scraping and data management to enhance your business operations and improve your search engine rankings through optimized data accessibility.

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.
You need to agree with the terms to proceed