Efficiently Scraping and Storing Legal Documents from JDIHN with PHP
This article details a method for extracting and storing legal document data from the JDIHN (Jaringan Dokumentasi dan Informasi Hukum Nasional) website, which is Indonesia’s national legal documentation and information network. This process uses a PHP script that systematically scrapes data and stores it in a structured format.
Understanding the Process: Web Scraping and Data Storage
Web scraping is the automated process of extracting data from websites. In this scenario, we’re targeting specific information about legal documents available on JDIHN. The extracted data is then organized and inserted into a Supabase database for easy access and management.
The PHP Script: A Step-by-Step Breakdown
The core of this solution is a PHP script that leverages the DiDom
library for HTML parsing and curl
for handling HTTP requests. Let’s dissect the key components:
- Dependencies and Setup:
- The script begins by requiring
vendor/autoload.php
, which is standard practice when using Composer to manage PHP dependencies. This likely loads theDiDom
library. - The
DiDom\Document
class is imported, providing tools to navigate and extract data from HTML.
- The script begins by requiring
safe_url()
Function:- This utility function takes text input and converts it into a URL-friendly string. It does this by:
- Converting the text to lowercase.
- Replacing any non-alphanumeric characters with underscores (
_
). - Trimming any leading or trailing underscores.
- This function is for creating clean and consistent identifiers.
- This utility function takes text input and converts it into a URL-friendly string. It does this by:
insertToSupabase()
Function:- This function handles the insertion of scraped data into a Supabase database.
- It defines Supabase API credentials:
$supabaseUrl
and$supabaseKey
. (Important: These should be treated as sensitive information and stored securely, ideally using environment variables.) $tableName
specifies the target table in Supabase (in this case, “peraturan,” which translates to “regulations”).- The function formats the input
$data
array into a key-value pair, skipping empty fields. - It then encodes the formatted data into JSON format.
- A
curl
request is initialized and configured to send a POST request to the Supabase REST API. The necessary headers, including the API key and authorization token, are set. - The JSON payload is sent, and the response from Supabase is received.
- Finally, the function returns the HTTP status code and the API response (decoded from JSON).
- The Scraping Loop:
- A
for
loop iterates backward from a starting ID (1986526) down to 0. This ID is used to construct the URL of the detail page on JDIHN. - Inside the loop, the target URL (
$uri
) is created using the current ID. file_get_contents()
attempts to retrieve the HTML content of the page. The@
symbol suppresses warnings if the page is not found.- Error Handling: If
file_get_contents()
fails (returnsfalse
), an error message (“page not found”) is displayed, and the loop continues to the next iteration. - HTML Parsing: If the page is successfully retrieved, a new
DiDom\Document
object is created, representing the HTML content. - Table Extraction: The script attempts to find the first
<table>
element within the document using$document->first('table')
. - Table Error Handling: If no table is found, an error message (“table not found”) is displayed, and the loop continues.
- Row and Cell Extraction: If a table is found, the script retrieves all
<tr>
(table row) elements. It then iterates through each row and extracts the<td>
(table data) elements. - Data Cleaning and Formatting:
- The
safe_url()
function is used to clean the text content of the first cell ($cells[0]
). - The script checks for links (
<a>
tags) within the third cell ($cells[2]
). If links are found, thehref
attribute (the URL) is extracted. If no links are present, the text content of the cell is used. - Empty data rows are skipped.
- The
- Adding a URL ID: The current iteration ID (
$i
) is added to the data array as “url_id”. - Database Insertion: The
insertToSupabase()
function is called to insert the extracted data into the Supabase table. - Feedback and Delay: A success message (“maybe [OK]”) is printed to the console. A random delay (
usleep()
) is introduced between requests to avoid overloading the target server. This is good web scraping etiquette.
- A
Supabase Table Structure
The script is designed to work with a specific Supabase table structure. The provided SQL creates the peraturan
table with the following columns:
CREATE TABLE peraturan (
asal_dokumen TEXT,
jenis_dokumen TEXT,
nomor TEXT,
tahun TEXT,
judul TEXT,
t_e_u TEXT,
singkatan_jenis TEXT,
tempat_terbit TEXT,
tanggal_penetapan TEXT,
tanggal_pengundangan TEXT,
subyek TEXT,
status TEXT,
penandatangan TEXT,
sumber TEXT,
bahasa TEXT,
unduhan TEXT,
unduhan_alternatif TEXT,
abstrak TEXT,
unduhan_abstrak_alternatif TEXT,
detil_dokumen TEXT,
url_id TEXT,
id SERIAL PRIMARY KEY -- Add this line
);
The important point it to replace id SERIAL PRIMARY KEY
with id
and what ever datatype should id have, and mark it as auto-incrementing primary key.
Running the Script
The instructions indicate that the script should be run from the terminal using nohup
. This command ensures that the script continues running even if the terminal session is closed:
nohup php script.php
Replace script.php
with the actual name of your PHP file.
Key Improvements and Considerations
- Error Handling: The script includes basic error handling for page and table not found scenarios. More robust error handling could be added to log errors, retry failed requests, or handle different HTTP status codes.
- Rate Limiting: The
usleep()
function introduces a delay to avoid overwhelming the JDIHN server. Consider implementing a more sophisticated rate-limiting mechanism to comply with the website’s terms of service. - Data Validation: Adding data validation checks before inserting data into the database can improve data quality.
- Scalability: For large-scale scraping, consider using a queue system to manage requests and distribute the workload.
- Security: Store API keys and other sensitive information securely, preferably using environment variables.
How Innovative Software Technology Can Help
At Innovative Software Technology, we specialize in developing custom web scraping solutions and data management systems. We can help your organization efficiently extract, process, and store data from various online sources, just like the JDIHN legal document scraping example above. Our expertise in PHP development, web scraping best practices, database design (including Supabase), and API integration ensures that we can build a robust and scalable solution tailored to your specific needs. We prioritize SEO-friendly data structures and efficient data retrieval, enabling you to maximize the value of the information we gather. We also offer services in data cleaning, data transformation, and data visualization, turning raw scraped data into actionable insights. Contact us to discuss how we can leverage web scraping and data management to enhance your business operations and improve your search engine rankings through optimized data accessibility.