Facebook iconUsing Puppeteer on AWS Lambda for Headless Automation
WhatsApp Icon (SVG)
Using Puppeteer on AWS Lambda for Headless Automation Hero

Web automation and scraping have become essential tools for businesses and developers alike today. Enter Puppeteer, a powerful Node.js library that provides a high-level API to control Chrome or Chromium browsers programmatically. Puppeteer excels in tasks such as web scraping, automated testing, and generating screenshots and PDFs of web pages.

Key features of Puppeteer

  • Full control over a headless Chrome instance.
  • Ability to interact with web pages just like a real user.
  • Support for generating PDFs and screenshots.
  • Automated form submission and UI testing.

Common use cases for Puppeteer range from data extraction and market research to automated testing of web applications and generating pre-rendered content for static websites.

Benefits of Serverless Computing

Before we dive into combining Puppeteer with serverless architecture, let's briefly explore the benefits of serverless computing:

  • Scalability: Serverless platforms automatically scale your application in response to demand, ensuring optimal performance during traffic spikes.
  • Cost-efficiency: You only pay for the actual computing time used, making it ideal for sporadic or unpredictable workloads.
  • Reduced management overhead: Serverless eliminates the need for server provisioning, patching, and maintenance, allowing developers to focus on writing code.
  • Event-driven execution: Serverless functions can be triggered by various events, enabling responsive and efficient application architectures.

Maximizing Serverless Puppeteer Automation

Combining Puppeteer with serverless architecture offers a powerful solution for web automation tasks. This approach allows you to run Puppeteer scripts on-demand without managing dedicated servers.

The Benefits Include:

  • Reduced infrastructure costs.
  • Automatic scaling to handle varying workloads.
  • Simplified deployment and management.

Moreover, by utilizing layers in serverless platforms, we can enhance the reusability and modularity of our Puppeteer-based functions, making it easier to maintain and update our automation scripts.

Setting up 

To get started with serverless Puppeteer automation, follow these steps:

Step 1. Install Puppeteer

Step 2. Choose a serverless platform: Popular options include AWS Lambda, Azure Functions, and Google Cloud Functions. For this guide, we are going to use AWS Lambda.

Step 3. Set up the AWS CLI and configure your credentials.

Step 4. Create a new Lambda function and configure it to use the Node.js runtime.

Creating Serverless Functions with Puppeteer

Here's a basic example of a serverless function using Puppeteer to take a screenshot of a website:

import chromium from '@sparticuz/chromium';
import puppeteer from 'puppeteer-core';

export const handler = async (event) => {
  let browser = null;
    let result = null;

    try {
        browser = await puppeteer.launch({
           args: chromium.args,
            defaultViewport: chromium.defaultViewport,
            executablePath: await chromium.executablePath('/opt/nodejs/node_modules/@sparticuz/chromium/bin'),
            headless: chromium.headless,
            ignoreHTTPSErrors: true,
        });

        const page = await browser.newPage();

        await page.goto(event.url || 'https://example.com');

        const screenshot = await page.screenshot({ encoding: 'base64' });

        result = {
            statusCode: 200,
            headers: {
                'Content-Type': 'image/png',
            },
            body: screenshot,
            isBase64Encoded: true,
        };
    } catch (error) {
        console.error(error);
        result = {
            statusCode: 500,
            body: JSON.stringify({ error: error.message }),
        };
    } finally {
        if (browser) {
            await browser.close();
        }
    }

    return result;
};

This function takes a URL as input, navigates to the web page, and returns a base64-encoded screenshot. (note we are using @sparticuz/chromium for chromium-browser because we are using chromium lambda layers provided by sparticuz/chromium arn link in aws).

Leveraging Layers for Modularity and Reusability

Layers in serverless computing allow you to package and share common code and dependencies across multiple functions. For Puppeteer, we can create a layer containing Puppeteer and its dependencies:

When working with Puppeteer on Lambda, using layers can significantly improve the management and deployment of your functions. This is especially useful when you have multiple functions that require Puppeteer.

1. Create a directory for your layer

mkdir puppeteer-layer && cd puppeteer-layer
Create a directory for your puppeteer layer

2. Initialize a new Node.js project and install Puppeteer

npm init -y
npm install puppeteer
Initialize a new Node.js project and install Puppeteer

3. Zip the contents of the node_modules folder

zip -r puppeteer-layer.zip node_modules
Zip the contents of the node_modules folder

4. Upload this zip file as a new layer in your serverless platform

Upload this zip file as a new layer in your serverless platform
Upload this zip file as a new layer in your serverless platform

Create and add the layer in the aws layers window then attach it to the respective lambda function

5. Attach the layer to your Lambda functions that require Puppeteer

Attach the layer to your Lambda functions that require Puppeteer

Click on Add Layers and Specify ARN add the below link then verify and add

Note: Add the browser version along with the region based on your configuration and click on add. 

 arn:aws:lambda:ap-south-1:764866452798:layer:chrome-aws-lambda:46

Ref: https://github.com/shelfio/chrome-aws-lambda-layer?tab=readme-ov-file

Again add the layer custom layer we custom-made, We also need to change the Executable path to start from /opt/ for the custom layer to work with the lambda function.

By using layers, you can keep your function code lean and easily update Puppeteer across all your functions by updating the layer.

Now if you check the layers we have two layers that we added.

To run the browser in lambda we need at least 2 GB of RAM and a timeout of 3 minutes for the function to run because it needs to open a browser and perform the automation the default 3 seconds wound budge. So go to the configuration tab under the same lambda function.

After the configuration, it should look like this:

After saving the changes click the deploy button and test it with the test button

Conclusion

Serverless Puppeteer automation with layers provides a powerful, scalable, and cost-effective solution for web scraping, testing, and other automation tasks. By mastering AWS Lambda and Puppeteer integration, developers can create efficient and scalable web automation workflows. Whether you're scraping data, generating reports, or running automated tests, Puppeteer on AWS Lambda provides a flexible and powerful solution.

Ready to get started? Set up your first serverless Puppeteer function today and unlock the potential of scalable web automation!

Author Detail

Author-goutham sakthi
goutham sakthi

Hello world!! Cloud and Ai enthusiast here and i can solve captcha

Phone

Next for you