Content management systems (CMS) are software tools that help content providers manage and maintain their content efficiently. They are used for creating, updating, and managing website content. They are critical to improving the user experience and overall website speed.
In this article, you will learn how Puppeteer and Playwright compare in terms of scraping Strapi-powered web applications, you will learn about the strengths and weaknesses of each library and possible walk-around.
To follow along with this tutorial and understand the code samples showcased in this article, you need to satisfy the following:
Strapi is an open-source headless content management system (CMS) that allows users to create, manage, and expose content-rich experiences to any digital device. It offers a user-friendly interface for content managers, allowing you to manage and edit content intuitively. This flexibility makes Strapi a valuable tool for various roles within a development team.
Strapi CMS significantly reduces development time by providing pre-built functionalities like user authentication, content management, API generation, templates, and starters. Building these functionalities from scratch can be time-consuming.
For instance, a case study on the Strapi website details how Société Générale built a complex e-training application in 3 months using Strapi, which would have taken an estimated one year with traditional development.
As a developer, you can use Strapi's features not to build projects in hours/few days instead of months. With Strapi, content can be managed and delivered to any digital platform, ensuring a seamless multi-device experience for end-users.
Puppeteer and Playwright are both web automation and testing libraries. These libraries allow you to scrape websites, and control a web browser with only a few lines of code. In terms of web scraping, both libraries have similar web scraping capabilities.
Scraping websites is important for businesses because of the following:
The effectiveness and efficiency of web scraping largely depend on the tools you use for it. Because it is unproductive and not advisable to use inefficient tools, especially for web scraping. Imagine having to install too many packages for simple functionality, worry about WebDriver and browser compatibility issues, getting easily blocked when scraping, and even more.
In this article, you will also learn how to make a better choice between the two most powerful browser automation tools available today for your web scraping needs. If you wish to use already existing web scrapers, there are tons of them available in the market such as social media scrapers like LinkedIn Job Postings Scraper.
Puppeteer and Playwright are powerful headless browser automation libraries that enable you to scrape data from websites, especially when working with Node.js-related projects for web-browser test automation. They are both useful for automating webpage interaction like clicking on buttons, and links, scrolling through pages, and even filling out forms on web pages, and ultimately, extracting data from websites.
Puppeteer and Playwright are both Node.js headless browser automation tools designed for end-to-end (E2E) automated testing of web applications. Because of their browser automation features, they are also used in web-scraping. Because of their great features, they are a good fit for scraping Strapi-powered web applications.
A quick look at the download trends from NPM Charts, reveals that both libraries continue to be downloaded increasingly. However, Playwright was downloaded over four million times, whereas Puppeteer was downloaded over three million times over the course of 1 week.
Both libraries are still being downloaded more and more, according to a glance at the download statistics from NPM Charts. Playwright has been downloaded more than four million times, while Puppeteer has been downloaded more than three million times in just one week.
In the next section, we will compare and contrast the different features of each library.
Puppeteer lets you manage headless Chrome or Chromium browsers. Puppeteer, created by Google, is a well-liked option for automating processes like as end-to-end testing and web scraping. A handful of Puppeteer's features are listed below:
Headless Browser Control: Puppeteer lets you automate a Chrome or Chromium browser without a graphical user interface. This makes it ideal for server-side environments, allowing for faster execution and efficient data processing.
Headless Browser Control: A Chrome or Chromium browser can be automated using Puppeteer without a graphical user interface. Because of this, it is perfect for server-side applications where data processing may be done quickly and effectively.
Web Scraping Efficiency: Puppeteer can swiftly scrape a lot of data because it doesn't render images. This is especially helpful for obtaining specialized data from web pages.
Screenshot Generation: Puppeteer's screenshot feature allows you to take pictures of web pages. This might be useful for content archiving on websites or for visual testing.
Flexible Element Selection: Puppeteer offers several ways to find elements on a webpage, including text selectors, custom selectors, and XPath. This guarantees you can aim for the precise information you require.
Chrome Extension Testing (Partially Supported): Puppeteer makes it easier to test extensions for Chrome, however, it's crucial to know that headless mode isn't supported for testing extensions because of Chrome's design limitations.
Playwright was created to make browser automation and online testing more efficient. Microsoft developed it, and it supports multiple languages and browsers, with client implementations that are both async and sync. Beyond only basic functions, it provides developers with an extensive feature set. Below are some features:
Cross-Browser Compatibility: Playwright comes with native support for Chromium, Firefox, and WebKit (which Safari uses), in contrast to Puppeteer. This enables you to create tests that function flawlessly in a variety of browsers.
Dynamic Web Page Handling: Playwright uses its auto-waiting feature to address the issues associated with dynamic websites. This ensures that your automation scripts function properly by doing away with the requirement for human waiting during testing.
Headful and Headless Modes: Playwright gives you the option to run browsers in both headless and headful modes, with or without a graphical user interface (GUI). While headful mode might be useful for debugging or visual reference, headless mode is best for server-side automation and site scraping.
Content Capture: Playwright also lets you take screen grabs, record videos of your automated tests, and create PDFs (in headless Chromium) via its page function. This adaptability makes visible feedback and thorough testing documentation possible.
Flexible Element Selection: Playwright facilitates the targeting of particular webpage items for interaction or data extraction by supporting standard selection methods like CSS and XPath.
Multi-Language Support: Playwright provides bindings for Python, Java, JavaScript, TypeScript, and .NET (C#) to cater to a broader development community.
Proxy Integration: When performing online automation operations, Playwright gives you the ability to use proxies to get more control.
Compared to Puppeteer, which only supports JavaScript, Playwright offers cross-language support for Python, JavaScript/TypeScript,.NET, and Java. Nevertheless, Puppeteer has an unauthorized Python support program named Pyppeteer. This is helpful if you want to use Playwright for scraping since, if you're not comfortable with JavaScript, you're not limited to a specific language.
If cross-browser compatibility is not a concern for you, using Playwright for your scraping needs is also helpful in terms of browser support. This is due to Playwright's native support for Firefox, WebKit (Safari), Chrome, and Chromium-based browsers. Playwright is a versatile option for projects requiring cross-browser testing or automation in non-Chromium environments because of the wide range of supported browsers.
Conversely, Puppeteer was created by the Google Chrome team and is limited to Chromium-based browsers and Google Chrome (with experimental support for Firefox and Edge). The original plan was to automate browsers that ran on the Chromium platform. Puppeteer is therefore a fantastic option for applications where Chromium compatibility is crucial.
Compared to Puppeteer, which has been in the market since 2017, Playwright is somewhat new to the community, having only been introduced in 2020. This corresponds to the variations in community size that have been noted.
As of April 2024, Puppeteer boasts of 86k+ GitHub stars, whereas Playwright has about 61k+ GitHub stars. When it comes to community, Puppeteer has a large following, while Playwright has a small but active one.
Both Puppeteer and Playwright are renowned for their rapidity and effectiveness. The Playwright, however, has a few advantages that might provide it a little advantage in particular situations.
Playwright's auto-waiting function is one important distinction. By simulating human behavior and automating waiting periods following tasks like form completion, this functionality may lower the possibility of a bot being discovered. The manual timer configuration required by Puppeteer may increase complexity and slow down scraping.
Playwright's synchronous and asynchronous client support is another aspect. While synchronous clients are simpler for smaller scripts, asynchronous clients are best for scaling complicated scraping jobs. Since Puppeteer only enables asynchronous clients, Playwright gives greater customization options for optimizing scraping performance by project requirements.
Playwright auto-waiting function removes the need for manually setting timers in your scraping scripts. Playwright automatically performs a series of checks to ensure elements are visible, stable, and responsive before attempting any actions like clicking or filling out forms. This not only simplifies your code but also helps avoid errors caused by interacting with elements that aren't fully loaded or ready thereby preventing flaky tests.
In Puppeteer, you can use the page.waitForSelector()
method to wait for the selector
to appear on the page. If at the moment of calling the method the selector already exists, the method will return immediately. If the selector doesn't appear after the timeout milliseconds of waiting, the function will throw an error.
When it comes to scraping content from Strapi-powered applications, both Puppeteer and Playwright are great libraries for browser automation and testing, however, there are minor differences that you should be aware of.
In the next section, I will summarize all the feature comparisons and differences between both libraries in a table.
Feature | Puppeteer | Playwright |
---|---|---|
Language Support | JavaScript/TypeScript (unofficial Python port: Pyppeteer) | Python, Java, JavaScript/TypeScript, C#, .NET |
Browser Support | Chrome/Chromium (limited Firefox/Edge support) | Chrome/Chromium, WebKit (Safari), Firefox |
Waiting Mechanism | Manual timer setup (e.g., page.waitForSelector() ) | Auto-waiting for elements to be ready before interaction |
Community & Documentation | Larger, established community | Growing community, good documentation, but not as extensive as Puppeteer |
Strapi Automation | Suitable for basic tasks like login, form filling, and content creation | More flexible for complex workflows due to multi-language support and potential browser emulation for responsive UI testing |
Strapi Scraping | Efficient for scraping Chrome/Chromium-based Strapi applications | Ideal for scraping Strapi applications across various browsers to ensure data consistency |
Ease of Use | Generally easier to start with due to larger community resources | Might have a slightly steeper learning curve for those new to the tool |
Integration with Strapi API | Both can be integrated with Strapi's official REST API for data manipulation | ✅ |
Error Handling | Provides error handling mechanisms for unexpected scenarios | Offers similar error handling capabilities |
Performance | Known for fast scraping speeds | Offers similar performance, especially for Chrome/Chromium scraping |
Scalability | Scales well for large-scale scraping projects | Scales well due to asynchronous client support |
In this section, I will walk you through how to scrape Strapi-powered web applications using both Puppeteer and Playwright, so you see them in action. For this article, I will be scraping the L'Équipe.fr website. The L'Équipe.fr website is powered by Strapi CMS; therefore, I will be using it for our scraping purposes.
I will use the Chrome browser DevTools to inspect the L'Équipe.fr to know the elements I will target to extract data. To open Chrome DevTools, you can press F12 or right-click anywhere on the page and choose Inspect.
Now go to L'Équipe.fr and open your DevTools there. Inspecting the same website as I will make this article easier to follow.
I will start by walking you through how to build your scraper using Puppeteer first, then Playwright.
The first step is to create a folder. In this case, I will name my folder puppeteer-scraper
. Run the command below to create your folder and cd
into this newly created folder.
mkdir puppeteer-scraper && cd puppeteer-scraper
Next, initialize an empty Node.js project by running this command:
npm init -y
Now, you need to install the Puppeteer library. Run this command to install Puppeteer:
npm install puppeteer
After typing this command, you should find this package.json
file in your project folder.
To get started, let's make sure your project can understand modern JavaScript (ES6 features) by adding "type": "module"
to your package.json
file.
1{
2 "name": "puppeteer-scraper",
3 "version": "1.0.0",
4 "type": "module",
5 "description": "",
6 "main": "index.js",
7 "scripts": {
8 "test": "echo \"Error: no test specified\" && exit 1"
9 },
10 "keywords": [],
11 "author": "",
12 "license": "ISC",
13 "dependencies": {
14 "puppeteer": "^22.6.3"
15 }
16}
After installing Puppeteer, next, create an index.js
file in the root folder. This will serve as an entry point to your scraper application. Inside the index.js
file, paste the following code:
1import puppeteer from "puppeteer";
2
3const getPosts = async () => {
4 // Start a Puppeteer session with:
5 // - a visible browser (`headless: false` - easier to debug because you'll see the browser in action)
6 // - no default viewport (`defaultViewport: null` - website page will be in full width and height)
7 const browser = await puppeteer.launch({
8 headless: false,
9 defaultViewport: null,
10 });
11
12 // Open a new page
13 const page = await browser.newPage();
14
15 // On this new page:
16 // - open the "https://www.lequipe.fr/Chrono" website
17 // - wait until the dom content is loaded (HTML is ready)
18 await page.goto("https://www.lequipe.fr/Chrono", {
19 waitUntil: "domcontentloaded",
20 });
21
22};
23
24// Start the scraping
25getPosts();
Now, it is time to run our application, so we see what we have so far. To run your application, run the following command:
node index.js
This will spin up Chrome in a headful browser with a new page and the L'Équipe.fr loaded onto it, just like in the screenshot below:
Next, update your index.js
file with this code:
1import puppeteer from "puppeteer";
2
3(async () => {
4 // Launch a headless browser (change to headless: false for debugging)
5 const browser = await puppeteer.launch({ headless: false });
6
7 // Open a new page
8 const page = await browser.newPage();
9
10 // Navigate to the L'Équipe.fr page
11 await page.goto('https://www.lequipe.fr/Chrono');
12
13 // Pause for `.ChronoItem`
14 await page.waitForSelector('.ChronoItem');
15
16 // Wait for at least 5 article cards to be loaded
17 await page.waitForFunction(() => {
18 const articleCards = document.querySelectorAll(".ChronoItem");
19 return articleCards.length > 5;
20 });
21
22 // Get page data
23 const scrapedPosts = await page.evaluate(() => {
24 const articleCards = document.querySelectorAll('.ChronoItem');
25
26 return Array.from(articleCards).map((card) => {
27 const link = card.querySelector(".Link.ChronoItem__link")?.href || null;
28 const time = card.querySelector(".ChronoItem__time").innerText?.trim() || null;
29 const summary = card.querySelector(".ChronoItem__summary").innerText?.trim() || null;
30 const tags = card.querySelectorAll(".ArticleTags__item").innerText?.trim()?.split(",") || null;
31
32 return { link, time, summary, tags };
33 });
34 });
35
36 // Display scraped posts
37 console.log("Scraped posts from current page:", scrapedPosts);
38
39 // Close the browser
40 await browser.close();
41})();
In the code above, you started a Puppeteer session with:
headless: false
- easier to debug because you'll see the browser in action)defaultViewport: null
- website page will be in full width and height)index.js
file uses Puppeteer to launch a headless browser (invisible) and opens a new page.page.evaluate
function which injects JavaScript code into the browser to query the DOM (Document Object Model) for elements containing article links, time, summaries, and tags.Now, stop the server by pressing CTRL + C
and then start the server once more by running the command below:
1node index.js
This result of scraping the L'Équipe.fr website using Puppeteer is shown below:
1 [
2 {
3 "link": "https://www.lequipe.fr/Tennis/Actualites/Ugo-humbert-apres-son-elimination-en-quarts-de-finale-a-monte-carlo-je-vais-tout-faire-pour-me-rapprocher-du-top-10/1460777",
4 "time": "21:21",
5 "summary": "Humbert : «Tout faire pour me rapprocher du top 10»",
6 "tags": null
7 },
8 {
9 "link": "https://www.lequipe.fr/Rugby/Actualites/Pro-d2-vannes-et-beziers-vainqueurs-avec-le-bonus-grenoble-cartonne/1460776",
10 "time": "21:16",
11 "summary": "Vannes, Béziers et Grenoble vainqueurs avec le bonus",
12 "tags": null
13 },
14 {
15 "link": "https://www.lequipe.fr/Jo-2024-paris/Cyclisme-sur-piste/Actualites/Les-francaises-de-la-poursuite-par-equipes-qualifiees-pour-les-jo-2024/1460773",
16 "time": "21:12",
17 "summary": "La poursuite par équipes féminine qualifiée",
18 "tags": null
19 },
20 {
21 "link": "https://www.lequipe.fr/Rugby/Actualites/Ronan-o-gara-entraineur-de-la-rochelle-mon-president-ne-m-a-pas-choisi-pour-emmener-mes-joueurs-visiter-mon-ancienne-ecole/1460769",
22 "time": "21:01",
23 "summary": "O'Gara : «Pas emmener mes joueurs visiter mon ancienne école»",
24 "tags": null
25 }
26]
Check out the complete code repository on GitHub.
The first step is to create a folder. In this case, I will name my folder puppeteer-scraper
. Run this command to create your folder and cd
into it.
mkdir playwright-scraper && cd playwright-scraper
Next, initialize an empty project by running this command:
npm init -y
Now, you need to install the Playwright library. Run this command to install Playwright:
npm install playwright && npx playwright install
After typing this command, you should find this package.json
file at the root of your project.
To get started, let's make sure your project can understand modern JavaScript (ES6 features) by adding "type": "module"
to your package.json
file.
1{
2 "name": "playwright-scraper",
3 "version": "1.0.0",
4 "type": "module",
5 "description": "",
6 "main": "index.js",
7 "scripts": {
8 "test": "echo \"Error: no test specified\" && exit 1"
9 },
10 "keywords": [],
11 "author": "",
12 "license": "ISC",
13 "dependencies": {
14 "playwright": "^1.43.1"
15 }
16}
After installing Playwright, next, create an index.js
file in the root folder. This will serve as an entry point to your scraper application. Inside it, paste the following code:
1// Import the Chromium browser into our scraper.
2import { chromium } from 'playwright';
3
4// Open a Chromium browser. We use headless: false
5// to be able to watch the browser window.
6const browser = await chromium.launch({
7 headless: false
8});
9
10// Open a new page/tab in the browser.
11const page = await browser.newPage();
12
13// Tell the tab to navigate to the JavaScript topic page.
14await page.goto('https://www.lequipe.fr/Chrono');
15
16// Pause to see what's going on.
17await page.waitForTimeout(120_000);
18
19// other codes go here
20
21
22
23// Close the browser to clean up after ourselves.
24await browser.close();
This will spin up Chrome in a headful browser with a new page and the L'Équipe.fr loaded onto it, just like in the screenshot below:
Up until now, you have not yet started scraping the website. The code above is to make sure Playwright is running as expected.
Next, add the following snippet where I added the comment // other codes go here
above:
1...
2
3await page.waitForFunction(() => {
4 // Find all article card elements (consider a more generic selector if needed)
5 const articleCards = document.querySelectorAll(".ChronoItem");
6 return articleCards.length > 5;
7});
8
9
10// Get page data
11const scrapedPosts = await page.$$eval('.ChronoItem", (articleCards) => {
12 return articleCards.map(card => {
13
14 // Fetch the sub-elements from the previously fetched quote element
15 const link = card.querySelector(".Link.ChronoItem__link")?.href || null;
16 const time = card.querySelector(".ChronoItem__time").innerText?.trim() || null;
17 const summary = card.querySelector(".ChronoItem__summary").innerText?.trim() || null;
18 const tags = card.querySelectorAll(".ArticleTags__item").innerText?.trim()?.split(",") || null;
19
20 return {
21 link,
22 time,
23 summary,
24 tags
25 };
26 });
27});
28
29// Add scraped posts from this page
30console.log("Scraped posts from current page:", scrapedPosts);
31
32
33
34// Pause to see what's going on.
35await page.waitForTimeout(120_000);
36
37...
The code above uses the Playwright library to scrape data from the L'Équipe.fr website. You make use of a headless Chromium browser, allowing you to control the process without a physical browser window. First, it launches the browser and opens a new page, navigating to the target URL (https://www.lequipe.fr/Chrono). A wait is implemented to ensure the page fully loads.
Then, the code employs a function to wait for specific conditions on the page. In this case, it waits until at least six elements match the .ChronoItem
selector, representing article cards. Once met, the script extracts data using page evaluation. It finds all elements matching .ChronoItem
and iterates through them, building an object for each containing detail like the link
, time
, summary
, and tags
. These objects are collected into an array and printed to the console.
1 [
2 {
3 "link": "https://www.lequipe.fr/Tennis/Actualites/Ugo-humbert-apres-son-elimination-en-quarts-de-finale-a-monte-carlo-je-vais-tout-faire-pour-me-rapprocher-du-top-10/1460777",
4 "time": "21:21",
5 "summary": "Humbert : «Tout faire pour me rapprocher du top 10»",
6 "tags": null
7 },
8 {
9 "link": "https://www.lequipe.fr/Rugby/Actualites/Pro-d2-vannes-et-beziers-vainqueurs-avec-le-bonus-grenoble-cartonne/1460776",
10 "time": "21:16",
11 "summary": "Vannes, Béziers et Grenoble vainqueurs avec le bonus",
12 "tags": null
13 },
14 {
15 "link": "https://www.lequipe.fr/Jo-2024-paris/Cyclisme-sur-piste/Actualites/Les-francaises-de-la-poursuite-par-equipes-qualifiees-pour-les-jo-2024/1460773",
16 "time": "21:12",
17 "summary": "La poursuite par équipes féminine qualifiée",
18 "tags": null
19 },
20 {
21 "link": "https://www.lequipe.fr/Rugby/Actualites/Ronan-o-gara-entraineur-de-la-rochelle-mon-president-ne-m-a-pas-choisi-pour-emmener-mes-joueurs-visiter-mon-ancienne-ecole/1460769",
22 "time": "21:01",
23 "summary": "O'Gara : «Pas emmener mes joueurs visiter mon ancienne école»",
24 "tags": null
25 }
26]
This scraped data from the L'Équipe.fr website is a collection of sports news containing a summary of the news, a link to read the full post, and tags.
Check out the complete code repository on GitHub.
Web scraping legality can be complex. Below are important tips to keep in mind while scraping.
Both Puppeteer and Playwright are great automation libraries for scraping. The best choice depends on your specific needs. Web scraping with Puppeteer remains a strong option for its ease of use, extensive documentation, and focus on Chrome/Chromium, which might be sufficient for many Strapi CMS projects. However, web scraping with Playwright's multi-language support, auto-waiting functionality, and cross-browser capabilities offer a more versatile and potentially future-proof solution, especially for complex workflows or scraping across multiple browsers.
Emmanuel is an experienced and enthusiastic software developer and technical writer with proven years of professional experience. He focuses on full-stack web development. He is fluent in React, TypeScript, VueJS, and NodeJS and familiar with industry-standard technologies such as version control, headless CMS, and JAMstack. He is passionate about knowledge sharing.