Our community is looking for talented writers who are passionate about our ecosystem (jamstack, open-source, javascript) and willing to share their knowledge/experiences through our Write for the community program.
If there is one message that I am trying to convey to the Strapi community, it is that it is possible to create any type of application with Strapi. Whether it's a simple blog, a showcase site, a corporate site, an e-commerce site, an API that can, by the way, be used by a mobile application and so on.
Due to its strong customization, you can create whatever you want with Strapi. Today, this tutorial will guide you to create an application used to scrape a website. I have been using Strapi in the past for a freelance mission which consisted of scraping several websites in order to collect public information.
Strapi was very useful because I built an architecture allowing me to manage a scraper in just a few clicks. However, what follows may not be the best way to create a scraping app with Strapi. There are no official guidelines, it depends on your needs and your imagination.
For this tutorial, we are going to scrape the jamstack.org site and more precisely the site generators section. This site lists headless CMSs like Strapi and others, but also site generators and frameworks such as Next.js, Gatsby or Jekyll.
We are simply going to create an app that, via a cron, will collect the site generators every day and insert them into a Strapi app. To do this we will use Puppeteer to control a browser to extract the necessary information with Cheerio.
Github repository of this article
Alright, let's get into it!
Web scraping involves collecting information from websites by simulating human browsing behavior with automated tools or scripts. This process extracts data from web pages and transforms it into a structured format for analysis or use in applications.
In this form, we have:
* * * * *
means every minute. For a daily execution at, let's say, 3 PM, it would be 0 15 * * *
.next_execution_at
will contain the timestamp of the next execution. This way, the scraper will respect the frequency. We'll explain more about this later.After extracting data, it's important to store or transmit it efficiently. Effective methods include:
Choosing the right method depends on your project's specific needs and how the data will be used. In this tutorial, we'll be storing the scraped data in our Strapi application to manage it effectively.
Create a Strapi application with the following command:
npx create-strapi-app jamstack-scraper --quickstart
This application will have a Sqlite database. Feel free to remove the --quickstart
option to select your favorite database.
Create the first collection type called scraper:
Content-Types Builder > + Create new collection type.
Great now you should be able to see your scraper by clicking on Scrapers in the side nav:
This view is not well organized, let's change that.
By creating a Scraper collection type, you'll be able to manage the behavior of your scraper directly in the admin. You can:
11. Enable/Disable it
22. Update the frequency
33. See the errors of the last execution
44. Get a report of the last execution
55. See all the data scraped for this scraper with a relation
Perfect! Everything seems ready concerning the content in the admin! Now let's create our scraper.
What we have here is:
11. The name of the scraper.
22. The slug which is generated thanks to the name.
33. We disable it for now.
44. The frequency here is expressed using cron schedule expressions. `* * * * *` here means every minute. Since after testing we want to launch the scraper every day at, let's say, 3pm, it would be something like this: `0 15 * * *`
55. The `next_execution_at` will contains a string of the timestamp of the next execution. This way the scraper will respect the frequency. You will have more details after ;)
Let's dive into the code! The first thing to do is make your application able to fetch your scraper with a slug.
Add the following code in your ./api/scraper/controllers/scraper.js
file:
const { sanitizeEntity } = require('strapi-utils');
module.exports = {
/**
* Retrieve a record.
*
* @return {Object}
*/
async findOne(ctx) {
const { slug } = ctx.params;
const entity = await strapi.services.scraper.findOne({ slug });
return sanitizeEntity(entity, { model: strapi.models.scraper });
},
};
Only update the findOne
route of your ./api/scraper/config/routes.json
file with the following code:
{
"routes": [
...
{
"method": "GET",
"path": "/scrapers/:slug",
"handler": "scraper.findOne",
"config": {
"policies": []
}
},
...
]
}
Awesome! Now you are able to fetch your scrapers with the associated slug.
Now let's create our scraper.
./scripts
folder at the root of your Strapi project../scripts/scrapers
folder inside this previously created folder../scripts/starters/jamstack.js
empty file.This file will contain the logic to scrape the site generators on Jamstack.org. It will be called by the cron but during the development, we will call it from the ./config/functions/bootstrap.js
file which is executed every time the server restart. So this way we will be able to try our script each time we save our file.
In the end, we will remove it from there and call it in our cron file every minute. The script will check if this is the time to execute it or not thanks to the frequency of your scraper you'll define for it.
Add the following code to your ./scripts/starters/jamstack.js
file:
const main = async () => {
// Fetch the correct scraper thanks to the slug
const slug = "jamstack-org"
const scraper = await strapi.query('scraper').findOne({
slug: slug
});
console.log(scraper);
// If the scraper doesn't exists, is disabled or doesn't have a frequency then we do nothing
if (scraper == null || !scraper.enabled || !scraper.frequency)
return
}
exports.main = main;
Update your ./config/functions/bootstrap.js
file with the following:
'use strict';
/**
* An asynchronous bootstrap function that runs before
* your application gets started.
*
* This gives you an opportunity to set up your data model,
* run jobs, or perform some special logic.
*
* See more details here: https://strapi.io/documentation/developer-docs/latest/concepts/configurations.html#bootstrap
*/
const jamstack = require('../../scripts/scrapers/jamstack.js')
module.exports = () => {
jamstack.main()
};
Now you should see your scraper (JSON) in your terminal after saving your bootstrap file.
Great! Now it's time to scrap! First of all, we are going to create a file containing a useful function for our script.
Create a ./scripts/scrapers/utils/utils.js
file containing the following:
'use strict'
The first function will use the cron-parser
package to check if the script can be executed depending on the frequency you set in the admin. We'll also use the chalk
package to display messages in color.
Add theses packages by running the following command:
yarn add cron-parser chalk
Add the following code to your ./scripts/scrapers/utils/utils.js
file.
'use strict'
const parser = require('cron-parser');
const scraperCanRun = async (scraper) => {
const frequency = parser.parseExpression(scraper.frequency);
const current_date = parseInt((new Date().getTime() / 1000));
let next_execution_at = ""
if (scraper.next_execution_at){
next_execution_at = scraper.next_execution_at
}
else {
next_execution_at = (frequency.next().getTime() / 1000);
await strapi.query('scraper').update({
id: scraper.id
}, {
next_execution_at: next_execution_at
});
}
if (next_execution_at <= current_date){
await strapi.query('scraper').update({
id: scraper.id
}, {
next_execution_at: (frequency.next().getTime() / 1000)
});
return true
}
return false
}
module.exports = { scraperCanRun }
This function will check if this is the time to execute the scraper or not. Depending on the frequency you set on your scraper, the package cron-parser
will parse it and verify if this is the right time to execute your scraper thanks to the next_execution_at
field.
Important: If you want to change the frequency of your scraper, you'll need to delete your next_execution_at
value. It will be reset accordingly to your new frequency.
The next function will get all site generators we already inserted in our database in order to not fetch them again during the execution of the script.
Add the following function to your ./scripts/scrapers/utils/utils.js
file.
'use strict'
...
const chalk = require('chalk');
const getAllSG = async (scraper) => {
const existingSG = await strapi.query('site-generator').find({
_limit: 1000,
scraper: scraper.id
}, ["name"]);
const allSG = existingSG.map(x => x.name);
console.log(`Site generators in database: \t${chalk.blue(allSG.length)}`);
return allSG;
}
module.exports = { getAllSG, scraperCanRun }
The next function will simply get the current date for our report and errors log
Add the following function to your ./scripts/scrapers/utils/utils.js
file.
'use strict'
...
const getDate = async () => {
const today = new Date();
const date = today.getFullYear()+'-'+(today.getMonth()+1)+'-'+today.getDate();
const time = today.getHours() + ":" + today.getMinutes() + ":" + today.getSeconds();
return date+' '+time;
}
module.exports = { getDate, getAllSG, scraperCanRun }
The last function will prepare the report of the last execution.
Add the following function to your ./scripts/scrapers/utils/utils.js
file.
'use strict'
...
const getReport = async (newSG) => {
return { newSG: newSG, date: await getDate()}
}
module.exports = { getReport, getDate, getAllSG, scraperCanRun }
Alright! Your ./scripts/scrapers/utils/utils.js
should look like this:
1```bash
2'use strict'
3
4const parser = require('cron-parser');
5const chalk = require('chalk');
6
7const scraperCanRun = async (scraper) => {
8 const frequency = parser.parseExpression(scraper.frequency);
9 const current_date = parseInt((new Date().getTime() / 1000));
10 let next_execution_at = ""
11
12 if (scraper.next_execution_at){
13 next_execution_at = scraper.next_execution_at
14 }
15 else {
16 next_execution_at = (frequency.next().getTime() / 1000);
17 await strapi.query('scraper').update({
18 id: scraper.id
19 }, {
20 next_execution_at: next_execution_at
21 });
22 }
23
24 if (next_execution_at <= current_date){
25 await strapi.query('scraper').update({
26 id: scraper.id
27 }, {
28 next_execution_at: (frequency.next().getTime() / 1000)
29 });
30 return true
31 }
32 return false
33}
34
35const getAllSG = async (scraper) => {
36 const existingSG = await strapi.query('site-generator').find({
37 scraper: scraper.id
38 }, ["name"]);
39 const allSG = existingSG.map(x => x.name);
40 console.log(`Site generators in database: \t${chalk.blue(allSG.length)}`);
41
42 return allSG
43}
44
45const getDate = async () => {
46 const today = new Date();
47 const date = today.getFullYear()+'-'+(today.getMonth()+1)+'-'+today.getDate();
48 const time = today.getHours() + ":" + today.getMinutes() + ":" + today.getSeconds();
49 return date+' '+time;
50}
51
52const getReport = async (newSG) => {
53 return { newSG: newSG, date: await getDate()}
54}
55
56module.exports = { getReport, getDate, getAllSG, scraperCanRun }
57```
Let's update our ./scripts/scrapers/jamstack.js
file a little bit now. But let's add puppeteer first ;)
Add puppeteer and cheerio to your package.json
file by running the following command:
yarn add puppeteer cheerio
./scripts/scrapers/jamstack.js
file with the following:'use strict'
const chalk = require('chalk');
const puppeteer = require('puppeteer');
const {
getReport,
getDate,
getAllSG,
scraperCanRun
} = require('./utils/utils.js')
let report = {}
let errors = []
let newSG = 0
const scrape = async () => {
console.log("Scrape function");
}
const main = async () => {
// Fetch the correct scraper thanks to the slug
const slug = "jamstack-org"
const scraper = await strapi.query('scraper').findOne({
slug: slug
});
// If the scraper doesn't exists, is disabled or doesn't have a frequency then we do nothing
if (scraper == null || !scraper.enabled || !scraper.frequency){
console.log(`${chalk.red("Exit")}: (Your scraper may does not exist, is not activated or does not have a frequency field filled in)`);
return
}
const canRun = await scraperCanRun(scraper);
if (canRun && scraper.enabled){
const allSG = await getAllSG(scraper)
await scrape(allSG, scraper)
report = await getReport(newSG);
}
}
exports.main = main;
Now save your file! You might have this message:
Yep, you didn't enable your scraper so it will not go further. So in order to continue, we will need to comment this code because we want to be able to code the scrape function without having to wait or what ;)
Comment the following portion of code:
'use strict'
const chalk = require('chalk');
const puppeteer = require('puppeteer');
const {
getReport,
getDate,
getAllSG,
scraperCanRun
} = require('./utils/utils.js')
let report = {}
let errors = []
let newSG = 0
const scrape = async () => {
console.log("Scrape function");
}
const main = async () => {
// Fetch the correct scraper thanks to the slug
const slug = "jamstack-org"
const scraper = await strapi.query('scraper').findOne({
slug: slug
});
// If the scraper doesn't exists, is disabled or doesn't have a frequency then we do nothing
// if (scraper == null || !scraper.enabled || !scraper.frequency){
// console.log(`${chalk.red("Exit")}: (Your scraper may does not exist, is not activated or does not have a frequency field filled in)`);
// return
// }
// const canRun = await scraperCanRun(scraper);
// if (canRun && scraper.enabled){
const allSG = await getAllSG(scraper)
await scrape(allSG, scraper)
// }
}
exports.main = main;
You can save your file and now it should be fine!
Perfect! Now let's dive into scraping data!
Update your ./scripts/scrapers/jamstack.js
file with the following:
'use strict'
const chalk = require('chalk');
const cheerio = require('cheerio');
const puppeteer = require('puppeteer');
const {
getReport,
getDate,
getAllSG,
scraperCanRun
} = require('./utils/utils.js')
const {
createSiteGenerators,
updateScraper
} = require('./utils/query.js')
let report = {}
let errors = []
let newSG = 0
const scrape = async (allSG, scraper) => {
const url = "https://jamstack.org/generators/"
const browser = await puppeteer.launch({ args: ['--no-sandbox', '--disable-setuid-sandbox'] });
const page = await browser.newPage();
try {
await page.goto(url)
} catch (e) {
console.log(`${chalk.red("Error")}: (${url})`);
errors.push({
context: "Page navigation",
url: url,
date: await getDate()
})
return
}
const expression = "//div[@class='generator-card flex flex-col h-full']"
const elements = await page.$x(expression);
await page.waitForXPath(expression, { timeout: 3000 })
const promise = new Promise((resolve, reject) => {
elements.forEach(async (element) => {
let card = await page.evaluate(el => el.innerHTML, element);
let $ = cheerio.load(card)
const name = $('.text-xl').text().trim() || null;
// Skip this iteration if the sg is already in db
if (allSG.includes(name))
return;
const stars = $('span:contains("stars")').parent().text().replace("stars", "").trim() || null;
const forks = $('span:contains("forks")').parent().text().replace("forks", "").trim() || null;
const issues = $('span:contains("issues")').parent().text().replace("issues", "").trim() || null;
const description = $('.text-sm.mb-4').text().trim() || null;
const language = $('dt:contains("Language:")').next().text().trim() || null;
const template = $('dt:contains("Templates:")').next().text().trim() || null;
const license = $('dt:contains("License:")').next().text().trim() || null;
const deployLink = $('a:contains("Deploy")').attr('href') || null;
await createSiteGenerators(
name,
stars,
forks,
issues,
description,
language,
template,
license,
deployLink,
scraper
)
newSG += 1;
});
});
promise.then(async () => {
await page.close()
await browser.close();
});
}
const main = async () => {
// Fetch the correct scraper thanks to the slug
const slug = "jamstack-org"
const scraper = await strapi.query('scraper').findOne({
slug: slug
});
// If the scraper doesn't exists, is disabled or doesn't have a frequency then we do nothing
// if (scraper == null || !scraper.enabled || !scraper.frequency){
// console.log(`${chalk.red("Exit")}: (Your scraper may does not exist, is not activated or does not have a frequency field filled in)`);
// return
// }
// const canRun = await scraperCanRun(scraper);
// if (canRun && scraper.enabled){
const allSG = await getAllSG(scraper)
await scrape(allSG, scraper)
report = await getReport(newSG);
// }
}
exports.main = main;
Let me explain what is going with this modification.
First of all, we can see that we are importing a function from another file we didn't create yet: ./scripts/scrapers/utils/query.js
. These two functions will allow us to create the site generators in our database and updating our scraper (errors and report). We will create this file just after don't worry ;)
Then we have the scrape
function that simply scrapes the data we want using puppeteer
and cheerio
and then use the previously explained function to insert this data in our database.
Create a ./scripts/scrapers/utils/query.js
file containing the following:
'use strict'
const chalk = require('chalk');
const createSiteGenerators = async (name, stars, forks, issues, description, language, template, license, deployLink, scraper) => {
try {
const entry = await strapi.query('site-generator').create({
name: name,
stars: stars,
forks: forks,
issues: issues,
description: description,
language: language,
templates: template,
license: license,
deploy_to_netlify_link: deployLink,
scraper: scraper.id
})
} catch (e) {
console.log(e);
}
}
const updateScraper = async (scraper, report, errors) => {
await strapi.query('scraper').update({
id: scraper.id
}, {
report: report,
error: errors,
});
console.log(`Job done for: ${chalk.green(scraper.name)}`);
}
module.exports = {
createSiteGenerators,
updateScraper,
}
As you can see we are updating the scraper at the end with this updateScraper
function. Then we will uncomment the code that allows our scraper to be executed or not depending on the frequency if it's enabled etc...
Let's add it to our ./scripts/scrapers/jamstack.js
file:
'use strict'
const chalk = require('chalk');
const cheerio = require('cheerio');
const puppeteer = require('puppeteer');
const {
getReport,
getDate,
getAllSG,
scraperCanRun
} = require('./utils/utils.js')
const {
createSiteGenerators,
updateScraper
} = require('./utils/query.js')
let report = {}
let errors = []
let newSG = 0
const scrape = async (allSG, scraper) => {
const url = "https://jamstack.org/generators/"
const browser = await puppeteer.launch({ args: ['--no-sandbox', '--disable-setuid-sandbox'] });
const page = await browser.newPage();
try {
await page.goto(url)
} catch (e) {
console.log(`${chalk.red("Error")}: (${url})`);
errors.push({
context: "Page navigation",
url: url,
date: await getDate()
})
return
}
const expression = "//div[@class='generator-card flex flex-col h-full']"
const elements = await page.$x(expression);
await page.waitForXPath(expression, { timeout: 3000 })
const promise = new Promise((resolve, reject) => {
elements.forEach(async (element) => {
let card = await page.evaluate(el => el.innerHTML, element);
let $ = cheerio.load(card)
const name = $('.text-xl').text().trim() || null;
// Skip this iteration if the sg is already in db
if (allSG.includes(name))
return;
const stars = $('span:contains("stars")').parent().text().replace("stars", "").trim() || null;
const forks = $('span:contains("forks")').parent().text().replace("forks", "").trim() || null;
const issues = $('span:contains("issues")').parent().text().replace("issues", "").trim() || null;
const description = $('.text-sm.mb-4').text().trim() || null;
const language = $('dt:contains("Language:")').next().text().trim() || null;
const template = $('dt:contains("Templates:")').next().text().trim() || null;
const license = $('dt:contains("License:")').next().text().trim() || null;
const deployLink = $('a:contains("Deploy")').attr('href') || null;
await createSiteGenerators(
name,
stars,
forks,
issues,
description,
language,
template,
license,
deployLink,
scraper
)
newSG += 1;
});
});
promise.then(async () => {
await page.close()
await browser.close();
});
}
const main = async () => {
// Fetch the correct scraper thanks to the slug
const slug = "jamstack-org"
const scraper = await strapi.query('scraper').findOne({
slug: slug
});
// If the scraper doesn't exists, is disabled or doesn't have a frequency then we do nothing
if (scraper == null || !scraper.enabled || !scraper.frequency){
console.log(`${chalk.red("Exit")}: (Your scraper may does not exist, is not activated or does not have a frequency field filled in)`);
return
}
const canRun = await scraperCanRun(scraper);
if (canRun && scraper.enabled){
const allSG = await getAllSG(scraper)
await scrape(allSG, scraper)
report = await getReport(newSG);
await updateScraper(scraper, report, errors)
}
}
exports.main = main;
Again, by saving your file you should have this message:
1```bash
2Exit: (Your scraper may not exist, is not activated, or does not have a frequency field filled in)
3```
frequency
and remove any value that can be in the next_execution_at
field. For this tutorial, we'll set a one-minute frequency to quickly see the result. Then you simply need to enable it:Update your ./config/functions/bootstrap.js
to it's default value:
'use strict'
/**
* An asynchronous bootstrap function that runs before
* your application gets started.
*
* This gives you an opportunity to set up your data model,
* run jobs, or perform some special logic.
*
* See more details here: https://strapi.io/documentation/developer-docs/latest/concepts/configurations.html#bootstrap
*/
module.exports = () => {};
./config/functions/cron.js
with the following code:'use strict';
/**
* Cron config that gives you an opportunity
* to run scheduled jobs.
*
* The cron format consists of:
* [SECOND (optional)] [MINUTE] [HOUR] [DAY OF MONTH] [MONTH OF YEAR] [DAY OF WEEK]
*
* See more details here: https://strapi.io/documentation/developer-docs/latest/concepts/configurations.html#cron-tasks
*/
const jamstack = require('../../scripts/scrapers/jamstack.js')
module.exports = {
'* * * * *': () => {
jamstack.main()
}
};
./config/server.js
file with the following:module.exports = ({ env }) => ({
host: env('HOST', '0.0.0.0'),
port: env.int('PORT', 1337),
admin: {
auth: {
secret: env('ADMIN_JWT_SECRET', 'ea8735ca1e3318a64b96e79cd093cd2c'),
},
},
cron: {
enabled: true,
}
});
Save your file!
Now what will happen is that, at the next minute, the scraper will fill your next_execution_at
value with the corresponding one according to your frequency, here, the next minute after.
So the next minute after the next_execution_at
timestamp will be compared to the actual one and of course will be inferior or equal so your scraper will be executed. A new next_execution_at
timestamp will be set to the next minute accordingly to this frequency
.
Well, I think it's over for this quick and simple tutorial!
Web scraping can present various challenges, such as changes in website structure, anti-scraping measures, or network issues. It's important to handle these scenarios to ensure the reliability of your scraper.
Web scraping allows you to collect vast amounts of data from websites automatically. Before deciding if it's the right approach for your needs, consider both its advantages and disadvantages.
Benefits:
Drawbacks:
Incorporating scraped data into your content management system can enhance the richness of your content. However, it may also introduce challenges:
Integrating web scraping with Strapi opens up new possibilities for automating data collection and content management. By following this tutorial, you can set up a scraper that collects data from external websites and manages it within Strapi, enhancing your application's capabilities.
Remember to consider the legal and ethical implications of web scraping. Always respect websites' terms of service and privacy policies.
Get started with Strapi by creating a project using a starter or trying our live demo. Also, consult our forum if you have any questions. We will be there to help you.
See you soon!Maxime started to code in 2015 and quickly joined the Growth team of Strapi. He particularly likes to create useful content for the awesome Strapi community. Send him a meme on Twitter to make his day: @MaxCastres