Where is the line drawn when using web scraping tools?

A practical and ethical perspective on data scraping — with a concrete example from De Gouden Gids.

Web scraping is today a powerful way to quickly and reliably collect large amounts of publicly available information. Yet there are many misconceptions about it: what is allowed, what is not, and where to draw the line? In this article, I explain the principles, describe how I help organizations, and show — through a case study involving De Gouden Gids — why ethics are essential.

🧩 What is web scraping?

Web scraping means using a script to automatically read information from web pages, such as company details, product information, job listings, or geographic data. It saves time, reduces errors, and makes processes repeatable. Scraping is not hacking, however — it must stay within the boundaries of technology, legislation, and common sense.

📘 Case study: De Gouden Gids

De Gouden Gids is a good example of why ethics in scraping matter. On the surface, their data seems easy to collect: via Inspect Element (F12), you can see that company listings are loaded through an internal API that returns JSON. However, this API is not publicly accessible and is intended solely for internal use by their own website. This is a first indication that, although the data is publicly visible, bulk scraping from this website is something the organization does not want.

1. A first test with Puppeteer

With a simple Node.js script (Puppeteer), we can still retrieve the data by analyzing which fields (such as address and email) are populated in the relevant ID fields. This works perfectly for a few requests: results load correctly and all elements appear as expected.

📷 Image 1 – first successful scraping test

After a few requests, their anti-bot firewall intervenes and a clear message appears: the website detects automated behavior and blocks access.

Technically, this can be bypassed — for example by adding delays, page scrolls, or other event listeners to mimic human behavior and avoid anti-bot detection. However, it should be obvious that I do not assist in building scraping tools in such cases. For me, this is where it stops.

📷 Image 2 – De Gouden Gids bot protection page

2. Yes, it can be bypassed… but we don’t do that

From a technical standpoint, it is entirely possible to circumvent such blocks using user-agent rotation, delayed requests, proxies, or headless browser masking. But when a website explicitly signals that scraping is unwanted, you must stop. This is the only correct choice — legally and professionally.

🔒 Where the line is drawn

1. Is the data intended to be publicly visible?
If yes, some use may be acceptable.

2. Is scraping actively blocked?
If so, it stops immediately.

3. Does it require bypassing security measures?
Then I don’t do it.

✔ What I help with

I support organizations in safely and legally collecting data that is intended to be publicly accessible, including:

Automating scraping within legal and technical limits
Respecting robots.txt, rate limits, and terms of use
Setting up secure ETL workflows (Extract – Transform – Load)
Integrating open data (Flanders, Statbel, INSPIRE, etc.)
Automating repetitive data tasks using Python or Node.js

❌ What I don’t do

Some websites make their data visible but do not allow it to be harvested in bulk. They protect themselves with firewalls, anti-bot systems, or restricted APIs. When a company clearly indicates that scraping is unwanted, that’s where it ends.

No scripts that attempt to bypass security or anti-bot systems
No use of proxies, captcha bypasses, or headless detection workarounds
No scraping of data that is not intended for bulk access

🤝 How I can help

When scraping is allowed, I can create significant time savings by automating processes: harvesting open data, building integrations, designing ETL flows, feeding dashboards, processing geographic datasets, and more.

You get a solution that is secure, scalable, and legally sound, fully tailored to your organization.

Want to learn more about legal and efficient scraping?

Curious about what’s possible within the boundaries of technology and legislation? I’d be happy to help.

Get in touch →