Press enter or click to view image in full size
At Mercado Libre, we operate the largest e-commerce ecosystem in Latin America. Pricing intelligence, catalog enrichment, assortment benchmarking, and competitive monitoring are part of our daily modeling workflows.
Even though Amazon does not operate uniformly across LATAM markets, it remains the global reference point for:
For several of our internal analytics pipelines, we needed high-frequency, structured Amazon product data across multiple marketplaces (US, UK, DE, JP).
So we built our own scraper fleet.
It worked.
Until it didn’t.
Over two years, our team maintained:
The architecture:
We logged:
After 14 months of telemetry, one pattern stood out:
Selectors broke every 11–16 days (median: ~14 days).
This is not an official Amazon statistic. It reflects our monitored pages. But the pattern was consistent.
Why?
Because Amazon’s product pages are not static. They are continuously reshaped by:
This worked for two weeks:
price = soup.select_one('.a-price-whole')Then overnight:
NoneThe DOM changed.
.a-offscreen moved under a new wrapper container.
Flat selectors stopped matching.
Selenium logs at 2AM:
NoSuchElementException: Unable to locate element: .a-price-wholeThe fix? 30 minutes.
The real cost?
4–6 hours:
Multiply by product, search, category, reviews, seller pages.
We were burning 80+ engineering hours per month just keeping extraction alive.
Most scraping tutorials teach you how to extract data.
Few explain why that extraction will fail.
Here’s what we learned.
Amazon mixes:
a-price-whole)_cDEzb_p13n-sc-css-line-clamp-1_1Fn1y)From our logs:
There is no such thing as a 30-day-stable selector.
A single product page can render in 7+ structural formats.
Price might appear inside:
#corePrice_feature_div#apex_offerDisplay_desktopWe maintained 4–6 fallback paths per data field.
New variants appeared monthly.
Amazon deploys layered defenses:
Our block rate climbed:
8% → 27% in six months
Same proxy pool.
The real danger wasn’t hard blocks.
It was silent garbage ingestion.
You must validate every response against an expected schema.
~40% of critical signals were JS-injected:
Scrapy alone missed half the data.
Headless browsers increased compute costs 5–8x.
And introduced:
Annualized:
Press enter or click to view image in full size
This excludes opportunity cost.
Join Medium for free to get updates from this writer.
During 14 months of maintenance:
For larger fleets (50K+ ASINs), peer teams report $150K–$300K annually.
We tested three approaches:
We needed:
We decide to evaluate Bright Data’s Amazon Scraper API based on some of our worldwide partners’ recommendation.
Press enter or click to view image in full size
import requests
import timeAPI_KEY = "YOUR_API_KEY"
DATASET_ID = "gd_l7q7dkf244hwjntr0" # your scraper's dataset_id
ASIN = "B0CX23V2ZK"
headers = {
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json",
}
# 1) Trigger the scraper
trigger_url = f"https://api.brightdata.com/datasets/v3/trigger?dataset_id={DATASET_ID}"
payload = [
{
"url": f"https://www.amazon.com/dp/{ASIN}"
# or "asin": ASIN # depends on the scraper's input schema
}
]
trigger_resp = requests.post(trigger_url, headers=headers, json=payload)
trigger_resp.raise_for_status()
snapshot_id = trigger_resp.json()["snapshot_id"]
print("Snapshot:", snapshot_id)
# 2) Poll snapshot status
snapshot_url = f"https://api.brightdata.com/datasets/v3/snapshot/{snapshot_id}"
while True:
status_resp = requests.get(snapshot_url, headers=headers)
status_resp.raise_for_status()
data = status_resp.json()
status = data["status"]
print("Status:", status)
if status == "ready":
break
if status == "failed":
raise RuntimeError("Collection failed")
time.sleep(5)
# 3) Download results as JSON
download_resp = requests.get(f"{snapshot_url}?format=json", headers=headers)
download_resp.raise_for_status()
products = download_resp.json()
print("Got", len(products), "records")
print(products[0]) # should include title, price, rating, reviews, etc.
We send an ASIN. We receive structured JSON.
No selector maintenance.
No browser cluster.
No proxy management.
Press enter or click to view image in full size
The key differentiator wasn’t just proxy management or higher success rates.
It was auto-healing extraction logic.
In a traditional scraper architecture, the extraction layer is tightly coupled to the DOM. When Amazon:
…your selectors fail.
Even if the data is still present on the page, your parser no longer knows where to look.
In a DIY system, that triggers a predictable cycle:
This is the 4–6 hour “fix” that repeats every couple of weeks.
With an auto-healing managed scraper, that maintenance loop moves upstream.
When the DOM changes:
From our pipeline’s perspective, nothing changes.
The API contract stays identical:
The important nuance here is not just that selectors are updated — it’s that the abstraction layer is preserved.
Our downstream systems (pricing models, feature engineering pipelines, dashboards) depend on schema stability. If a field disappears or shifts type, it can cascade into:
Auto-healing essentially decouples our analytics layer from Amazon’s front-end volatility.
Instead of reacting to DOM changes, we stopped caring about them.
And for a data science team whose goal is to build models — not maintain parsers — that architectural separation made all the difference.
After migration:
All three were blocked for over a year.
The migration took less than a day.
We replaced:
With a single endpoint.
Not every team should switch to managed scraping. Managed scraping APIs makes sense when:
DIY in-house scraping still works if:
In the three months after we migrated to Bright Data’s Amazon Scraper API, our team shipped all three of those backlogged features: the repricing engine, the trend dashboard, and the prediction model. All three had been blocked for over a year because scraper maintenance consumed the engineering bandwidth required to build them.
Amazon scraping does not fail because engineers are careless.
It fails because Amazon’s front-end evolves continuously.
When your analytics pipeline depends on unstable DOM structures, maintenance becomes a permanent tax on innovation.
For us, the tipping point wasn’t cost.
It was opportunity cost.
And once we measured it, the decision became straightforward.