Data Scraping Procedure
1. Concepts & Terminology
| Term | Meaning |
|---|---|
Scraping Batch (scrapingbatch) | The top-level orchestration record. One batch = one scrape job + multiple refine or enrichment job |
Scraping Job (scrapingjob) | Phase 1 job. Fetches a raw list of vehicle listings from a marketplace (Craigslist or Facebook Marketplace). One per batch. |
Raw Listing (rawvehiclelisting) | Minimal data captured in Phase 1: title, price, URL, thumbnail, location, mileage, year, make, model. Globally deduplicated by (sourceplatform, postid). |
Refine Job (refinejob) | Phase 2 job. Deep-scrapes a single listing URL and enriches it with AI extraction. One refine job is created per unique URL that needs enrichment after Phase 1. |
Refined Listing (refinedvehiclelisting) | Full structured vehicle record produced after Phase 2. Linked 1-to-1 with a raw listing. |
Dealer Data Mapping (dealerdatamapping) | Links a dealer to a refined listing. Drives billing and export. |
| Crawler | Receives scraping requests via RabbitMQ and performs the actual browser-based scraping using Crawl4AI + Playwright. |
| ManagementService | .NET ASP.NET Core service. Owns the orchestration logic, DB writes, and queue management. |
2. Status Enums
ScrapingBatchStatus
| Value | Meaning |
|---|---|
Queued | Created, not yet dispatched to crawler. |
Fetching | Phase 1 in progress — raw scraping job is running. |
Enriching | Phase 1 complete — refine jobs are being processed. |
Completed | All phases done, no failures. |
PartiallyCompleted | Some jobs succeeded, some failed. |
Failed | All jobs failed. |
Cancelled | Cancelled by user or admin before reaching a terminal state. |
ScrapingJobStatus (Phase 1)
| Value | Meaning |
|---|---|
Queued | Created, not yet dispatched to crawler. |
InProgress | Message published to scraping-job queue; crawler is working. |
Completed | Raw listings received and processed. |
Failed | Crawler returned failure or retry budget exhausted. |
Cancelled | Batch was cancelled before or during this job. |
RefineJobStatus (Phase 2)
| Value | Meaning |
|---|---|
Queued | Created and published to the batch queue; awaiting crawler. |
InProgress | Crawler has acknowledged it and is working. |
Completed | Enrichment succeeded. |
PartialCompleted | Legacy status, treated as done during finalization. |
Failed | Enrichment failed after all retries. |
Cancelled | Batch was cancelled; refine job will not run. |
RefineResult (per listing outcome)
| Value | Meaning |
|---|---|
Inserted | New refined listing was created. |
Updated | Existing refined listing was updated (data changed). |
Skipped | Raw data unchanged; no new refined listing needed (dealer mapping only). |
3. Database Tables
scrapingbatch
├─ scrapingbatchid (PK)
├─ dealerid (FK → dealermaster)
├─ sourceid, dealerscrapingsourcesid
├─ status, scrapemode (Manual / Scheduled)
├─ iscancelled, cancelledat, cancelledby
├─ requestedat, rawcompletedat, refinecompletedat
└─ scrapecode (unique display code)
scrapingjob
├─ scrapingjobid (PK)
├─ scrapingbatchid (FK → scrapingbatch)
├─ status, retrycount
├─ isrefineneeded, totalrawdata
├─ newrawdatacount, updatedrawdatacount, skippedrawdatacount
├─ totalrefinerequested, completedrefinecount, failedrefinecount
└─ requestedat
refinejob
├─ refinejobid (PK)
├─ scrapingjobid (FK → scrapingjob)
├─ scrapingbatchid (FK → scrapingbatch)
├─ rawlistingid (FK → rawvehiclelisting)
├─ url
├─ status, retrycount
└─ requestedat
rawvehiclelisting
├─ rawlistingid (PK)
├─ (sourceplatform, postid) — UNIQUE dedup key
├─ title, price, location, mileage, year, make, model, thumbnail, url
├─ firstseenjobid, lastseenjobid
└─ scrapedat
refinedvehiclelisting
├─ refinedlistingid (PK)
├─ rawlistingid (FK → rawvehiclelisting, UNIQUE)
├─ refinejobid (FK → refinejob)
├─ attributes (JSONB — all AI-extracted vehicle fields)
├─ sellertype (private / dealer)
├─ refineresult (Inserted / Updated / Skipped)
└─ createdat, updatedat
dealerdatamapping
├─ dealerid + refinedlistingid — composite PK
├─ scrapingjobid
└─ isexported
4. Queue Architecture
Current Architecture (production)
RabbitMQ Queues / Exchanges
│
├── batch-scraping-topic [Topic Exchange]
│ └── {batchId} [Durable queue, one per active batch]
│ Routing key: batch.{batchId}
│ Contains: CrawlFilterRequested (Phase 1) + CrawlRequested (Phase 2)
│
├── job-registry [Durable queue — direct]
│ Contains: BatchQueueRegistered notifications (batchId, queueName)
│ Publisher: ManagementService (on batch creation)
│ Subscriber: Not yet consumed (planned for future dynamic subscription)
│
├── crawler-crawl-requests [Fanout, FastStream subscriber]
│ Contains: CrawlRequested (Phase 2 enrichment)
│
├── crawler-crawl-filter-requests [Fanout, FastStream subscriber]
│ Contains: CrawlFilterRequested (Phase 1 raw scraping)
│
└── crawl_status_data [Fanout, MassTransit subscriber]
Contains: CrawlStatusData responses from crawler → ManagementService
Note: In the current implementation, Phase 1 and Phase 2 messages are both routed through the per-batch
{batchId}on the .NET side, but the crawler subscribes to the generic fanout queues (crawler-crawl-filter-requests,crawler-crawl-requests). The batch queue exists to isolate and drop messages on cancellation. Thejob-registryqueue is published to but not yet consumed by the crawler.
Planned Architecture (in design)
scraping-job [Durable queue — shared, all Phase 1 jobs]
job-registry [Durable queue — only batchIds with active refine work]
{batchId} queue [Durable queue — per-batch, Phase 2 only; created lazily after Phase 1 success]
See the architecture redesign plan for full details.
5. Message Contracts
All contracts live in ContractLibrary/Messages/.
CrawlFilterRequested — Phase 1 request
Published by: ScrapingOrchestrationService.ExecuteScrapingPipeline
Consumed by: Python crawler handle_crawl_filter_requested
{
"provider": "cgl",
"location_id": null,
"radius": 50,
"zip_code": 90210,
"min_price": 5000,
"max_price": 30000,
"min_year": 2015,
"max_year": 2024,
"min_miles": null,
"max_miles": 80000,
"makes": ["Toyota", "Honda"]
}
The AMQP MessageId property = scrapingJobId (used as dedup ID and correlation ID).
CrawlRequested — Phase 2 request
Published by: ScrapingOrchestrationService.ProcessRawBatch
Consumed by: Python crawler handle_crawl_requested
{
"id": "<refineJobId>",
"url": "https://craigslist.org/cto/d/listing/123",
"urls": null,
"priority": 0,
"requestedAt": "2026-06-03T12:00:00Z"
}
The AMQP MessageId = refineJobId.
CrawlStatusData — crawler response
Published by: Python crawler publish_crawl_status
Consumed by: .NET CrawlStatusConsumer
{
"id": "<scrapingJobId or refineJobId>",
"url": "https://...",
"result": "<JSON string — raw listings array or enriched vehicle object>",
"error": null,
"status": "SUCCESS",
"completedAt": "2026-06-03T12:01:00Z"
}
status values: STARTED | IN_PROGRESS | SUCCESS | FAILURE