Section 02
How the scanner works today.
Each script invocation produces one pipeline_id and walks
every site listed in config/sites-config-full.json
sequentially. Per site, the flow below runs end-to-end. The numbered
steps are mirrored exactly in src/cookies-checker.ts.
-
1
Browser launch
One Chromium instance for the whole script run. --test-third-party-cookie-phaseout is passed in every mode (standard / stealth / headed).
src/cookies-checker.ts:80–95 · F-015
-
2
Per-site context, JS interceptors injected
A new browser context (UA pinned to Chrome 126, viewport 1920×1080) is created. preload/preload-trace.js overrides document.cookie's setter and proxies window.localStorage. Two listeners are attached: context.on('console') and context.on('response').
src/cookies-checker.ts:115–164 · S-006, S-001
-
3
Page load + 5 s wait
Page is opened with waitUntil: 'domcontentloaded', then the script sleeps 5 000 ms hoping the consent banner has rendered.
src/cookies-checker.ts:170–171 · S-004
-
4
Pre-consent JAR snapshot, tagged afterConsent=false
Native context.storageState() is dumped. Every cookie and every LocalStorage entry sitting in the jar at this moment is recorded.
src/cookies-checker.ts:173–177 · S-009
P0 The flag is a closure variable, not a timestamp boundary. Cookies arriving in steps 2–3 may be classified inconsistently with this snapshot.
F-001
-
5
Consent click
Inside the page, document.querySelector('button[id*="didomi-notice-agree-button"], button[id*="cpexSubs_consentButton"]')?.click(). On success, the closure variable afterConsent flips to true.
src/lib/uniweb-site.ts consent() · S-002
P1 Selector matches Didomi and CPEx only. OneTrust, Cookiebot, TrustArc, Sourcepoint, iframe-hosted banners all fall through silently.
F-003
P0 If the click throws,
afterConsent never flips and the run continues — the post-consent dump is then mis-labelled as pre-consent.
F-002
-
6
10 s wait
Hard-coded adLoad = 10_000. Late-loading SDKs that finish after this window are missed.
src/cookies-checker.ts:23, 199 · F-004
-
7
Post-consent JAR snapshot, tagged afterConsent=true
Second context.storageState() dump. The fix from Finding 16 prevents double-counting cookies that persisted from snapshot 1, but the type field of merged rows still reflects the first source seen.
src/cookies-checker.ts:201–205 · S-009
P2 JAR provenance is lost when collapsed onto an HDR/JS row.
F-006
-
8
CSV write + DB insert
CSV at csvoutput/cookiesoutput-<site>.csv writes the full cookie value. The MariaDB cookies table inserts truncate(c.value, 50) — first 50 chars + .... The cookie name written to both is the regex-normalised form, not the as-observed name.
src/lib/EdpsCookieStore.ts:435–506 · S-012, S-013
P0 CSV and DB diverge. Observed cookie name is never persisted.
F-009 /
F-010
How records are de-duplicated
The dedup logic decides what counts as "the same cookie observed twice". The four code paths use four different keys.
| Path |
Dedup key |
Source |
Issue |
addCookie (JS, HDR) |
name + domain + afterConsent |
EdpsCookieStore.ts:286–296 |
P0 domain="" collision with JAR |
addPlaywrightStorageState cookies (JAR) |
name + domain |
EdpsCookieStore.ts:393–405 |
P2 Loses type provenance on merge |
addPlaywrightStorageState LS (JAR) |
name + host |
EdpsCookieStore.ts:417–431 |
Asymmetric with event-driven LS path |
addLocalStorage (event) |
name + domain + afterConsent (domain always "") |
EdpsCookieStore.ts:337–350 |
P1 Same key counted twice across paths |
What ends up in the database today
One MariaDB table — cookies — carries every observation. There is no runs table, no site_visits table, and no run-level metadata is recorded against the pipeline_id.
CREATE TABLE `cookies` (
`id` int AUTO_INCREMENT,
`timestamp` datetime,
`pipeline_id` varchar(128),
`site_id` varchar(128),
`type` varchar(10), -- JS / HDR / LS / JAR
`host` varchar(128), -- divergent semantics across sources
`cookie_source` text, -- the cookie's Domain attribute
`cookie_name` varchar(100), -- regex-NORMALISED, raw is lost
`cookie_value` text, -- TRUNCATED to 50 chars on insert
`source` text, -- the URL at observation time
`path` varchar(45), -- silently truncates long paths
`expires` datetime, -- includes deletion sentinels (1970)
`http_only` tinyint(1),
`secure` tinyint(1),
`same_site` varchar(20),
`callstack` text,
`known` boolean,
`known_from` varchar(10),
`after_consent` tinyint(1) -- can be wrong: race + failed-consent fallback
);
Source: src/tools/dbcreate.ts:20–43, src/lib/EdpsCookieStore.ts:475
Section 05
The proposed evidence database.
Three additive tables, one derived view. Old cookies table
kept for one release for parallel comparison, then dropped. Below is
the SQL — same shape can be expressed in Postgres or SQLite.
runs — one row per script invocation
CREATE TABLE runs (
run_id VARCHAR(36) PRIMARY KEY,
started_at DATETIME(3) NOT NULL,
ended_at DATETIME(3) NULL,
tool_version VARCHAR(32) NOT NULL,
tool_git_sha CHAR(40) NOT NULL,
browser_engine VARCHAR(16) NOT NULL,
browser_version VARCHAR(32) NOT NULL,
user_agent TEXT NOT NULL,
viewport_w SMALLINT NOT NULL,
viewport_h SMALLINT NOT NULL,
mode VARCHAR(16) NOT NULL, -- std / stealth / headed
third_party_phaseout BOOL NOT NULL,
egress_ip VARCHAR(45) NULL,
egress_country CHAR(2) NULL,
dataset_version VARCHAR(32) NULL,
errors_count INT NOT NULL DEFAULT 0,
manifest_sha256 CHAR(64) NULL
);
site_visits — one row per (run, site)
CREATE TABLE site_visits (
visit_id VARCHAR(36) PRIMARY KEY,
run_id VARCHAR(36) NOT NULL,
site_id VARCHAR(128) NOT NULL,
site_url TEXT NOT NULL,
final_url TEXT NULL,
started_at DATETIME(3) NOT NULL,
page_loaded_at DATETIME(3) NULL,
consent_attempted_at DATETIME(3) NULL,
consent_accepted_at DATETIME(3) NULL,
consent_state ENUM('accepted','rejected','failed_selector',
'no_banner','timeout','error') NOT NULL,
consent_vendor VARCHAR(32) NULL,
cmp_id SMALLINT NULL,
cmp_version SMALLINT NULL,
tc_string TEXT NULL,
purpose_consents JSON NULL,
vendor_consents JSON NULL,
page_load_status VARCHAR(16) NOT NULL,
error_summary TEXT NULL,
cookies_observed_count INT NOT NULL DEFAULT 0,
ls_observed_count INT NOT NULL DEFAULT 0,
idb_observed_count INT NOT NULL DEFAULT 0,
KEY ix_site_run (run_id, site_id),
CONSTRAINT fk_visit_run FOREIGN KEY (run_id) REFERENCES runs(run_id)
);
observations — one row per storage write
CREATE TABLE observations (
obs_id BIGINT AUTO_INCREMENT PRIMARY KEY,
visit_id VARCHAR(36) NOT NULL,
observed_at DATETIME(3) NOT NULL,
source ENUM('JS','HDR','LS','JAR','IDB','CS') NOT NULL,
cookie_name_raw VARCHAR(255) NOT NULL, -- as observed, never rewritten
cookie_name_norm VARCHAR(255) NOT NULL, -- regex-normalised
cookie_value MEDIUMTEXT NULL, -- full value, no truncation
cookie_value_sha256 CHAR(64) NULL,
domain_raw VARCHAR(255) NULL, -- raw Domain attr (or NULL)
domain_resolved VARCHAR(255) NOT NULL, -- normalised host
path VARCHAR(2048) NULL,
expires DATETIME NULL,
max_age INT NULL,
is_session BOOL NOT NULL DEFAULT 0,
is_deletion BOOL NOT NULL DEFAULT 0,
http_only BOOL NULL,
secure BOOL NULL,
same_site VARCHAR(20) NULL,
partitioned BOOL NULL,
observed_origin VARCHAR(255) NOT NULL,
observed_via_url TEXT NULL,
callstack MEDIUMTEXT NULL,
is_known BOOL NOT NULL DEFAULT 0,
known_from VARCHAR(64) NULL,
is_leg BOOL NOT NULL DEFAULT 0,
occurrence_count INT NOT NULL DEFAULT 1,
KEY ix_visit (visit_id),
KEY ix_name (cookie_name_norm),
CONSTRAINT fk_obs_visit FOREIGN KEY (visit_id) REFERENCES site_visits(visit_id)
);
observations_classified — derived view
CREATE OR REPLACE VIEW observations_classified AS
SELECT
o.*,
v.consent_state,
v.consent_accepted_at,
CASE
WHEN v.consent_state <> 'accepted' THEN 'no_consent'
WHEN o.observed_at < v.consent_accepted_at THEN 'before_consent'
ELSE 'after_consent'
END AS consent_phase
FROM observations o
JOIN site_visits v ON v.visit_id = o.visit_id;
Per-run evidence pack — what gets shipped to storage
run-<run_id>.zip
├── manifest.json run metadata + SHA-256 of every file
├── manifest.json.sig detached signature (cosign / minisign / GPG)
├── runs.csv one row from runs
├── visits.csv rows from site_visits for this run
├── observations.csv full observations rows for this run
├── csvoutput/
│ └── cookiesoutput-<site>.csv
├── screenshots/
│ └── <site>.jpeg
├── har/
│ └── <site>.har Playwright HAR per visit (network trail)
└── README.md generated; scope, methodology, software version