db Schema Reference¶
Column-by-column reference for every SQLite table sembr creates. The DDL source of truth is in sembr/db/*.py — this page mirrors it for SQL/ops use. If they ever disagree, the DDL wins.
All TEXT timestamp columns store ISO-8601 in UTC (e.g. 2026-05-05T12:34:56Z). All boolean-ish columns are INTEGER (0/1) — SQLite has no native bool.
feeds¶
User-facing feed registry. One row per RSS source.
| Column | Type | Default | Constraints | Meaning |
|---|---|---|---|---|
id |
INTEGER |
autoincrement | PRIMARY KEY |
Surrogate key |
name |
TEXT |
— | NOT NULL |
Display name |
url |
TEXT |
— | NOT NULL UNIQUE |
Feed URL — unique key for dedup at create time |
source_type |
TEXT |
'rss' |
NOT NULL |
Source plugin name; only rss ships in 1.0 |
config |
TEXT |
'{}' |
NOT NULL |
JSON-encoded source-specific config |
poll_interval_minutes |
INTEGER |
30 |
NOT NULL |
APScheduler interval (5–1440) |
last_collected_at |
TEXT |
NULL |
— | Last successful fetch (UTC ISO-8601) |
created_at |
TEXT |
datetime('now') |
NOT NULL |
Row creation time |
enabled |
INTEGER |
1 |
NOT NULL |
0 = paused (added via idempotent migration _ensure_enabled_column) |
Indexes: PK on id, UNIQUE on url.
feed_items¶
MD5 fingerprint dedup. One row per article ever ingested from any feed.
| Column | Type | Default | Constraints | Meaning |
|---|---|---|---|---|
md5 |
TEXT |
— | PRIMARY KEY |
MD5(url + title), 32 lowercase hex chars |
feed_id |
INTEGER |
— | NOT NULL, FK → feeds(id) ON DELETE CASCADE |
Owning feed |
collected_at |
TEXT |
datetime('now') |
NOT NULL |
Ingestion time |
Indexes: PK on md5, idx_feed_items_feed_id on feed_id (used for per-feed scans).
Lifecycle: written inside insert_article_pending together with pending_articles; survives forever as the dedup ledger. Cascade-deleted when the feed is deleted.
feed_tags¶
Tag set per feed.
| Column | Type | Default | Constraints | Meaning |
|---|---|---|---|---|
feed_id |
INTEGER |
— | NOT NULL, FK → feeds(id) ON DELETE CASCADE |
Feed owner |
tag |
TEXT |
— | NOT NULL |
Tag string |
Primary key: composite (feed_id, tag).
Indexes: idx_feed_tags_tag on (tag, feed_id) — supports "list all feeds with tag X" without table scan.
seeded_feeds¶
Tracks every URL ever seeded from INITIAL_FEEDS so deletes are sticky.
| Column | Type | Default | Constraints | Meaning |
|---|---|---|---|---|
url |
TEXT |
— | PRIMARY KEY |
Feed URL ever auto-seeded |
Invariant: once a URL is in seeded_feeds it stays forever, even if the feed row is deleted. seed_initial_feeds skips URLs already here, so user-deleted seed feeds never come back on restart.
Scope: only INITIAL_FEEDS entries — user POST /feeds writes do not populate this table.
pending_articles¶
Articles fetched but not yet embedded. Row presence is the only state indicator — there is no status column.
| Column | Type | Default | Constraints | Meaning |
|---|---|---|---|---|
md5 |
TEXT |
— | PRIMARY KEY |
Same MD5(url + title) as feed_items.md5 |
feed_id |
INTEGER |
— | NOT NULL, FK → feeds(id) ON DELETE CASCADE |
Owning feed |
url |
TEXT |
— | NOT NULL |
Article URL |
title |
TEXT |
— | NOT NULL |
Article title |
body |
TEXT |
— | NOT NULL |
Article body, hard-capped at 1 MB (_BODY_CAP_BYTES) |
published_at |
TEXT |
NULL |
— | Article publish time as reported by source |
retry_count |
INTEGER |
0 |
NOT NULL |
Embedding-attempt counter |
created_at |
TEXT |
datetime('now') |
NOT NULL |
When inserted into the queue |
Indexes:
- PK on
md5 idx_pending_articles_feed_idonfeed_ididx_pending_articles_retryonretry_count— coversWHERE retry_count < ?; ORDER BY rowid within the filtered set is FIFO
Lifecycle: inserted by collector inside insert_article_pending; pulled in batches by embedder via pull_pending_batch (FIFO by rowid); deleted on success or moved to dead_articles after retry_count >= max_retry.
dead_articles¶
Articles that exhausted retry budget. Kept for forensics — no FK cascade, so deleting a feed leaves its dead articles for postmortem.
| Column | Type | Default | Constraints | Meaning |
|---|---|---|---|---|
md5 |
TEXT |
— | PRIMARY KEY |
Same fingerprint |
feed_id |
INTEGER |
— | nullable, no FK | Original feed (NULL if feed since deleted by admin) |
url |
TEXT |
— | NOT NULL |
Article URL |
title |
TEXT |
— | NOT NULL |
Article title |
body |
TEXT |
— | NOT NULL |
Captured body |
published_at |
TEXT |
NULL |
— | As reported by source |
error_message |
TEXT |
NULL |
— | Last exception that caused demotion (per-batch attribution from demote_md5s_to_dead) |
failed_at |
TEXT |
datetime('now') |
NOT NULL |
When demoted |
Indexes: PK on md5, idx_dead_articles_failed_at on failed_at.
Conflict policy: INSERT OR REPLACE on demotion preserves the latest failed_at and error_message if a row was previously demoted then re-queued.
intents¶
User-defined monitoring intents.
| Column | Type | Default | Constraints | Meaning |
|---|---|---|---|---|
id |
INTEGER |
autoincrement | PRIMARY KEY |
Surrogate key |
name |
TEXT |
— | NOT NULL |
User label |
text |
TEXT |
— | NOT NULL |
Natural-language intent — embedded into Qdrant intents collection |
threshold |
REAL |
0.75 |
NOT NULL |
Cosine similarity cutoff (0.20–0.95) |
enabled |
INTEGER |
1 |
NOT NULL |
0 = paused; matcher skips |
channels |
TEXT |
'[]' |
NOT NULL |
JSON array of ChannelConfig — discriminated union; email is the only built-in type today |
tags |
TEXT |
'[]' |
NOT NULL |
JSON array of free-form tag strings |
system_template |
TEXT |
'default' |
NOT NULL |
Filename (no .md) under prompts/system/ |
instruction_template |
TEXT |
'default' |
NOT NULL |
Filename (no .md) under prompts/instruction/ |
feed_filter |
TEXT |
'null' |
NOT NULL |
JSON FeedFilter or 'null' to scan all feeds — distinguishes omitted from explicit-null via model_fields_set on update |
schedule |
TEXT |
'{}' |
NOT NULL |
JSON Schedule: {mode: 'cron', preset, hour, minute, weekday, lookback_seconds, skip_seen} or {mode: 'event', ...} |
timezone |
TEXT |
'UTC' |
NOT NULL |
IANA timezone for cron evaluation |
language |
TEXT |
'zh' |
NOT NULL |
Output language for LLM summary |
created_at |
TEXT |
datetime('now') |
NOT NULL |
Row creation |
updated_at |
TEXT |
datetime('now') |
NOT NULL |
Last mutation; update_intent_raw rollback intentionally writes the original value |
Indexes: PK on id. No secondary indexes (intent count is small — < 1000 expected).
match_seen¶
Deduplication log: which articles each intent has already alerted on.
| Column | Type | Default | Constraints | Meaning |
|---|---|---|---|---|
intent_id |
INTEGER |
— | NOT NULL, FK → intents(id) ON DELETE CASCADE |
Intent that matched |
article_id |
TEXT |
— | NOT NULL |
Qdrant point ID (string form) of the matched article |
first_matched_at |
TEXT |
datetime('now') |
NOT NULL |
When this pair was first seen |
Primary key: composite (intent_id, article_id).
Lifecycle: written by matcher via insert_unseen_returning_new (single multi-row INSERT OR IGNORE … RETURNING). Cleared on intent text mutation via clear_intent — the new query vector means previously seen articles must be re-evaluated.
event_pending¶
Buffer for the event-driven matcher path. DDL is in db/event_buffer.py; absorb/flush logic lives in matcher/event_buffer.py.
| Column | Type | Default | Constraints | Meaning |
|---|---|---|---|---|
intent_id |
INTEGER |
— | NOT NULL, FK → intents(id) ON DELETE CASCADE |
Intent owning the buffered group |
group_id |
INTEGER |
— | NOT NULL |
Cluster id within this intent (assigned during absorb) |
rep_article_id |
TEXT |
— | NOT NULL |
Representative article for the cluster (highest-score member) |
rep_title_norm |
TEXT |
— | NOT NULL |
Normalised title used for semantic-dedup comparison across new arrivals |
members_json |
TEXT |
— | NOT NULL |
JSON list of {article_id, score} cluster members |
created_at |
TEXT |
— | NOT NULL |
When the cluster was first created |
Primary key: composite (intent_id, group_id).
Cascade summary¶
DELETE FROM feeds WHERE id = X
├─ deletes feed_items WHERE feed_id = X (FK CASCADE)
├─ deletes feed_tags WHERE feed_id = X (FK CASCADE)
├─ deletes pending_articles WHERE feed_id = X (FK CASCADE)
└─ does NOT touch dead_articles (no FK — kept for forensics)
DELETE FROM intents WHERE id = Y
├─ deletes match_seen WHERE intent_id = Y (FK CASCADE)
└─ deletes event_pending WHERE intent_id = Y (FK CASCADE)
PRAGMA foreign_keys=ON is required for cascades to fire — set globally in init_sqlite.