
This page describes how to use OLAMIP as a first‑class data source in Retrieval‑Augmented Generation (RAG) pipelines. It covers architecture, chunking strategies, retrieval scoring, metadata‑aware embeddings, and includes example code snippets.
1. Overview
OLAMIP provides a structured, machine‑interpretable representation of a website’s content. Instead of scraping HTML or parsing inconsistent page layouts, a RAG pipeline can ingest olamip.json (and optionally olamip‑delta.json) as a clean, normalized dataset containing:
- A consistent hierarchy of sections, subsections, and entries
- Human‑curated summaries, tags, and content types
- Editorial signals such as priority
- Optional semantic alignment via schema.org or knowledge‑graph identifiers
- Multilingual metadata at the file, section, or entry level
Because OLAMIP is explicitly designed for LLM consumption, it provides a high‑quality foundation for retrieval pipelines with minimal preprocessing.
2. Architecture of an OLAMIP‑Powered RAG Pipeline
A typical RAG system using OLAMIP follows these stages:
- Ingestion Load olamip.json and, if present, olamip‑delta.json.
- Normalization Flatten the hierarchical structure (sections → subsections → entries) into a list of documents while preserving all metadata.
- Chunking Convert entries into retrieval chunks. Each chunk includes text plus structured metadata.
- Embedding Generate vector embeddings for each chunk. Metadata may optionally be incorporated into the embedding text.
- Indexing Store embeddings and metadata in a vector database.
- Retrieval Retrieve top‑k chunks using similarity search and optional metadata filters.
- Generation Provide retrieved chunks to an LLM as contextual input.
3. Chunking strategies
OLAMIP entries contain structured fields such as title, summary, url, tags, priority, content_type, and optional metadata. Chunking determines how these fields are transformed into retrieval units.
3.1 Entry-level chunks
The simplest strategy is to treat each OLAMIP entry as a single chunk:
- title
- summary
Metadata:
- url
- tags
- priority
- content_type
- section/subsection identifiers
- language
- custom metadata fields
This approach works well for concise summaries and page‑level retrieval.
3.2 Section-aware chunks
For sites with meaningful hierarchy, incorporate section context into the chunk text.
[Section: Time‑Lapse Projects]
Entry Title
Summary...
Metadata additions:
- section_title
- section_type
This helps the retriever understand relationships such as:
- “Photography Tips → Best Places to Photograph in Los Angeles”
- “Time‑Lapse Projects → Timeless Magic City – Miami Time‑Lapse”
3.3 Multi‑Level Hierarchy Chunks
OLAMIP supports unlimited nesting of sections and subsections. You can create:
- Parent‑level chunks (section summaries)
- Entry‑level chunks (individual pages)
- Deep‑hierarchy chunks (subsections several levels down)
Metadata may include:
- parent_section_title
- parent_section_url
This enables multi‑granular retrieval: broad overviews and fine‑grained details..
4. Retrieval scoring
Standard RAG retrieval uses vector similarity (e.g., cosine similarity). OLAMIP allows enhanced scoring using metadata fields defined in the protocol.
Examples:
- priority: boost high‑priority entries
- content_type: emphasize certain types (e.g., project, blog_article)
- section_type: boost or filter specific sections
4.1 Example scoring strategy
4.1 Example Scoring Formula
Let:
- = vector similarity score
- = weight from priority
- = weight from content_type
Final score:
Example weights:
| Field | Value | Weight |
|---|---|---|
| priority | high | 1.2 |
| priority | medium | 1.0 |
| priority | low | 0.8 |
| content_type | project | 1.1 |
| content_type | blog_article | 1.0 |
| content_type | page | 0.9 |
This preserves semantic similarity as the primary signal while incorporating editorial intent.
5. Metadata-aware embeddings
Metadata can be incorporated directly into the text passed to the embedding model. This helps the model encode semantic signals that would otherwise be lost.
5.1 Text Concatenation Example
Include selected metadata fields in the text passed to the embedding model, for example:
[Section: Time‑Lapse Projects]
[Content Type: project]
[Tags: miami, time-lapse, cityscape]
[Priority: high]
Timeless Magic City – Miami Time‑Lapse
A large-scale Miami time‑lapse project capturing Downtown, Miami Beach, and the Florida Keys across multiple years with cinematic precision.
This improves clustering, retrieval accuracy, and entity disambiguation.
5.2 Storing Metadata as Structured Fields
Metadata should also be stored separately in the vector database. This enables:
- Filtering (e.g.,
section_title == "Time‑Lapse Projects") - Boosting (e.g.,
priority == "high") - Faceting (e.g., grouping by
content_type)
Combining metadata‑in‑text and metadata‑as‑fields yields the strongest retrieval performance.scoring.
6. Example code snippets
Below are simplified Python‑style examples using a generic embedding model and vector database. Adjust to your stack (e.g., OpenAI, Azure, local models, Pinecone, Qdrant, etc.).
6.1 Parsing OLAMIP and building chunks
import json
with open("olamip.json", "r", encoding="utf-8") as f:
olamip = json.load(f)
sections = olamip["content"]["sections"]
chunks = []
for section in sections:
section_title = section.get("title")
section_type = section.get("section_type")
section_url = section.get("url")
for entry in section.get("entries", []):
title = entry.get("title")
summary = entry.get("summary")
url = entry.get("url")
tags = entry.get("tags", [])
priority = entry.get("priority", "medium")
content_type = entry.get("content_type")
text_parts = [
f"[Section: {section_title}]",
f"[Content Type: {content_type}]",
f"[Tags: {', '.join(tags)}]",
f"[Priority: {priority}]",
"",
title or "",
"",
summary or ""
]
text = "\n".join(text_parts).strip()
chunk = {
"id": url,
"text": text,
"metadata": {
"url": url,
"title": title,
"section_title": section_title,
"section_type": section_type,
"tags": tags,
"priority": priority,
"content_type": content_type,
"section_url": section_url
}
}
chunks.append(chunk)
6.2 Embedding and indexing
from typing import List
def embed_texts(texts: List[str]) -> List[List[float]]:
raise NotImplementedError
texts = [c["text"] for c in chunks]
embeddings = embed_texts(texts)
for chunk, embedding in zip(chunks, embeddings):
vector_db.upsert(
id=chunk["id"],
vector=embedding,
metadata=chunk["metadata"]
)
6.3 Retrieval with priority-aware scoring
def priority_weight(priority: str) -> float:
if priority == "high":
return 1.2
if priority == "low":
return 0.8
return 1.0
def content_type_weight(content_type: str) -> float:
if content_type == "project":
return 1.1
if content_type == "page":
return 0.9
return 1.0
def rerank_with_metadata(results):
reranked = []
for r in results:
base_score = r["score"]
meta = r["metadata"]
w_p = priority_weight(meta.get("priority", "medium"))
w_t = content_type_weight(meta.get("content_type"))
final_score = base_score * w_p * w_t
r["final_score"] = final_score
reranked.append(r)
reranked.sort(key=lambda x: x["final_score"], reverse=True)
return reranked
def retrieve(query: str, top_k: int = 5):
q_embedding = embed_texts([query])[0]
raw_results = vector_db.search(vector=q_embedding, top_k=top_k * 3)
return rerank_with_metadata(raw_results)[:top_k]
6.4 Using retrieved chunks in a RAG call
def build_context(retrieved_chunks):
parts = []
for r in retrieved_chunks:
meta = r["metadata"]
parts.append(f"Title: {meta.get('title')}")
parts.append(f"URL: {meta.get('url')}")
parts.append(r["text"])
parts.append("\n---\n")
return "\n".join(parts)
query = "How do you create time-lapse footage in Miami?"
results = retrieve(query, top_k=5)
context = build_context(results)
prompt = f"""
You are a domain expert assistant.
Use the following context from an OLAMIP-powered RAG system to answer the question.
Context:
{context}
Question:
{query}
Answer:
"""
7. Summary
Using OLAMIP in RAG pipelines provides:
- Clean, structured input for embeddings
- Built‑in relevance signals via priority, tags, and hierarchy
- Metadata‑aware retrieval for higher accuracy
- Incremental updates via
olamip-delta.json - Canonical URLs for deduplication and cross‑referencing
By treating OLAMIP as the authoritative content layer and RAG as the reasoning layer, you create a retrieval system that is robust, maintainable, and semantically aligned with your website’s structure and editorial intent.content platform.