Skip to content

Extraction Rules (Layer C)

Layer C describes how to parse and normalize what a scrape target returns. Together, Layers B + C make a packet L2. At L2+, at least one extraction rule is required.

ExtractionRule fields

FieldTypeNotes
idstringRequired
rule_typeenumRequired — see below
name · descriptionstring
expressionstringThe rule expression (jq filter, JSONPath, CSS selector, regex, …)
llm_promptstringFor llm_extract — use {content} for the input text
output_contract_idstringId of the DataContract this rule outputs
field_mapobject<string,string>For field_mapsource_field → dest_field
apply_tostringJSONPath/field within the response to apply to
fallbackanyDefault value if extraction yields null
post_processenum[]Post-processing pipeline — see below
_extensionsobject

Rule types

jq_transform · jsonpath · css_selector · xpath · regex ·
llm_extract · python_fn · js_fn · field_map

Post-processing

post_process[] applies an ordered pipeline of normalizers:

trim · lowercase · uppercase · parse_int · parse_float ·
parse_date · strip_html · truncate_512

Examples

json
{
  "id": "rule-pubmed-ids",
  "rule_type": "xpath",
  "expression": "//IdList/Id/text()",
  "apply_to": "$",
  "post_process": ["trim"]
}
json
{
  "id": "rule-extract-findings",
  "rule_type": "llm_extract",
  "llm_prompt": "From the abstract below, extract the primary outcome and effect size as JSON. {content}",
  "output_contract_id": "contract-finding"
}

→ Next: Directives (Layer D) · Scrape Targets (Layer B)

Released under the MIT License.