---
name: nl2sql-quvi-yml-column-picker
description: Generates an optimal DSL + example SQL and full SQL for a natural-language query by selecting the best tables/columns from repository YAML schema descriptions. Returns per-column YAML excerpts and short rationales that justify each choice. Common trigger: "select table/columns for query", "generate DSL and example SQL based on yml", "explain column selection".
---

# Skill purpose

This Skill analyzes natural-language queries and repository YAML schema descriptions to improve table and column selection for an nl2sql QuVI system. It produces:
- A suggested DSL (node workflow fragment) selecting tables/columns for the query
- A concise example SQL snippet that demonstrates how the DSL maps to a query
- A full SQL statement consistent with the DSL and DBT-style transformations where applicable
- For each chosen table and column: the full YAML description excerpt (or relevant fields) and a short paragraph explaining why that column was selected

This helps LLM-powered nodes pick the most relevant schema elements (improving column selection accuracy) and provides transparent evidence from YAML descriptions.

# Step-by-step instructions Claude must follow

1. Input parsing
   - Accept two required inputs: (A) the user's natural-language query (NLQ) and (B) a list of YAML schema files or YAML content for candidate tables. If the repository contains many YAMLs, prioritize files referenced by the node's config, otherwise consider all provided YAMLs.
2. Normalize NLQ
   - Normalize the NLQ: detect intent (aggregation, filter, join, grouping, time-range, top-k), canonicalize synonyms (e.g., "count" -> aggregate count), detect named entities, numeric constraints, and target output granularity.
3. Extract schema metadata
   - Parse each YAML file to extract: table name, column names, column type, and the column/table description fields. Preserve the full description text for later quoting.
4. Relevance scoring
   - For each column compute a relevance score to the NLQ using a hybrid approach:
     a. Exact token overlap between NLQ and YAML description/name (higher weight for column name matches).
     b. Semantic similarity between NLQ and YAML description using the LLM's embedding/semantic judgment (or similarity heuristic if embeddings not available).
     c. Column type & NLQ intent pruning (e.g., time filters prefer timestamp/date columns; aggregations prefer numeric columns; boolean filters match boolean-like descriptions).
     d. Table-level context: prefer columns that belong to tables whose descriptions mention the NLQ's domain or entity.
   - Normalize scores and rank columns and tables.
5. Choose minimal optimal set
   - Select a minimal set of tables and columns sufficient to answer the NLQ (include join keys if required). Prefer single-table solutions if sufficient; otherwise include the smallest number of tables that yields correct semantics.
6. Generate DSL fragment
   - Produce a DSL fragment (in the user's QuVI DSL style; if style unknown, use a clear placeholder DSL schema) that lists the chosen node(s), table references, selected columns, filters, and aggregations. Mark the selected columns explicitly.
7. Produce example SQL snippet
   - Generate a short, illustrative SQL snippet demonstrating core selection/filter/aggregation. Keep it compact (one or two statements) and consistent with the DSL mapping. Use canonicalized table/column names exactly as in YAML.
8. Produce full SQL
   - Generate a complete SQL statement that can be executed in the DBT/DB environment, including required joins, groupings, WHERE clauses, ORDER BY, LIMIT, and any DBT macros if applicable. Make reasonable assumptions about schema prefixes or dataset names; state assumptions in one line comment.
9. Provide YAML excerpts and rationales
   - For each chosen column (and table when relevant): include the full YAML description text (or the description field excerpt) and a concise 1–2 sentence rationale explaining why it influenced the choice.
10. Output format
   - Return a structured response with sections: DSL, Example SQL Snippet, Full SQL, Selection Summary (table -> column list), Per-column YAML Excerpts and Rationale, and Assumptions. Use clear, machine-parseable separators (e.g., labeled sections). Keep content factual and concise.

# Usage examples

Example 1 — Single-table, aggregation
- Input NLQ: "How many active users signed up in January 2025?"
- Expected outputs:
  - DSL fragment selecting users table, columns: signup_date (date/timestamp), user_status (description mentions active), user_id (for count)
  - Example SQL: SELECT COUNT(DISTINCT user_id) FROM users WHERE signup_date BETWEEN '2025-01-01' AND '2025-01-31' AND user_status = 'active';
  - Per-column YAML excerpt and rationale for signup_date, user_status, user_id.

Example 2 — Multi-table join
- Input NLQ: "Show total revenue by product category for last quarter."
- Expected outputs:
  - DSL selecting orders and products tables, join on product_id, columns: order_amount, product_category, order_date
  - Example SQL snippet and full SQL with GROUP BY product_category, filters for last quarter
  - YAML excerpts and short rationales for each column and join key.

# Best practices

- Always quote/use the exact column and table identifiers as they appear in the YAML metadata to avoid naming mismatches.
- Prefer minimal column sets; include join keys and any additional columns required to preserve semantics (time zone columns, currency columns) with explicit rationale.
- If YAML descriptions are ambiguous or missing, return an explicit note stating uncertainty and prefer conservative selections labeled as "recommended" vs "possible".
- When multiple columns match similarly, provide ranked alternatives and a brief reason for each alternative.
- Keep DSL and SQL consistent; any assumption (dataset name, DBT model prefix, macro) should be stated in the Assumptions section.

# Templates & placeholders

- DSL template (placeholder if QuVI DSL specifics are not provided):
  node: select_table
  table: <table_name>
  columns:
    - <column1>
    - <column2>
  filters:
    - <filter_expressions>
  joins:
    - {left_table: <table>, left_key: <col>, right_table: <table>, right_key: <col>}

- Use this template to populate the chosen elements. If your environment uses a different DSL format, adapt fields accordingly.

# Links to examples

See the supporting sample files referenced below for a sample YAML schema and a sample NL query to test the Skill.

