Identify and Review Data Quality Issues

The MKK API exposes a dedicated data quality endpoint that aggregates extraction issues across documents for one or more funds. You can use it to identify documents with too few line items, values that failed to parse as numbers, mappings that may need human review, and documents with no usable content at all. This is the starting point for any data validation or remediation workflow.

The data quality endpoint

GET https://mkk-roan.vercel.app/api/data-quality

Filter by fund and configure the quality threshold:

curl "https://mkk-roan.vercel.app/api/data-quality?fund_code=OJB&low_line_item_threshold=5"

Parameters

Parameter	Type	Description
`fund_id`	integer	Filter to a single fund by internal ID.
`fund_code`	string	Filter to a single fund by code (e.g. `OJB`).
`limit`	integer	Maximum number of issue records to return.
`low_line_item_threshold`	integer	Documents with fewer line items than this value are flagged. Default varies by deployment.

Quality issue categories

The response groups issues into six named categories:

1. low_line_item_documents

Documents that contain fewer line items than the low_line_item_threshold. This usually indicates a parsing failure, an unusual document layout, or a document that was submitted without structured financial data.

2. portfolio_only_documents

Documents that contain portfolio (holdings) data but no structured line item values. These documents are partially usable — portfolio analysis is possible — but financial summary data (e.g. net asset value, expense ratios) is absent.

3. empty_documents

Documents with neither line items nor portfolio entries. These documents were processed but yielded no structured data at all. They may correspond to cover pages, amendments, or documents in an unsupported format.

4. review_mappings

Line item values where the mapping confidence score falls below the acceptable threshold. These values were extracted and mapped to a known line item slug, but the match is uncertain enough to warrant manual verification before use in analysis.

5. numeric_parse_failures

Values where the raw text was extracted but could not be parsed into a numeric value. Common causes include non-standard number formatting, footnote markers embedded in the value field, or text entries that are not numeric by nature.

6. missing_pdfs

Documents whose original PDF source file is not available via GET /documents/{docId}/pdf. The document record exists and may contain extracted data, but you cannot verify it against the source.

Response structure

A representative response for GET /data-quality?fund_code=OJB:

{
  "scope": { "fund_id": "OJB" },
  "limits": { "list_limit": 50, "low_line_item_threshold": 5 },
  "summary": {
    "documents": 48,
    "documents_with_line_items": 45,
    "documents_with_portfolio": 44,
    "documents_with_both": 42,
    "portfolio_only_documents": 1,
    "empty_documents": 0,
    "low_line_item_documents": 3,
    "review_mappings": 14,
    "numeric_parse_failures": 7,
    "portfolio_numeric_parse_failures": 2,
    "missing_pdfs": 2
  },
  "mapping_methods": [
    {
      "method": "exact_label_match",
      "count": 1240,
      "average_confidence": 0.99
    },
    {
      "method": "fuzzy_label_match",
      "count": 87,
      "average_confidence": 0.74
    },
    {
      "method": "positional_match",
      "count": 34,
      "average_confidence": 0.61
    }
  ],
  "low_line_item_documents": [
    {
      "document_id": 55,
      "period": "2022-Q2",
      "line_item_count": 2,
      "threshold": 5
    }
  ],
  "portfolio_only_documents": [
    {
      "document_id": 63,
      "period": "2022-Q4"
    }
  ],
  "empty_documents": [],
  "review_mappings": [
    {
      "line_item_value_id": 901,
      "document_id": 42,
      "period": "2023-Q4",
      "line_item_slug": "management-fee",
      "raw_label": "Yönetim Ücreti *",
      "mapping_confidence": 0.62
    }
  ],
  "numeric_parse_failures": [
    {
      "line_item_value_id": 870,
      "document_id": 42,
      "period": "2023-Q4",
      "line_item_slug": "other-income",
      "raw_value": "N/A"
    }
  ],
  "missing_pdfs": [
    {
      "document_id": 29,
      "period": "2021-Q1",
      "disclosure_index": "OLD98765"
    }
  ]
}

The summary section

The summary object provides aggregate counts for each issue category across all documents matching your filters. Use the summary to triage: if review_mappings is large, focus on improving mapping rules; if empty_documents is non-zero, investigate whether those document formats are supported.

The mapping_methods breakdown

The mapping_methods array shows how values were mapped to line item slugs and the average confidence for each method. Methods with low average confidence across many values indicate a systematic extraction or mapping issue that may need a rule update rather than individual review.

Adjusting the low_line_item_threshold

The low_line_item_threshold parameter controls how many line items a document must contain before it is considered adequately populated. Adjust it to match the typical richness of your fund’s documents.

# Use a threshold of 5 — flag documents with fewer than 5 line items
curl "https://mkk-roan.vercel.app/api/data-quality?fund_code=OJB&low_line_item_threshold=5"

# Use a higher threshold for funds with rich disclosures
curl "https://mkk-roan.vercel.app/api/data-quality?fund_code=OJB&low_line_item_threshold=20"

The threshold only affects the low_line_item_documents category. The other five categories are always computed regardless of the threshold value.

Exporting the quality report as CSV

Use GET /exports/data-quality.csv to download the same report in CSV format for offline analysis or sharing with your data team.

curl "https://mkk-roan.vercel.app/api/exports/data-quality.csv?fund_code=OJB" \
  -o ojb-data-quality.csv

The CSV export accepts the same fund_id, fund_code, low_line_item_threshold, and export_limit parameters as the JSON endpoint.

Run the quality report regularly after new documents are ingested. Catching review_mappings and numeric_parse_failures early prevents low-quality values from propagating into downstream analysis.

Get Started

Core Concepts

Guides

Identify and Review Data Quality Issues

The data quality endpoint

Parameters

Quality issue categories

1. low_line_item_documents

2. portfolio_only_documents

3. empty_documents

4. review_mappings

5. numeric_parse_failures

6. missing_pdfs

Response structure

The summary section

The mapping_methods breakdown

Adjusting the low_line_item_threshold

Exporting the quality report as CSV

Get Started

Core Concepts

Guides

Documentation Index

​The data quality endpoint

​Parameters

​Quality issue categories

​1. low_line_item_documents

​2. portfolio_only_documents

​3. empty_documents

​4. review_mappings

​5. numeric_parse_failures

​6. missing_pdfs

​Response structure

​The summary section

​The mapping_methods breakdown

​Adjusting the low_line_item_threshold

​Exporting the quality report as CSV

The data quality endpoint

Parameters

Quality issue categories

1. low_line_item_documents

2. portfolio_only_documents

3. empty_documents

4. review_mappings

5. numeric_parse_failures

6. missing_pdfs

Response structure

The summary section

The mapping_methods breakdown

Adjusting the low_line_item_threshold

Exporting the quality report as CSV