Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Credential Scanning

What you’ll learn

  • The 16 credential patterns and 4 PII patterns NabaOS detects
  • How to test the scanner from the command line
  • How redaction works and what the output looks like
  • How to verify detection with specific pattern examples

Overview

The credential scanner runs on every piece of text that enters or leaves the system – user input, LLM responses, chain step outputs, and log messages. It uses compiled regex patterns to detect secrets and personally identifiable information (PII) in under 1ms.

When a match is found, the scanner replaces it with a type-safe placeholder. The original secret value is never logged, stored, or returned in any API response. Byte offsets are kept pub(crate) to prevent external code from reverse-engineering secret positions from match metadata.


16 Credential Patterns

The scanner detects the following credential types, listed in scan order:

#Pattern IDWhat it matchesExample prefix
1aws_access_keyAWS access key IDAKIA + 16 alphanumeric
2aws_secret_keyAWS secret access key40-char base64-like string
3gcp_api_keyGoogle Cloud Platform API keyAIza + 35 chars
4openai_keyOpenAI API keysk- + 20+ chars
5anthropic_keyAnthropic API keysk-ant- + 20+ chars
6github_patGitHub personal access tokenghp_ + 36 chars
7github_oauthGitHub OAuth tokengho_ + 36 chars
8gitlab_patGitLab personal access tokenglpat- + 20+ chars
9stripe_keyStripe secret keysk_test_ or sk_live_ + 24+ chars
10stripe_restrictedStripe restricted keyrk_test_ or rk_live_ + 24+ chars
11private_keyPEM private key header-----BEGIN [RSA] PRIVATE KEY-----
12private_key_bodyBase64 private key material (no header)MII + 60+ base64 chars
13generic_secretKeyword-value pairs (password=, token=, etc.)password = "..."
14connection_stringDatabase connection URIspostgres://, mongodb://, redis://
15telegram_bot_tokenTelegram bot API token8-10 digit ID + : + 35-char secret
16huggingface_tokenHuggingFace API tokenhf_ + 34+ chars

4 PII Patterns

#Pattern IDWhat it matchesExample
1us_ssnUS Social Security Number123-45-6789
2credit_cardVisa, Mastercard, Amex, Discover4111111111111111
3emailEmail addressesalice@example.com
4phone_usUS phone numbers(555) 123-4567, +1-555-123-4567

PII matches use the PII_REDACTED prefix in placeholders instead of REDACTED, so downstream code can distinguish between credential leaks and personal data exposure.


How to Test

Use the nabaos admin scan command to test the scanner against any input:

nabaos admin scan "my AWS key is AKIAIOSFODNN7EXAMPLE and email is alice@example.com"

Expected output:

=== Security Scan Results ===

Credential matches: 1
  [1] aws_access_key

PII matches: 1
  [1] email

Redacted text:
  my AWS key is [REDACTED:aws_access_key] and email is [PII_REDACTED:email]

Test each pattern type

Here are test commands for every credential category:

# AWS access key
nabaos admin scan "AKIAIOSFODNN7EXAMPLE"

# OpenAI key
nabaos admin scan "sk-abc123def456ghi789jkl012mno345"

# Anthropic key
nabaos admin scan "sk-ant-api03-abcdefghijklmnopqrst"

# GitHub PAT
nabaos admin scan "ghp_ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghij"

# GitLab PAT
nabaos admin scan "glpat-xxxxxxxxxxxxxxxxxxxx"

# Stripe key
nabaos admin scan "sk_live_abcdefghijklmnopqrstuvwx"

# Private key header
nabaos admin scan "-----BEGIN RSA PRIVATE KEY-----"

# Generic secret
nabaos admin scan 'password = "MyS3cretP@ssw0rd!"'

# Connection string
nabaos admin scan "postgres://user:pass@localhost:5432/mydb"

# Telegram bot token
nabaos admin scan "1234567890:ABCDefghIJKLmnopQRSTuvwxYZ123456789"

# HuggingFace token
nabaos admin scan "hf_abcdefghijklmnopqrstuvwxyz12345678"

# SSN
nabaos admin scan "SSN is 123-45-6789"

# Credit card
nabaos admin scan "Card: 4111111111111111"

# Email
nabaos admin scan "Contact alice@example.com"

# Phone
nabaos admin scan "Call (555) 123-4567"

How Redaction Works

The redaction process operates in four steps:

  1. Scan credentials: All 16 credential patterns are evaluated against the input text. Each match records its type, byte start offset, and byte end offset.

  2. Scan PII: All 4 PII patterns are evaluated. Matches are added to the same list.

  3. Deduplicate overlaps: Matches are sorted by position (descending). If two matches overlap in byte range, the more specific match (scanned first) is kept and the other is dropped.

  4. Replace: Working from the end of the string backward (so byte offsets remain valid), each match is replaced with its placeholder string.

Placeholder format

Credentials are replaced with:

[REDACTED:pattern_id]

PII is replaced with:

[PII_REDACTED:pattern_id]

Where redaction runs

LocationWhenWhy
Input gateBefore security classificationPrevent secrets from reaching the BERT classifier context
LLM outputAfter every LLM responseCatch secrets the model may have memorized or hallucinated
Chain step outputAfter each tool call returnsCatch secrets in API responses
Log pipelineBefore any text is written to logsEnsure secrets never appear in log files

Design Decisions

Why regex instead of ML? Credential patterns have rigid, well-defined formats (fixed prefixes, known lengths). Regex detection is deterministic, auditable, and runs in under 1ms. An ML classifier would add latency, require training data, and introduce false-negative risk for a problem that regex solves perfectly.

Why cap generic_secret at 200 characters? Without a length cap, the [^\s'"]{8,200} quantifier could backtrack exponentially on long non-matching strings, causing a regex denial-of-service (ReDoS). The 200-character cap bounds worst-case execution time.

Why are byte offsets pub(crate)? Exposing match positions in a public API would allow an attacker to infer secret length and location from redaction metadata. By keeping offsets internal, the public interface reveals only the type of credential found, not where it was in the input.


Next Steps