Skip to content

feat(data-retention): granular PII redaction stages (input + block outputs)#5272

Open
TheodoreSpeaks wants to merge 16 commits into
stagingfrom
feat/pii-granular-redaction
Open

feat(data-retention): granular PII redaction stages (input + block outputs)#5272
TheodoreSpeaks wants to merge 16 commits into
stagingfrom
feat/pii-granular-redaction

Conversation

@TheodoreSpeaks

Copy link
Copy Markdown
Collaborator

Summary

  • Add two execution-altering PII redaction stages alongside the existing log redaction: redact the workflow input before execution, and mask every block output in-flight before the next block reads it
  • Per-stage policy (entity types + language) for each of Logs / Workflow input / Block outputs; resolved most-specific-wins per workspace, with full back-compat for existing logs-only rules
  • In-flight stages fail-fast (abort the run) on a Presidio error instead of scrubbing or leaking; the logs stage keeps scrub-to-marker
  • Reuse the shared HTTP → Presidio path; block-output redaction runs before payload compaction so offloaded large values are still masked
  • Settings UI: chip-tabs across the three stages, language-first picker with the entity grid filtered to that language's recognizers, and a confirmation before removing a workspace override

Type of Change

  • New feature

Testing

Tested manually. Unit tests for resolver back-compat, redactObjectStrings + failure modes, and the contract schema. bun run lint, check:api-validation:strict, and check:migrations origin/staging all pass.

Checklist

  • Code follows project style guidelines
  • Self-reviewed my changes
  • Tests added/updated and passing
  • No new warnings introduced
  • I confirm that I have read and agree to the terms outlined in the Contributor License Agreement (CLA)

@vercel

vercel Bot commented Jun 29, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
docs Skipped Skipped Jun 30, 2026 7:46pm

Request Review

@cursor

cursor Bot commented Jun 29, 2026

Copy link
Copy Markdown

PR Summary

High Risk
Changes execution-time data (inputs, block outputs, streams, memory) and log persistence with fail-fast vs scrub semantics; misconfiguration or Presidio outages can abort runs or affect workflow correctness, not only observability.

Overview
Adds three independently configurable PII redaction stages (workflow input, block outputs, logs), each with its own entity types and language, while legacy flat rules still map to logs-only.

Runtime: Workflow input is masked before execution when the input stage is on. Block outputs are masked in-flight (before compaction and downstream blocks), including buffer-only streaming so raw chunks are not forwarded when that stage is enabled; policy propagates to child workflows and agent memory writes. Input/block stages abort the run on Presidio failure (onFailure: 'throw'); log persistence keeps scrub-to-marker behavior. Log redaction now uses only the logs stage, applies without the pii-redaction feature flag (stored rules are the source of truth), and hydrates large-value refs before masking so offloaded content gets the logs policy.

Presidio / batching: New /analyze_batch and /anonymize_batch endpoints; masking paths chunk by shared byte/count budgets and use batched analyze/anonymize instead of per-string concurrency.

Settings & contracts: Data retention UI uses stage tabs, language-filtered entity grids, and confirm-on-remove for overrides; API/schema accept stages with validation (enabled stages must pick at least one entity type).

Reviewed by Cursor Bugbot for commit f0c71cc. Bugbot is set up for automated code reviews on this repo. Configure here.

Comment thread apps/sim/executor/execution/block-executor.ts
@greptile-apps

greptile-apps Bot commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds granular PII redaction stages for workflow execution and logs. The main changes are:

  • Per-stage PII policies for workflow input, block outputs, and logs.
  • Input and block-output masking during workflow execution.
  • Batched Presidio masking paths for higher-volume redaction.
  • Settings UI updates for stage-specific entity and language selection.
  • Log redaction support for offloaded large values.

Confidence Score: 4/5

This is close, but the restored large-value path should be fixed before merging.

  • Normal execution now masks block outputs before compaction.
  • Log persistence now hydrates and masks offloaded large values.
  • Restored execution state can still carry refs to raw offloaded values into resumed blocks.

apps/sim/lib/workflows/executor/execution-core.ts

Security Review

Block-output redaction can still miss restored large-value refs from older paused or run-from-block snapshots, leaving raw PII reachable during resumed execution.

Important Files Changed

Filename Overview
apps/sim/lib/workflows/executor/execution-core.ts Adds execution-time policy resolution and restore masking, but restored large-value refs can still bypass block-output redaction.
apps/sim/executor/execution/block-executor.ts Masks block outputs before compaction and prevents raw streaming chunks from being forwarded when block-output redaction is enabled.
apps/sim/lib/logs/execution/pii-large-values.ts Adds hydrate, mask, and re-store handling for offloaded values in the log redaction path.
apps/sim/lib/billing/retention.ts Resolves per-stage PII policy with stored rules as the execution-time source of truth.

Reviews (12): Last reviewed commit: "fix(data-retention): always apply logs p..." | Re-trigger Greptile

Comment thread apps/sim/lib/workflows/executor/execution-core.ts Outdated
Comment thread apps/sim/executor/execution/block-executor.ts
Comment thread apps/sim/lib/workflows/executor/execution-core.ts
@TheodoreSpeaks

Copy link
Copy Markdown
Collaborator Author

@greptile review

Comment thread apps/sim/lib/workflows/executor/execution-core.ts Outdated
@TheodoreSpeaks

Copy link
Copy Markdown
Collaborator Author

@greptile review

Comment thread apps/sim/executor/execution/block-executor.ts Outdated
Comment thread apps/sim/lib/workflows/executor/execution-core.ts Outdated
@TheodoreSpeaks

Copy link
Copy Markdown
Collaborator Author

@greptile review

@TheodoreSpeaks

Copy link
Copy Markdown
Collaborator Author

@greptile review

…redaction

# Conflicts:
#	apps/sim/ee/data-retention/components/data-retention-settings.tsx
Comment thread apps/sim/app/api/organizations/[id]/data-retention/route.ts
@TheodoreSpeaks

Copy link
Copy Markdown
Collaborator Author

@greptile review

Comment thread apps/sim/lib/guardrails/validate_pii.ts
Comment thread apps/sim/lib/billing/retention.ts
@TheodoreSpeaks

Copy link
Copy Markdown
Collaborator Author

@greptile review

@TheodoreSpeaks

Copy link
Copy Markdown
Collaborator Author

@greptile review

Comment thread apps/sim/lib/billing/retention.ts Outdated
Comment thread apps/sim/ee/data-retention/components/data-retention-settings.tsx
Comment thread apps/sim/lib/workflows/executor/execution-core.ts
@TheodoreSpeaks

Copy link
Copy Markdown
Collaborator Author

@greptile review

Comment on lines +678 to +682
// Limitation: this walks inline strings only — values offloaded to
// large-value storage are still refs here and are not re-masked. In the
// normal flow that is safe (a run with the stage on masks before offload);
// the gap is the narrow case of a run that offloaded a large value while
// the stage was OFF and is resumed after the stage is turned ON.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 security Large values bypass masking

When block-output redaction is enabled after a workflow already offloaded large block outputs, this restore path only masks inline strings in the snapshot. The offloaded payloads stay behind large-value refs. On resume or run-from-block, downstream blocks can still read the raw restored payload, and log persistence can skip the large-value scrub because block-output redaction is now enabled. This leaves raw PII reachable from prior block outputs after the stage is turned on.

abortSignal: ctx.abortSignal,
// Propagate in-flight block-output redaction into child workflows so
// nested blocks mask outputs too (recurses: each child forwards it).
piiBlockOutputRedaction: ctx.piiBlockOutputRedaction,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Child workflows skip input redaction

Medium Severity

The new workflow-input PII stage runs only in executeWorkflowCore on top-level processedInput. Nested child runs are started with a direct Executor and pass childWorkflowInput unchanged. Only the block-output policy is forwarded on the context, so when the input stage is on and block outputs are off, mapped or explicit child input can execute and produce downstream state without in-flight input masking.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 8f86d77. Configure here.

@TheodoreSpeaks

Copy link
Copy Markdown
Collaborator Author

@greptile review

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 6e9587a. Configure here.

Comment thread apps/sim/lib/logs/execution/logger.ts Outdated
Comment on lines +689 to +692
snapshot.state.blockStates = await redactObjectStrings(
snapshot.state.blockStates,
blockOutputOpts
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 security Refs stay unmasked When a paused run or run-from-block snapshot contains a large-value ref that was created before block-output redaction was enabled, this call only masks inline strings. Large-value refs are treated as opaque by redactObjectStrings, so the ref still points at the original offloaded bytes. The later warm-up step can materialize that raw value for downstream blocks, letting them read or send unredacted PII even though the block-output stage is enabled.

@TheodoreSpeaks

Copy link
Copy Markdown
Collaborator Author

@greptile review

Comment on lines +689 to +692
snapshot.state.blockStates = await redactObjectStrings(
snapshot.state.blockStates,
blockOutputOpts
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 security Refs stay raw

This restore path still only masks inline strings. When a paused run or run-from-block snapshot contains a large-value ref created before block-output redaction was enabled, redactObjectStrings leaves the ref untouched. The later warm-up can materialize that original offloaded value for downstream blocks, so the resumed workflow can read raw PII even though block-output redaction is now enabled. This path needs to hydrate, mask, and re-store restored refs before downstream state can use them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant