How does Nemotron CC & Nemo Curator "sort" its data? #15431

XevWright · 2026-02-24T16:31:16Z

XevWright
Feb 24, 2026

Hi everyone,

I’ve been digging into the NeMo Curator documentation and the Nemotron-CC dataset papers. I’m trying to understand the "under the hood" logic for how they actually split raw Common Crawl data into specialized Math and Code streams for processing.

Could anyone share some insights into the specific mechanisms used for this initial separation? I’m particularly curious about the classifiers or heuristics deployed at the very beginning to route this domain-specific data accurately—especially before it reaches specialized stages like Lynx or Phi-4 cleaning.

Any details on the initial split logic would be incredibly helpful. Thanks in advance!

XevWright · 2026-02-25T02:14:48Z

XevWright
Feb 25, 2026
Author

Adding a quick follow-up to my previous question!

I just wanted to double-check if my understanding of the exact data extraction pipeline is accurate:

For the Math data: extract URLs from known high-quality math datasets, pull their original HTML from Common Crawl snapshots, and then process them strictly through the Lynx + LLM pipeline?

For the Code data (Nemotron-CC-Code-v1): first use a "fast pattern matching code classifier" to filter Common Crawl pages, and then run those surviving pages through that same Lynx + LLM pipeline to preserve syntax and indentation?

If my understanding of the Code pipeline is correct, I am very curious about the initial "fast pattern matching code classifier". Could you share a bit more about the specific heuristics, regex rules, or keywords it relies on? Alternatively, are these specific rules already integrated into the NeMo-Curator repository?

Thanks again!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How does Nemotron CC & Nemo Curator "sort" its data? #15431

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

How does Nemotron CC & Nemo Curator "sort" its data? #15431

Uh oh!

XevWright Feb 24, 2026

Replies: 1 comment

Uh oh!

XevWright Feb 25, 2026 Author

XevWright
Feb 24, 2026

XevWright
Feb 25, 2026
Author