Replies: 1 comment
-
|
Adding a quick follow-up to my previous question! I just wanted to double-check if my understanding of the exact data extraction pipeline is accurate: For the Math data: extract URLs from known high-quality math datasets, pull their original HTML from Common Crawl snapshots, and then process them strictly through the Lynx + LLM pipeline? For the Code data (Nemotron-CC-Code-v1): first use a "fast pattern matching code classifier" to filter Common Crawl pages, and then run those surviving pages through that same Lynx + LLM pipeline to preserve syntax and indentation? If my understanding of the Code pipeline is correct, I am very curious about the initial "fast pattern matching code classifier". Could you share a bit more about the specific heuristics, regex rules, or keywords it relies on? Alternatively, are these specific rules already integrated into the NeMo-Curator repository? Thanks again! |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi everyone,
I’ve been digging into the NeMo Curator documentation and the Nemotron-CC dataset papers. I’m trying to understand the "under the hood" logic for how they actually split raw Common Crawl data into specialized Math and Code streams for processing.
Could anyone share some insights into the specific mechanisms used for this initial separation? I’m particularly curious about the classifiers or heuristics deployed at the very beginning to route this domain-specific data accurately—especially before it reaches specialized stages like Lynx or Phi-4 cleaning.
Any details on the initial split logic would be incredibly helpful. Thanks in advance!
Beta Was this translation helpful? Give feedback.
All reactions