@@ -25,6 +25,8 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
2525- Custom dictionary: ` php src/cmd/demo_user_dict.php `
2626- Tokenization with positions: ` php src/cmd/demo_tokenize.php `
2727- ** Custom POS tagging** : ` php src/cmd/demo_custom_pos_tag.php `
28+ - ** TF-IDF and POS integration** : ` php src/cmd/demo_tf_idf_pos.php `
29+ - ** Mixed CJK language processing** : ` php src/cmd/demo_mixed_cjk.php `
2830
2931### Memory Requirements
3032All operations require significant memory allocation: ` ini_set('memory_limit', '1024M'); `
@@ -38,12 +40,20 @@ This is a PHP port of the Python jieba Chinese text segmentation library. The co
3840 - Supports custom word addition with ` addWord($word, $freq, $tag) `
3941 - Enhanced input validation and security measures
4042 - Memory management improvements
43+ - ** NEW** : Support for ` with_pos ` and ` with_scores ` options in ` cut() ` method
4144- ** Finalseg** : HMM-based final segmentation for unknown words using Viterbi algorithm
4245- ** JiebaAnalyse** : TF-IDF keyword extraction functionality
46+ - ** NEW** : Modular TF calculation with ` calculateTF($words) `
47+ - ** NEW** : Flexible TF-IDF calculation with ` calculateTFIDF($tf_values, $detailed) `
4348- ** Posseg** : Part-of-speech tagging with HMM model
4449 - ** Custom POS tag support** : Add custom tags with ` addWordTag($word, $tag) `
4550 - ** Input validation** : Secure tag validation with length limits and character restrictions
4651 - ** Memory cleanup** : ` removeWordTag($word) ` for tag cleanup
52+ - ** NEW** : Support for ` with_scores ` option in ` cut() ` method
53+ - ** JiebaMemory** : NEW unified memory management utility
54+ - ** Memory management** : ` destroyAll() ` , ` initAll() ` , ` clearAllCaches() `
55+ - ** Statistics** : ` getMemoryStats() ` , ` getAllCacheStats() ` , ` getInitializationStatus() `
56+ - ** Convenience** : ` isAllInitialized() ` for checking all classes
4757
4858### Dictionary System (src/dict/)
4959- ** dict.txt** : Default dictionary with word frequencies
@@ -73,6 +83,9 @@ Jieba::init($options); // Load dictionary and build trie
7383Finalseg::init(); // Load HMM models
7484JiebaAnalyse::init(); // Load IDF data
7585Posseg::init(); // Load POS models
86+
87+ // NEW: Convenient initialization of all classes
88+ JiebaMemory::initAll($options); // Initialize all classes at once
7689```
7790
7891### Dictionary Modes
@@ -89,7 +102,43 @@ Posseg::init(); // Load POS models
89102### Multi-language Support
90103- Primary: Simplified/Traditional Chinese
91104- Secondary: Japanese, Korean (with ` 'cjk'=>'all' ` )
105+ - ** ENHANCED** : Improved mixed-language text processing
106+ - ** NEW** : Better handling of complex mixed CJK scenarios
92107- Custom dictionaries can extend language support
108+ - ** NEW Demo** : ` demo_mixed_cjk.php ` for testing multi-language capabilities
109+
110+ ## Enhanced TF-IDF and POS Integration Features
111+
112+ ### NEW: Integrated TF-IDF Scoring
113+ ``` php
114+ // Jieba::cut() with POS tags
115+ $pos_result = Jieba::cut($text, false, array('with_pos' => true));
116+
117+ // Jieba::cut() with TF-IDF scores
118+ $scored_result = Jieba::cut($text, false, array('with_scores' => true));
119+
120+ // Jieba::cut() with both POS tags and TF-IDF scores
121+ $full_result = Jieba::cut($text, false, array(
122+ 'with_pos' => true,
123+ 'with_scores' => true
124+ ));
125+
126+ // Posseg::cut() with TF-IDF scores
127+ $posseg_scored = Posseg::cut($text, array('with_scores' => true));
128+ ```
129+
130+ ### NEW: Modular TF-IDF Calculation
131+ ``` php
132+ // Calculate Term Frequency
133+ $words = array('測試', '中文', '分詞', '測試');
134+ $tf_values = JiebaAnalyse::calculateTF($words);
135+
136+ // Calculate TF-IDF (simple format)
137+ $tfidf_simple = JiebaAnalyse::calculateTFIDF($tf_values, false);
138+
139+ // Calculate TF-IDF (detailed format with TF, IDF, TF-IDF)
140+ $tfidf_detailed = JiebaAnalyse::calculateTFIDF($tf_values, true);
141+ ```
93142
94143## Custom POS Tagging Features
95144
@@ -144,12 +193,55 @@ Posseg::removeWordTag('詞彙');
144193- ** Security** : Input validation and injection prevention tests
145194- ** User Dictionaries** : Dictionary loading and processing tests
146195- ** Memory Management** : Memory cleanup and leak prevention tests
196+ - ** NEW: TF-IDF Integration** : Enhanced TF-IDF and POS tagging features (` TfIdfPosTest.php ` )
197+ - ** NEW: Mixed CJK Support** : Multi-language text processing tests (` MixedCJKTest.php ` )
147198
148199### Test Coverage
149- - 58 + tests with 259 + assertions
200+ - 70 + tests with 300 + assertions
150201- PSR2 coding standard compliance
151202- Edge case coverage for mixed character types
152203- Security vulnerability testing
204+ - ** NEW** : Comprehensive TF-IDF integration testing
205+ - ** NEW** : Multi-language CJK text processing validation
206+ - ** NEW** : Backward compatibility verification
207+
208+ ## Memory Management with JiebaMemory
209+
210+ ### NEW: Unified Memory Management
211+ ``` php
212+ use Fukuball\Jieba\JiebaMemory;
213+
214+ // Initialize all classes at once
215+ JiebaMemory::initAll($options);
216+
217+ // Check which classes are initialized
218+ $status = JiebaMemory::getInitializationStatus();
219+ if (!JiebaMemory::isAllInitialized()) {
220+ // Handle partial initialization
221+ }
222+
223+ // Get comprehensive memory statistics
224+ $stats = JiebaMemory::getMemoryStats();
225+ echo "Current Memory: " . $stats['current_memory_usage_formatted'] . "\n";
226+ echo "Peak Memory: " . $stats['peak_memory_usage_formatted'] . "\n";
227+
228+ // Clear all caches while keeping classes initialized
229+ JiebaMemory::clearAllCaches();
230+
231+ // Destroy all classes and free memory
232+ JiebaMemory::destroyAll();
233+ ```
234+
235+ ### NEW: Cache Statistics Monitoring
236+ ``` php
237+ // Get detailed cache statistics for all classes
238+ $cacheStats = JiebaMemory::getAllCacheStats();
239+
240+ // Monitor individual class cache usage
241+ echo "Jieba DAG Cache: " . $cacheStats['jieba']['dag_cache_size'] . "\n";
242+ echo "Posseg Word Tags: " . $cacheStats['posseg']['word_tag_size'] . "\n";
243+ echo "JiebaAnalyse IDF: " . $cacheStats['jieba_analyse']['idf_freq_size'] . "\n";
244+ ```
153245
154246## Best Practices & Guidelines
155247
@@ -159,6 +251,8 @@ Posseg::removeWordTag('詞彙');
159251Jieba::init();
160252Finalseg::init();
161253Posseg::init();
254+ // OR use convenient initialization
255+ JiebaMemory::initAll();
162256
163257// Add words with proper error handling
164258try {
@@ -168,6 +262,22 @@ try {
168262}
169263```
170264
265+ ### NEW: Enhanced Feature Usage
266+ ``` php
267+ // Use integrated TF-IDF and POS features
268+ $result = Jieba::cut($text, false, array(
269+ 'with_pos' => true,
270+ 'with_scores' => true
271+ ));
272+
273+ // Automatic JiebaAnalyse initialization when needed
274+ // No need to manually call JiebaAnalyse::init() for scoring features
275+
276+ // Use modular TF-IDF calculation for custom workflows
277+ $tf_values = JiebaAnalyse::calculateTF($words);
278+ $tfidf_scores = JiebaAnalyse::calculateTFIDF($tf_values, true);
279+ ```
280+
171281### Security Considerations
172282- Always validate user input before adding custom tags
173283- Use safe characters only: alphanumeric, underscore, hyphen, Chinese characters
@@ -178,6 +288,8 @@ try {
178288- Load user dictionaries during initialization, not runtime
179289- Use appropriate dictionary modes ('small' for memory-constrained environments)
180290- Clear unused tags with ` removeWordTag() ` to prevent memory leaks
291+ - ** NEW** : Use ` JiebaMemory::clearAllCaches() ` for comprehensive cache management
292+ - ** NEW** : Monitor memory with ` JiebaMemory::getMemoryStats() ` and ` getAllCacheStats() `
181293- Cache initialization results when possible
182294
183295### Error Handling Patterns
0 commit comments