Skip to content

Commit aee1e1d

Browse files
fukuballclaude
andcommitted
Update documentation with new features and enhancements
- Add JiebaMemory class for unified memory management - Document enhanced TF-IDF and POS tagging integration features - Update multi-language CJK text processing capabilities - Add new demo scripts: demo_tf_idf_pos.php and demo_mixed_cjk.php - Include comprehensive test coverage updates - Update feature lists and usage examples in both Chinese and English - Add memory management best practices and API documentation 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
1 parent 660062b commit aee1e1d

File tree

2 files changed

+238
-26
lines changed

2 files changed

+238
-26
lines changed

CLAUDE.md

Lines changed: 113 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,8 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
2525
- Custom dictionary: `php src/cmd/demo_user_dict.php`
2626
- Tokenization with positions: `php src/cmd/demo_tokenize.php`
2727
- **Custom POS tagging**: `php src/cmd/demo_custom_pos_tag.php`
28+
- **TF-IDF and POS integration**: `php src/cmd/demo_tf_idf_pos.php`
29+
- **Mixed CJK language processing**: `php src/cmd/demo_mixed_cjk.php`
2830

2931
### Memory Requirements
3032
All operations require significant memory allocation: `ini_set('memory_limit', '1024M');`
@@ -38,12 +40,20 @@ This is a PHP port of the Python jieba Chinese text segmentation library. The co
3840
- Supports custom word addition with `addWord($word, $freq, $tag)`
3941
- Enhanced input validation and security measures
4042
- Memory management improvements
43+
- **NEW**: Support for `with_pos` and `with_scores` options in `cut()` method
4144
- **Finalseg**: HMM-based final segmentation for unknown words using Viterbi algorithm
4245
- **JiebaAnalyse**: TF-IDF keyword extraction functionality
46+
- **NEW**: Modular TF calculation with `calculateTF($words)`
47+
- **NEW**: Flexible TF-IDF calculation with `calculateTFIDF($tf_values, $detailed)`
4348
- **Posseg**: Part-of-speech tagging with HMM model
4449
- **Custom POS tag support**: Add custom tags with `addWordTag($word, $tag)`
4550
- **Input validation**: Secure tag validation with length limits and character restrictions
4651
- **Memory cleanup**: `removeWordTag($word)` for tag cleanup
52+
- **NEW**: Support for `with_scores` option in `cut()` method
53+
- **JiebaMemory**: NEW unified memory management utility
54+
- **Memory management**: `destroyAll()`, `initAll()`, `clearAllCaches()`
55+
- **Statistics**: `getMemoryStats()`, `getAllCacheStats()`, `getInitializationStatus()`
56+
- **Convenience**: `isAllInitialized()` for checking all classes
4757

4858
### Dictionary System (src/dict/)
4959
- **dict.txt**: Default dictionary with word frequencies
@@ -73,6 +83,9 @@ Jieba::init($options); // Load dictionary and build trie
7383
Finalseg::init(); // Load HMM models
7484
JiebaAnalyse::init(); // Load IDF data
7585
Posseg::init(); // Load POS models
86+
87+
// NEW: Convenient initialization of all classes
88+
JiebaMemory::initAll($options); // Initialize all classes at once
7689
```
7790

7891
### Dictionary Modes
@@ -89,7 +102,43 @@ Posseg::init(); // Load POS models
89102
### Multi-language Support
90103
- Primary: Simplified/Traditional Chinese
91104
- Secondary: Japanese, Korean (with `'cjk'=>'all'`)
105+
- **ENHANCED**: Improved mixed-language text processing
106+
- **NEW**: Better handling of complex mixed CJK scenarios
92107
- Custom dictionaries can extend language support
108+
- **NEW Demo**: `demo_mixed_cjk.php` for testing multi-language capabilities
109+
110+
## Enhanced TF-IDF and POS Integration Features
111+
112+
### NEW: Integrated TF-IDF Scoring
113+
```php
114+
// Jieba::cut() with POS tags
115+
$pos_result = Jieba::cut($text, false, array('with_pos' => true));
116+
117+
// Jieba::cut() with TF-IDF scores
118+
$scored_result = Jieba::cut($text, false, array('with_scores' => true));
119+
120+
// Jieba::cut() with both POS tags and TF-IDF scores
121+
$full_result = Jieba::cut($text, false, array(
122+
'with_pos' => true,
123+
'with_scores' => true
124+
));
125+
126+
// Posseg::cut() with TF-IDF scores
127+
$posseg_scored = Posseg::cut($text, array('with_scores' => true));
128+
```
129+
130+
### NEW: Modular TF-IDF Calculation
131+
```php
132+
// Calculate Term Frequency
133+
$words = array('測試', '中文', '分詞', '測試');
134+
$tf_values = JiebaAnalyse::calculateTF($words);
135+
136+
// Calculate TF-IDF (simple format)
137+
$tfidf_simple = JiebaAnalyse::calculateTFIDF($tf_values, false);
138+
139+
// Calculate TF-IDF (detailed format with TF, IDF, TF-IDF)
140+
$tfidf_detailed = JiebaAnalyse::calculateTFIDF($tf_values, true);
141+
```
93142

94143
## Custom POS Tagging Features
95144

@@ -144,12 +193,55 @@ Posseg::removeWordTag('詞彙');
144193
- **Security**: Input validation and injection prevention tests
145194
- **User Dictionaries**: Dictionary loading and processing tests
146195
- **Memory Management**: Memory cleanup and leak prevention tests
196+
- **NEW: TF-IDF Integration**: Enhanced TF-IDF and POS tagging features (`TfIdfPosTest.php`)
197+
- **NEW: Mixed CJK Support**: Multi-language text processing tests (`MixedCJKTest.php`)
147198

148199
### Test Coverage
149-
- 58+ tests with 259+ assertions
200+
- 70+ tests with 300+ assertions
150201
- PSR2 coding standard compliance
151202
- Edge case coverage for mixed character types
152203
- Security vulnerability testing
204+
- **NEW**: Comprehensive TF-IDF integration testing
205+
- **NEW**: Multi-language CJK text processing validation
206+
- **NEW**: Backward compatibility verification
207+
208+
## Memory Management with JiebaMemory
209+
210+
### NEW: Unified Memory Management
211+
```php
212+
use Fukuball\Jieba\JiebaMemory;
213+
214+
// Initialize all classes at once
215+
JiebaMemory::initAll($options);
216+
217+
// Check which classes are initialized
218+
$status = JiebaMemory::getInitializationStatus();
219+
if (!JiebaMemory::isAllInitialized()) {
220+
// Handle partial initialization
221+
}
222+
223+
// Get comprehensive memory statistics
224+
$stats = JiebaMemory::getMemoryStats();
225+
echo "Current Memory: " . $stats['current_memory_usage_formatted'] . "\n";
226+
echo "Peak Memory: " . $stats['peak_memory_usage_formatted'] . "\n";
227+
228+
// Clear all caches while keeping classes initialized
229+
JiebaMemory::clearAllCaches();
230+
231+
// Destroy all classes and free memory
232+
JiebaMemory::destroyAll();
233+
```
234+
235+
### NEW: Cache Statistics Monitoring
236+
```php
237+
// Get detailed cache statistics for all classes
238+
$cacheStats = JiebaMemory::getAllCacheStats();
239+
240+
// Monitor individual class cache usage
241+
echo "Jieba DAG Cache: " . $cacheStats['jieba']['dag_cache_size'] . "\n";
242+
echo "Posseg Word Tags: " . $cacheStats['posseg']['word_tag_size'] . "\n";
243+
echo "JiebaAnalyse IDF: " . $cacheStats['jieba_analyse']['idf_freq_size'] . "\n";
244+
```
153245

154246
## Best Practices & Guidelines
155247

@@ -159,6 +251,8 @@ Posseg::removeWordTag('詞彙');
159251
Jieba::init();
160252
Finalseg::init();
161253
Posseg::init();
254+
// OR use convenient initialization
255+
JiebaMemory::initAll();
162256

163257
// Add words with proper error handling
164258
try {
@@ -168,6 +262,22 @@ try {
168262
}
169263
```
170264

265+
### NEW: Enhanced Feature Usage
266+
```php
267+
// Use integrated TF-IDF and POS features
268+
$result = Jieba::cut($text, false, array(
269+
'with_pos' => true,
270+
'with_scores' => true
271+
));
272+
273+
// Automatic JiebaAnalyse initialization when needed
274+
// No need to manually call JiebaAnalyse::init() for scoring features
275+
276+
// Use modular TF-IDF calculation for custom workflows
277+
$tf_values = JiebaAnalyse::calculateTF($words);
278+
$tfidf_scores = JiebaAnalyse::calculateTFIDF($tf_values, true);
279+
```
280+
171281
### Security Considerations
172282
- Always validate user input before adding custom tags
173283
- Use safe characters only: alphanumeric, underscore, hyphen, Chinese characters
@@ -178,6 +288,8 @@ try {
178288
- Load user dictionaries during initialization, not runtime
179289
- Use appropriate dictionary modes ('small' for memory-constrained environments)
180290
- Clear unused tags with `removeWordTag()` to prevent memory leaks
291+
- **NEW**: Use `JiebaMemory::clearAllCaches()` for comprehensive cache management
292+
- **NEW**: Monitor memory with `JiebaMemory::getMemoryStats()` and `getAllCacheStats()`
181293
- Cache initialization results when possible
182294

183295
### Error Handling Patterns

0 commit comments

Comments
 (0)