-
Notifications
You must be signed in to change notification settings - Fork 112
Fix DOB/age parsing for player profiles #113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
WalkthroughReplaced strict DOB regex with a flexible dd[./-]dd[./-]yyyy pattern (optional parenthesized 1–2 digit age) and added DOB parsing utilities plus two optional DOB fields to player profile schemas; profile parsing now stores raw DOB and ISO-normalized DOB using the new utilities. Changes
Sequence Diagram(s)sequenceDiagram
participant S as Scraper / Source
participant P as profile parser
participant U as clean_dob util
participant D as PlayerProfile
S->>P: provide raw DOB string (e.g. "12/05/1990 (34)")
P->>P: trim & remove trailing age marker (if present)
P->>U: clean_dob(raw_trimmed)
U-->>P: ISO date string or null
P->>D: set dateOfBirthRaw = original trimmed (or None)
P->>D: set dateOfBirth = ISO (or None)
Note over P,D: TransfermarktPlayerProfile used in overall mapping
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (5)
app/schemas/players/profile.py (1)
63-64: Type normalized DOB as a date; keep raw as string.For consistency with other fields (e.g., joined/contract_expires), prefer the normalized value to be a date. Keep Raw as string.
If API clients expect a string, keep as-is; otherwise:
- dateOfBirth: Optional[str] = None + dateOfBirth: Optional[date] = None dateOfBirthRaw: Optional[str] = NoneAdditionally, consider deprecating
date_of_birthor wiring an alias to avoid duplicate fields diverging. If aliases are used in TransfermarktBaseModel, ensure both serialize consistently.app/utils/regex.py (1)
6-6: Tighten DOB regex with boundaries and valid day/month ranges.Reduce false positives and keep permissive matching without anchors.
-REGEX_DOB_AGE: str = r"(?P<dob>\d{1,2}[./-]\d{1,2}[./-]\d{4})(?:\s*\((?P<age>\d{1,2})\))?" +REGEX_DOB_AGE: str = ( + r"\b(?P<dob>(0?[1-9]|[12][0-9]|3[01])[./-](0?[1-9]|1[0-2])[./-](19|20)\d{2})\b" + r"(?:\s*\((?P<age>\d{1,2})\))?" +)app/services/players/profile.py (3)
5-7: Avoid regex drift: compile fallback from the shared constant.Using a local hardcoded pattern can diverge from app.utils.regex.REGEX_DOB_AGE over time.
If import order allows, prefer:
# after importing REGEX_DOB_AGE _DOB_FALLBACK = re.compile(REGEX_DOB_AGE)If needed, move the REGEX_DOB_AGE import above and initialize _DOB_FALLBACK there.
8-22: Type hint portability + narrower exception.
- Use Optional[str] for broader Python compatibility if <3.10 is supported.
- Catch ValueError instead of bare Exception.
-def clean_dob(raw_text: str) -> str | None: +from typing import Optional +def clean_dob(raw_text: Optional[str]) -> Optional[str]: @@ - except Exception: + except ValueError: return NoneIf you later switch schema dateOfBirth to a date type, have clean_dob return a date object and let Pydantic serialize it.
90-101: Reuse cached text; avoid second XPath; handle age in one place.You already have raw_dob_text. Use it for age extraction and drop the extra DOM read later.
-raw_dob_text = self.get_text_by_xpath(Players.Profile.DATE_OF_BIRTH_AGE) +raw_dob_text = self.get_text_by_xpath(Players.Profile.DATE_OF_BIRTH_AGE) @@ -# Store both versions -self.response["dateOfBirthRaw"] = raw_dob_only # "24/06/1987" -self.response["dateOfBirth"] = clean_dob(raw_dob_only) # "1987-06-24" +# Store both versions +self.response["dateOfBirthRaw"] = raw_dob_only # "24/06/1987" +self.response["dateOfBirth"] = clean_dob(raw_dob_only) # "1987-06-24" +# Extract age from the same text to avoid a second XPath query +self.response["age"] = safe_regex(raw_dob_text, REGEX_DOB_AGE, "age")Then remove the age block at Lines 106–110.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
app/schemas/players/profile.py(1 hunks)app/services/players/profile.py(2 hunks)app/utils/regex.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
app/services/players/profile.py (2)
app/services/base.py (1)
get_text_by_xpath(152-206)app/utils/xpath.py (5)
Players(1-102)Players(148-177)Profile(18-57)Profile(106-138)Profile(181-183)
🔇 Additional comments (1)
app/utils/regex.py (1)
6-6: safe_regex correctly uses re.search
The implementation in app/utils/utils.py (line 81) calls re.search, so the unanchored DOB/age pattern will match anywhere in the string as intended.
| def clean_dob(raw_text: str) -> str | None: | ||
| """ | ||
| Find a DD[./-]MM[./-]YYYY in the input and return ISO YYYY-MM-DD. | ||
| Returns None if not found or parsing fails. | ||
| """ | ||
| if not raw_text: | ||
| return None | ||
| m = _DOB_FALLBACK.search(raw_text) | ||
| if not m: | ||
| return None | ||
| dob = m.group("dob").replace("-", ".").replace("/", ".") # unify to dots | ||
| try: | ||
| return datetime.strptime(dob, "%d.%m.%Y").strftime("%Y-%m-%d") | ||
| except Exception: | ||
| return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Quick sanity checks for DOB parsing.
Add a tiny table‑driven test to guard formats you intend to support.
Also confirm the repo’s Python requirement to validate str | None usage vs Optional:
Also applies to: 90-101
🏁 Script executed:
#!/bin/bash
python - << 'PY'
from app.services.players.profile import clean_dob
samples = [
"24/06/1987 (38)",
"1-2-2000 (5)",
"03.11.1999",
None,
"",
]
for s in samples:
print(repr(s), "->", clean_dob(s))
PYLength of output: 353
🏁 Script executed:
#!/bin/bash
rg -n "requires-python|python_requires" -C2 pyproject.toml setup.cfg setup.py || trueLength of output: 98
🏁 Script executed:
#!/bin/bash
python3 - << 'PY'
import sys, types
# Stub bs4 to avoid ModuleNotFoundError
sys.modules['bs4'] = types.ModuleType('bs4')
sys.modules['bs4'].BeautifulSoup = lambda *args, **kwargs: None
# Stub app.services.base before importing profile to bypass its imports
base_mod = types.ModuleType('app.services.base')
base_mod.TransfermarktBase = type('TransfermarktBase', (), {})
sys.modules['app.services.base'] = base_mod
from app.services.players.profile import clean_dob
samples = [
"24/06/1987 (38)",
"1-2-2000 (5)",
"03.11.1999",
None,
"",
]
for s in samples:
print(f"{s!r} -> {clean_dob(s)}")
PYLength of output: 168
🏁 Script executed:
#!/bin/bash
# Test clean_dob outputs with proper f-string and stubbed imports
python3 - << 'PY'
import sys, types
# Stub missing modules
sys.modules['bs4'] = types.ModuleType('bs4')
sys.modules['bs4'].BeautifulSoup = lambda *args, **kwargs: None
base_mod = types.ModuleType('app.services.base')
base_mod.TransfermarktBase = type('TransfermarktBase', (), {})
sys.modules['app.services.base'] = base_mod
from app.services.players.profile import clean_dob
samples = ["24/06/1987 (38)", "1-2-2000 (5)", "03.11.1999", None, ""]
for s in samples:
print(f"{s!r} -> {clean_dob(s)}")
PY
# Search for Python version requirements
rg -n "requires-python|python_requires|requires_python|requires-python" pyproject.toml || echo "No python requirement found"Length of output: 196
🏁 Script executed:
#!/bin/bash
# 1. Test clean_dob outputs with correct f-string
python3 - << 'PY'
import sys, types
# Stub missing modules
sys.modules['bs4'] = types.ModuleType('bs4')
sys.modules['bs4'].BeautifulSoup = lambda *args, **kwargs: None
base_mod = types.ModuleType('app.services.base')
base_mod.TransfermarktBase = type('TransfermarktBase', (), {})
sys.modules['app.services.base'] = base_mod
from app.services.players.profile import clean_dob
samples = ["24/06/1987 (38)", "1-2-2000 (5)", "03.11.1999", None, ""]
for s in samples:
print(f"{s!r} -> {clean_dob(s)}")
PY
# 2. Inspect Python version requirement in pyproject.toml
sed -n '1,200p' pyproject.tomlLength of output: 1508
Align type annotations with project’s Python requirement
- The pyproject.toml pins
python = "^3.9", but PEP 604 union syntax (str | None) isn’t valid until Python 3.10+. Either raise the requirement to^3.10in pyproject.toml or revert toOptional[str](withfrom typing import Optional) inapp/services/players/profile.py. - Add a small, table-driven pytest for
clean_dobcovering “24/06/1987 (38)” →"1987-06-24", “1-2-2000 (5)” →"2000-02-01", “03.11.1999” →"1999-11-03", plusNoneand empty string cases.
🤖 Prompt for AI Agents
In app/services/players/profile.py around lines 8 to 22, the function uses PEP
604 union syntax `str | None` which is invalid under the project's Python 3.9
constraint; replace the return annotation with Optional[str] and add `from
typing import Optional` at the top of the file, keeping behavior unchanged. Also
add a small table-driven pytest (e.g., in tests/unit/test_profile.py) that
parametrizes inputs and expected outputs for cases: "24/06/1987 (38)" ->
"1987-06-24", "1-2-2000 (5)" -> "2000-02-01", "03.11.1999" -> "1999-11-03", None
-> None, and "" -> None to ensure parsing and edge cases are covered.
Summary by CodeRabbit
New Features
Bug Fixes