Skip to content

Conversation

@KernowMaiden
Copy link

@KernowMaiden KernowMaiden commented Sep 26, 2025

Summary by CodeRabbit

  • New Features

    • Player profiles now include both a raw date-of-birth field and a parsed/standardized date-of-birth.
  • Bug Fixes

    • Improved date-of-birth and age parsing to accept multiple date separators and 1–2 digit ages, and to tolerate optional spacing — reducing parsing errors and improving extraction accuracy.

git commit -m "Fix DOB/age parsing for player profiles"
@coderabbitai
Copy link

coderabbitai bot commented Sep 26, 2025

Walkthrough

Replaced strict DOB regex with a flexible dd[./-]dd[./-]yyyy pattern (optional parenthesized 1–2 digit age) and added DOB parsing utilities plus two optional DOB fields to player profile schemas; profile parsing now stores raw DOB and ISO-normalized DOB using the new utilities.

Changes

Cohort / File(s) Summary
Regex update
app/utils/regex.py
Replaced REGEX_DOB_AGE pattern that required Mon D, YYYY (YY) with a flexible pattern capturing (?P<dob>\d{1,2}[./-]\d{1,2}[./-]\d{4})(?:\s*\((?P<age>\d{1,2})\))? allowing dd/mm/yyyy-like formats and an optional 1–2 digit age in parentheses.
Schema additions
app/schemas/players/profile.py
Added two optional string fields dateOfBirth and dateOfBirthRaw to PlayerProfile, defaulting to None, exposing raw and normalized DOB values.
Service logic & helpers
app/services/players/profile.py
Added `clean_dob(raw_text: str) -> str

Sequence Diagram(s)

sequenceDiagram
  participant S as Scraper / Source
  participant P as profile parser
  participant U as clean_dob util
  participant D as PlayerProfile

  S->>P: provide raw DOB string (e.g. "12/05/1990 (34)")
  P->>P: trim & remove trailing age marker (if present)
  P->>U: clean_dob(raw_trimmed)
  U-->>P: ISO date string or null
  P->>D: set dateOfBirthRaw = original trimmed (or None)
  P->>D: set dateOfBirth = ISO (or None)
  Note over P,D: TransfermarktPlayerProfile used in overall mapping
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

I nibble strings of dates and dots,
I hop through formats, ties, and knots.
Raw and tidy, both I store,
Carrots of data — one and more.
A rabbit's regex, neat and spry,
Hops DOBs into ISO sky. 🥕

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title succinctly captures the primary focus of the changes by indicating that the pull request fixes the date‐of‐birth and age parsing logic for player profiles, which aligns directly with the regex update, schema extensions, and service modifications introduced in the changeset. It is specific to the domain (player profiles) and clearly describes the key intent (fixing DOB/age parsing) without unnecessary detail.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (5)
app/schemas/players/profile.py (1)

63-64: Type normalized DOB as a date; keep raw as string.

For consistency with other fields (e.g., joined/contract_expires), prefer the normalized value to be a date. Keep Raw as string.

If API clients expect a string, keep as-is; otherwise:

-    dateOfBirth: Optional[str] = None
+    dateOfBirth: Optional[date] = None
     dateOfBirthRaw: Optional[str] = None

Additionally, consider deprecating date_of_birth or wiring an alias to avoid duplicate fields diverging. If aliases are used in TransfermarktBaseModel, ensure both serialize consistently.

app/utils/regex.py (1)

6-6: Tighten DOB regex with boundaries and valid day/month ranges.

Reduce false positives and keep permissive matching without anchors.

-REGEX_DOB_AGE: str = r"(?P<dob>\d{1,2}[./-]\d{1,2}[./-]\d{4})(?:\s*\((?P<age>\d{1,2})\))?"
+REGEX_DOB_AGE: str = (
+    r"\b(?P<dob>(0?[1-9]|[12][0-9]|3[01])[./-](0?[1-9]|1[0-2])[./-](19|20)\d{2})\b"
+    r"(?:\s*\((?P<age>\d{1,2})\))?"
+)
app/services/players/profile.py (3)

5-7: Avoid regex drift: compile fallback from the shared constant.

Using a local hardcoded pattern can diverge from app.utils.regex.REGEX_DOB_AGE over time.

If import order allows, prefer:

# after importing REGEX_DOB_AGE
_DOB_FALLBACK = re.compile(REGEX_DOB_AGE)

If needed, move the REGEX_DOB_AGE import above and initialize _DOB_FALLBACK there.


8-22: Type hint portability + narrower exception.

  • Use Optional[str] for broader Python compatibility if <3.10 is supported.
  • Catch ValueError instead of bare Exception.
-def clean_dob(raw_text: str) -> str | None:
+from typing import Optional
+def clean_dob(raw_text: Optional[str]) -> Optional[str]:
@@
-    except Exception:
+    except ValueError:
         return None

If you later switch schema dateOfBirth to a date type, have clean_dob return a date object and let Pydantic serialize it.


90-101: Reuse cached text; avoid second XPath; handle age in one place.

You already have raw_dob_text. Use it for age extraction and drop the extra DOM read later.

-raw_dob_text = self.get_text_by_xpath(Players.Profile.DATE_OF_BIRTH_AGE)
+raw_dob_text = self.get_text_by_xpath(Players.Profile.DATE_OF_BIRTH_AGE)
@@
-# Store both versions
-self.response["dateOfBirthRaw"] = raw_dob_only # "24/06/1987"
-self.response["dateOfBirth"] = clean_dob(raw_dob_only) # "1987-06-24"
+# Store both versions
+self.response["dateOfBirthRaw"] = raw_dob_only  # "24/06/1987"
+self.response["dateOfBirth"] = clean_dob(raw_dob_only)  # "1987-06-24"
+# Extract age from the same text to avoid a second XPath query
+self.response["age"] = safe_regex(raw_dob_text, REGEX_DOB_AGE, "age")

Then remove the age block at Lines 106–110.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 77cb1b2 and cbdf45c.

📒 Files selected for processing (3)
  • app/schemas/players/profile.py (1 hunks)
  • app/services/players/profile.py (2 hunks)
  • app/utils/regex.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
app/services/players/profile.py (2)
app/services/base.py (1)
  • get_text_by_xpath (152-206)
app/utils/xpath.py (5)
  • Players (1-102)
  • Players (148-177)
  • Profile (18-57)
  • Profile (106-138)
  • Profile (181-183)
🔇 Additional comments (1)
app/utils/regex.py (1)

6-6: safe_regex correctly uses re.search
The implementation in app/utils/utils.py (line 81) calls re.search, so the unanchored DOB/age pattern will match anywhere in the string as intended.

Comment on lines +8 to +22
def clean_dob(raw_text: str) -> str | None:
"""
Find a DD[./-]MM[./-]YYYY in the input and return ISO YYYY-MM-DD.
Returns None if not found or parsing fails.
"""
if not raw_text:
return None
m = _DOB_FALLBACK.search(raw_text)
if not m:
return None
dob = m.group("dob").replace("-", ".").replace("/", ".") # unify to dots
try:
return datetime.strptime(dob, "%d.%m.%Y").strftime("%Y-%m-%d")
except Exception:
return None
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Quick sanity checks for DOB parsing.

Add a tiny table‑driven test to guard formats you intend to support.

Also confirm the repo’s Python requirement to validate str | None usage vs Optional:

Also applies to: 90-101


🏁 Script executed:

#!/bin/bash
python - << 'PY'
from app.services.players.profile import clean_dob
samples = [
    "24/06/1987 (38)",
    "1-2-2000 (5)",
    "03.11.1999",
    None,
    "",
]
for s in samples:
    print(repr(s), "->", clean_dob(s))
PY

Length of output: 353


🏁 Script executed:

#!/bin/bash
rg -n "requires-python|python_requires" -C2 pyproject.toml setup.cfg setup.py || true

Length of output: 98


🏁 Script executed:

#!/bin/bash
python3 - << 'PY'
import sys, types

# Stub bs4 to avoid ModuleNotFoundError
sys.modules['bs4'] = types.ModuleType('bs4')
sys.modules['bs4'].BeautifulSoup = lambda *args, **kwargs: None

# Stub app.services.base before importing profile to bypass its imports
base_mod = types.ModuleType('app.services.base')
base_mod.TransfermarktBase = type('TransfermarktBase', (), {})
sys.modules['app.services.base'] = base_mod

from app.services.players.profile import clean_dob

samples = [
    "24/06/1987 (38)",
    "1-2-2000 (5)",
    "03.11.1999",
    None,
    "",
]
for s in samples:
    print(f"{s!r} -> {clean_dob(s)}")
PY

Length of output: 168


🏁 Script executed:

#!/bin/bash
# Test clean_dob outputs with proper f-string and stubbed imports
python3 - << 'PY'
import sys, types
# Stub missing modules
sys.modules['bs4'] = types.ModuleType('bs4')
sys.modules['bs4'].BeautifulSoup = lambda *args, **kwargs: None
base_mod = types.ModuleType('app.services.base')
base_mod.TransfermarktBase = type('TransfermarktBase', (), {})
sys.modules['app.services.base'] = base_mod

from app.services.players.profile import clean_dob
samples = ["24/06/1987 (38)", "1-2-2000 (5)", "03.11.1999", None, ""]
for s in samples:
    print(f"{s!r} -> {clean_dob(s)}")
PY

# Search for Python version requirements
rg -n "requires-python|python_requires|requires_python|requires-python" pyproject.toml || echo "No python requirement found"

Length of output: 196


🏁 Script executed:

#!/bin/bash
# 1. Test clean_dob outputs with correct f-string
python3 - << 'PY'
import sys, types
# Stub missing modules
sys.modules['bs4'] = types.ModuleType('bs4')
sys.modules['bs4'].BeautifulSoup = lambda *args, **kwargs: None
base_mod = types.ModuleType('app.services.base')
base_mod.TransfermarktBase = type('TransfermarktBase', (), {})
sys.modules['app.services.base'] = base_mod

from app.services.players.profile import clean_dob
samples = ["24/06/1987 (38)", "1-2-2000 (5)", "03.11.1999", None, ""]
for s in samples:
    print(f"{s!r} -> {clean_dob(s)}")
PY

# 2. Inspect Python version requirement in pyproject.toml
sed -n '1,200p' pyproject.toml

Length of output: 1508


Align type annotations with project’s Python requirement

  • The pyproject.toml pins python = "^3.9", but PEP 604 union syntax (str | None) isn’t valid until Python 3.10+. Either raise the requirement to ^3.10 in pyproject.toml or revert to Optional[str] (with from typing import Optional) in app/services/players/profile.py.
  • Add a small, table-driven pytest for clean_dob covering “24/06/1987 (38)” → "1987-06-24", “1-2-2000 (5)” → "2000-02-01", “03.11.1999” → "1999-11-03", plus None and empty string cases.
🤖 Prompt for AI Agents
In app/services/players/profile.py around lines 8 to 22, the function uses PEP
604 union syntax `str | None` which is invalid under the project's Python 3.9
constraint; replace the return annotation with Optional[str] and add `from
typing import Optional` at the top of the file, keeping behavior unchanged. Also
add a small table-driven pytest (e.g., in tests/unit/test_profile.py) that
parametrizes inputs and expected outputs for cases: "24/06/1987 (38)" ->
"1987-06-24", "1-2-2000 (5)" -> "2000-02-01", "03.11.1999" -> "1999-11-03", None
-> None, and "" -> None to ensure parsing and edge cases are covered.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant