Skip to content

parse_fragment does not parse whitespace in HTML (or XML) text properly #421

@calimeroteknik

Description

@calimeroteknik

Description

parse_fragment does not parse whitespace in HTML (or XML) text properly, keeping it as-is when it should not.

To Reproduce

Steps to reproduce the behavior:

  • Using Floki v0.33.1
  • Using Elixir v1.13.2
  • Using Erlang OTP 24.3.2 [erts-12.3]
  • With this code:
      Floki.parse_document("<!DOCTYPE html>\n<html>\n\t<head>\n\t\t<title> \t&#110;&#111;&#116;&#104;&#105;&#110;&#103;\t\n\t\t\t &#116;&#111;\n&#115;&#101;&#101;  &#104;&#101;&#114;&#101;&#44;&#32;&#119;&#111;&#114;&#107;&#105;&#110;&#103;&#32;&#112;&#114;&#111;&#112;&#101;&#114;&#108;&#121; \n\n\t\t</title>\n\t</head>\n\t<body>\n\t</body>\n</html>\n")
        |> Rustic.Result.map_err(fn reason -> {:invalid_html, reason} end)
        |> Rustic.Result.and_then(fn doc ->
          data = doc
            |> Floki.find("head > title")
            |> Enum.take(1)
            |> Floki.text()
            |> Floki.HTMLParser.parse_fragment()
    
        end)
    I get the following output:
    {:ok, [" \tnothing\t\n\t\t\t to\nsee  here, working properly \n\n\t\t"]}

Expected behavior

The following output:

{:ok, [" nothing to see here, working properly "]}

(I think that the leading and trailing space must not be trimmed, although like the others it must be collapsed to 1 space; this might need triple-checking with the standards)

Test file (HTML): floki-test.html.txt

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions