jsoup removes linebreak from element

See original GitHub issue

The following Kotlin script reads and then writes an XML file.

#!/usr/bin/env kscript

@file:DependsOn("org.jsoup:jsoup:1.12.1")

import java.io.File
import org.jsoup.Jsoup
import org.jsoup.parser.Parser

val pom = Jsoup.parse(File("pom.xml").readText(), "", Parser.xmlParser())
pom.outputSettings().prettyPrint(false)

File("pom.xml").writeText(pom.html())

jsoup seems to keep the formatting, comments, and even attribute order, which is really nice but somehow messes up the root (<project>) element, as shown by the git diff

-<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
-         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
+<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">

The original element has a line break after the xmlns attribute but jsoup somehow seems to remove it.

Any way to keep the linebreak?

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:5 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
jhycommented, Jul 6, 2021

Hi @helpermethod - I’m definitely open to a PR! My main concern is increasing the memory footprint of the DOM by having to track a lot more state in the Attributes object, which obviously gets attached to so many Element objects.

One approach would be to extend Attributes to something like WhiteSpacePreservingAttribute, and use that when creating new attributes in the parse phase, if this option is enabled.

Maybe sketch out a design and we can discuss before putting too much code down. I think it would be a useful optional addition if we can keep the memory footprint of non-users down.

0reactions
helpermethodcommented, Jul 6, 2021

Hi @jhy,

thanks for your suggestions, I agree that the functionality should not impact users which do not use this feature.

I’ll have to think a little bit about it 😃.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Preserving Line Breaks When Using Jsoup - Baeldung
Jsoup removes the newline character (\n) by default from the HTML text and replaces each newline with a space character.
Read more >
How do I preserve line breaks when using jsoup to convert ...
The real solution that preserves linebreaks should be like this: public static String br2nl(String html) { if(html==null) return html; Document document ...
Read more >
Jsoup preserve new lines example
Jsoup removes the newline character “\n” by default from the HTML. It also does not retain new lines created by “<br>” or “<p>”...
Read more >
JSoup Tip How to get raw element text with newlines in Java
A quick tip for JSoup. I wanted to pull out the raw text from an HTML element and retain the \n newline characters....
Read more >
Element (jsoup Java HTML Parser 1.15.3 API)
Remove all of the element's child nodes. ... Get the (unencoded) text of this element, not including any child elements, including any newlines...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found