The HTML Sanitizer is executed when rich content is submitted to a Telligent Enterprise site and is used to ensure that user-submitted content is secure and well-formatted, preventing JavaScript-, plugin-, and HTML-based hacking attempts (for example, cross-site scripting and content spoofing). Specifically, it performs the following verifications/adjustments:

  1. Ensures that only allowed HTML elements, attributes, and style attributes/values are used.  Any invalid markup is removed.
  2. Ensures that only allowed URLs are referenced.  Invalid URLs are removed.
  3. Ensures that HTML is well-formed.  Improper nesting and formatting is corrected.
  4. Adjusts markup based on configured rules, for example, to adjust markup generated from Word/Outlook.

The behavior of the HTML Sanitizer is defined within the <HtmlSanitization> node of the communityserver.config file.

Configuring Allowed HTML / CSS

The list of allowed HTML tags and attributes and allowed CSS style attributes are defined within the <AllowedHtml> node under the <HtmlSanitization> node in the communityserver.config file in the following format:

<HtmlSanitization>
  <AllowedHtml>
    <UrlProtocols>
      <Protocol name="..." />
    </UrlProtocols>
    <Attributes>
      <Attribute name="..." />
    </Attributes>
    <Tags>
      <Tag name="...">
        <Attribute name="..." />
      </Tag>
      <Tag name="..." />
    </Tags>
    <Style>
      <Attribute name="..." />
      <Attribute name="...">
        <Value>...</Value>
      </Attribute>
    </Style>
  </AllowedHtml>
</HtmlSanitization> 

URL Protocols

The <UrlProtocols> node identifies allowed URL protocols (HTTP, FTP, MAILTO, etc) that are allowed to be used within markup processed by the HTML Sanitizer.  Any URL that doesn't match one of these protocols and is not a local URL is removed.

Each supported protocol should be identified via the name attribute of a new <UrlProtocol /> node within the <UrlProtocols /> node.

Attributes

The <Attributes> node identifies HTML attributes that are allowed for all allowed HTML tags.  Specifying attributes within this node prevents the need to duplicate allowed attributes within the <Tags> node.

Each globally-allowed HTML attribute should be identified via the name attribute of a new <Attribute /> node within the <Attributes> node.

Tags

The <Tags> node identifies HTML tags that are allowed within content processed by the HTML Sanitizer.  Any tag that is encountered that is not allowed is removed.

Each allowed tag should be identified via the name attribute on a new <Tag /> node within the <Tags> node.  Each <Tag /> node can optionally identify allowed attributes that will only be enabled for this tag (note that attributes identified within the <Attributes> node are enabled for all tags).

Style

The <Style> node identifies allowed CSS style attributes. Only enabled style attributes are allowed to be defined within the style attribute of allowed HTML tags (assuming that the style attribute is allowed for the tag).

Each allowed style attribute should be identified via the name attribute on a new <Attribute /> node within the <Style> node.

If only specific values should be allowed for an individual style attribute, a <Value> node can be added for each allowed value within the associated <Attribute> node.  When an attribute is allowed, but its value is not allowed, the attribute and its value are removed by the HTML Sanitizer.

Configuring HTML Nesting and Formatting

The configuration of the HTML formatting correction logic of the HTML Sanitizer is defined within the <SelfContainedHtml>, <CloseBeforeNext>, and <RemoveContentsWithTags> nodes under the <HtmlSanitization> node in the communityserver.config file in the following format:

<HtmlSanitization>
  <SelfContainedHtml>
    <Tag name="..." />
  </SelfContainedHtml>
  <CloseBeforeNextHtml>
    <Tag name="...">
      <ParentTag name="..." />
    </Tag> 
    <Tag name="..." />
  </CloseBeforeNextHtml>
  <RemoveContentsWithTags>
    <Tag name="..." />
  </RemoveContentsWithTags>
</HtmlSanitization>

Self Contained HTML

The <SelfContainedHtml> node identifies the tags that should always be self-contained (should never have separate opening and closing tags).  Each self-contained tag should be identified via the name attribute on a new <Tag> node within the <SelfContainedHtml> node.

Close Before Next HTML

The <CloseBeforeNextHtml> node identifies the tags that should always be closed before the next occurrence of the same tag.  For example, an <li> tag should be closed before the next opening <li> tag so "li" should be identified within the <CloseBeforeNextHtml> node.  Additionally, for tags that have required parents, the valid parent tags should be defined to enable the HTML Sanitizer to both ensure that the tag is properly nested (that its within the context of an allowed parent tag) and that the tag can be nested if its within the context of a new parent tag (for example, <ul><li><ul><li></li></ul></li></ul> is valid even though an <li> opening tag exists within an open <li> tag because the inner <li> tag is within a valid parent, <ul>).

Each tag that should be closed before the next occurrence should be defined via the name attribute of a new <Tag> node within the <CloseBeforeNextHtml> node.  Additionally, if the tag must exist within a valid parent tag, each parent tag should be defined via the name attribute of a <ParentTag> node within the corresponding <Tag> node.

Remove Contents With Tags

By default, the contents of tags that are removed by the HTML Scrubber are not removed.  To remove the contents of a tag (all content between the opening and closing tag being removed) when it is removed, the tag should be included within the <RemoveContentsWithTags> node.  Each tag should be identified via the name attribute of a new <Tag> node within the <RemoveContentsWithTags> node.

Configuring HTML Replacements

To automate common HTML cleanup tasks, the HTML Sanitizer supports processing simple replacement rules.  As an example, HTML replacement rules are used by default to adjust MsoNormal and MsoListParagraph style rule usage in markup generated from Outlook/Word and to ensure that all <img /> tags in user-entered content have alt attributes.  HTML replacements are defined within the <HtmlReplacements> node of the <HtmlSanitization> node in the communityserver.config file in the following format:

<HtmlSanitization>
  <HtmlReplacements>
    <HtmlReplacement>
      <Match tag="..." />
      <Replace tag="..." />
    </HtmlReplacement>
    <HtmlReplacement>
      <Match tag="...">
        <Attribute name="..." contains="..." />
      </Match>
      <Replace tag="...">
        <Attribute name="..." replace="..." with="..." />
      </Replace>
    </HtmlReplacement>
  </HtmlReplacements>
</HtmlSanitization>    

Each replacement consists of a matching tag and optional matching attributes and an optional replacement tag and optional attribute value replacements.  The simple and full examples are shown above.

For example, to adjust <p class="MsoNormal">...</p> to <div>...</div>, the following rule could be defined within the <HtmlReplacements> node:

<HtmlReplacement>
  <Match tag="p">
    <Attribute name="class" contains="MsoNormal" />
  </Match>
  <Replace tag="div">
    <Attribute name="class" replace="MsoNormal" with="" />
  </Replace>
</HtmlReplacement>

This rule matches <p> tags containing "MsoNormal" in their class attribute and adjusts the tag name to "div" and removes "MsoNormal" from the class attribute.

A few notes on HTML replacements

  1. The Html Sanitizer will automatically remove any empty attributes after rules are processed.
  2. To add content to an attribute or add an attribute that may not exist, set the replace attribute on the <Attribute> node to an empty value.
  3. If multiple <Attribute> nodes within a single <Replace> node affect the same attribute name, only the last <Attribute> node with that name will be processed.
  4. The replacement tag is optional.  If the tag attribute is not defined on the <Replace> node, the tag name will not be adjusted.