Jul 14, 2025

Making HTML from MS Office & Google Docs Semantic

If you’ve ever let users paste content from Google Docs or Microsoft Office into your web app, you know the pain: the HTML is a mess. Non-semantic tags, inline styles, and proprietary attributes make it hard for frameworks and editors to parse the content cleanly.

In this post, I’ll show you a practical approach to cleaning up and converting this HTML into semantic, accessible markup—before it ever hits your rich text editor or content pipeline.

Why Bother Making HTML Semantic?

Accessibility: Semantic tags (like , <blockquote>, <ol>, <ul>, <li>, etc.) help screen readers and assistive tech.
Cleaner Parsing: Frameworks and editors (Lexical, Slate, ProseMirror, etc.) work better with semantic HTML.
Maintainability: Semantic HTML is easier to style and debug.

The Problem: What Google Docs & Office Output Looks Like

When you copy-paste from Google Docs or Microsoft Word, you get things like:

<b style="font-weight: normal;">Not actually bold</b>
<span style="font-weight: 700;">Actually bold</span>
<p class="MsoListParagraph" style="mso-list: l0 level1 lfo1;">1. List item</p>
<o:p></o:p>

Google Docs uses  for normal text and  for bold.
Microsoft Office uses class names like MsoTitle for headings, MsoQuote for quotes, and MsoListParagraph for list items.
Both use lots of inline styles and non-semantic tags.

The Solution: Walk & Transform the DOM

Here’s a TypeScript function that walks the DOM and replaces these non-semantic elements with proper HTML:

const orderedListBulletText = [
  "1", "2", "3", "4", "5", "6", "7",
  "8", "9", "0", "a", "b", "c", "d",
  "e", "f", "g", "h", "i", "j", "k",
  "l", "m", "n", "o", "p", "q", "r",
  "s", "t", "u", "v", "w", "x", "y",
  "z",
];

function walk(node: Node) {
  if (node instanceof HTMLElement) {
    const el = node;
    // Google Docs: <b style="font-weight: normal"> → <span>
    if (el.tagName === "B" && el.style.fontWeight === "normal") {
      const span = document.createElement("span");
      span.append(...el.childNodes);
      el.replaceWith(span);
      walk(span);
    // Google Docs: <span style="font-weight: 700"> → <strong>
    } else if (parseInt(el.style.fontWeight) >= 700) {
      const strong = document.createElement("strong");
      strong.append(...el.childNodes);
      el.replaceWith(strong);
      walk(strong);
    // MS Office: <p class="MsoTitle"> → <h1>
    } else if (el.className.startsWith("MsoTitle")) {
      const heading = document.createElement("h1");
      heading.append(...el.childNodes);
      el.replaceWith(heading);
      walk(heading);
    // MS Office: <p class="MsoQuote"> → <blockquote>
    } else if (
      el.className.startsWith("MsoIntenseQuote") ||
      el.className.startsWith("MsoQuote")
    ) {
      const quote = document.createElement("blockquote");
      quote.append(...el.childNodes);
      el.replaceWith(quote);
      walk(quote);
    // MS Office: <o:p> → remove
    } else if (el.tagName.toLowerCase() === "o:p") {
      el.remove();
    // MS Office: <span style="mso-list:ignore"> → remove
    } else if (el.getAttribute("style")?.toLowerCase() === "mso-list:ignore") {
      el.remove();
    }

    // ... (list transformation logic omitted for brevity)

    for (const child of el.childNodes) {
      walk(child);
    }
  }
}

export function makeHtmlSemantic(html: string): string {
  const parser = new window.DOMParser();
  const doc = parser.parseFromString(html, "text/html");
  walk(doc.body);
  return doc.body.innerHTML;
}

How It Works

Tag Replacement: Converts non-semantic tags to semantic ones (e.g.,  to ,  to , etc.).
Class & Style Detection: Looks for MS Office/Google Docs-specific class names and styles.
List Handling: (See full code) Detects and reconstructs ordered/unordered lists from paragraphs and inline styles.
Removes Junk: Strips out proprietary tags like <o:p> and elements with mso-list:ignore.

Usage Example

const dirtyHtml = /* HTML pasted from Google Docs or Word */;
const cleanHtml = makeHtmlSemantic(dirtyHtml);
// Now pass cleanHtml to your editor or parser

Framework Usage Examples

Lexical (React)

Here’s how you can intercept paste events in Lexical and clean up the HTML before it’s inserted:

import { $getSelection, PASTE_COMMAND } from 'lexical';
import { useLexicalComposerContext } from '@lexical/react/LexicalComposerContext';
import { $insertDataTransferForRichText } from '@lexical/clipboard';
import { useEffect } from 'react';
import { makeHtmlSemantic as cleanHtmlWithSemantics } from '../utils/makeHtmlSemantic';

export function CleanPastePlugin() {
  const [editor] = useLexicalComposerContext();

  useEffect(() => {
    return editor.registerCommand(
      PASTE_COMMAND,
      (event: ClipboardEvent) => {
        const clipboard = event.clipboardData;
        if (!clipboard) return false;

        const pastedHtml = clipboard.getData('text/html');
        const selection = $getSelection();
        if (pastedHtml && selection) {
          const cleanedClipboard = new DataTransfer();
          cleanedClipboard.setData('text/html', cleanHtmlWithSemantics(pastedHtml));
          $insertDataTransferForRichText(cleanedClipboard, selection, editor);
          return true;
        }
        return false;
      },
      2
    );
  }, [editor]);

  return null;
}

Slate (React)

In Slate, you can override the onPaste event to preprocess HTML before inserting:

import { ReactEditor } from 'slate-react';
import { makeHtmlSemantic } from '../utils/makeHtmlSemantic';

function handlePaste(event: React.ClipboardEvent, editor: ReactEditor) {
  const html = event.clipboardData.getData('text/html');
  if (html) {
    event.preventDefault();
    const cleanHtml = makeHtmlSemantic(html);
    // Use your favorite HTML-to-Slate conversion here
    const fragment = deserializeHtmlToSlate(cleanHtml);
    editor.insertFragment(fragment);
  }
}

// ...
<Editable onPaste={event => handlePaste(event, editor)} />

ProseMirror

For ProseMirror, you can use a custom paste handler in your plugin:

import { Plugin } from 'prosemirror-state';
import { DOMParser } from 'prosemirror-model';
import { makeHtmlSemantic } from '../utils/makeHtmlSemantic';

const semanticPastePlugin = new Plugin({
  props: {
    handlePaste(view, event, slice) {
      const html = event.clipboardData?.getData('text/html');
      if (html) {
        const cleanHtml = makeHtmlSemantic(html);
        const parser = DOMParser.fromSchema(view.state.schema);
        const dom = document.createElement('div');
        dom.innerHTML = cleanHtml;
        const docFragment = parser.parse(dom);
        const tr = view.state.tr.replaceSelectionWith(docFragment);
        view.dispatch(tr);
        return true;
      }
      return false;
    },
  },
});

Tips & Gotchas

Always sanitize user HTML before inserting it into the DOM (use DOMPurify or similar).
This approach is a starting point—expand the function to handle more edge cases as needed.
Test with real-world pasted content from both Google Docs and MS Office.