nlp - How to create table of contents using unstructured (the python package)

tl;dr

How can I extract a clean table of contents from a pdf document that has hierarchical section headers using the unstructured package?

Some more details

I have a pdf document that is multiple pages long. The text in the document is organised into multiple sections, each with a header/title. Each of these sections are potentially split up into subsections with their own header/title. These subsections can have subsubsections, etc.

The document does not have a table of contents page. How can I use the unstructured package to automatically extract a table of contents from my document? The table of contents should have the same hierarchy as the sections and subsections in my document.

Example

If my document looks like this:

This is the title of section 1

Bla bla bla.

This is the title of subsection 1.1

More bla bla bla.

This is the title of subsubsection 1.1.1

More bla bla bla.

This is the title of subsection 1.2

More bla bla bla.

This is the title of section 2

Even more bla bla.

Then I would like to extract a table of contents from this that includes the hierarchy of headers. For example:

{
    "This is the title of section 1": 0,
    "This is the title of subsection 1.1": 1,
    "This is the title of subsubsection 1.1.1": 2,
    "This is the title of subsection 1.2": 1,
    "This is the title of section 2": 0,
}

Where the number indicates the level of the header in the hierarchy.

Source link