How can I export all OneNote pages to individual markdown files?

I am moving to Linux and the last hurdle is to get out of OneNote. I'd like to export all of my notebooks so that every page goes to an individual markdown file.

I've tried many things—this thread had several suggestions, but they are all outdated.

If I could get OneNote to export all of the pages as individual .docx files, it would be easy to use pandoc to convert them to individual markdown files. But, OneNote will only export multiple pages as a single file. So, one route would be to find a way to automate the export of every single page individually.

Another option is to export entire notebooks at a time as .docx files, convert them to markdown with pandoc, and then split the files—but, I am not enough of a regex wizard to get csplit to cut the files correctly with just its basic regular expressions, and not enough of an awk wizard to get it to output files with the correct and full regular expression.

Can anyone help me with this?

Solution 1:

I did end up finding an export pipeline, but it was a pain. Here are my notes from doing that:

workflow:

Turn off your network(s) to prevent OneNote from performing a lengthy OneDrive sync after each export.
In the Notebooks list, expand the notebook to see all the tabs.
Right click a tab and click "Export...".
Click the filetype dropdown and press M to select .docx format. Press Enter to select it.
Press Enter again to save the exported file.
Repeat steps 2-5 for each tab in the notebook.
Set up pandoc and open a PowerShell or cmd window.
cd into the directory where the exported .docx files are located.
For each exported .docx file, use the following pandoc command to convert it to markdown (replace journal with the name of your file):
```
pandoc --extract-media='' --wrap=preserve '.\journal.docx' -o journal.md
```
Here's an explanation of the command: --extract-media='' tells pandoc to extract images from the .docx file and put them in the default subfolder (named 'media' by default). --wrap=preserve tells pandoc not hard-wrap the output file with linebreaks (which is the default). The next field is the input filename, and -o stands for 'output', so journal.md is the output filename.

If you don't want to split this file (for example, if your tab contained only one page), skip to step 15.

(When you are doing a bunch of these, you can press the ↑ (up arrow) key to recall the previous command in the shell, then edit the filename.)
Create a new folder to store the pages in the tab. For this example, right now all the pages from our Journal tab in OneNote are mashed together in journal.md. Make a folder called journal which will store the final separated pages as individual .md files.
If there were any images in the .docx file, these will be exported to a new folder called media. Drag the media folder, if it exists, into the folder you just created now. (This is why we need to do each pandoc operation seperately, because each export will create a separate media folder, and we want to keep these separate so the links in the markdown files work correctly. We could write a clever script to do all this automatically, but it will take less time to just do it manually, unless you have a huge number of notebooks.) (Note: You can save a step by putting your desired foldername in the single quotes of the --extract-media='' argument—for .docx files with images, a folder will be created automatically for you.)
Open a bash terminal and cd to the directory containing the .md file. The folder you created in step 10 must be a subfolder of this one (unless you fix the path in the following command).
If you haven't already, click the Windows Bash window icon, click on Properties, check QuickEdit Mode, then click OK. Now click on the Windows Bash window icon again, this time click on Defaults, check QuickEdit Mode, and then click OK (so new Bash windows you create in the future will remember this setting). Now you can select text in the terminal and press Ctrl+C to copy, or right-click the terminal window to paste the text into the clipboard. Now we can prepare our command in a separate location, and quickly paste each version into Bash.
Customize the following command and run it for each .md file you want to split into individual pages:

csplit ./journal.md --keep-files --prefix='journal/journalentry ' --suffix-format='%i.md' --elide-empty-files '/^$Monday\|Tuesday\|Wednesday\|Thursday\|Friday\|Saturday\|Sunday$,/-2' '{*}'

(Type it as one line.)

As you can see, journal.md is the name of our markdown file (in the current directory, denoted by ./), the second occurrence of journal (after --prefix=') is the name of our subfolder which will contain the split files, and journalentry is what each file will be named (followed by an index number).

If you want to understand the command, here's an explanation: --keep-files still prints files when errors or the end of the file is encountered, ensuring the last page will be output correctly (since it doesn't end in the pattern of our regular expression). --prefix sets the naming scheme of the output files. --suffix-format allows us to set our file extension (.md in this case), but we must include %i for the sprintf statement which outputs the index number of the file. --elide-empty-files skips outputting empty files, which we don't care about. Finally, the regular expression, which begins with '/ and ends with /-2', defines when to split the file: it says "When you find (/) at the start of the line (^) the following (() Monday or (\|) Tuesday or Wednesday or Thursday or Friday or Saturday or Sunday ()) followed by a comma, step back two lines (-2)" and split the file there, outputting what we have up until now. The final bit, '{*}', repeats the previous command indefinitely, until the end of the file is reached.
Drag the .docx and .md files into a folder, say a folder you create now called intermediates. Or you can just delete them. It's nice to save them for a while, until you are comfortable with your new file format, in case you want to go back and reference something that happened during the conversion process. Moving them into the intermediates folder now will rude the chance of forgetting where we are and repeating steps.
Repeat steps 9-14 for each .docx file you exported from OneNote.
Now you have one folder for each tab, with a bunch of separate .md files in it, one for each page! Plus a media folder in each subfolder that had images in the OneNote tab.
I recommend exporting each of your OneNote notebooks as a .mht file (Single File Web Page), or, if you prefer, a .pdf. This way, if there was lost formatting or other information in some of your markdown files, due to the multiple conversion, you can always go back and easily see how it was supposed to look in the .mht file. In addition, I would recommend exporting each of your OneNote notebooks as a .onepkg file (OneNote Package), so you have a nice final export copy if you ever want to reopen the Notebook in OneNote in its native/original file format (this might be useful if, for example, the .mht file is also missing some original formatting that you want to recover).
As you finish each notebook, right-click the notebook in OneNote and click "Close This Notebook" so you won't accidentally edit the notebook and have to re-export your new changes. For the markdown folders, I also created a folder for each notebook, and put all the tab folders in it.
When you are finished with the whole export project, you can go to your OneDrive and delete all your OneNote notebooks originals which have synced there (make sure you are backing up your own files now, of course! There is OneDrive for Linux, or you could try something like Syncthing).
Finally, we can rename all our .md files to their OneNote page title, which is the first line in each file, by using two scripts. Make the following files:

File 1: ~/scripts/rename-files-to-first-line.sh
```
for i in *md ; do mv -n "$i" "$(cat "$i"|head -n1|tr -d '\000-\037[]{}()/\?*')".md; done
```
File 2: ~/scripts/recurse.sh
```
CDIR=$(pwd)
for i in $(ls -R | grep :); do
    DIR=${i%:}                    # Strip ':'
    cd $DIR
    $1                            # Your command
    cd $CDIR
done
```
Then navigate to your notes folder and run the recurse.sh command using the rename-files-to-first-line.sh command as an argument:
```
$ ~/scripts/recurse.sh ~/scripts/rename-files-to-first-line.sh
```
You will see the script go through all your files recursively, throwing some errors on files with weird first lines (that won't convert to a filename) and on other edge cases. However, the mv command in rename-files-to-first-line is executed with argument -n, which will prevent it from overwriting any files. There might be a few notes that don't get renamed, because the first line in them is blank or something else weird, but you can just fix those few files manually.
Bask in your clean escape from OneNote.

Caveats:

This doesn't capture subpages—you will have to recreate those with sub-subfolders, if you like.
I don't know how well it does with tables—markdown is a bit ungainly for tables anyway.
There are probably other kinds of formatting, such as fonts, which get lost or screwed up in the export. But for rich text and images, it works pretty well!

Solution 2:

The other answer didn't cut it for me, because my notes are not journal entries, but I found a solution using Microsoft's Graph API. This means you don't even have to run OneNote, it just requires that your notes are synced to your Microsoft account and then you can get your notes as perfectly formatted HTML (which you can view in the browser or convert to whatever format you prefer using Pandoc).

The magic happens in this Python script. It runs a simple local web server that you can use to log in to your Microsoft account and once you do that it downloads all your notes as HTML, plus images and attachments in their original formats, and stores them in file hierarchy preserving the original structure of your notebooks (including page order and subpages).

Before you can run the script, you have to register an "app" in Microsoft Azure so it can access the Graph API:

Go to https://aad.portal.azure.com/ and log in with your Microsoft account.
Select "Azure Active Directory" and then "App registrations" under "Manage".
Select "New registration". Choose any name, set "Supported account types" to "Accounts in any organizational directory and personal Microsoft accounts" and under "Redirect URI", select Web and enter http://localhost:5000/getToken. Register.
Copy the "Application (client) ID" and paste it as client_id in the beginning of the Python script.
Select "Certificates & secrets" under "Manage". Press "New client secret", choose a name and confirm.
Copy the client secret and paste it as secret in the Python script.
Select "API permissions" under "Manage". Press "Add a permission", scroll down and select OneNote, choose "Delegated permissions" and check "Notes.Read" and "Notes.Read.All". Press "Add permissions".

Then you need to install the Python dependencies. Make sure you have Python 3.7 (or newer) installed and install the dependencies using the command pip install flask msal requests_oauthlib.

Now you can run the script. In a terminal, navigate to the directory where the script is located and run it using python onenote_export.py. This will start a local web server on port 5000.

In your browser navigate to http://localhost:5000 and log in to your Microsoft account. The first time you do it, you will also have to accept that the app can read your OneNote notes. (This does not give any third parties access to your data, as long as you don't share the client id and secret you created on the Azure portal). After this, go back to the terminal to follow the progress.

Note: Microsoft limits how many requests you can do within a given time period. Therefore, if you have many notes you might eventually see messages like this in the terminal: Too many requests, waiting 20s and trying again. This is not a problem, but it means the entire process can take a while. Also, the login session can expire after a while, which results in a TokenExpiredError. If this happens, simply reload http://localhost:5000 and the script will continue (skipping the files it already downloaded).

Solution 3:

To export your OneNote pages to individual markdown (.md) you should install Joplin and Evernote.

As suggested in this link, first you import the notes into Evernote. Then export all the notes into a .enex file from Evernote and import them into Joplin.

Joplin has the option to export the notes as .md files.

Note: I suggest using flags in Evernote beforehand if you want to group your notes, since the Evernote way to keep hierarchy between notes is different from OneNote.

Solution 4:

Finally, someone has resolved this for once and for all. All the above methods might work, but have pro's and cons. Migrating through Evernote and then Notion or Jotterpad, or manually doing it through exporting as .mht or .xps files and then to .html, and then to markdown all has the drawback that 'complex' Onenote pages won't be copied in a workable way: if you have an image somewhere random on the page, your whole Markdown file will be a table, and working in it will be hell.

SjoerdV has written a Powershell script you can run, which will export all the workbooks you have open for you to a .docx document, and then convert it all into Markdown, keeping your whole file hierarchy, keeping images, everything! You can find it here: https://github.com/SjoerdV/ConvertOneNote2MarkDown

All you have to do: clone the repository (or get the .ps1 file (powershell script) from github), go to the commandline and to the directory where you've got that .ps1 file, run it (type .\ConvertOneNote2MarkDown.ps1), provide the full directory path where you wish to drop the files, and it starts running for you.

Don't forget to thank the original author for helping you out of your Onenote lock-in!

Drawback: only works on Windows...

Edit: forgot to add the link

How can I export all OneNote pages to individual markdown files?

Solution 1:

Solution 2:

Solution 3:

Solution 4:

Related

Recent Posts