Tech-savvy time-saver: notebook lm

Wednesday, January 08, 2025

How You Can Squeeze Notebook LM

It claims to be happy with a URL. Try it and you find it doesn't do much - in fact, nothing at all..

So, what *can* you do?

Download the website - say it's one of those ancient online books with a TOC and Previous/Next buttons on each page. Use httrack.

Then, use this script (click to select) to (since Notebook LM takes markdown) pull the text from each page *meaningfully*. That is, you're saying Notebook LM is smart enough to make something out of headings. Else, you could just give it text. Ja? :)

Yes, you might need to ask chatGPT for a bash script to put all the text files into one - since N LM says something about a 50 file limit.


import sys
import os
import html2text

def convert_html_to_text(input_path, output_path):
    # Read the HTML content from the input file
    try:
        with open(input_path, "r", encoding="utf-8") as infile:
            html_content = infile.read()
    except FileNotFoundError:
        print(f"Error: The file '{input_path}' was not found.")
        sys.exit(1)
    except Exception as e:
        print(f"Error reading the file '{input_path}': {e}")
        sys.exit(1)
    
    # Convert HTML to text
    text_maker = html2text.HTML2Text()
    text_maker.ignore_links = True  # Optional: Ignore links in the output
    text = text_maker.handle(html_content)
    
    # Write the plain text to the output file
    try:
        with open(output_path, "w", encoding="utf-8") as outfile:
            outfile.write(text)
        print(f"Text successfully written to '{output_path}'.")
    except Exception as e:
        print(f"Error writing to the file '{output_path}': {e}")
        sys.exit(1)

if __name__ == "__main__":
    # Ensure correct usage
    if len(sys.argv) != 3:
        print("Usage: python html2text_converter.py <input_file> <output_file_or_directory>")
        sys.exit(1)
    
    # Get command-line arguments
    input_file = sys.argv[1]
    output_arg = sys.argv[2]
    
    # Check if input file exists
    if not os.path.isfile(input_file):
        print(f"Error: The input file '{input_file}' does not exist.")
        sys.exit(1)
    
    # Determine output path
    if os.path.isdir(output_arg):
        # If output_arg is a directory, create the output file path
        output_file = os.path.join(
            output_arg,
            os.path.splitext(os.path.basename(input_file))[0] + ".md"
        )
    else:
        # If output_arg is a file, use it directly
        output_file = output_arg
    
    # Perform conversion
    convert_html_to_text(input_file, output_file)

And then:
$ for fil in download_folder_name/www.websitename.com/somefolder/text/*.html ; 
do python3 script.py $fil destination_dir; done

Then, just upload the files from this destination directory and you're ready to start chatting with your data. It does work!