Tech-savvy time-saver: How You Can Squeeze Notebook LM

Wednesday, January 08, 2025

How You Can Squeeze Notebook LM

It claims to be happy with a URL. Try it and you find it doesn't do much - in fact, nothing at all..

So, what *can* you do?

Download the website - say it's one of those ancient online books with a TOC and Previous/Next buttons on each page. Use httrack.

Then, use this script (click to select) to (since Notebook LM takes markdown) pull the text from each page *meaningfully*. That is, you're saying Notebook LM is smart enough to make something out of headings. Else, you could just give it text. Ja? :)

Yes, you might need to ask chatGPT for a bash script to put all the text files into one - since N LM says something about a 50 file limit.


import sys
import os
import html2text

def convert_html_to_text(input_path, output_path):
    # Read the HTML content from the input file
    try:
        with open(input_path, "r", encoding="utf-8") as infile:
            html_content = infile.read()
    except FileNotFoundError:
        print(f"Error: The file '{input_path}' was not found.")
        sys.exit(1)
    except Exception as e:
        print(f"Error reading the file '{input_path}': {e}")
        sys.exit(1)
    
    # Convert HTML to text
    text_maker = html2text.HTML2Text()
    text_maker.ignore_links = True  # Optional: Ignore links in the output
    text = text_maker.handle(html_content)
    
    # Write the plain text to the output file
    try:
        with open(output_path, "w", encoding="utf-8") as outfile:
            outfile.write(text)
        print(f"Text successfully written to '{output_path}'.")
    except Exception as e:
        print(f"Error writing to the file '{output_path}': {e}")
        sys.exit(1)

if __name__ == "__main__":
    # Ensure correct usage
    if len(sys.argv) != 3:
        print("Usage: python html2text_converter.py <input_file> <output_file_or_directory>")
        sys.exit(1)
    
    # Get command-line arguments
    input_file = sys.argv[1]
    output_arg = sys.argv[2]
    
    # Check if input file exists
    if not os.path.isfile(input_file):
        print(f"Error: The input file '{input_file}' does not exist.")
        sys.exit(1)
    
    # Determine output path
    if os.path.isdir(output_arg):
        # If output_arg is a directory, create the output file path
        output_file = os.path.join(
            output_arg,
            os.path.splitext(os.path.basename(input_file))[0] + ".md"
        )
    else:
        # If output_arg is a file, use it directly
        output_file = output_arg
    
    # Perform conversion
    convert_html_to_text(input_file, output_file)

And then:
$ for fil in download_folder_name/www.websitename.com/somefolder/text/*.html ; 
do python3 script.py $fil destination_dir; done

Then, just upload the files from this destination directory and you're ready to start chatting with your data. It does work!

Wednesday, January 08, 2025

How You Can Squeeze Notebook LM

And then:

No comments: