Wednesday, January 08, 2025

How You Can Squeeze Notebook LM


It claims to be happy with a URL. Try it and you find it doesn't do much - in fact, nothing at all..

So, what *can* you do?

Download the website - say it's one of those ancient online books with a TOC and Previous/Next buttons on each page. Use httrack.

Then, use this script (click to select) to (since Notebook LM takes markdown) pull the text from each page *meaningfully*. That is, you're saying Notebook LM is smart enough to make something out of headings. Else, you could just give it text. Ja? :)

Yes, you might need to ask chatGPT for a bash script to put all the text files into one - since N LM says something about a 50 file limit.

import sys import os import html2text def convert_html_to_text(input_path, output_path): # Read the HTML content from the input file try: with open(input_path, "r", encoding="utf-8") as infile: html_content = infile.read() except FileNotFoundError: print(f"Error: The file '{input_path}' was not found.") sys.exit(1) except Exception as e: print(f"Error reading the file '{input_path}': {e}") sys.exit(1) # Convert HTML to text text_maker = html2text.HTML2Text() text_maker.ignore_links = True # Optional: Ignore links in the output text = text_maker.handle(html_content) # Write the plain text to the output file try: with open(output_path, "w", encoding="utf-8") as outfile: outfile.write(text) print(f"Text successfully written to '{output_path}'.") except Exception as e: print(f"Error writing to the file '{output_path}': {e}") sys.exit(1) if __name__ == "__main__": # Ensure correct usage if len(sys.argv) != 3: print("Usage: python html2text_converter.py <input_file> <output_file_or_directory>") sys.exit(1) # Get command-line arguments input_file = sys.argv[1] output_arg = sys.argv[2] # Check if input file exists if not os.path.isfile(input_file): print(f"Error: The input file '{input_file}' does not exist.") sys.exit(1) # Determine output path if os.path.isdir(output_arg): # If output_arg is a directory, create the output file path output_file = os.path.join( output_arg, os.path.splitext(os.path.basename(input_file))[0] + ".md" ) else: # If output_arg is a file, use it directly output_file = output_arg # Perform conversion convert_html_to_text(input_file, output_file)

And then:

$ for fil in download_folder_name/www.websitename.com/somefolder/text/*.html ; do python3 script.py $fil destination_dir; done

Then, just upload the files from this destination directory and you're ready to start chatting with your data. It does work!

No comments: