Showing posts with label notebook lm. Show all posts
Showing posts with label notebook lm. Show all posts

Wednesday, January 08, 2025

How You Can Squeeze Notebook LM


It claims to be happy with a URL. Try it and you find it doesn't do much - in fact, nothing at all..

So, what *can* you do?

Download the website - say it's one of those ancient online books with a TOC and Previous/Next buttons on each page. Use httrack.

Then, use this script (click to select) to (since Notebook LM takes markdown) pull the text from each page *meaningfully*. That is, you're saying Notebook LM is smart enough to make something out of headings. Else, you could just give it text. Ja? :)

Yes, you might need to ask chatGPT for a bash script to put all the text files into one - since N LM says something about a 50 file limit.

import sys import os import html2text def convert_html_to_text(input_path, output_path): # Read the HTML content from the input file try: with open(input_path, "r", encoding="utf-8") as infile: html_content = infile.read() except FileNotFoundError: print(f"Error: The file '{input_path}' was not found.") sys.exit(1) except Exception as e: print(f"Error reading the file '{input_path}': {e}") sys.exit(1) # Convert HTML to text text_maker = html2text.HTML2Text() text_maker.ignore_links = True # Optional: Ignore links in the output text = text_maker.handle(html_content) # Write the plain text to the output file try: with open(output_path, "w", encoding="utf-8") as outfile: outfile.write(text) print(f"Text successfully written to '{output_path}'.") except Exception as e: print(f"Error writing to the file '{output_path}': {e}") sys.exit(1) if __name__ == "__main__": # Ensure correct usage if len(sys.argv) != 3: print("Usage: python html2text_converter.py <input_file> <output_file_or_directory>") sys.exit(1) # Get command-line arguments input_file = sys.argv[1] output_arg = sys.argv[2] # Check if input file exists if not os.path.isfile(input_file): print(f"Error: The input file '{input_file}' does not exist.") sys.exit(1) # Determine output path if os.path.isdir(output_arg): # If output_arg is a directory, create the output file path output_file = os.path.join( output_arg, os.path.splitext(os.path.basename(input_file))[0] + ".md" ) else: # If output_arg is a file, use it directly output_file = output_arg # Perform conversion convert_html_to_text(input_file, output_file)

And then:

$ for fil in download_folder_name/www.websitename.com/somefolder/text/*.html ; do python3 script.py $fil destination_dir; done

Then, just upload the files from this destination directory and you're ready to start chatting with your data. It does work!