Automated Web-Based Book Pages, Content Extraction and HTML Transformation Framework

Time Icon 2 min read

Problem

The project involved processing multiple subject-based book web pages and converting them into structured HTML files.

  1. Each book consisted of multiple pages that needed to be navigated and saved as HTML files after performing specific content modifications.
  2. A significant amount of repetitive code existed across the HTML files, which required standardization.
  3. Images and embedded YouTube links had to be downloaded from multiple book pages and embedded into their respective HTML files.
  4. All generated HTML files needed to be organized into module-wise folders for each subject.
  5. Images had to be stored in a separate folder, distinct from the HTML files.

Performing these tasks manually was highly time-consuming and posed a high risk of human error.

Solution

To address these challenges, an automated solution was designed and implemented using Selenium with Java. The solution was divided into modular scripts, each responsible for a specific task:

  1. Page Navigation and HTML Content Extraction
    • Selenium WebDriver was used to iterate through multi-level book pages.
    • The JavaScriptExecutor method is used to retrieve page content dynamically and modify the HTML as required.
  2. HTML Transformation and Persistence
    • Custom DOM manipulation scripts were executed via JavaScript to normalize repetitive HTML blocks.
    • Saved changed HTML files using FileWriter and buffered IO streams in the folder module-wise.
  3. Media Asset Extraction
    • Image resources were extracted by parsing img[src] attributes from the DOM.
    • Images were downloaded using Java’s URLConnection and saved to module-wise folders.
    • Embedded YouTube video URLs were extracted by navigating into iframe elements and saved module-wise.
  4. Post-processing and Structural Normalization
    • Jsoup library was used to modify common HTML attributes across all files and inserting extracted images and video URLs to all subjects module-wise.
  5. File Handling and Organization
      Java NIO (Files.move, Paths, and DirectoryStream) was used to implement:
      • Subject-wise folder creation
      • Module-based HTML file relocation
      • Media asset segregation into centralized and subject-specific directories

Results

The automated solution delivered significant improvements:

  • Reduced processing time dramatically compared to manual effort
  • Eliminated repetitive manual coding tasks
  • Minimized the risk of human error
  • Ensured consistent folder structure and standardized HTML formatting across all subjects

Business Impact (Before and After)

Metrics Before Automation After Automation
Content Extraction Time Manual processing of each book page; several hours to days per subject Automated extraction; completed in a few days per subject
Media (Images & YouTube URLs) Handling Manual download and embedding; high chance of broken links Automatic extraction, download, and embedding with accurate paths and folder mapping
HTML Standardization Highly inconsistent; repetitive code manually edited across files Uniform, fully standardized HTML structure via automated DOM and Jsoup transformations
Manual Errors Frequent errors in HTML formatting, missing media, and incorrect folder structure Human error has been nearly eliminated; consistent and reliable output
Folder Structuring Manual creation of subject/module folders; prone to mistakes Automated module-wise and subject-wise folder generation using Java NIO
Team Productivity High manual workload; time spent on repetitive tasks Significant time savings; team focuses on high-value tasks
Scalability Difficult to scale; effort increased linearly with more subjects/modules Highly scalable; new subjects/modules processed with minimal changes
Reusability of Code/Framework No reusable components; work repeated for each set of pages Modular automation scripts reused across all subjects and future projects

Development Timeline

The end-to-end automation framework was rapidly designed and developed within just a few days to meet the evolving requirements of the development team. The work included requirement analysis, modular script development, iterative testing, and validation across multiple subjects.

What would have taken nearly a year of manual effort was transformed into a fully automated workflow capable of completing the same volume of work in a few weeks.

Through this automated process, the system successfully generated:

  • 2,208 structured folders
  • Over 1,14,000 HTML files across numerous subjects.

Conclusion

The automation framework delivered for this project goes far beyond a simple script—it represents a strategic, reusable product foundation for large-scale content transformation. It showcases Sarvika’s expertise in building robust, future-ready automation projects that can process thousands of pages with accuracy, consistency, and speed.