Mastering Beautiful Soup for Web Scraping Projects

Beautiful Soup is one of the most popular and useful libraries in Python for web scraping and data extraction purposes. In this comprehensive guide, we will go far beyond basic usage and really dive into advanced techniques for production-grade web scraping with Beautiful Soup.

A Brief History of Beautiful Soup

Before jumping into the technical details, let‘s briefly go over the history and background of this excellent library:

Original version created in 2004 by Leonard Richardson to parse broken HTML code
Named after the Mock Turtle‘s song "Beautiful Soup" from Alice in Wonderland
Integrates well with popular Python web scraping tools like Requests and Selenium
Official Beautiful Soup 4 release in 2012 modernized the library for Python 3 and added new features
Active development continues with version 4.4.1 released in 2018
Available via pip and conda installs on all major platforms and environments

The creator summarizes the value of Beautiful Soup well:

"Programming is more fun when the tools you use aren‘t opaque and frustrating."

Let‘s now dive into setup and start using this productive library for our web scraping projects!

Installation and Setup

Beautiful Soup 4+ requires Python 3.x or Python 2.7+. We recommend using Python 3 for all new development.

Installation is simple using pip:

pip install beautifulsoup4

Or conda for Anaconda and data science focused environments:

conda install beautifulsoup4

That‘s it! We are now ready to start using this versatile library for parsing HTML and extracting valuable data from websites.

Importing and Creating BeautifulSoup Objects

In your Python code, import Beautiful Soup 4 and create an object to parse your target webpage content:

from bs4 import BeautifulSoup

soup = BeautifulSoup(page_html, ‘html.parser‘)

page_html can be a variable with the source HTML content or fetched using Requests, Selenium etc.
The parser helps Beautiful Soup interpret and navigate the document. Common options include ‘html.parser‘, ‘lxml‘ and ‘html5lib‘.

Now let‘s start using this soup object to extract useful information from HTML and XML documents!

Parsing and Navigating Web Pages

A key concept in Beautiful Soup is the parse tree – an XML/HTML document converted into a navigable tree of Python objects representing tags, attributes and text.

Beautiful Soup provides intuitive ways to traverse this parse tree and extract information using methods like:

soup.title – get the tag </li> <li><code>soup.p</code> – get the first <p> paragraph tag</li> <li><code>soup.find_all(‘div‘)</code> – find all <div> tags </li> </ul> <p>This makes scraping structured data from tag attributes and content much simpler than manual string processing.</p> <h2>Searching for Tags and Attributes</h2> <p>BeautifulSoup offers a wide range of methods like <code>find()</code>, <code>find_all()</code> and <code>find_parents()</code> to filter for tags with specific names, attributes, and textual content.</p> <p>For example, to get all <code><p></code> tags with the class <strong>"summary"</strong>:</p> <pre><code class="language-python">summary_paras = soup.find_all(‘p‘, class_=‘summary‘) </code></pre> <p>We can retrieve further details like the <code>.text</code> attribute to access the textual contents.</p> <p>These search methods act on tag <strong>names, attributes, and nested navigational properties</strong> to zero in on the exact data you need to extract. </p> <h2>Accessing Tag Content and Attributes</h2> <p>In Beautiful Soup, tags and navigation trees mostly return full fledged <code>BeautifulSoup</code> objects. </p> <p>Useful properties to access data from these tag-based objects include:</p> <pre><code>tag.name tag.attrs tag[‘class‘] - get value of ‘class‘ attribute tag.string tag.text tag.contents tag.children</code></pre> <p>For example:</p> <pre><code class="language-python">first_p_tag = soup.p print(first_p_tag.name) # ‘p‘ print(first_p_tag.string) # Paragraph text print(first_p_tag.text) # Stripped paragraph text</code></pre> <p>So tags can contain both attributes in a dictionary and nested child objects – providing flexibility to access all kinds of structured data.</p> <h2>Extracting Data with find(), find_all() and select()</h2> <p>Now let‘s focus on some workhorse methods that you will rely on regularly in Beautiful Soup powered scraping scripts.</p> <h3>find()</h3> <p>Returns only the <strong>first matching tag or attribute</strong> based on provided filters in the parse tree.</p> <pre><code class="language-python">single_result = soup.find(‘div‘, class_=‘article‘) </code></pre> <p>Useful for getting very specific, individual result.</p> <h3>find_all()</h3> <p>Returns a <strong>list</strong> of all matching tags/attributes in the parse tree for supplied filters. </p> <pre><code class="language-python">all_articles = soup.find_all(‘div‘, class_=‘article‘)</code></pre> <p>Helpful for extracting collections and datasets from page contents.</p> <h3>select()</h3> <p>CSS selectors provide another powerful way to specify required elements. Select filters using CSS id "#" and class "." syntax.</p> <pre><code class="language-python">headlines = soup.select(‘#main .headline‘) </code></pre> <p>Returns a list matching the CSS style selectors.</p> <h2>Working with Navigation Trees and Traversal</h2> <p>While find methods search top down through the entire parse tree, you can also systematically traverse through the tree using <strong>navigation properties</strong>:</p> <ul> <li><code>.contents</code> – children tag objects</li> <li><code>.children</code> – generator with children</li> <li><code>.descendants</code> – all nested descendants </li> <li><code>.parent</code> – direct parent </li> <li><code>.parents</code> – list with full ancestry </li> </ul> <p>For example:</p> <pre><code class="language-python">outer_tag = soup.find(‘div‘, class_=‘outer‘) for child in outer_tag.contents: print(child) # Print direct children for descendant in outer_tag.descendants: print(descendant) # Nested child tags</code></pre> <p>This allows methodically iterating and accessing different areas of interest in complex parse trees.</p> <h2>Handling Different Data Types</h2> <p>While navigating and searching, you will come across tag objects with different kinds of data:</p> <ul> <li>String values like paragraph text </li> <li>Lists and tuples containing multiple values</li> <li>Nested child tags and sub-trees</li> </ul> <p>Methods like <code>.text</code> and <code>.strings</code> help handle them in your code: </p> <pre><code class="language-python">for string in soup.stripped_strings: # Loop through strings for sibling in soup.tr.next_siblings: # Navigate sibling tags</code></pre> <p>Datatype checks using <code>isinstance()</code> also help process values correctly. </p> <p>This flexibility allows Beautiful Soup to handle even poorly structured data effectively.</p> <h2>Dealing with Common Scraping Issues</h2> <p>Real world HTML can be messy and lead to missing data or encoding problems. </p> <p>Here are some ways Beautiful Soup helps tackle typical scraping issues:</p> <ul> <li>Replace missing tags or attributes with defaults using <code>.get()</code> method</li> <li>Catch encoding related errors and handle by passing custom parsers like <code>lxml</code> </li> <li>Check for <code>None</code> return values before accessing properties </li> </ul> <pre><code class="language-python">article = soup.find(‘div‘, ‘article‘) title = article.get(‘title‘, default=‘No title‘) unicode_error = False try: print(article.text) except UnicodeEncodeError as ude: unicode_error = True # Handle encoding issue if article is not None: print(article.text[:50]) #Avoid None result </code></pre> <p>Robust handling of missing data, encoding issues and edge cases leads to reliable extraction workflows.</p> <h2>Best Practices for Writing Reliable Scraping Code</h2> <p>Like all programs dealing with external, unpredictable data – expectations on completeness, encoding formats and general messiness of HTML code should be low. </p> <p>Here are some best practices I follow for writing resilient Beautiful Soup scrapers:</p> <ul> <li>Be generous with exception handling blocks – escape gracefully </li> <li>Print critical output directly instead of storing for later crashes</li> <li>Expect and catch None return values with default behaviour </li> <li>Validate parsed data before processing further</li> <li>Handle chunks of HTML independently to limit errors spreading </li> <li>Quickly <code>print()</code> and check small pieces before building complex nested handling </li> </ul> <p>Getting basics working first – then expanding in complexity often leads to robust scraping code.</p> <h2>Comparing Beautiful Soup with Other Python Scraping Libraries</h2> <p>There are several excellent parsing and scraping libraries in Python – but Beautiful Soup remains one of the most popular and newbie friendly options. Here‘s how some alternatives like <strong>Scrapy</strong> and <strong>lxml</strong> compare:</p> <table> <thead> <tr> <th></th> <th><strong>Beautiful Soup</strong></th> <th><strong>Scrapy</strong></th> <th><strong>lxml</strong></th> </tr> </thead> <tbody> <tr> <td>Best For</td> <td>Navigating existing HTML</td> <td>Large scale crawling</td> <td>Very fast parsing</td> </tr> <tr> <td>Learning Curve</td> <td>Easy</td> <td>Moderate</td> <td>Moderate</td> </tr> <tr> <td>Speed</td> <td>Good</td> <td>Very Fast</td> <td>Extremely Fast</td> </tr> <tr> <td>Features</td> <td>HTML/XML parsing</td> <td>Spidering, scaling</td> <td>XML/HTML parsing</td> </tr> <tr> <td>Ease of Debugging</td> <td>Print tag contents</td> <td>Can require echo pipeline</td> <td>Print document nodes</td> </tr> <tr> <td>Synchronous/Asynchronous</td> <td>Synchronous</td> <td>Asynchronous</td> <td>Synchronous</td> </tr> </tbody> </table> <p>As highlighted in this table, combination with libraries like Scrapy and lxml give you flexibility to handle everything from simple data extraction to large scale distributed web crawling.</p> <h2>Real World Use Cases and Examples</h2> <p>While simple tutorials focus on parsing just a few tags – real world scraping requires building robust <em>data pipelines</em> from raw HTML to useable outputs.</p> <p>Here are some examples of end-to-end workflows powered by Beautiful Soup:</p> <h3>Scraping News Articles</h3> <ul> <li>Start from homepage or RSS feeds</li> <li>Extract item links with <code>find_all()</code> </li> <li>Iterate over links to scrape full article HTML</li> <li>Cleanly parse article sections despite missing fields </li> <li>Strip HTML tags while retaining text formatting </li> <li>Build JSON output with title, authors, publish date, topics, paragraph contents</li> </ul> <h3>Creating Product Catalogs</h3> <ul> <li>Crawl category pages and find product listings </li> <li>Standardize parsing despite diverse HTML layouts</li> <li>Scrape current price, images, descriptions, SKU </li> <li>Normalize inconsistent naming like size variants </li> <li>Generate Excel sheet or CSV catalog output </li> </ul> <h3>Search Engine Crawling</h3> <ul> <li>Crawl result pages for search keywords </li> <li>Parse out <code><a></code> links, snippets and metadata </li> <li>Rate page relevance based on search terms </li> <li>Filter dead and low quality links </li> <li>Enrich documents with related entities, topics and facts</li> </ul> <p>The common theme here is <strong>robust workflows</strong> to cleanly transform diverse HTML data into usable datasets – instead of just extracting a few sample tags from minimal examples.</p> <h2>Integrating Scraping Code into Apps and Crawlers</h2> <p>While great for learning and testing on small samples, Beautiful Soup code isn‘t designed to run standalone for large scale production workflows. </p> <p>Here are some ways to integrate parsers effectively into web apps and scalable crawlers:</p> <h3>Build Pipelines for Data Collection</h3> <p>Rather than running beautifulsoup snippets independently:</p> <ul> <li>Create executable <strong>scripts</strong> to run pasing jobs </li> <li>Parameterize configurations like start pages and tags to target </li> <li>Chain together with other harvester scripts into a data pipeline</li> <li>Add steps for data validation, storage and processing apps</li> </ul> <p>This makes your project <strong>modular, configurable and scalable</strong>.</p> <h3>Schedule and Orchestrate Tasks</h3> <p>Static scripts still have limitations for real world crawling and extraction needs:</p> <ul> <li>Processing limits on single machines</li> <li>Difficulty debugging and updating pipelines</li> <li>No continuity between scrape attempts </li> </ul> <p>Tools like <strong>Apache Airflow</strong> allow robust orchestration by providing:</p> <ul> <li>Graphical pipeline design </li> <li>Schedule timed jobs and dependencies</li> <li>Monitor and restart failed tasks</li> <li>Store interim state for continuity</li> <li>Scale across worker machines</li> </ul> <p>This productionizes scraping code for resilient execution.</p> <h2>Latest Features in Beautiful Soup 4.4+</h2> <p>While Beautiful Soup provides a stable API for parsing, the library continues to be improved and added in each new release. </p> <p>Here are some useful updates in the latest 4.4+ versions:</p> <ul> <li><strong>CSS Selectors</strong>: Compact querying syntax based on element class and ID strings </li> <li><strong>Recovers from bad HTML</strong>: Skip tags with missing end quotes or angle brackets instead of crashing</li> <li><strong>New methods and filters</strong>: <code>.previous_elements</code> and <code>.next_elements</code> to traverse siblings </li> <li><strong>Speed improvements</strong>: Faster <code>.decompose()</code> method to remove tree branches </li> <li><strong>Python type hints</strong>: Improve static type checking and IDE support for code completion/analysis</li> </ul> <p>So definitely keep your Beautiful Soup version updated to leverage all these benefits!</p> <h2>Tips and Tricks for Improved Scraping Performance</h2> <p>While Beautiful Soup provides a very intuitive API – some methods and approaches are better than others for efficiency and performance reasons when dealing with large datasets. </p> <p>Here are some tips that can give your scripts a speed boost:</p> <ul> <li><strong>Minimize DOM tree depth</strong> with <code>.contents</code> and <code>.children</code> rather than deep recursion with <code>.descendants</code> etc</li> <li>Use <strong>Generators</strong> like <code>.children</code> and <code>.strings</code> for lazy parsing instead of creating full lists in memory </li> <li><strong>Avoid disk I/O</strong> when possible – stream response content directly into Beautiful Soup </li> <li><strong>Parse only required sections</strong> instead of converting entire pages </li> <li><strong>Simplify CSS selectors</strong> based on ID and direct descendants rather than long paths </li> <li>Install C libraries like <strong>lxml</strong> for faster XML/HTML parsing</li> </ul> <p>Profiling scripts with <code>%timeit</code> and <code>%prun</code> helps identify optimization opportunities.</p> <h2>Debugging Tips and Troubleshooting Errors</h2> <p>Here is a quick troubleshooting guide for some common errors and unexpected results while scraping with Beautiful Soup:</p> <table> <thead> <tr> <th>Issue</th> <th>Troubleshooting Tips</th> </tr> </thead> <tbody> <tr> <td>HTML Parsing Errors</td> <td>Switch parsers from ‘html.parser‘ to ‘lxml‘ or ‘html5lib‘</td> </tr> <tr> <td>Can‘t find tags</td> <td>Print tag variable and inspect attributes, Ensure case sensitivity is correct</td> </tr> <tr> <td>None return value</td> <td>Print and check result before accessing, Handle Nones</td> </tr> <tr> <td>UnicodeErrors</td> <td>Catch exception, Print encoding details, Pass custom encoding to BeautifulSoup</td> </tr> <tr> <td>Scripts Running Slow</td> <td>Use Python profile hooks to detect slow sections, optimize navigation</td> </tr> <tr> <td>Same tag matches multiple times</td> <td>Narrow down by adding more attributes/conditions for uniqueness</td> </tr> <tr> <td>Script hangs</td> <td>Enable timeouts, Check for invalid HTTP responses or encoding issues</td> </tr> </tbody> </table> <p>Don‘t forget to leverage Python debugging staples likes <code>print</code> statements liberally and exceptions effectively.</p> <h2>Legal and Ethical Considerations</h2> <p>While Beautiful Soup is a useful tool for extracting publicly available data – be mindful of intellectual property rights, anti-scraping policies and reasonable usage when building your web scrapers. </p> <p>Some things to keep in mind:</p> <ul> <li>Restrict high frequency requests to avoid overloading publisher sites</li> <li>Identify yourself transparently via user-agent strings </li> <li>Adhere to robots.txt and scraping guidelines where specified</li> <li>Consider caching result data locally for reuse instead of hitting sites repeatedly</li> <li>Transform and enrich scraped data to provide additional analysis and insights </li> </ul> <p>By respecting data sources and providing utility rather than duplication – your web scrapers can co-exist sustainably alongside content publishers.</p> <p>I hope this detailed guide gives you a comprehensive overview of capabilities, advanced techniques and real world integration ideas for production-grade web scraping using Python‘s versatile Beautiful Soup library.</p> <p>The simple yet powerful APIs make it approachable for novices, while support for modifying parsers and encodings provides flexibility for industrial strength usage.</p> <p>Combine your new expertise of search methods, traversal and robust practices from this post with high concurrency frameworks like Scrapy and Airflow as needed to build performant and scalable data harvesting solutions.</p> <p>So grab some coffee and code up your next beautiful web scraper with BeautifulSoup! </p> <p>Let me know in the comments if you have any other questions. Happy parsing!</p> <div id='jp-relatedposts' class='jp-relatedposts' > <h3 class="jp-relatedposts-headline"><em>Related</em></h3> </div></div><div class="nv-tags-list"><span>Tags:</span><a href=https://greasyguide.com/tag/development/ title="Development" class=development rel="tag">Development</a><a href=https://greasyguide.com/tag/python/ title="Python" class=python rel="tag">Python</a> </div> </article> </div> </div> </main> <a tabindex="0" id="scroll-to-top" class="scroll-to-top scroll-to-top-right scroll-show-mobile icon" aria-label="Scroll to Top"><svg class="scroll-to-top-icon" aria-hidden="true" role="img" xmlns="http://www.w3.org/2000/svg" width="15" height="15" viewBox="0 0 15 15"><rect width="15" height="15" fill="none"/><path fill="currentColor" d="M2,8.48l-.65-.65a.71.71,0,0,1,0-1L7,1.14a.72.72,0,0,1,1,0l5.69,5.7a.71.71,0,0,1,0,1L13,8.48a.71.71,0,0,1-1,0L8.67,4.94v8.42a.7.7,0,0,1-.7.7H7a.7.7,0,0,1-.7-.7V4.94L3,8.47a.7.7,0,0,1-1,0Z"/></svg></a><footer class="site-footer" id="site-footer" > <div class="hfg_footer"> <div class="footer--row footer-main layout-full-contained" id="cb-row--footer-main" data-row-id="main" data-show-on="desktop"> <div class="footer--row-inner footer-main-inner footer-content-wrap"> <div class="container"> <div class="hfg-grid nv-footer-content hfg-grid-main row--wrapper row " data-section="hfg_footer_layout_main" > <div class="hfg-slot left"><div class="builder-item desktop-left tablet-left mobile-left"><div class="item--inner builder-item--footer-one-widgets" data-section="neve_sidebar-widgets-footer-one-widgets" data-item-id="footer-one-widgets"> <div class="widget-area"> <div id="block-7" class="widget widget_block"> <div style="height:10px" aria-hidden="true" class="wp-block-spacer"></div> </div><div id="text-2" class="widget widget_text"><p class="widget-title">Greasyguide.com</p> <div class="textwidget"><p>Your online destination for urban information.</p> </div> </div> </div> </div> </div></div><div class="hfg-slot c-left"><div class="builder-item desktop-left tablet-left mobile-left"><div class="item--inner builder-item--footer-two-widgets" data-section="neve_sidebar-widgets-footer-two-widgets" data-item-id="footer-two-widgets"> <div class="widget-area"> <div id="block-8" class="widget widget_block"> <div style="height:5px" aria-hidden="true" class="wp-block-spacer"></div> </div> <div id="recent-posts-4" class="widget widget_recent_entries"> <p class="widget-title">Recent Posts</p> <ul> <li> <a href="https://greasyguide.com/the-complete-2023-guide-to-becoming-a-front-end-web-developer/">The Complete 2023 Guide to Becoming a Front-End Web Developer</a> </li> <li> <a href="https://greasyguide.com/mastering-beautiful-soup-for-web-scraping-projects/" aria-current="page">Mastering Beautiful Soup for Web Scraping Projects</a> </li> <li> <a href="https://greasyguide.com/the-essential-guide-to-battery-saver-apps-how-data-and-ai-are-revolutionizing-mobile-power-optimization/">The Essential Guide to Battery Saver Apps: How Data and AI are Revolutionizing Mobile Power Optimization</a> </li> <li> <a href="https://greasyguide.com/the-complete-beginners-guide-to-personal-cybersecurity-in-2023/">The Complete Beginner‘s Guide to Personal Cybersecurity in 2023</a> </li> </ul> </div> </div> </div> </div></div><div class="hfg-slot center"><div class="builder-item desktop-left tablet-left mobile-left"><div class="item--inner builder-item--footer-three-widgets" data-section="neve_sidebar-widgets-footer-three-widgets" data-item-id="footer-three-widgets"> <div class="widget-area"> <div id="block-9" class="widget widget_block"> <div style="height:10px" aria-hidden="true" class="wp-block-spacer"></div> </div><div id="block-10" class="widget widget_block widget_media_gallery"> <figure class="wp-block-gallery columns-1 is-cropped wp-block-gallery-1 is-layout-flex wp-block-gallery-is-layout-flex"><ul class="blocks-gallery-grid"><li class="blocks-gallery-item"><figure><img decoding="async" src="https://demosites.io/neve-3/wp-content/uploads/sites/643/2021/10/neve-demo-image-33-300x160.png" alt="" data-id="377" data-full-url="https://demosites.io/neve-3/wp-content/uploads/sites/643/2021/10/neve-demo-image-33.png" data-link="https://demosites.io/neve-3/homepage/neve-demo-image-33/" class="wp-image-377" /></figure></li><li class="blocks-gallery-item"><figure><img decoding="async" src="https://demosites.io/neve-3/wp-content/uploads/sites/643/2021/10/neve-demo-image-25-300x159.png" alt="" data-id="348" data-full-url="https://demosites.io/neve-3/wp-content/uploads/sites/643/2021/10/neve-demo-image-25.png" data-link="https://demosites.io/neve-3/neve-demo-image-25/" class="wp-image-348" /></figure></li></ul></figure> </div> </div> </div> </div></div> </div> </div> </div> </div> <div class="footer--row footer-bottom layout-full-contained" id="cb-row--footer-bottom" data-row-id="bottom" data-show-on="desktop"> <div class="footer--row-inner footer-bottom-inner footer-content-wrap"> <div class="container"> <div class="hfg-grid nv-footer-content hfg-grid-bottom row--wrapper row " data-section="hfg_footer_layout_bottom" > <div class="hfg-slot left"><div class="builder-item mobile-center tablet-center desktop-left"><div class="item--inner builder-item--footer_copyright" data-section="footer_copyright" data-item-id="footer_copyright"> <div class="component-wrap"> <div> © Copyright GreasyGuide.com 2024 </div> </div> </div> </div></div><div class="hfg-slot c-left"><div class="builder-item desktop-right tablet-center mobile-center"><div class="item--inner builder-item--footer-menu has_menu" data-section="footer_menu_primary" data-item-id="footer-menu"> <div class="component-wrap"> <div role="navigation" class="style-plain nav-menu-footer" aria-label="Footer Menu"> <ul id="footer-menu" class="footer-menu nav-ul"><li id="menu-item-1650" class="menu-item menu-item-type-post_type menu-item-object-page menu-item-1650"><a href="https://greasyguide.com/about/">About</a></li> <li id="menu-item-1649" class="menu-item menu-item-type-post_type menu-item-object-page menu-item-1649"><a href="https://greasyguide.com/7b3c2-contact/">Contact</a></li> <li id="menu-item-1648" class="menu-item menu-item-type-post_type menu-item-object-page menu-item-1648"><a href="https://greasyguide.com/privacy-policy/">Privacy Policy</a></li> </ul> </div> </div> </div> </div></div> </div> </div> </div> </div> </div> </footer> </div> <ul class="obfx-sharing obfx-sharing-left "> <li class=""> <a class = "facebook" aria-label="Facebook" href="https://www.facebook.com/sharer.php?u=https://greasyguide.com/mastering-beautiful-soup-for-web-scraping-projects/"> <i class="socicon-facebook"></i> </a> </li> <li class=""> <a class = "twitter" aria-label="Twitter" href="https://twitter.com/intent/tweet?url=https://greasyguide.com/mastering-beautiful-soup-for-web-scraping-projects/&text=Mastering%20Beautiful%20Soup%20for%20Web%20Scraping%20Projects&hashtags=Development"> <i class="socicon-twitter"></i> </a> </li> <li class=""> <a class = "pinterest" aria-label="Pinterest" href="https://pinterest.com/pin/create/bookmarklet/?media=https://greasyguide.com/wp-content/uploads/2024/07/20240702005414-66834fb6ce88d.png&url=https://greasyguide.com/mastering-beautiful-soup-for-web-scraping-projects/&description=Mastering%20Beautiful%20Soup%20for%20Web%20Scraping%20Projects"> <i class="socicon-pinterest"></i> </a> </li> <li class=""> <a class = "linkedin" aria-label="LinkedIn" href="https://www.linkedin.com/shareArticle?url=https://greasyguide.com/mastering-beautiful-soup-for-web-scraping-projects/&title=Mastering%20Beautiful%20Soup%20for%20Web%20Scraping%20Projects"> <i class="socicon-linkedin"></i> </a> </li> <li class=""> <a class = "reddit" aria-label="Reddit" href="https://reddit.com/submit?url=https://greasyguide.com/mastering-beautiful-soup-for-web-scraping-projects/&title=Mastering%20Beautiful%20Soup%20for%20Web%20Scraping%20Projects"> <i class="socicon-reddit"></i> </a> </li> </ul> <style id='core-block-supports-inline-css' type='text/css'> .wp-block-gallery.wp-block-gallery-1{--wp--style--unstable-gallery-gap:var( --wp--style--gallery-gap-default, var( --gallery-block--gutter-size, var( --wp--style--block-gap, 0.5em ) ) );gap:var( --wp--style--gallery-gap-default, var( --gallery-block--gutter-size, var( --wp--style--block-gap, 0.5em ) ) );} </style> <script type="text/javascript" src="https://c0.wp.com/c/6.5.5/wp-includes/js/dist/vendor/wp-polyfill-inert.min.js" id="wp-polyfill-inert-js"></script> <script type="text/javascript" src="https://c0.wp.com/c/6.5.5/wp-includes/js/dist/vendor/regenerator-runtime.min.js" id="regenerator-runtime-js"></script> <script type="text/javascript" src="https://c0.wp.com/c/6.5.5/wp-includes/js/dist/vendor/wp-polyfill.min.js" id="wp-polyfill-js"></script> <script type="text/javascript" id="contact-form-7-js-extra"> /* <![CDATA[ */ var wpcf7 = {"api":{"root":"https:\/\/greasyguide.com\/wp-json\/","namespace":"contact-form-7\/v1"},"cached":"1"}; /* ]]> */ </script> <script type="text/javascript" src="https://greasyguide.com/wp-content/plugins/contact-form-7/includes/js/index.js?ver=5.5.6.1" id="contact-form-7-js"></script> <script type="text/javascript" src="https://greasyguide.com/wp-content/plugins/themeisle-companion/obfx_modules/social-sharing/js/public.js?ver=2.10.12" id="obfx-module-pub-js-social-sharing-0-js"></script> <script type="text/javascript" id="neve-script-js-extra"> /* <![CDATA[ */ var NeveProperties = {"ajaxurl":"https:\/\/greasyguide.com\/wp-admin\/admin-ajax.php","nonce":"d74a2833dc","isRTL":"","isCustomize":"","infScroll":"enabled","maxPages":"0","endpoint":"https:\/\/greasyguide.com\/wp-json\/nv\/v1\/posts\/page\/","query":"{\"page\":\"\",\"name\":\"mastering-beautiful-soup-for-web-scraping-projects\"}","lang":"en_US"}; /* ]]> */ </script> <script type="text/javascript" src="https://greasyguide.com/wp-content/themes/neve/assets/js/build/modern/frontend.js?ver=3.2.5" id="neve-script-js" async></script> <script type="text/javascript" id="neve-script-js-after"> /* <![CDATA[ */ var html = document.documentElement; var theme = html.getAttribute('data-neve-theme') || 'light'; var variants = {"logo":{"light":{"src":"https:\/\/i0.wp.com\/greasyguide.com\/wp-content\/uploads\/2024\/06\/cropped-Greasy-Guide-1.png?fit=200%2C50&ssl=1","srcset":false,"sizes":"(max-width: 200px) 100vw, 200px"},"dark":{"src":"https:\/\/i0.wp.com\/greasyguide.com\/wp-content\/uploads\/2024\/06\/cropped-Greasy-Guide-1.png?fit=200%2C50&ssl=1","srcset":false,"sizes":"(max-width: 200px) 100vw, 200px"},"same":true},"logo_2":{"light":{"src":false,"srcset":false,"sizes":false},"dark":{"src":false,"srcset":false,"sizes":false},"same":true}}; function setCurrentTheme( theme ) { var pictures = document.getElementsByClassName( 'neve-site-logo' ); for(var i = 0; i<pictures.length; i++) { var picture = pictures.item(i); if( ! picture ) { continue; }; var fileExt = picture.src.slice((Math.max(0, picture.src.lastIndexOf(".")) || Infinity) + 1); if ( fileExt === 'svg' ) { picture.removeAttribute('width'); picture.removeAttribute('height'); picture.style = 'width: var(--maxwidth)'; } var compId = picture.getAttribute('data-variant'); if ( compId && variants[compId] ) { var isConditional = variants[compId]['same']; if ( theme === 'light' || isConditional || variants[compId]['dark']['src'] === false ) { picture.src = variants[compId]['light']['src']; picture.srcset = variants[compId]['light']['srcset'] || ''; picture.sizes = variants[compId]['light']['sizes']; continue; }; picture.src = variants[compId]['dark']['src']; picture.srcset = variants[compId]['dark']['srcset'] || ''; picture.sizes = variants[compId]['dark']['sizes']; }; }; }; var observer = new MutationObserver(function(mutations) { mutations.forEach(function(mutation) { if (mutation.type == 'attributes') { theme = html.getAttribute('data-neve-theme'); setCurrentTheme(theme); }; }); }); observer.observe(html, { attributes: true });!function(){"use strict";const e="data-neve-theme",t="neve_user_theme";function r(){let n="light",r=localStorage.getItem(t);"dark"===r&&(n="dark"),"light"===r&&(n="light"),document.documentElement.setAttribute(e,n)}r();const a=document.getElementById("neve_body");function n(n){if(n.srcElement&&(n.srcElement.matches("a.palette-icon-wrapper")||n.srcElement.parentElement&&n.srcElement.parentElement.matches("a.palette-icon-wrapper")||n.srcElement.parentElement&&n.srcElement.parentElement.parentElement.matches("a.palette-icon-wrapper")||n.srcElement.parentElement&&n.srcElement.parentElement.parentElement.parentElement.matches("a.palette-icon-wrapper"))){if(n.preventDefault(),"dark"===document.documentElement.getAttribute(e))return localStorage.setItem(t,"light"),void document.documentElement.setAttribute(e,"light");localStorage.setItem(t,"dark"),document.documentElement.setAttribute(e,"dark")}}a&&a.addEventListener("click",n,!1);}(); /* ]]> */ </script> <script type="text/javascript" id="neve-pro-scroll-to-top-js-extra"> /* <![CDATA[ */ var scrollOffset = {"offset":"0"}; /* ]]> */ </script> <script type="text/javascript" src="https://greasyguide.com/wp-content/plugins/neve-pro-addon/includes/modules/scroll_to_top/assets/js/build/script.js?ver=2.2.5" id="neve-pro-scroll-to-top-js" async></script> <script type="text/javascript" src="https://stats.wp.com/e-202427.js" id="jetpack-stats-js" data-wp-strategy="defer"></script> <script type="text/javascript" id="jetpack-stats-js-after"> /* <![CDATA[ */ _stq = window._stq || []; _stq.push([ "view", JSON.parse("{\"v\":\"ext\",\"blog\":\"49521606\",\"post\":\"2592\",\"tz\":\"0\",\"srv\":\"greasyguide.com\",\"j\":\"1:13.5\"}") ]); _stq.push([ "clickTrackerInit", "49521606", "2592" ]); /* ]]> */ </script> </body> </html>