What does your project do?¶

Analyzing the Vitals.com Doctor Ratings Website¶

For our project, we decided to assess the usefulness of the physician rating website Vitals. To accomplish this, we scraped the data from the website and then proceeded to analyze it using the pandas package. Ideally, we wanted to compare the ratings across specialties and locations to be able to draw conclusions about the perceived quality of doctors, but unfortunately limited analyses were able to be conducted due to biased data.

What modules did you use, if any?¶

requests
BeautifulSoup4
Selenium
time
re
csv
pandas
matplotlib
numpy

Explain how we can run your project¶

To run the project, several dependencies must be installed in addition to Python:

Through pip Selenium, Requests, BeautifulSoup4: e.g. pip install beautifulsoup4 selenium requests
Chrome Test Driver 2.9: http://chromedriver.storage.googleapis.com/index.html?path=2.9/ (this is required for Selenium to be able to automate the Chrome browser; this was necessary since on all scraped pages, results were loaded in by client-side Javascript calls that had to be executed in a browser)

Additionally, the file path for downloaded webpages and for the CSV must be set below:

#Be sure to include final forwardslash in filepath!
filepath = "C:/Desired/File/Path"

An evaluation of Python's suitability to your task¶

Some problems were encountered while scraping data from online. As mentioned prior, results were loaded in by client-side Javascript that needed to be excecuted in a browser. This posed initial complications, however the use of Selenium helped automate the process. Despite these problems, Python was absolutely suited to the task and allowed us to complete this step with relatively few issues.

In past analyses, we have typically used R or SAS for analysis. While we are more comfortable with those languages, python was more than suitable for this analysis as well.

Obtaining the Data¶

Data was scraped from the Vitals website, with only physicians from the state of New York being included in our dataset. To accomplish this using webdriver from the selenium package along with beautifulsoup. The scraped data was then exported into a csv file using the csv package.

Data scraping proceeded as follows:

Creating a function to save all the pages of reviews using selenium:¶

from selenium import webdriver
import requests
import time

def save_page(url, out_file, wait_for_class='search-card'):
    """ Saves a page, using the Selenium browser testing framework to run any Javascript
        , writing the resulting final DOM as HTML source to disk.
        Waits until element of a given class is loaded to make sure Javascript has actually
        completed before source is retrieved. """

    browser = webdriver.Chrome("C:/Path/To/Chromedriver.exe")
    browser.get(url)
    # Wait until elment with given class is loaded
    while True:
        try:
            # if the element cannot be found, this will throw
            browser.find_element_by_class_name(wait_for_class)
            break # otherwise the loop will end here
        except:
            pass # do nothing, try again
    # source has loaded - write out result
    open(out_file, 'w', encoding="utf-8").write(browser.page_source)

#loop through 42 pages of search results & save to working directory
page = 0    
while page <= 41:
    url = 'https://www.vitals.com/search?display_type=Doctor&state=NY&page=' + str(page)
    page += 1
    save_page(url, filepath + 'Vitals' + str(page) + ".html")

Create function to extract necessary data from downloaded pages using BeautifulSoup:¶

import re
from bs4 import BeautifulSoup

def parse_page(filepath):
    page_contents = open(filepath).read().strip()
    my_soup = BeautifulSoup(page_contents, "html.parser")
    data = []
    #Scrape physician name
    for card in my_soup.findAll("div", attrs = {"class": "search-card"}):
        name_element = card.find("span", attrs = {"class": "name"})
        if name_element == None:
            continue
        name = name_element.text.strip()
    #Scrape physician specialty
        specialty_element = card.find("span", attrs = {"class": "specialty"})
        if specialty_element == None:
            continue
        specialty = specialty_element.text.strip().split(',')[0]
    #Scrape physician location
        location_element = card.find("span", attrs = {"class": "address"})
        if location_element == None:
            continue
        location = location_element.text.strip().split(',')[0]
    #Scrape physician rating
        rating_element = card.find("span", attrs = {"class": "rating-text"})
        if rating_element == None:
            continue
        rating = rating_element.text.strip('')
    #Scrape number of reviews for each physican
        reviews_element = str(card.select("a[href*=reviews]"))
        if reviews_element == None:
            continue
        reviews = ''.join(re.findall(r'\d+', reviews_element))
        row = (name, specialty, location, rating, reviews)
        data.append(row)
        
    return (data)

Export scraped information to a csv file for ease of use:¶

import csv

with open(filepath + '/healthreviews.csv','w') as csvfile:
    writer = csv.writer(csvfile)
    titlerow = ("Name", "Specialty", "Location", "Rating", "Reviews")
    writer.writerow(titlerow)
    # Loop through all downloaded pages of the website; each "card" forms a row of the CSV
    for page in range(1, 43):
        all_rows = parse_page(filepath + "Vitals" + str(page) + ".html")
        for row in all_rows:
            writer.writerow(row)

Analyzing the data¶

The resulting csv was structured into a dataframe using the pandas package. The analysis and visualizations of the data were accomplished using a combination of pandas and pyplot from the matplotlib package. After the cleaning the dataset contained 958 records with the following variables:

Name: Name of physician
Specialty: Specialty of physician
Location: Location of physician
Rating: Rating of the physician (1-5)
Reviews: Number of Reviews for Physician

Exploratory analysis of the dataset proceeded as follows:

Reading the data into pandas and performing some cleaning¶

import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

# Reading in the resulting csv. 1003 records for physicians
healthReview = pd.read_csv(filepath + "healthreviews.csv")

# Keeping doctors that had one or more reviews
healthReview = healthReview[healthReview['Reviews'] > 0]

# Checking the dataset for any missing values. 45 missing values returned
healthReview.isnull().sum().sum()

# Dropping missing values. 958 Rows remaining
healthReview = healthReview.dropna()

# Sorting dataset by number of reviews, descending
healthReview = healthReview.sort_values(['Reviews'], ascending = False)

Taking a peek into the dataset, looking at the 10 doctors with the most reviews¶

healthReview.head(10).style.bar(subset=['Reviews'], color='#5fba7d')

General summary statistics for ratings and review variables¶

# Generating some summary statistics for rating and reviews variables
healthReview.describe().round(2)

We can see that the average rating for a physician is approximately 5, which is the max rating available. There is an average of 9.4 reviews for each physician. To get a visual on these trends, we next created box plots for the two variables.

Creating box plots to examine the distribution for rating and reviews varables¶

# Box plot for ratings variable (On a 1-5 scale)
healthReview['Rating'].plot(kind = 'box')

<matplotlib.axes._subplots.AxesSubplot at 0x15f7e6a96d8>

This boxplot is concerning, as the interquartile range is non-existent for ratings. The purpose of the website is to view reviews for physicians, yet the majority appear to be rated 5! To examine this more closely, we decided to take a look at the 5 physicians rated the lowest.

healthReview.sort_values(['Rating']).head(5)

Only two physicians out of the 958 in our dataset have a rating of less than 5. It seems clear from this that the Vitals website cannot be used to accurately gauge the quality of the physician.

# Box plot for reviews variable
healthReview['Reviews'].plot(kind = 'box', ylim = (0,50))

<matplotlib.axes._subplots.AxesSubplot at 0x15f7ea712b0>

It appears that the majority of physicians in our dataset have less than 10 reviews. Still, that is a rather large number of 5 rating reviews.

Creating a bar graph for location, including the 10 locations with the highest number of doctors with reviews¶

# Preparing the data to visualize the number of physicians in each location
reviewLocation = pd.DataFrame(healthReview.groupby('Location')['Location'].count())
reviewLocation = reviewLocation.sort_values(['Location'], ascending = False)
reviewLocation.columns = ['Physicians Reviewed']
reviewLocation.head(10).plot(kind = 'bar')

<matplotlib.axes._subplots.AxesSubplot at 0x15f78401588>

We filtered our selection of doctors to those residing in the state of New York, but the resulting locations aren't completely clear. Flushing is its own location despite residing in Queens, and the manhattan grouping (New York) includes some doctors in other New York City locations. We can see that Manhattan is the location with the largest amount of physicians in our dataset, which is perhaps a little surprising given the breakdown of population numbers in the boroughs. Overall, we would expect the highest number of reviews to be within NYC and that is reflected in the dataset.

Examing the most common specialties to receive reviews¶

# Bar graphing count of physicians for each specialty
healthReview.groupby('Specialty')['Reviews'].count().sort_values(ascending=False).head(10).plot(kind = 'bar')

<matplotlib.axes._subplots.AxesSubplot at 0x15f7ef27240>

Internal medicine, pediatrics, and family medicine have the largest number of physicians. This seems consistent with what we would expect, as these specialties are among the most commonly chosen among rising medical students. The one outlier seems to be surgery, which is another often chosen specialty.

Conclusion¶

Unfortunately, the Vitals website does not seem to be an effective resource when attempting to make an informed choice of physician. Only two reviews out of 958 deviated from a perfect 5 star rating. The proportion of physicians in each location and specialty at least makes some sense, which lends a little credibility to physician participation in the website. The sheer number of 5-star reviews might suggest some manipulation of the ratings system by the physicians themselves. While typically many rating systems such as Uber, Airbnb, and Amazon trend towards the upper end, they generally have some variation. Vitals has essentially all perfect scores, and doesn't really have variation. This makes it inneffective as an analysis tool. It might be a question of website maturity or public awareness, but it seems that they need to develop more of a presence to compete with healthgrades and ZocDoc.

	Name	Specialty	Location	Rating	Reviews
14	Dr. John F Morrison	Neurological Surgery	Buffalo	5	315
166	Dr. Michael I Horowitz	Surgery of the Hand	Brooklyn	5	310
288	Dr. Paul S Cohen	Internal Medicine	Syracuse	5	210
229	Dr. Alexander H Tejani	Orthopaedic Surgery	Brooklyn	5	196
505	Yevgeniy Vaynkof M.D.	Family Medicine	New York	5	193
6	Dr. Shahriar Shayani	Internal Medicine	New Hyde Park	5	114
3	Dr. Daniel Weitz	Clinical Cardiac Electrophysiology	New York	5	61
4	Dr. Mehran Alagheband	Dermatology	Glen Cove	5	56
5	Dr. Justin Cohen MD	Otolaryngology	New York	5	49
381	Dr. Steven E Goldberg	Internal Medicine	Troy	5	36

	Rating	Reviews
count	960.00	960.00
mean	5.00	9.39
std	0.01	18.75
min	4.80	3.00
25%	5.00	5.00
50%	5.00	6.00
75%	5.00	10.00
max	5.00	315.00

	Name	Specialty	Location	Rating	Reviews
2	Dmitriy Fuzaylov M.D.	Pain Management	Malverne	4.8	6
1	Michael Nguyen M.D.	Pain Management	New York	4.9	27
843	Dr. Jill A Jacobson	Psychiatry	New York	5.0	5
842	Dr. Khalida Itriyeva	Pediatrics	New Hyde Park	5.0	5
841	Dr. Morton D Borg	Pediatrics	New York	5.0	5