For our project, we decided to assess the usefulness of the physician rating website Vitals. To accomplish this, we scraped the data from the website and then proceeded to analyze it using the pandas package. Ideally, we wanted to compare the ratings across specialties and locations to be able to draw conclusions about the perceived quality of doctors, but unfortunately limited analyses were able to be conducted due to biased data.
To run the project, several dependencies must be installed in addition to Python:
Additionally, the file path for downloaded webpages and for the CSV must be set below:
#Be sure to include final forwardslash in filepath!
filepath = "C:/Desired/File/Path"
Some problems were encountered while scraping data from online. As mentioned prior, results were loaded in by client-side Javascript that needed to be excecuted in a browser. This posed initial complications, however the use of Selenium helped automate the process. Despite these problems, Python was absolutely suited to the task and allowed us to complete this step with relatively few issues.
In past analyses, we have typically used R or SAS for analysis. While we are more comfortable with those languages, python was more than suitable for this analysis as well.
Data was scraped from the Vitals website, with only physicians from the state of New York being included in our dataset. To accomplish this using webdriver from the selenium package along with beautifulsoup. The scraped data was then exported into a csv file using the csv package.
Data scraping proceeded as follows:
from selenium import webdriver
import requests
import time
def save_page(url, out_file, wait_for_class='search-card'):
""" Saves a page, using the Selenium browser testing framework to run any Javascript
, writing the resulting final DOM as HTML source to disk.
Waits until element of a given class is loaded to make sure Javascript has actually
completed before source is retrieved. """
browser = webdriver.Chrome("C:/Path/To/Chromedriver.exe")
browser.get(url)
# Wait until elment with given class is loaded
while True:
try:
# if the element cannot be found, this will throw
browser.find_element_by_class_name(wait_for_class)
break # otherwise the loop will end here
except:
pass # do nothing, try again
# source has loaded - write out result
open(out_file, 'w', encoding="utf-8").write(browser.page_source)
#loop through 42 pages of search results & save to working directory
page = 0
while page <= 41:
url = 'https://www.vitals.com/search?display_type=Doctor&state=NY&page=' + str(page)
page += 1
save_page(url, filepath + 'Vitals' + str(page) + ".html")
import re
from bs4 import BeautifulSoup
def parse_page(filepath):
page_contents = open(filepath).read().strip()
my_soup = BeautifulSoup(page_contents, "html.parser")
data = []
#Scrape physician name
for card in my_soup.findAll("div", attrs = {"class": "search-card"}):
name_element = card.find("span", attrs = {"class": "name"})
if name_element == None:
continue
name = name_element.text.strip()
#Scrape physician specialty
specialty_element = card.find("span", attrs = {"class": "specialty"})
if specialty_element == None:
continue
specialty = specialty_element.text.strip().split(',')[0]
#Scrape physician location
location_element = card.find("span", attrs = {"class": "address"})
if location_element == None:
continue
location = location_element.text.strip().split(',')[0]
#Scrape physician rating
rating_element = card.find("span", attrs = {"class": "rating-text"})
if rating_element == None:
continue
rating = rating_element.text.strip('')
#Scrape number of reviews for each physican
reviews_element = str(card.select("a[href*=reviews]"))
if reviews_element == None:
continue
reviews = ''.join(re.findall(r'\d+', reviews_element))
row = (name, specialty, location, rating, reviews)
data.append(row)
return (data)
import csv
with open(filepath + '/healthreviews.csv','w') as csvfile:
writer = csv.writer(csvfile)
titlerow = ("Name", "Specialty", "Location", "Rating", "Reviews")
writer.writerow(titlerow)
# Loop through all downloaded pages of the website; each "card" forms a row of the CSV
for page in range(1, 43):
all_rows = parse_page(filepath + "Vitals" + str(page) + ".html")
for row in all_rows:
writer.writerow(row)
The resulting csv was structured into a dataframe using the pandas package. The analysis and visualizations of the data were accomplished using a combination of pandas and pyplot from the matplotlib package. After the cleaning the dataset contained 958 records with the following variables:
Exploratory analysis of the dataset proceeded as follows:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
# Reading in the resulting csv. 1003 records for physicians
healthReview = pd.read_csv(filepath + "healthreviews.csv")
# Keeping doctors that had one or more reviews
healthReview = healthReview[healthReview['Reviews'] > 0]
# Checking the dataset for any missing values. 45 missing values returned
healthReview.isnull().sum().sum()
# Dropping missing values. 958 Rows remaining
healthReview = healthReview.dropna()
# Sorting dataset by number of reviews, descending
healthReview = healthReview.sort_values(['Reviews'], ascending = False)
healthReview.head(10).style.bar(subset=['Reviews'], color='#5fba7d')
# Generating some summary statistics for rating and reviews variables
healthReview.describe().round(2)
We can see that the average rating for a physician is approximately 5, which is the max rating available. There is an average of 9.4 reviews for each physician. To get a visual on these trends, we next created box plots for the two variables.
# Box plot for ratings variable (On a 1-5 scale)
healthReview['Rating'].plot(kind = 'box')
This boxplot is concerning, as the interquartile range is non-existent for ratings. The purpose of the website is to view reviews for physicians, yet the majority appear to be rated 5! To examine this more closely, we decided to take a look at the 5 physicians rated the lowest.
healthReview.sort_values(['Rating']).head(5)
Only two physicians out of the 958 in our dataset have a rating of less than 5. It seems clear from this that the Vitals website cannot be used to accurately gauge the quality of the physician.
# Box plot for reviews variable
healthReview['Reviews'].plot(kind = 'box', ylim = (0,50))
It appears that the majority of physicians in our dataset have less than 10 reviews. Still, that is a rather large number of 5 rating reviews.
# Preparing the data to visualize the number of physicians in each location
reviewLocation = pd.DataFrame(healthReview.groupby('Location')['Location'].count())
reviewLocation = reviewLocation.sort_values(['Location'], ascending = False)
reviewLocation.columns = ['Physicians Reviewed']
reviewLocation.head(10).plot(kind = 'bar')
We filtered our selection of doctors to those residing in the state of New York, but the resulting locations aren't completely clear. Flushing is its own location despite residing in Queens, and the manhattan grouping (New York) includes some doctors in other New York City locations. We can see that Manhattan is the location with the largest amount of physicians in our dataset, which is perhaps a little surprising given the breakdown of population numbers in the boroughs. Overall, we would expect the highest number of reviews to be within NYC and that is reflected in the dataset.
# Bar graphing count of physicians for each specialty
healthReview.groupby('Specialty')['Reviews'].count().sort_values(ascending=False).head(10).plot(kind = 'bar')
Internal medicine, pediatrics, and family medicine have the largest number of physicians. This seems consistent with what we would expect, as these specialties are among the most commonly chosen among rising medical students. The one outlier seems to be surgery, which is another often chosen specialty.
Unfortunately, the Vitals website does not seem to be an effective resource when attempting to make an informed choice of physician. Only two reviews out of 958 deviated from a perfect 5 star rating. The proportion of physicians in each location and specialty at least makes some sense, which lends a little credibility to physician participation in the website. The sheer number of 5-star reviews might suggest some manipulation of the ratings system by the physicians themselves. While typically many rating systems such as Uber, Airbnb, and Amazon trend towards the upper end, they generally have some variation. Vitals has essentially all perfect scores, and doesn't really have variation. This makes it inneffective as an analysis tool. It might be a question of website maturity or public awareness, but it seems that they need to develop more of a presence to compete with healthgrades and ZocDoc.