How Do I Avoid Data From Different Tabs To Be Concatenated In One Cell When I Scrape A Table?
I scraped this page https://www.capfriendly.com/teams/bruins, specifically looking for the tables under the tab Cap Hit (Fowards, Defense, GoalTenders). I used Python and Beautiful
Solution 1:
Based on the source code, this is some text in specific rows that is conditionally visible depending on what tab you're on (as your title states). The class .hide
is added to the child element in the td
when it is intended to be hidden on that specific tab.
When you're parsing the td
elements to retreive the text, you could filter out those elements which are suppose to be hidden. In doing so, you can retrieve the text that would be visible as if you were viewing the page in a web browser.
In the snippet below, I added a parse_td
function which filters out the children span
elements with a class of hide
. From there, the corresponding text is returned.
import requests, bs4, csv
r = requests.get('https://www.capfriendly.com/teams/bruins')
soup = bs4.BeautifulSoup(r.text, 'lxml')
table = soup.find(id="team")
withopen("csvfile.csv", "w", newline='') as team_data:
defparse_td(td):
filtered_data = [tag.text for tag in td.find_all('span', recursive=False)
if'hide'notin tag.attrs['class']]
return filtered_data[0] if filtered_data else td.text;
for tr in table('tr', class_=['odd', 'even']):
row = [parse_td(td) for td in tr('td')]
writer = csv.writer(team_data)
writer.writerow(row)
Post a Comment for "How Do I Avoid Data From Different Tabs To Be Concatenated In One Cell When I Scrape A Table?"