Skip to content

Data collection and processing for intelligent technology ecosystem analysis

Notifications You must be signed in to change notification settings

park1997/GitHub_Crawling_TextMining_Project

Repository files navigation

์ง€๋Šฅํ™” ๊ธฐ์ˆ  ์ƒํƒœ๊ณ„ ๋ถ„์„์„ ์œ„ํ•œ ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ ๋ฐ ๊ฐ€๊ณต (Data collection and processing for intelligent technology ecosystem analysis)

Packages

  1. sklearn(TfidfVectorizer, CountVectorizer, PCA)
  2. KMeans
  3. DBSCAN

Index

  • Research Proceduer
    • Item-based Technology Type Analysis
    • Topic-based Technology Type Analysis
  • Research Result (Github's key Repository Analysis)
    • Star-based
    • Big Tech Company
    • Future Tech : Autonomous-vehicle, Metaverse

Research

แ„‰แ…ณแ„แ…ณแ„…แ…ตแ†ซแ„‰แ…ฃแ†บ 2022-03-02 แ„‹แ…ฉแ„’แ…ฎ 3 49 56

1. ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ ๋ฐ ์ „์ฒ˜๋ฆฌ

  1. ๊นƒํ—ˆ๋ธŒ ์˜คํ”ˆ์†Œ์Šค ์ •๋ณด ๋ฐ API ๋ถ„์„
  • ๊นƒํ—ˆ๋ธŒ๋Š” ๋Œ€์šฉ๋Ÿ‰์˜ ์˜คํ”ˆ์†Œ์Šค ์ •๋ณด๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ํ™•์ธํ•˜๊ธฐ ์œ„ํ•œ ๋„๊ตฌ๋กœ API๋ฅผ ์ œ๊ณตํ•˜๊ณ  ์žˆ์œผ๋ฉฐ, ๊ฐœ๋ฐœ์ž, ๊ฐœ๋ฐœํ™˜๊ฒฝ, ํ˜„ํ™ฉ ๋“ฑ ์—ฌ๋Ÿฌ ๊ธฐ์ˆ ์†์„ฑ์„ ๊ฒ€์ƒ‰ํ•  ์ˆ˜ ์žˆ๋„๋ก ์ง€์›ํ•˜๊ณ  ์žˆ์Œ.
  • ์•„๋ž˜ url์„ ํ†ตํ•ด "deep learning" ํ‚ค์›Œ๋“œ ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ์›นํŽ˜์ด์ง€ ์ƒ์—์„œ API ํ˜•ํƒœ๋กœ ํ™•์ธ ๊ฐ€๋Šฅ https://api.github.com/search/repositories?q=deep%20learning&page,per_page,sort,order

แ„‰แ…ณแ„แ…ณแ„…แ…ตแ†ซแ„‰แ…ฃแ†บ 2022-03-02 แ„‹แ…ฉแ„’แ…ฎ 3 47 13

  1. API ์ด์šฉํ•˜์—ฌ ํ•„์š”ํ•œ ์ €์žฅ์†Œ ์ •๋ณด Crawling test
def topic(t):

  topic = t.replace(' ', '%20')
  response = urlopen('https://api.github.com/search/repositories?q={}&page,per_page,sort,order'.format(topic)).read().decode('utf-8')

  responseJson = json.loads(response)

  name_lst = []
  type_lst = []
  create_lst = []
  size_lst = []
  star_lst = []
  fork_lst = []
  login_lst = []

  items = responseJson.get('items')

  for lst in items:
      name = lst.get('name')
      typ = lst.get('owner').get('type')
      create = lst.get('created_at')
      size = lst.get('size')
      star = lst.get('stargazers_count')
      fork = lst.get('forks_count')
      login = lst.get('owner').get('login')

      name_lst.append(name)
      type_lst.append(typ)
      create_lst.append(create)
      size_lst.append(size)
      star_lst.append(star)
      fork_lst.append(fork)
      login_lst.append(login)

  df = pd.DataFrame([name_lst, type_lst, create_lst, size_lst, star_lst, fork_lst, login_lst])
  df = df.transpose()
  df.columns = ['name','type','created_at','size','stargazers_count','fork','login']
  return df

# test
topic('deep learning')

แ„‰แ…ณแ„แ…ณแ„…แ…ตแ†ซแ„‰แ…ฃแ†บ 2022-03-02 แ„‹แ…ฉแ„’แ…ฎ 3 48 58

  1. ์ „์ฒด ํŽ˜์ด์ง€ Json ํ˜•ํƒœ๋กœ response, Crawling ๋ฐ Excel ํ˜•ํƒœ ์ €์žฅ
  • Github ์ž์ฒด์—์„œ ์ธํ„ฐํŽ˜์ด์Šค ๊ธฐ๋ฐ˜์˜ ํŽ˜์ด์ง€ ๋ณ€ํ™”์™€ ์ž๋™ ํฌ๋กค๋ง ๋ฐฉ์ง€๋ฅผ ์œ„ํ•œ ์š”์ฒญ์‹œ๊ฐ„ ํ™•์ธ ๋“ฑ์˜ ์ด์Šˆ์‚ฌํ•ญ ์กด์žฌ : time.sleep() ํ•จ์ˆ˜๋ฅผ ํ†ตํ•ด ๋Œ€๊ธฐ์‹œ๊ฐ„ ๋ฐœ์ƒ์‹œ์ผœ ๋ฐ˜๋ณต์ ์ธ ๋™์  Crawling
def topic(t):

  topic = t.replace(' ', '%20')
  name_lst = []
  type_lst = []
  create_lst = []
  size_lst = []
  star_lst = []
  fork_lst = []
  login_lst = []

  for i in range(1,11):

      try:
          response = urlopen('https://api.github.com/search/repositories?q={}&sort=stars&per_page=100&page={}'.format(topic, i)).read().decode('utf-8')
      except:
          time.sleep(10)

      responseJson = json.loads(response)

      print(f'{i} response')

      items = responseJson.get('items')

      for lst in items:
          name = lst.get('name')
          typ = lst.get('owner').get('type')
          create = lst.get('created_at')
          size = lst.get('size')
          star = lst.get('stargazers_count')
          fork = lst.get('forks_count')
          login = lst.get('owner').get('login')

          name_lst.append(name)
          type_lst.append(typ)
          create_lst.append(create)
          size_lst.append(size)
          star_lst.append(star)
          fork_lst.append(fork)
          login_lst.append(login)

#         print('{} / {} / {} / {} / {} / {} / {}'.format(name, typ, create, size, star, fork, login))

  df = pd.DataFrame([name_lst, type_lst, create_lst, size_lst, star_lst, fork_lst, login_lst])
  df = df.transpose()
  df.columns = ['name','type','created_at','size','stargazers_count','fork','login']
  return df

# test
topic('deep learning')

แ„‰แ…ณแ„แ…ณแ„…แ…ตแ†ซแ„‰แ…ฃแ†บ 2022-03-02 แ„‹แ…ฉแ„’แ…ฎ 4 30 11

-> ์œ„์™€ ๊ฐ™์€ Crawling ๋ฐฉ์‹์œผ๋กœ ์ƒ์œ„ ์ธ์ง€๋„(star), ๋น…ํ…Œํฌ ๊ธฐ์—…(Google, MS, Intel, Facebook, Apple, Amazon ๋“ฑ), ๋ฏธ๋ž˜๊ธฐ์ˆ (์ž์œจ์ฃผํ–‰์ฐจ, ๋ฉ”ํƒ€๋ฒ„์Šค) ๊ธฐ์ค€ Crawling ๋ฐ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šคํ™”

2. ๊ธฐ์ˆ  ๋ถ„์„

1) ์ธ์ง€๋„(star) ๊ธฐ๋ฐ˜ ๊นƒํ—ˆ๋ธŒ ์ฃผ์š” ์ €์žฅ์†Œ ๋ถ„์„

  • project name, topic keyword, star number๋ฅผ column์œผ๋กœ Crawling ์‹คํ–‰
 stars = input("์Šคํƒ€์ˆ˜ ์ž…๋ ฅ : ")
 url = "https://github.com/search?p=1&q=stars%3A%3E{}&type=Repositories".format(stars)

 def crawling_func(url):
     print(url)
     try:
         res = requests.get(url)
         res.raise_for_status()
         soup = BeautifulSoup(res.text,"lxml")
         p = soup.find_all("a",attrs={"class":"v-align-middle"}) 
     except:
         time.sleep(1)
         crawling_func(url) # ์˜ค๋ฅ˜๋ฅผ ๋Œ€๋น„ํ•˜๊ธฐ์œ„ํ•ด ์žฌ๊ท€ํ•จ์ˆ˜ ํ˜ธ์ถœ
         pass
     finally:
         return p

 topic_ads = []
 pages = int(input("๊ฒ€์ƒ‰ํ•  ํŽ˜์ด์ง€ ์ˆ˜(Ex : 10) : "))
 print("{}๊ฐœ ์ด์ƒ์˜ star์ˆ˜๋ฅผ ๊ฐ€์ง„ Repository์˜ ์ƒ์œ„ {}ํŽ˜์ด์ง€๋ฅผ ํฌ๋กค๋งํ•ฉ๋‹ˆ๋‹ค. ".format(stars,pages))
 print()
 for i in range(1,pages+1):# x ํŽ˜์ด์ง€๊นŒ์ง€ ํƒ์ƒ‰
     url = "https://github.com/search?p={}&q=stars%3A%3E10000&type=Repositories".format(i) # ํŽ˜์ด์ง€ formatting
     time.sleep(15)  # 10์ดˆ๋ฅผ ์‰ฌ์–ด๋„ ์˜ค๋ฅ˜๊ฐ€ ๋–ด์—ˆ์Œ
     for j in crawling_func(url):
         topic_ads.append(j.get_text())
         print(j.get_text())


 topics_dic = {}
 topics_list = []
 for ad in topic_ads:
     url_topic = "https://github.com/" + ad
     res_topic = requests.get(url_topic)
     res_topic.raise_for_status()
     soup_topic = BeautifulSoup(res_topic.text,"lxml")

     topic = soup_topic.find("div",attrs={"class":"BorderGrid-cell"}).find_all("a",attrs={"class":"topic-tag topic-tag-link"})
     project_topics=[i.get_text().replace("\n","").replace("\t","").strip() for i in topic]

     # ์Šคํƒ€ ์ˆ˜(๋ถ๋งˆํฌ ์ˆ˜, ์ธ๊ธฐ๋„)
     star_num = soup_topic.find("ul",attrs={"class":"pagehead-actions flex-shrink-0 d-none d-md-inline"}).find("a",attrs={"class":"social-count js-social-count"}).get_text()
     star_num = star_num.replace('\t', '').replace('\n', '').strip()

     topics_list.append([ad,project_topics,star_num])

     for t in project_topics:
         if t in topics_dic:
             topics_dic[t] += 1
         else:
             topics_dic[t] = 1

 # value๋กœ ๋‚ด๋ฆผ์ฐจ์ˆœ ์ •๋ ฌ
 topics_dic= sorted(topics_dic.items(), key=lambda x: x[1], reverse=True)
 df_topic2 = pd.DataFrame(topics_list,columns=['project_name','topic_keyword','star_number'])
 # ์—‘์…€๋กœ ์ €์žฅํ•˜๊ธฐ

 df_topic2.to_excel('Topics_stars{}_project_keyword.xlsx'.format(stars),index=False)
 print()
 print("star ์ˆ˜๊ฐ€ {} ๊ฐœ ์ด์ƒ์ธ ํ”„๋กœ์ ํŠธ ์ƒ์œ„ {}ํŽ˜์ด์ง€์—๋Œ€ํ•œ Data๊ฐ€ ์ €์žฅ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.".format(stars,pages))
 print("์ข…๋ฃŒ .")

แ„‰แ…ณแ„แ…ณแ„…แ…ตแ†ซแ„‰แ…ฃแ†บ 2022-03-02 แ„‹แ…ฉแ„’แ…ฎ 6 33 11

  • ๊ฐ Project๋ณ„ Topic Keyword ๋ฒกํ„ฐํ™” ์ง„ํ–‰ (TF-IDF) -> ์œ ์˜์–ด๋ผ๋ฆฌ ์„œ๋กœ ๋ฌถ๋Š” ์ž‘์—… ๋ฐ Topic์˜ ๋ˆ„์ ํ•ฉ์„ ๋”•์…”๋„ˆ๋ฆฌ ํ˜•ํƒœ๋กœ ๋งŒ๋“œ๋Š” ์ž‘์—… ํ›„ ์ง„ํ–‰
vectorize = TfidfVectorizer(
    min_df=5    # ์˜ˆ์ œ๋กœ ๋ณด๊ธฐ ์ข‹๊ฒŒ 1๋ฒˆ ์ •๋„๋งŒ ๋…ธ์ถœ๋˜๋Š” ๋‹จ์–ด๋“ค์€ ๋ฌด์‹œํ•˜๊ธฐ๋กœ ํ–ˆ๋‹ค
                # min_df = 0.01 : ๋ฌธ์„œ์˜ 1% ๋ฏธ๋งŒ์œผ๋กœ ๋‚˜ํƒ€๋‚˜๋Š” ๋‹จ์–ด ๋ฌด์‹œ
                # min_df = 10 : ๋ฌธ์„œ์— 10๊ฐœ ๋ฏธ๋งŒ์œผ๋กœ ๋‚˜ํƒ€๋‚˜๋Š” ๋‹จ์–ด ๋ฌด์‹œ
                # max_df = 0.80 : ๋ฌธ์„œ์˜ 80% ์ด์ƒ์— ๋‚˜ํƒ€๋‚˜๋Š” ๋‹จ์–ด ๋ฌด์‹œ
                # max_df = 10 : 10๊ฐœ ์ด์ƒ์˜ ๋ฌธ์„œ์— ๋‚˜ํƒ€๋‚˜๋Š” ๋‹จ์–ด ๋ฌด์‹œ
)
X = vectorize.fit_transform(df['topic_keyword_str'])
print('fit_transform, (sentence {}, feature {})'.format(X.shape[0], X.shape[1]))

# ๋ฌธ์žฅ์—์„œ ๋ฝ‘์•„๋‚ธ feature ๋“ค์˜ ๋ฐฐ์—ด
features = vectorize.get_feature_names()

X.toarray()

แ„‰แ…ณแ„แ…ณแ„…แ…ตแ†ซแ„‰แ…ฃแ†บ 2022-05-07 แ„‹แ…ฉแ„’แ…ฎ 8 06 00

  • ๋น„์Šทํ•œ Topic๋ผ๋ฆฌ 1์ฐจ DBSCAN Clustering -> PCA๋กœ ์ฐจ์›์„ ์ถ•์†Œํ•œ ํ›„ ์ง„ํ–‰. -> ์—ฌ๊ธฐ์„œ๋Š” ์ •๋ณด๋Ÿ‰์˜ ์†์‹ค์ด 5%๋งŒํผ๋งŒ ๋ฐœ์ƒํ•˜๋„๋ก 216์ฐจ์›์—์„œ 155์ฐจ์›์œผ๋กœ ์ค„์ž„
# ์ฐจ์›์ถ•์†Œ๋ฅผ ํ•˜์ง€ ์•Š๊ณ ๋„ PCA๋ฅผ ๋Œ๋ ค๋ด๋ณด๊ธฐ

# ์ •๋ณด๋Ÿ‰์ด 95% ์ธ ๋งŒํผ์˜ ์นผ๋Ÿผ์ˆ˜๊ฐ€ 155์ž„
pca = PCA(n_components=155)
df_pca = pca.fit_transform(tfidf_vector_df)
df_pca = pd.DataFrame(df_pca, index=tfidf_vector_df.index,
                   columns=[f"pca{num+1}" for num in range(df_pca.shape[1])])

for i in df_dbscan_cluster['clusters']model = DBSCAN(eps=0.4, min_samples=5, metric='cosine')

result = model.fit_predict(df_pca)
set(result)

df_result = df.copy()
df_result['result'] = result

j = 0
keyword = []
topic = []
num = []

for cluster_num in set(result):
 
 if(cluster_num == -1 or cluster_num == 0):
     continue # ๋…ธ์ด์ฆˆ ๋ฐ์ดํ„ฐ๋“ค์€ ๋ฒ„๋ฆผ
 else:
     print('cluster num : {}'.format(cluster_num))
     temp_df = df_result[df_result['result'] == cluster_num]       
     
     i = 0
     
     for k in temp_df['topic_keyword_str']:
         keyword.append(k)
         num.append(cluster_num)
         i = i + 1
     # print('๊ตฐ์ง‘ ๋‚ด ๋ฐ์ดํ„ฐ ๊ฐœ์ˆ˜: ',i)
     
     for t in temp_df['project_name']:
         topic.append(t)
 j += i
 print()
dic_cluster = {}
dic_cluster['topic'] = topic
dic_cluster['keyword'] = keyword
dic_cluster['number'] = num

clusters = {}
keywords = {}

num = []
for i in set(df_cluster['number']):
 n = 0
 cluster = []
 keyword = []
 for j in df_cluster.values:
     if j[2] == i:
         cluster.append(j[0])
         clusters[i] = cluster
         keyword += j[1].split(' ')
         n += 1
     else:
         pass
     keywords[i] = keyword
 num.append(n)

count_items = []

for i in keywords.values():
 count = {}
 for j in i:
     try:
         count[j] += 1
     except:
         count[j] = 1
 val = sorted(count.items(), key=lambda x: x[1], reverse=True)
 count_items.append(val[:15]) 

df_cluster_ = pd.DataFrame()
df_cluster_['clusters'] = clusters.values()
df_cluster_['cluster_num'] = clusters.keys()
df_cluster_['count'] = num
df_cluster_['top_15_topics'] = count_items
df_cluster_

แ„‰แ…ณแ„แ…ณแ„…แ…ตแ†ซแ„‰แ…ฃแ†บ 2022-05-07 แ„‹แ…ฉแ„’แ…ฎ 8 15 35

  • cluster_num = 1์ธ Topic๋ผ๋ฆฌ 2์ฐจ DBSCAN Clustering
num_list = []
for idx, num in enumerate(result):
    if num == 1:
        num_list.append(df_pca.iloc[idx])
    df_result2 = df_result[df_result['result'] == 1]
df_pca2 = pd.DataFrame(num_list)
df_pca2 = df_pca2.loc[~df_pca2.index.duplicated(keep='first')]
tfidf_vector_df2 = pd.merge(tfidf_vector_df, df_pca2, left_index=True, right_index=True, how='inner', sort=False)
tfidf_vector_df2.drop(tfidf_vector_df2.iloc[:,217:], axis=1, inplace=True)
tfidf_vector_df2.loc[~tfidf_vector_df2.index.duplicated(keep='first')]

ca = PCA(n_components=155)
df_pca2 = pca.fit_transform(tfidf_vector_df2)
df_pca2 = pd.DataFrame(df_pca2, index=tfidf_vector_df2.index,
                      columns=[f"pca{num+1}" for num in range(df_pca2.shape[1])])
                      
model2 = DBSCAN(eps=0.3, min_samples=5, metric='cosine') # parameter ๊ฐ’ ์žฌ์„ค์ • ํ•„์š”

result2 = model2.fit_predict(df_pca2)
set(result2)

# ์ดํ›„๋กœ๋Š” ์œ„ ํด๋Ÿฌ์Šคํ„ฐ๋ง ๋ฐฉ๋ฒ•๊ณผ ๋™์ผ

แ„‰แ…ณแ„แ…ณแ„…แ…ตแ†ซแ„‰แ…ฃแ†บ 2022-05-07 แ„‹แ…ฉแ„’แ…ฎ 8 18 03

2) ๋น…ํ…Œํฌ ๊ธฐ์—… ์ €์žฅ์†Œ ๋ถ„์„

  • ๊ฒ€์ƒ‰ํ•  ๊ธฐ์—…์ด๋ฆ„ ๊ธฐ์ค€ ์ •๋ณด ์ˆ˜์ง‘
# org = ["aws","facebook","google","naver","kakao","apple","alibaba","tencent","baidu","microsoft","samsung"]
org = list(input("๊ฒ€์ƒ‰ํ•˜์‹ค ๊ธฐ์—…์ด๋ฆ„์„ ์˜์–ด๋กœ ์ž…๋ ฅํ•ด์ฃผ์„ธ์š”.(์—ฌ๋Ÿฌ๊ฐœ ์ธ๊ฒฝ์šฐ ๋„์–ด์“ฐ๊ธฐ๋กœ ๊ตฌ๋ถ„ํ•˜์—ฌ ์ž…๋ ฅ) : ").split())
org_dic={}
for o in org:
    url = "https://github.com/orgs/{}/repositories".format(o)
    print("{} ์— ๋Œ€ํ•œ ์ •๋ณด ์ˆ˜์ง‘ ์‹œ์ž‘.".format(url))
    res= requests.get(url)
    try:
        res.raise_for_status()
    except:
        print("์ž…๋ ฅํ•˜์‹  ๊ธฐ์—…\"{}\" ์— ๋Œ€ํ•œ ์ •๋ณด๊ฐ€ ์กด์žฌํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.\n".format(o))
        continue
    soup=BeautifulSoup(res.text,"lxml")
    try:
        max_page = int(soup.find("div",attrs={"role":"navigation"}).find_all("a")[-2].get_text())
    except:
        max_page = 1
    item_temp = []
    for p in range(1,max_page+1):
        time.sleep(1)
        url = "https://github.com/orgs/{}/repositories?page={}".format(o,p)
        res= requests.get(url)
        res.raise_for_status()
        soup=BeautifulSoup(res.text,"lxml")
        print("{}{}์ˆ˜์ง‘์‹œ์ž‘{}".format("*"*10,o,"*"*10))
        for item in soup.find("div",attrs={"class":"org-repos repo-list"}).find_all("li",attrs={"class":"Box-row"}):
            print(item.a.get_text().strip())
            item_temp.append(item.a.get_text().strip())
    org_dic[o]=item_temp
    print()

แ„‰แ…ณแ„แ…ณแ„…แ…ตแ†ซแ„‰แ…ฃแ†บ 2022-05-08 แ„‹แ…ฉแ„’แ…ฎ 11 02 37

  • DBSCAN Clustering
for o in org:
    print("{} {} DBSCAN ํด๋Ÿฌ์Šคํ„ฐ๋ง ์‹œ์ž‘ {}".format("*"*10,o,"*"*10))
    excel_name = "{}_vectors.xlsx".format(o)
    df_org = pd.read_excel("{}.xlsx".format(o))
    df_vector = pd.read_excel(excel_name)
    # eps ๊ฐ’์„ ์กฐ์ •ํ•ด๋‚˜๊ฐ€๋ฉด์„œ ํด๋Ÿฌ์Šคํ„ฐ๋ง์„ ํ•ด์•ผ ๋” ์ •ํ™•ํ•œ ๊ฒฐ๊ณผ๊ฐ€ ๋‚˜์˜จ๋‹ค
    dbscan = DBSCAN(eps = 0.3)
    dbscan_cluster = dbscan.fit_predict(df_vector)
    dbscan_cluster
    dbscan_clustered_dic = {}
    dbscan_clustered_list = []
    dbscan_cluster_num = len(set(dbscan_cluster))
    
    
    for idx,i in enumerate(dbscan_cluster):
        if i not in dbscan_clustered_dic:
            dbscan_clustered_dic[i] = [df_org['ProjectName'][idx]]
        else:
            dbscan_clustered_dic[i].append(df_org['ProjectName'][idx])

    # ํด๋Ÿฌ์Šคํ„ฐ๋ง์ด ๋œ ํŒจํ‚ค์ง€๋“ค
    # 20๊ฐœ์˜ ๊ตฐ์ง‘์œผ๋กœ ์ƒ์„ฑ

    dbscan_clustered_dic = sorted(dbscan_clustered_dic.items(), key=lambda x: x[0])

    df_dbscan_cluster = pd.DataFrame(dbscan_clustered_dic,columns=['num','clusters'])
    dbscan_cluster_num = [len(i) for i in df_dbscan_cluster['clusters']]
    df_dbscan_cluster['cluster_num'] = dbscan_cluster_num
    topic_dbscan_clustered_list = []
    for i in df_dbscan_cluster['clusters']:
        temp_dic = {}
        for j in i:
            topics = df_org[df_org['ProjectName']==j]['Topics'].values[0].replace("[","").replace("]","").replace("'","").strip().split(",")
            for i in topics:
                if len(i)==0:
                    continue
                i = same_things(i)
                if i not in temp_dic:
                    temp_dic[i] = 1
                else:
                    temp_dic[i] += 1
        temp_dic = sorted(temp_dic.items(), key=lambda x: x[1], reverse=True)
    #     print(temp_dic[:15]) # ์ƒ์œ„ 15๊ฐœ๋งŒ ๋ณด์—ฌ์คŒ
    #     print()
        topic_dbscan_clustered_list.append(temp_dic[:15])
    df_dbscan_cluster['top_15_topics'] = topic_dbscan_clustered_list
    df_dbscan_cluster.to_excel("{}_DBSCAN_clusters.xlsx".format(o),index=False)
    print(df_dbscan_cluster)
    print("{}_DBSCAN_clusters.xlsx".format(o),"์ €์žฅ์™„๋ฃŒ")
    print("*"*50)
    

แ„‰แ…ณแ„แ…ณแ„…แ…ตแ†ซแ„‰แ…ฃแ†บ 2022-05-08 แ„‹แ…ฉแ„’แ…ฎ 11 04 03

3) ๋ฏธ๋ž˜๊ธฐ์ˆ  ์ €์žฅ์†Œ ๋ถ„์„

  • ๊ฒ€์ƒ‰ํ•  ๊ธฐ์ˆ  ์ด๋ฆ„ ๊ธฐ์ค€ ์ •๋ณด ์ˆ˜์ง‘
topic_name = input("๊ธฐ์ˆ ๋ช…์„ ์ž…๋ ฅํ•ด ์ฃผ์„ธ์š”. : ")
url = "https://github.com/topics/{}?o=desc&s=stars".format(topic_name)
# ๋‚ด ์ปดํ“จํ„ฐ์˜ User_Agent
res= requests.get(url)
res.raise_for_status()
soup=BeautifulSoup(res.text,"lxml")
soup
options = webdriver.ChromeOptions()
# headless option์ž„

options.add_argument("headless")
browser = webdriver.Chrome("./chromedriver",options=options)
# browser = webdriver.Chrome("./chromedriver")
browser.get(url)
soup = BeautifulSoup(browser.page_source,'lxml')
# Load more ์„ ๋ช‡๋ฒˆ ๋ˆ„๋ฅผ๊ฒƒ์ธ์ง€??
# ์ž„์˜๋กœ 100๋ฒˆ์„ ํ–ˆ์ง€๋งŒ star์ˆ˜๊ฐ€ ์˜ˆ๋ฅผ๋“ค์–ด x๊ฐœ ์ด์ƒ์ผ๋•Œ๊นŒ์ง€ Load_more๋ฒ„ํŠผ์„ ๋ˆ„๋ฅด๋Š” ์‹์œผ๋กœ๋„ ๊ฐ€๋Šฅ
Load_more_times = 100
for _ in range(Load_more_times):
    prev = len(soup.find_all("article",attrs={"class":"border rounded color-shadow-small color-bg-subtle my-4"}))
    try:
        browser.find_element_by_xpath("//*[@id=\"js-pjax-container\"]/div[2]/div[2]/div/div[1]/form/button").click()
    except:
        print("End")
        break
    while 1:
        soup = BeautifulSoup(browser.page_source,'lxml')
        if prev < len(soup.find_all("article",attrs={"class":"border rounded color-shadow-small color-bg-subtle my-4"})):
            prev = len(soup.find_all("article",attrs={"class":"border rounded color-shadow-small color-bg-subtle my-4"}))
            break
    print(prev,"๊ฐœ load ์™„๋ฃŒ.")
    
# Topic๋“ค์„ Crawlingํ›„ ๋‚˜์—ด 
soup = BeautifulSoup(browser.page_source,'lxml')
topics = soup.find_all("h3",attrs={"class":"f3 color-fg-muted text-normal lh-condensed"})
topic_ads = []
for i in topics:
    topic_ad = "".join(i.get_text().strip().replace("\n","").split()) 
    topic_ads.append(topic_ad)

แ„‰แ…ณแ„แ…ณแ„…แ…ตแ†ซแ„‰แ…ฃแ†บ 2022-05-08 แ„‹แ…ฉแ„’แ…ฎ 11 06 24

  • DBSCAN Clustering
print("{} {} DBSCAN ํด๋Ÿฌ์Šคํ„ฐ๋ง ์‹œ์ž‘ {}".format("*"*10,topic_name,"*"*10))
excel_name = "keyword({})_vectors.xlsx".format(topic_name)
df_vector = pd.read_excel(excel_name)
# eps ๊ฐ’์„ ์กฐ์ •ํ•ด๋‚˜๊ฐ€๋ฉด์„œ ํด๋Ÿฌ์Šคํ„ฐ๋ง์„ ํ•ด์•ผ ๋” ์ •ํ™•ํ•œ ๊ฒฐ๊ณผ๊ฐ€ ๋‚˜์˜จ๋‹ค
dbscan = DBSCAN(eps = 0.3)
dbscan_cluster = dbscan.fit_predict(df_vector)
dbscan_cluster
dbscan_clustered_dic = {}
dbscan_clustered_list = []
dbscan_cluster_num = len(set(dbscan_cluster))


for idx,i in enumerate(dbscan_cluster):
    if i not in dbscan_clustered_dic:
        dbscan_clustered_dic[i] = [df_topic['project_name'][idx]]
    else:
        dbscan_clustered_dic[i].append(df_topic['project_name'][idx])

# ํด๋Ÿฌ์Šคํ„ฐ๋ง์ด ๋œ ํŒจํ‚ค์ง€๋“ค
# 20๊ฐœ์˜ ๊ตฐ์ง‘์œผ๋กœ ์ƒ์„ฑ

dbscan_clustered_dic = sorted(dbscan_clustered_dic.items(), key=lambda x: x[0])

df_dbscan_cluster = pd.DataFrame(dbscan_clustered_dic,columns=['num','clusters'])
dbscan_cluster_num = [len(i) for i in df_dbscan_cluster['clusters']]
df_dbscan_cluster['cluster_num'] = dbscan_cluster_num
topic_dbscan_clustered_list = []
for i in df_dbscan_cluster['clusters']:
    temp_dic = {}
    for j in i:
        topics = df_topic[df_topic['project_name']==j]['topic_keyword'].values[0]
        for i in topics:
            if len(i)==0:
                continue
            i = same_things(i)
            if i not in temp_dic:
                temp_dic[i] = 1
            else:
                temp_dic[i] += 1
    temp_dic = sorted(temp_dic.items(), key=lambda x: x[1], reverse=True)
#     print(temp_dic[:15]) # ์ƒ์œ„ 15๊ฐœ๋งŒ ๋ณด์—ฌ์คŒ
#     print()
    topic_dbscan_clustered_list.append(temp_dic[:15])
df_dbscan_cluster['top_15_topics'] = topic_dbscan_clustered_list
df_dbscan_cluster.to_excel("{}_DBSCAN_clusters.xlsx".format(topic_name),index=False)
print(df_dbscan_cluster)
print("{}_DBSCAN_clusters.xlsx".format(topic_name),"์ €์žฅ์™„๋ฃŒ")
print("*"*50)

แ„‰แ…ณแ„แ…ณแ„…แ…ตแ†ซแ„‰แ…ฃแ†บ 2022-05-08 แ„‹แ…ฉแ„’แ…ฎ 11 07 04

About

Data collection and processing for intelligent technology ecosystem analysis

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •