How to create a new data frame from a larger dataset
I am working with a dataset (10000 data points) that provides 100 different account numbers with transaction amounts, date and time of transactions etc.
From this dataset I want to create a separate data frame for one account number, which then contains all the transactions (ordered by time) that that account number made throughout the year.
I tried to do this by:
group = df.groupby('account_num')
which then gives me
pandas.core.groupby.generic.DataFrameGroupBy
Then, when I want to get the group for a specific account number, say 51234:
group.get_group('51234')
I receive an error:
KeyError: 51234
How can I make a separate data frame containing all the transaction for one single account number?
(Sorry if this is a very basic question, Im a newbie)
Solution 1:
IIUC, you can get your output in a slightly different way. You can start by making sure your time column, which I assume is a date based on your description, is actually a datetime
object, and then filtering your dataframe for the specific account number - there are plenty of ways to do this, a common one is loc
, but in my case I use query
. Then you can sort based on your date, using sort_values
, and lastly you can use groupby
on the year part of your date column:
# Convert your date column to datetime
df['date'] = pd.to_datetime(df['date'])
# Filter and sort
>>> print(df.query('account_num == 51234')\
.sort_values(by=['date'],ascending=True))
# Equivalently with loc
print(
df.loc[df['account_num'] == 51234]\
.sort_values(by=['date'],ascending=True))
account_num date
0 51234 2020-01-01
1 51234 2020-02-01
2 51234 2020-03-01
7 51234 2020-08-01
9 51234 2020-08-01
11 51234 2020-08-01
13 51234 2020-08-01
3 51234 2021-04-01
4 51234 2021-05-01
5 51234 2023-06-01
6 51234 2023-07-01
8 51234 2023-07-01
10 51234 2023-07-01
12 51234 2023-07-01
# Filter, sort, and get yearly count
>>> print(
df.query('account_num == 51234')\
.sort_values(by=['date'],ascending=True)\
.groupby(df['date'].dt.year).account_num.count())
date
2020 7
2021 2
2023 5
Based on the below sample DF:
{'account_num': {0: 51234,
1: 51234,
2: 51234,
3: 51234,
4: 51234,
5: 51234,
6: 51234,
7: 51234,
8: 51234,
9: 51234,
10: 51234,
11: 51234,
12: 51234,
13: 51234,
14: 512346,
15: 512346,
16: 512346,
17: 512346,
18: 512346,
19: 512346,
20: 512346,
21: 512346,
22: 512346,
23: 13123,
24: 13123,
25: 13123,
26: 13123,
27: 13123,
28: 13123,
29: 13123,
30: 13123,
31: 13123},
'date': {0: '01/01/2020',
1: '02/01/2020',
2: '03/01/2020',
3: '04/01/2021',
4: '05/01/2021',
5: '06/01/2023',
6: '07/01/2023',
7: '08/01/2020',
8: '07/01/2023',
9: '08/01/2020',
10: '07/01/2023',
11: '08/01/2020',
12: '07/01/2023',
13: '08/01/2020',
14: '09/01/2020',
15: '10/01/2020',
16: '11/01/2020',
17: '12/01/2020',
18: '13/01/2020',
19: '14/01/2020',
20: '15/01/2020',
21: '16/01/2020',
22: '17/01/2020',
23: '18/01/2020',
24: '19/01/2020',
25: '20/01/2020',
26: '21/01/2020',
27: '22/01/2020',
28: '23/01/2020',
29: '24/01/2020',
30: '25/01/2020',
31: '26/01/2020'}}